Measuring context switch time for threads - c

I want to calculate the context switch time and I am thinking to use mutex and conditional variables to signal between 2 threads so that only one thread runs at a time. I can use CLOCK_MONOTONIC to measure the entire execution time and CLOCK_THREAD_CPUTIME_ID to measure how long each thread runs.
Then the context switch time is the (total_time - thread_1_time - thread_2_time).
To get a more accurate result, I can just loop over it and take the average.
Is this a correct way to approximate the context switch time? I cant think of anything that might go wrong but I am getting answers that are under 1 nanosecond..
I forgot to mention that the more time I loop it over and take the average, the smaller results I get.
Edit
here is a snippet of the code that I have
typedef struct
{
struct timespec start;
struct timespec end;
}thread_time;
...
// each thread function looks similar like this
void* thread_1_func(void* time)
{
thread_time* thread_time = (thread_time*) time;
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &(thread_time->start));
for(x = 0; x < loop; ++x)
{
//where it switches to another thread
}
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &(thread_time->end));
return NULL;
};
void* thread_2_func(void* time)
{
//similar as above
}
int main()
{
...
pthread_t thread_1;
pthread_t thread_2;
thread_time thread_1_time;
thread_time thread_2_time;
struct timespec start, end;
// stamps the start time
clock_gettime(CLOCK_MONOTONIC, &start);
// create two threads with the time structs as the arguments
pthread_create(&thread_1, NULL, &thread_1_func, (void*) &thread_1_time);
pthread_create(&thread_2, NULL, &thread_2_func, (void*) &thread_2_time);
// waits for the two threads to terminate
pthread_join(thread_1, NULL);
pthread_join(thread_2, NULL);
// stamps the end time
clock_gettime(CLOCK_MONOTONIC, &end);
// then I calculate the difference between between total execution time and the total execution time of two different threads..
}

First of all, using CLOCK_THREAD_CPUTIME_ID is probably very wrong; this clock will give the time spent in that thread, in user mode. However the context switch does not happen in user mode, You'd want to use another clock. Also, on multiprocessing systems the clocks can give different values from processor to another! Thus I suggest you use CLOCK_REALTIME or CLOCK_MONOTONIC instead. However be warned that even if you read either of these twice in rapid succession, the timestamps usually will tens of nanoseconds apart already.
As for context switches - tthere are many kinds of context switches. The fastest approach is to switch from one thread to another entirely in software. This just means that you push the old registers on stack, set task switched flag so that SSE/FP registers will be lazily saved, save stack pointer, load new stack pointer and return from that function - since the other thread had done the same, the return from that function happens in another thread.
This thread to thread switch is quite fast, its overhead is about the same as for any system call. Switching from one process to another is much slower: this is because the user-space page tables must be flushed and switched by setting the CR0 register; this causes misses in TLB, which maps virtual addresses to physical ones.
However the <1 ns context switch/system call overhead does not really seem plausible - it is very probable that there is either hyperthreading or 2 CPU cores here, so I suggest that you set the CPU affinity on that process so that Linux only ever runs it on say the first CPU core:
#include <sched.h>
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(0, &mask);
result = sched_setaffinity(0, sizeof(mask), &mask);
Then you should be pretty sure that the time you're measuring comes from a real context switch. Also, to measure the time for switching floating point / SSE stacks (this happens lazily), you should have some floating point variables and do calculations on them prior to context switch, then add say .1 to some volatile floating point variable after the context switch to see if it has an effect on the switching time.

This is not straight forward but as usual someone has already done a lot of work on this. (I'm not including the source here because I cannot see any License mentioned)
https://github.com/tsuna/contextswitch/blob/master/timetctxsw.c
If you copy that file to a linux machine as (context_switch_time.c) you can compile and run it using this
gcc -D_GNU_SOURCE -Wall -O3 -std=c11 -lpthread context_switch_time.c
./a.out
I got the following result on a small VM
2000000 thread context switches in 2178645536ns (1089.3ns/ctxsw)
This question has come up before... for Linux you can find some material here.
Write a C program to measure time spent in context switch in Linux OS
Note, while the user was running the test in the above link they were also hammering the machine with games and compiling which is why the context switches were taking a long time. Some more info here...
how can you measure the time spent in a context switch under java platform

Related

Linux timer interval

I want to run a timer with interval of 5 ms. I created a Linux timer and when a sigalrm_handler is called I'm checking elapsed time from a previous call. I'm getting times like: 4163, 4422, 4266, 4443, 4470 4503, 4288 microseconds when I want intervals to be about 5000 microseconds with a least possible error. I don't know why this interval is not constant but it varies and is much lower than it should be.
Here is my code:
static int time_count;
static int counter;
struct itimerval timer={0};
void sigalrm_handler(int signum)
{
Serial.print("SIGALRM received, time: ");
Serial.println(micros()-time_count);
time_count=micros();
}
void setup() {
Serial.begin(9600);
timer.it_value.tv_sec = 1;
timer.it_interval.tv_usec = 5000;
signal(SIGALRM, &sigalrm_handler);
setitimer(ITIMER_REAL, &timer, NULL);
time_count = micros();
}
I want to run a timer with interval of 5 ms.
You probably cannot get that period reliably, because it is smaller than a reasonable PC hardware can handle.
As a rule of thumb, 50 Hz (or perhaps 100Hz) is probably the highest reliable frequency you could get. And it is not a matter of software, but of hardware.
Think of your typical processor cache (a few megabytes). You could need a few milliseconds to fill it. Or think of time to handle a page fault; it probably would take more than a millisecond.
And Intel Edison is not a top-fast processor. I won't be surprised if converting a number to a string and displaying that string on some screen could take about a millisecond (but I leave you to check that). This could explain your figures.
Regarding software, see also time(7) (or consider perhaps some busy waiting approach inside the kernel; I don't recommend that).
Look also into /proc/interrupts several times (see proc(5)) by running a few times some cat /proc/interrupts command in a shell. You'll probably see that the kernel gets interrupted less frequently than once every one or a few milliseconds.
BTW your signal handler calls non-async-signal-safe functions (so is undefined behavior). Read signal(7) & signal-safety(7).
So it looks like your entire approach is wrong.
Maybe you want some RTOS, at least if you need some hard real-time (and then, you might consider upgrading your hardware to something faster and more costly).

Linux/C: Check if context switch has occurred from inside thread

In a Linux/GNU/C environment, is there any visibility a running thread has into whether it has been put to sleep. For example say you have a function like
void foo() {
startClock();
bar();
endClock();
}
But you're only concerned with the running time of the code itself. I.e. you don't care about any time related to the thread being suspended during the run. Ideally you'd be able to leverage some system call or library, like countThreadSwitches(), such that:
void foo() {
int lastCount = countThreadSwitches();
startClock();
bar();
endClock();
if (countThreadSwitches() != lastCount)
discardClock();
}
Being able to tell whether the thread has switched in between two statements, would allow us to only measure runs unaffected by context switches.
So, is there anything like that hypothetical countThreadSwitches() call? Or is that information opaque to the thread itself?
In linux int getrusage(int who, struct rusage *usage); can be used to fill a struct containing timeval ru_utime (user CPU time used) and timeval ru_stime (system CPU time used), for a thread or a process.
These values along with the system clock will let you know how much CPU time your process/thread was actually running compared to how much time wasn't spent running your process/thread.
For example something like (ru_time + ru_stime) / (clock_endtime - clock_startstart) * 100 will give you CPU usage as a percent of the time elpased between start and end.
There are also some stats in there for number of context switches under certain circumstances, but that info isn't very useful.
On Linux you can read and parse the nonvoluntary_ctxt_switches: line from /proc/self/status (probably best to just do a single 4096-byte read() before and after, then parse them both afterwards).

Implementing time delay function in C

I want to implement a delay function using null loops. But the amount of time needed to complete a loop once is compiler and machine dependant. I want my program to determine the time on its own and delay the program for the specified amount of time. Can anyone give me any idea how to do this?
N. B. There is a function named delay() which suspends the system for the specified milliseconds. Is it possible to suspend the system without using this function?
First of all, you should never sit in a loop doing nothing. Not only does it waste energy (as it keeps your CPU 100% busy counting your loop counter) -- in a multitasking system it also decreases the whole system performance, because your process is getting time slices all the time as it appears to be doing something.
Next point is ... I don't know of any delay() function. This is not standard C. In fact, until C11, there was no standard at all for things like this.
POSIX to the rescue, there is usleep(3) (deprecated) and nanosleep(2). If you're on a POSIX-compliant system, you'll be fine with those. They block (means, the scheduler of your OS knows they have nothing to do and schedules them only after the end of the call), so you don't waste CPU power.
If you're on windows, for a direct delay in code, you only have Sleep(). Note that THIS function takes milliseconds, but has normally only a precision around 15ms. Often good enough, but not always. If you need better precision on windows, you can request more timer interrupts using timeBeginPeriod() ... timeBeginPeriod(1); will request a timer interrupt each millisecond. Don't forget calling timeEndPeriod() with the same value as soon as you don't need the precision any more, because more timer interrupts come with a cost: they keep the system busy, thus wasting more energy.
I had a somewhat similar problem developing a little game recently, I needed constant ticks in 10ms intervals, this is what I came up with for POSIX-compliant systems and for windows. The ticker_wait() function in this code just suspends until the next tick, maybe this is helpful if your original intent was some timing issue.
Unless you're on a real-time operating system, anything you program yourself directly is not going to be accurate. You need to use a system function to sleep for some amount of time like usleep in Linux or Sleep in Windows.
Because the operating system could interrupt the process sooner or later than the exact time expected, you should get the system time before and after you sleep to determine how long you actually slept for.
Edit:
On Linux, you can get the current system time with gettimeofday, which has microsecond resolution (whether the actual clock is that accurate is a different story). On Windows, you can do something similar with GetSystemTimeAsFileTime:
int gettimeofday(struct timeval *tv, struct timezone *tz)
{
const unsigned __int64 epoch_diff = 11644473600000000;
unsigned __int64 tmp;
FILETIME t;
if (tv) {
GetSystemTimeAsFileTime(&t);
tmp = 0;
tmp |= t.dwHighDateTime;
tmp <<= 32;
tmp |= t.dwLowDateTime;
tmp /= 10;
tmp -= epoch_diff;
tv->tv_sec = (long)(tmp / 1000000);
tv->tv_usec = (long)(tmp % 1000000);
}
return 0;
}
You could do something like find the exact time it is at a point in time and then keep it in a while loop which rechecks the time until it gets to whatever the time you want. Then it just breaks out and continue executing the rest of your program. I'm not sure if I see much of a benefit in looping rather than just using the delay function though.

QueryPerformance counter and Queryperformance frequency in windows

#include <windows.h>
#include <stdio.h>
#include <stdint.h>
// assuming we return times with microsecond resolution
#define STOPWATCH_TICKS_PER_US 1
uint64_t GetStopWatch()
{
LARGE_INTEGER t, freq;
uint64_t val;
QueryPerformanceCounter(&t);
QueryPerformanceFrequency(&freq);
return (uint64_t) (t.QuadPart / (double) freq.QuadPart * 1000000);
}
void task()
{
printf("hi\n");
}
int main()
{
uint64_t start = GetStopWatch();
task();
uint64_t stop = GetStopWatch();
printf("Elapsed time (microseconds): %lld\n", stop - start);
}
The above contains a query performance counter function Retrieves the current value of the high-resolution performance counter and query performance frequency function Retrieves the frequency of the high-resolution performance counter. If I am calling the task(); function multiple times then the difference between the start and stop time varies but I should get the same time difference for calling the task function multiple times. could anyone help me to identify the mistake in the above code ??
The thing is, Windows is a pre-emptive multi-tasking operating system. What the hell does that mean, you ask?
'Simple' - windows allocates time-slices to each of the running processes in the system. This gives the illusion of dozens or hundreds of processes running in parallel. In reality, you are limited to 2, 4, 8 or perhaps 16 parallel processes in a typical desktop/laptop. An Intel i3 has 2 physical cores, each of which can give the impression of doing two things at once. (But in reality, there's hardware tricks going on that switch the execution between each of the two threads that each core can handle at once) This is in addition to the software context switching that Windows/Linux/MacOSX do.
These time-slices are not guaranteed to be of the same duration each time. You may find the pc does a sync with windows.time to update your clock, you may find that the virus-scanner decides to begin working, or any one of a number of other things. All of these events may occur after your task() function has begun, yet before it ends.
In the DOS days, you'd get very nearly the same result each and every time you timed a single iteration of task(). Though, thanks to TSR programs, you could still find an interrupt was fired and some machine-time stolen during execution.
It is for just these reasons that a more accurate determination of the time a task takes to execute may be calculated by running the task N times, dividing the elapsed time by N to get the time per iteration.
For some functions in the past, I have used values for N as large as 100 million.
EDIT: A short snippet.
LARGE_INTEGER tStart, tEnd;
LARGE_INTEGER tFreq;
double tSecsElapsed;
QueryPerformanceFrequency(&tFreq);
QueryPerformanceCounter(&tStart);
int i, n = 100;
for (i=0; i<n; i++)
{
// Do Something
}
QueryPerformanceCounter(&tEnd);
tSecsElapsed = (tEnd.QuadPart - tStart.QuadPart) / (double)tFreq.QuadPart;
double tMsElapsed = tSecElapsed * 1000;
double tMsPerIteration = tMsElapsed / (double)n;
Code execution time on modern operating systems and processors is very unpredictable. There is no scenario where you can be sure that the elapsed time actually measured the time taken by your code, your program may well have lost the processor to another process while it was executing. The caches used by the processor play a big role, code is always a lot slower when it is executed the first time when the caches do not yet contain the code and data used by the program. The memory bus is very slow compared to the processor.
It gets especially meaningless when you measure a printf() statement. The console window is owned by another process so there's a significant chunk of process interop overhead whose execution time critically depends on the state of that process. You'll suddenly see a huge difference when the console window needs to be scrolled for example. And most of all, there isn't actually anything you can do about making it faster so measuring it is only interesting for curiosity.
Profile only code that you can improve. Take many samples so you can get rid of the outliers. Never pick the lowest measurement, that just creates unrealistic expectations. Don't pick the average either, that is affected to much by the long delays that other processes can incur on your test. The median value is a good choice.

How to make a thread sleep/block for nanoseconds (or at least milliseconds)?

How can I block my thread (maybe process) for nanoseconds or maybe for a milliseconds (at least) period?
Please note that I can't use sleep, because the argument to sleep is always in seconds.
nanosleep or clock_nanosleep is the function you should be using (the latter allows you to specify absolute time rather than relative time, and use the monotonic clock or other clocks rather than just the realtime clock, which might run backwards if an operator resets it).
Be aware however that you'll rarely get better than several microseconds in terms of the resolution, and it always rounds up the duration of sleep, rather than rounding down. (Rounding down would generally be impossible anyway since, on most machines, entering and exiting kernelspace takes more than a microsecond.)
Also, if possible I would suggest using a call that blocks waiting for an event rather than sleeping for tiny intervals then polling. For instance, pthread_cond_wait, pthread_cond_timedwait, sem_wait, sem_timedwait, select, read, etc. depending on what task your thread is performing and how it synchronizes with other threads and/or communicates with the outside world.
One relatively portable way is to use select() or pselect() with no file descriptors:
void sleep(unsigned long nsec) {
struct timespec delay = { nsec / 1000000000, nsec % 1000000000 };
pselect(0, NULL, NULL, NULL, &delay, NULL);
}
Try usleep(). Yes this wouldn't give you nanosecond precision but microseconds will work => miliseconds too.
Using any variant of sleep for pthreads, the behaviour is not guaranteed. All the threads can also sleep since the kernel is not aware of the different threads. Hence a solution is required which the pthread library can handle rather than the kernel.
A safer and cleaner solution to use is the pthread_cond_timedwait...
pthread_mutex_t fakeMutex = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t fakeCond = PTHREAD_COND_INITIALIZER;
void mywait(int timeInSec)
{
struct timespec timeToWait;
struct timeval now;
int rt;
gettimeofday(&now,NULL);
timeToWait.tv_sec = now.tv_sec + timeInSec;
timeToWait.tv_nsec = now.tv_usec*1000;
pthread_mutex_lock(&fakeMutex);
rt = pthread_cond_timedwait(&fakeCond, &fakeMutex, &timeToWait);
pthread_mutex_unlock(&fakeMutex);
printf("\nDone\n");
}
void* fun(void* arg)
{
printf("\nIn thread\n");
mywait(5);
}
int main()
{
pthread_t thread;
void *ret;
pthread_create(&thread, NULL, fun, NULL);
pthread_join(thread,&ret);
}
For pthread_cond_timedwait , you need to specify how much time to wait from current time.
Now by the use of the function mywait() only the thread calling it will sleep and not the other pthreads.
nanosleep allows you to specify the accuracy of the sleep down to nano-seconds. However the actual resolution of your sleep is likely to be much larger due to the kernel/CPU limitations.
Accurate nano-second resolution is going to be impossible on a general Linux OS, due to the fact that generally Linux distributions aren't (hard) real-time OSes. If you really need that fined grained control over timing, consider using such an operating system.
Wikipedia has a list of some real-time operating systems here: http://en.wikipedia.org/wiki/RTOS (note that it doesn't say if they are soft or hard real time, so you'll have to do some research).
On an embedded system with access to multiple hardware timers, create a high-speed clock for your nanosecond or microsecond waits. Create a macro to enable and disable it, and handle your high-resolution processing in the timer interrupt service routine.
If wasting power and busywaiting is not an issue, perform some no-op instructions - but verify that the compiler does not optimize your no-ups out. Try using volatile types.

Resources