Trying to understand Interlocked* functions

Trying to understand Interlocked* functions - c

This is concerning Microsoft/Visual Studio and Intel/AMD-specific implementation only.
Say, if declare a global variable:
volatile __declspec(align(16)) ULONG vFlags = 0;
And, say, I have multiple contending threads:
//Thread 1
ULONG prevFlags;
prevFlags = InterlockedExchange(&vFlags, 0);
if(prevFlags != 0)
{
//Do work
}
and then from other threads, I do:
//Thread N
vFlags = SomeNonZeroValue;
So say, on a multi-CPU system, at the moment in time while thread 1 is executing a locked InterlockedExchange instruction, some other threads come to executing vFlags = 2 and vFlags = 4 instructions.
What would happen in that case? Would vFlags = 2 and vFlags = 4 be stalled until InterlockedExchange completes, or will it disregard that lock?
Or do I need to use this instead?
//Thread N
InterlockedOr(&vFlags, SomeNonZeroValue);

Instructions that don't use locks to update a variable do not interact with instructions that do. Locking is a cooperative process that all participants must observe in order for it to work. So, yes, updating the flag with a simple assignment on one thread will not be blocked by another thread calling InterlockedExchange.
On the other hand, assigning different values to variables that are read by other threads raises the issue of visibility across cores since other threads may not immediately, or indeed ever, see the updates. InterlockedExchange solves this issue as well by providing implicit memory fences.
In conclusion, I would use InterlockedExchange in all threads updating the flag.

Related

Measuring context switch time for threads

I want to calculate the context switch time and I am thinking to use mutex and conditional variables to signal between 2 threads so that only one thread runs at a time. I can use CLOCK_MONOTONIC to measure the entire execution time and CLOCK_THREAD_CPUTIME_ID to measure how long each thread runs.
Then the context switch time is the (total_time - thread_1_time - thread_2_time).
To get a more accurate result, I can just loop over it and take the average.
Is this a correct way to approximate the context switch time? I cant think of anything that might go wrong but I am getting answers that are under 1 nanosecond..
I forgot to mention that the more time I loop it over and take the average, the smaller results I get.
Edit
here is a snippet of the code that I have
typedef struct
{
struct timespec start;
struct timespec end;
}thread_time;
...
// each thread function looks similar like this
void* thread_1_func(void* time)
{
thread_time* thread_time = (thread_time*) time;
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &(thread_time->start));
for(x = 0; x < loop; ++x)
{
//where it switches to another thread
}
clock_gettime(CLOCK_THREAD_CPUTIME_ID, &(thread_time->end));
return NULL;
};
void* thread_2_func(void* time)
{
//similar as above
}
int main()
{
...
pthread_t thread_1;
pthread_t thread_2;
thread_time thread_1_time;
thread_time thread_2_time;
struct timespec start, end;
// stamps the start time
clock_gettime(CLOCK_MONOTONIC, &start);
// create two threads with the time structs as the arguments
pthread_create(&thread_1, NULL, &thread_1_func, (void*) &thread_1_time);
pthread_create(&thread_2, NULL, &thread_2_func, (void*) &thread_2_time);
// waits for the two threads to terminate
pthread_join(thread_1, NULL);
pthread_join(thread_2, NULL);
// stamps the end time
clock_gettime(CLOCK_MONOTONIC, &end);
// then I calculate the difference between between total execution time and the total execution time of two different threads..
}

First of all, using CLOCK_THREAD_CPUTIME_ID is probably very wrong; this clock will give the time spent in that thread, in user mode. However the context switch does not happen in user mode, You'd want to use another clock. Also, on multiprocessing systems the clocks can give different values from processor to another! Thus I suggest you use CLOCK_REALTIME or CLOCK_MONOTONIC instead. However be warned that even if you read either of these twice in rapid succession, the timestamps usually will tens of nanoseconds apart already.
As for context switches - tthere are many kinds of context switches. The fastest approach is to switch from one thread to another entirely in software. This just means that you push the old registers on stack, set task switched flag so that SSE/FP registers will be lazily saved, save stack pointer, load new stack pointer and return from that function - since the other thread had done the same, the return from that function happens in another thread.
This thread to thread switch is quite fast, its overhead is about the same as for any system call. Switching from one process to another is much slower: this is because the user-space page tables must be flushed and switched by setting the CR0 register; this causes misses in TLB, which maps virtual addresses to physical ones.
However the <1 ns context switch/system call overhead does not really seem plausible - it is very probable that there is either hyperthreading or 2 CPU cores here, so I suggest that you set the CPU affinity on that process so that Linux only ever runs it on say the first CPU core:
#include <sched.h>
cpu_set_t mask;
CPU_ZERO(&mask);
CPU_SET(0, &mask);
result = sched_setaffinity(0, sizeof(mask), &mask);
Then you should be pretty sure that the time you're measuring comes from a real context switch. Also, to measure the time for switching floating point / SSE stacks (this happens lazily), you should have some floating point variables and do calculations on them prior to context switch, then add say .1 to some volatile floating point variable after the context switch to see if it has an effect on the switching time.

This is not straight forward but as usual someone has already done a lot of work on this. (I'm not including the source here because I cannot see any License mentioned)
https://github.com/tsuna/contextswitch/blob/master/timetctxsw.c
If you copy that file to a linux machine as (context_switch_time.c) you can compile and run it using this
gcc -D_GNU_SOURCE -Wall -O3 -std=c11 -lpthread context_switch_time.c
./a.out
I got the following result on a small VM
2000000 thread context switches in 2178645536ns (1089.3ns/ctxsw)
This question has come up before... for Linux you can find some material here.
Write a C program to measure time spent in context switch in Linux OS
Note, while the user was running the test in the above link they were also hammering the machine with games and compiling which is why the context switches were taking a long time. Some more info here...
how can you measure the time spent in a context switch under java platform

Linux, C: Accumulate data from multiple threads

I have an application which plays a network server role, and will pthread_create multiple threads and each thread will listen on a particular TCP port and accept multiple TCP socket connections.
Now, after some time for example, 60 seconds, all network clients (TCP socket clients) have been closed (but my application is still running those threads and listening on those ports), how do I collect data (such as total_bytes received) from those threads created by my application?
One solution I currently used is: in each socket accept(), when new data arrives, the corresponding thread will update a static variable with pthread_mutex_t. But I suspect this is low efficiency and waste time by the mutex.
Is there any lock-free way to do this?
If there any solution of "per_cpu" counters just like it is used in network driver/without lock/mutex?
Or, I don't update the Receiver_Total_Bytes when receiving n bytes from socket (read()). Instead, I keep calculate the total bytes within the thread. But the question is, how do I get the total bytes number from a un-finished thread?
===sudo code===
register long Receiver_Total_Bytes = 0;
static pthread_mutex_t Summarizer_Mutex = PTHREAD_MUTEX_INITIALIZER;
void add_server_transfer_bytes(register long bytes )
{
pthread_mutex_lock( &Summarizer_Mutex );
Receiver_Total_Bytes += bytes;
pthread_mutex_unlock( &Summarizer_Mutex );
}
void reset_server_transfer_bytes( )
{
pthread_mutex_lock( &Summarizer_Mutex );
Receiver_Total_Bytes = 0;
pthread_mutex_unlock( &Summarizer_Mutex );
}
Then in socket read:
if((n = read(i, buffer, bytes_to_be_read)) > 0) {
............
add_server_transfer_bytes(n);

Another option is to allocate a structure for each thread, and have that structure include the desired counters, say connections and total_bytes, at least.
The thread itself should increment these using atomic built-ins:
__sync_fetch_and_add(&(threadstruct->connections), 1);
__sync_fetch_and_add(&(threadstruct->total_bytes), bytes);
or
__atomic_fetch_add(&(threadstruct->connections), 1, __ATOMIC_SEQ_CST);
__atomic_fetch_add(&(threadstruct->total_bytes), bytes, __ATOMIC_SEQ_CST);
These are slightly slower than non-atomic operations, but the overhead is very small, if there is no contention. (In my experience, cacheline ping-pong -- when different CPUs try to access the variable at the same time -- is a significant risk and a real-world cause for slowdown, but here the risk is minimal. At worst, only the current thread and the main thread may access the variables at the same time. Of course, the main thread should not calculate the summaries too often; say, once or twice a second should be enough.)
Because the structure is allocated in the main thread, the main thread can also access the counters. To collect the totals, it'll use a loop, and inside the loop,
overall_connections += __sync_fetch_and_add(&(threadstruct[thread]->connections), 0);
overall_total_bytes += __sync_fetch_and_add(&(threadstruct[thread]->total_bytes), 0);
or
overall_connections += __atomic_load_n(&(threadstruct[thread]->connections));
overall_total_bytes += __atomic_load_n(&(threadstruct[thread]->total_bytes));
See the GCC manual for further information on the __atomic and __sync built-in functions. Other C compilers like Intel CC also provide these -- or at least used to; the last time I verified this was a few years ago. The __sync ones are older (and more widely supported in older compiler versions), but the __atomic ones reflect the memory models specified in C11, so are more likely to be supported by future C compilers.

Yes, your concerns are warranted. The worst thing you can do here is to use mutex as suggested in another answer. Mutexes preempt threads, so they really are multithreaders worst enemy. The other thing which might come to mind is to use atomic operations for incrementing (also mentioned in the same answer). Terrible idea indeed! Atomic operations perform very poor under contention (atomic increment is a actually a loop, which will try to incrememnt until succeeds). Since in the case described I imagine the conention will be high, atomics will behave bad.
The other problem with atomics and mutexes a like is that enforce memory ordering and impose bariers. Not a good thing for performance!
The real solution to the question, is, of course, having each thread using it's own private counter. It is not per-cpu, it is per thread. Once the threads are done, those counters can be accumulated.

Synchronizing threads with conditional variables in C

I have a C application, part of which does some threaded stuff, which I'm having some difficulty to implement. I'm using pthread.h (POSIX thread programming) as a guideline.
I need to synchronize two threads that repeat a certain task a predefined number of times, and with each repetition the two tasks need to start at the same time. My idea is to let each thread initialize and do their work before the sync'd task begins, and when this happens thread one (let's call this thread TX) will signal thread 2 (RX) that it can begin doing the task.
Here's an example:
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
static pthread_cond_t tx_condvar = PTHREAD_COND_INITIALIZER;
static bool tx_ready = false;
These are declared in a header file. The TX thread is shown below:
while (reps > 0 && !task->quit) {
pthread_mutex_lock(&mutex);
tx_ready = true;
pthread_cond_signal(&tx_condvar);
pthread_mutex_unlock(&mutex);
status = do_stuff();
if (status != 0) {
print_error();
goto tx_task_out;
}
reps--;
// one task done, wait till it's time to do the next
usleep(delay_value_us);
tx_ready = false;
}
And then the RX
while (!done && !task->quit) {
// wait for the tx_ready signal before carrying on
pthread_mutex_lock(&mutex);
while (!tx_ready){
pthread_cond_wait(&tx_condvar, &mutex);
}
pthread_mutex_unlock(&mutex);
status = do stuff();
if (status != 0) {
print_error();
goto rx_task_out;
}
n = fwrite(samples, 2 * sizeof(samples[0]), to_rx, p->out_file);
num_rx += to_rx;
if (num_rx == s->rx_length){
done = true;
}
}
Is there a better way to handle this, and am I even doing it correctly? It's incredibly important that the two tasks within the tx/rx threads start at the same time for each repetition.
Thanks in advance for your input!

What you are looking for is called a barrier. Basically it blocks threads entering the barrier until a certain number of threads have entered and then it releases them all.
I believe pthreads have a barrier, although it might be a extension.
https://computing.llnl.gov/tutorials/pthreads/man/pthread_barrier_wait.txt

First, if you are not using a multi-core processor, you cannot run two tasks at precisely the same time.
Starting two tasks at precisely the same time requires a multi-core system. (SMP architecture where multiple cores with shared memory are under a single OS). There are extensions available on several dev environments specifically for taking advantage of features such as processor affinity, where you can dedicate a particular thread to run only on a specific core, or set up processors to be dedicated to running specific threads, and when to run them.
I use LabWindows/CVI (ANSI C with extensions for instrumentation). HERE is a small white paper on capabilities using multi-core. Here is another that is mostly NI specific, but also includes some generic techniques applicable using any ANSI C compiler (toward bottom) dealing with time critical loops.

cost on blocked operation was increased by the number of thread

I've written a program that executes some calculations and then merges the results.
I've used multi-threading to calculate in parallel.
During the phase of merge result, each thread will lock the global array, and then append individual part to it, and some extra work will be done to eliminate the repetitions.
I test it and find that the cost on merging increases with the number of threads, and the rate is unexpected:
2 thread: 40,116,084(us)
6 thread:511,791,532(us)
Why: what occurs when the number of threads increases? How do I change this?
--------------------------slash line -----------------------------------------------------
Actually, the code was very simply, there is the pseudo-code:
typedef my_object{
long no;
int count;
double value;
//something others
} my_object_t;
static my_object_t** global_result_array; //about ten thounds
static pthread_mutex_t global_lock;
void* thread_function(void* arg){
my_object_t** local_result;
int local_result_number;
int i;
my_object_t* ptr;
for(;;){
if( exit_condition ){ return NULL;}
if( merge_condition){
//start time point to log
pthread_mutex_lock( &global_lock);
for( i = local_result_number-1; i>=0 ;i++){
ptr = local_result[ i] ;
if( NULL == global_result_array[ ptr->no] ){
global_result_array[ ptr->no] = ptr; //step 4
}else{
global_result_array[ ptr->no] -> count += ptr->count; // step 5
global_result_array[ ptr->no] -> value += ptr->value; // step 6
}
}
pthread_mutex_unlock( &global_lock); // end time point to log
}else{
//do some calculation and produce the partly and thread-local result ,namely the local_result and local_result_number
}
}
}
As above, the difference between two threads and six threads are step 5 and step6, i has counted that there were about hundreds millions order of execution of step 5 and 6. The others are same.
So, from my view, the merge operation was very light, in spite of using 2 thread or 6 thread, they both need to lock and do merge exclusively.
Another astonished thing was : when using six thread, the cost on step 4 was boomed! It was the boot reason that the total cost was boomed!
btw: The test server has two cpus ,each cpu has four cores.

There are various reasons for the behaviour shown:
More threads means more locks and more blocking time among threads. As is apparent from your description, your implementation uses mutex locks or something similar. The speed-up with threads is better if the data sets are largely exclusive.
Unless your system has as many processors/cores as the number of threads, all of them cannot run concurrently. You can set the maximum concurrency using pthread_setconcurrency.

Context switching is an overhead. Hence the difference. If your computer had 6 cores it would be faster. Overwise you need to have more context switches for the threads.

This is a huge performance difference between 2/6 threads. I'm sorry, but you have to try very hard indeed to make such a huge discrepancy. You seem to have succeeded:((
As others have pointed out, using multiple threads on one data set only becomes worth it if the time spent on inter-thread communication, (locks etc.), is less than the time gained by the concurrent operations.
If, for example, you find that you are merging successively smaller data sections, (eg. with a merge sort), you are effectively optimizing the time wasted on inter-thread comms and cache-thrashing. This is why multi-threaded merge-sorts are frequently started with an in-place sort once the data has been divided up into a chunk less than the size of the L1 cache.
'each thread will lock the global array' - try to not do this. Locking large data structures for extended periods, or continually locking them for successive short periods, is a very bad plan. Locking the global once serializes the threads and generates one thread with too much inter-thread comms. Continualy locking/releasing generates one thread with far, far too much inter-thread comms.
Once the operations get so short that the returns are diminished to the point of uselessness, you would be better off queueing those operations to one thread that finishes off the job on its own.
Locking is often grossly over-used and/or misused. If I find myself locking anything for longer than the time taken to push/pop a pointer onto a queue or similar, I start to get jittery..
Without seeing/analysing the code, and more importantly, data,, (I guess both are complex), it's difficult to give any direct advice:(

concurrent variable access in c

I have a fairly specific question about concurrent programming in C. I have done a fair bit of research on this but have seen several conflicting answers, so I'm hoping for some clarification. I have a program that's something like the following (sorry for the longish code block):
typedef struct {
pthread_mutex_t mutex;
/* some shared data */
int eventCounter;
} SharedData;
SharedData globalSharedData;
typedef struct {
/* details unimportant */
} NewData;
void newData(NewData data) {
int localCopyOfCounter;
if (/* information contained in new data triggers an
event */) {
pthread_mutex_lock(&globalSharedData.mutex);
localCopyOfCounter = ++globalSharedData.eventCounter;
pthread_mutex_unlock(&globalSharedData.mutex);
}
else {
return;
}
/* Perform long running computation. */
if (localCopyOfCounter != globalSharedData.eventCounter) {
/* A new event has happened, old information is stale and
the current computation can be aborted. */
return;
}
/* Perform another long running computation whose results
depend on the previous one. */
if (localCopyOfCounter != globalSharedData.eventCounter) {
/* Another check for new event that causes information
to be stale. */
return;
}
/* Final stage of computation whose results depend on two
previous stages. */
}
There is a pool of threads servicing the connection for incoming data, so multiple instances of newData can be running at the same time. In a multi-processor environment there are two problems I'm aware of in getting the counter handling part of this code correct: preventing the compiler from caching the shared counter copy in a register so other threads can't see it, and forcing the CPU to write the store of the counter value to memory in a timely fashion so other threads can see it. I would prefer not to use a synchronization call around the counter checks because a partial read of the counter value is acceptable (it will produce a value different than the local copy, which should be adequate to conclude that an event has occurred). Would it be sufficient to declare the eventCounter field in SharedData to be volatile, or do I need to do something else here? Also is there a better way to handle this?

Unfortunately, the C standard says very little about concurrency. However, most compilers (gcc and msvc, anyway) will regard a volatile read as if having acquire semantics -- the volatile variable will be reloaded from memory on every access. That is desirable, your code as it is now may end up comparing values cached in registers. I wouldn't even be surprised if the both comparisons were optimized out.
So the answer is yes, make the eventCounter volatile. Alternatively, if you don't want to restrict your compiler too much, you can use the following function to perform reads of eventCounter.
int load_acquire(volatile int * counter) { return *counter; }
if (localCopy != load_acquire(&sharedCopy))
// ...

preventing the compiler from caching
the local counter copy in a register
so other threads can't see it
Your local counter copy is "local", created on the execution stack and visible only to the running thread. Every other thread runs in a different stack and has the own local counter variable (no concurrency).
Your global counter should be declared volatile to avoid register optimization.

You can also use hand coded assembly or compiler intrinsics which will garuntee atomic checks against your mutex, they can also atomically ++ and -- your counter.
volatile is useless these days, for the most part, you should look at memory barrier's which are other low level CPU facility to help with multi-core contention.
However the best advice I can give, would be for you to bone up on the various managed and native multi-core support libraries. I guess some of the older one's like OpenMP or MPI (message based), are still kicking and people will go on about how cool they are... however for most developers, something like intel's TBB or Microsoft's new API's, I also just dug up this code project article, he's apparently using cmpxchg8b which is the lowlevel hardware route which I mentioned initially...
Good luck.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight