Linux, C: Accumulate data from multiple threads

Linux, C: Accumulate data from multiple threads - c

I have an application which plays a network server role, and will pthread_create multiple threads and each thread will listen on a particular TCP port and accept multiple TCP socket connections.
Now, after some time for example, 60 seconds, all network clients (TCP socket clients) have been closed (but my application is still running those threads and listening on those ports), how do I collect data (such as total_bytes received) from those threads created by my application?
One solution I currently used is: in each socket accept(), when new data arrives, the corresponding thread will update a static variable with pthread_mutex_t. But I suspect this is low efficiency and waste time by the mutex.
Is there any lock-free way to do this?
If there any solution of "per_cpu" counters just like it is used in network driver/without lock/mutex?
Or, I don't update the Receiver_Total_Bytes when receiving n bytes from socket (read()). Instead, I keep calculate the total bytes within the thread. But the question is, how do I get the total bytes number from a un-finished thread?
===sudo code===
register long Receiver_Total_Bytes = 0;
static pthread_mutex_t Summarizer_Mutex = PTHREAD_MUTEX_INITIALIZER;
void add_server_transfer_bytes(register long bytes )
{
pthread_mutex_lock( &Summarizer_Mutex );
Receiver_Total_Bytes += bytes;
pthread_mutex_unlock( &Summarizer_Mutex );
}
void reset_server_transfer_bytes( )
{
pthread_mutex_lock( &Summarizer_Mutex );
Receiver_Total_Bytes = 0;
pthread_mutex_unlock( &Summarizer_Mutex );
}
Then in socket read:
if((n = read(i, buffer, bytes_to_be_read)) > 0) {
............
add_server_transfer_bytes(n);

Another option is to allocate a structure for each thread, and have that structure include the desired counters, say connections and total_bytes, at least.
The thread itself should increment these using atomic built-ins:
__sync_fetch_and_add(&(threadstruct->connections), 1);
__sync_fetch_and_add(&(threadstruct->total_bytes), bytes);
or
__atomic_fetch_add(&(threadstruct->connections), 1, __ATOMIC_SEQ_CST);
__atomic_fetch_add(&(threadstruct->total_bytes), bytes, __ATOMIC_SEQ_CST);
These are slightly slower than non-atomic operations, but the overhead is very small, if there is no contention. (In my experience, cacheline ping-pong -- when different CPUs try to access the variable at the same time -- is a significant risk and a real-world cause for slowdown, but here the risk is minimal. At worst, only the current thread and the main thread may access the variables at the same time. Of course, the main thread should not calculate the summaries too often; say, once or twice a second should be enough.)
Because the structure is allocated in the main thread, the main thread can also access the counters. To collect the totals, it'll use a loop, and inside the loop,
overall_connections += __sync_fetch_and_add(&(threadstruct[thread]->connections), 0);
overall_total_bytes += __sync_fetch_and_add(&(threadstruct[thread]->total_bytes), 0);
or
overall_connections += __atomic_load_n(&(threadstruct[thread]->connections));
overall_total_bytes += __atomic_load_n(&(threadstruct[thread]->total_bytes));
See the GCC manual for further information on the __atomic and __sync built-in functions. Other C compilers like Intel CC also provide these -- or at least used to; the last time I verified this was a few years ago. The __sync ones are older (and more widely supported in older compiler versions), but the __atomic ones reflect the memory models specified in C11, so are more likely to be supported by future C compilers.

Yes, your concerns are warranted. The worst thing you can do here is to use mutex as suggested in another answer. Mutexes preempt threads, so they really are multithreaders worst enemy. The other thing which might come to mind is to use atomic operations for incrementing (also mentioned in the same answer). Terrible idea indeed! Atomic operations perform very poor under contention (atomic increment is a actually a loop, which will try to incrememnt until succeeds). Since in the case described I imagine the conention will be high, atomics will behave bad.
The other problem with atomics and mutexes a like is that enforce memory ordering and impose bariers. Not a good thing for performance!
The real solution to the question, is, of course, having each thread using it's own private counter. It is not per-cpu, it is per thread. Once the threads are done, those counters can be accumulated.

Related

Trying to understand Interlocked* functions

This is concerning Microsoft/Visual Studio and Intel/AMD-specific implementation only.
Say, if declare a global variable:
volatile __declspec(align(16)) ULONG vFlags = 0;
And, say, I have multiple contending threads:
//Thread 1
ULONG prevFlags;
prevFlags = InterlockedExchange(&vFlags, 0);
if(prevFlags != 0)
{
//Do work
}
and then from other threads, I do:
//Thread N
vFlags = SomeNonZeroValue;
So say, on a multi-CPU system, at the moment in time while thread 1 is executing a locked InterlockedExchange instruction, some other threads come to executing vFlags = 2 and vFlags = 4 instructions.
What would happen in that case? Would vFlags = 2 and vFlags = 4 be stalled until InterlockedExchange completes, or will it disregard that lock?
Or do I need to use this instead?
//Thread N
InterlockedOr(&vFlags, SomeNonZeroValue);

Instructions that don't use locks to update a variable do not interact with instructions that do. Locking is a cooperative process that all participants must observe in order for it to work. So, yes, updating the flag with a simple assignment on one thread will not be blocked by another thread calling InterlockedExchange.
On the other hand, assigning different values to variables that are read by other threads raises the issue of visibility across cores since other threads may not immediately, or indeed ever, see the updates. InterlockedExchange solves this issue as well by providing implicit memory fences.
In conclusion, I would use InterlockedExchange in all threads updating the flag.

Monitoring Thread performance of server

I have developed a C server using gcc and pthreads that receives UDP packets and depending on the configuration either drops or forwards them to specific targets. In some cases these packets are untouched and just redirected, in some cases headers in the packet are modified, in other cases there is another module of the server that modifies every byte of the packet.
To configure this server, there is a GUI written in Java that connects to the C Server using TCP (to exchange configuration commands). There can be multiple connected GUIs at the same time.
In order to measure utilization of the server I have written kind of a module that starts two separate threads (#2 & #3). The main thread (#1) that does the whole forwarding work essentially works like the following:
struct monitoring_struct data; //contains 2 * uint64_t for start and end time among other fields
for(;;){
recvfrom();
data.start = current_time();
modifyPacket();
sendPacket(); //sometimes to multiple destinations
data.end = current_time();
writeDataToPipe();
}
The current_time function:
//give a timestamp in microsecond precision
uint64_t current_time(void){
struct timespec spec;
clock_gettime(CLOCK_REALTIME, &spec);
uint64_t ts = (uint64_t) ((((double) spec.tv_sec) * 1.0e6) +
(((double) spec.tv_nsec) / 1.0e3));
return ts;
}
As indicated in the main thread, the data struct is written into a pipe, where thread #2 waits to read from. Everytime there is data to be read from the pipe, thread #2 uses a given aggregation function that stores the data in another place in memory. Thread #3 is a loop, that always sleeps for ~1 sec and then sends out the aggregated values (median, avg, min, max, lower quartil, upper quartil, ...) and then resets the aggregated data. Thread #2 and #3 are synchronized by mutexes.
The GUI listens to this data (if the monitoring window is open) which is sent out via UDP to listeners (there can be more) and the GUI then converts the numbers into diagrams, graphs and "pressure" indicators.
I came up with this as this is in my mind the solution that interferes least of all with thread #1 (assuming that it is run on a multicore system, which it always is, and exclusively besides OS and maybe SSH).
As performance is critical for my server (version "1.0" with simpler configuration was able to manage the maximum amount of streams that were possible using gigabit ethernet) I would like to ask if have my solution may be not as good as I think it is to ensure the least performance hit on thread #1 and if you think there would better designs for that? At least I am unable to think of another solution that is not using locks on the data itself (avoiding the pipe, but potentially locking thread #1) or a shared list implementation using rwlock, with possible reader starvation.
There are scenarios where packets are larger, but we currently use the mode for performance measuring where 1 Streams sends exactly 1000 packets per second. We currently want to ensure version 2.0 at least is possible to work with 12 Streams (hence 12000 packets per second), however previously the server was able to manage 84 Streams.
In the future I would like to add other milestone timestamps to thread #1, e.g. inside modifyPacket() (there are multiple steps) and before sendPacket().
I have tried tinkering with the current_time() function, mostly trying to remove it to save time by just storing the value of clock_gettime(), but in my simple test program the current_time() function always beat the clock_gettime.
Thanks in advance for any input.

if you think there would better designs for that?
The short answer is to use Data Plane Development Kit (DPDK) with its design patterns and libraries. It might be quite a learning curve, but in terms of performance it is the best solution at the moment. It is free and open source (BSD license).
A bit more detailed answer:
the data struct is written into a pipe
Since thread #1 and #2 are the threads of the same process, it would be much faster to pass data using shared memory, not pipes. Just like you used between threads #2 and #3.
thread #2 uses a given aggregation function that stores the data in another place in memory
Those two threads seems unnecessary. Thread #2 can read data passed by thread #1, aggregate and send it out?
I am unable to think of another solution that is not using locks on the data itself
Have a look at the lockless queues which are called "rings" in DPDK. The idea is to have a common circular buffer between threads and use lockless algorithms to enqueue/dequeue to/from the buffer.
We currently want to ensure version 2.0 at least is possible to work with 12 Streams (hence 12000 packets per second), however previously the server was able to manage 84 Streams.
Measure the performance and find the bottlenecks (seems your are still not 100% sure what is the bottleneck in the code).
Just for the reference, Intel publishes the performance reports for DPDK. Those reference numbers for L3 forwarding (i.e. routing) are up to 30 million packet per second.
Sure, you might have less powerful processor and NIC, but few millions packets per second are reachable quite easily using the right techniques.

cost on blocked operation was increased by the number of thread

I've written a program that executes some calculations and then merges the results.
I've used multi-threading to calculate in parallel.
During the phase of merge result, each thread will lock the global array, and then append individual part to it, and some extra work will be done to eliminate the repetitions.
I test it and find that the cost on merging increases with the number of threads, and the rate is unexpected:
2 thread: 40,116,084(us)
6 thread:511,791,532(us)
Why: what occurs when the number of threads increases? How do I change this?
--------------------------slash line -----------------------------------------------------
Actually, the code was very simply, there is the pseudo-code:
typedef my_object{
long no;
int count;
double value;
//something others
} my_object_t;
static my_object_t** global_result_array; //about ten thounds
static pthread_mutex_t global_lock;
void* thread_function(void* arg){
my_object_t** local_result;
int local_result_number;
int i;
my_object_t* ptr;
for(;;){
if( exit_condition ){ return NULL;}
if( merge_condition){
//start time point to log
pthread_mutex_lock( &global_lock);
for( i = local_result_number-1; i>=0 ;i++){
ptr = local_result[ i] ;
if( NULL == global_result_array[ ptr->no] ){
global_result_array[ ptr->no] = ptr; //step 4
}else{
global_result_array[ ptr->no] -> count += ptr->count; // step 5
global_result_array[ ptr->no] -> value += ptr->value; // step 6
}
}
pthread_mutex_unlock( &global_lock); // end time point to log
}else{
//do some calculation and produce the partly and thread-local result ,namely the local_result and local_result_number
}
}
}
As above, the difference between two threads and six threads are step 5 and step6, i has counted that there were about hundreds millions order of execution of step 5 and 6. The others are same.
So, from my view, the merge operation was very light, in spite of using 2 thread or 6 thread, they both need to lock and do merge exclusively.
Another astonished thing was : when using six thread, the cost on step 4 was boomed! It was the boot reason that the total cost was boomed!
btw: The test server has two cpus ,each cpu has four cores.

There are various reasons for the behaviour shown:
More threads means more locks and more blocking time among threads. As is apparent from your description, your implementation uses mutex locks or something similar. The speed-up with threads is better if the data sets are largely exclusive.
Unless your system has as many processors/cores as the number of threads, all of them cannot run concurrently. You can set the maximum concurrency using pthread_setconcurrency.

Context switching is an overhead. Hence the difference. If your computer had 6 cores it would be faster. Overwise you need to have more context switches for the threads.

This is a huge performance difference between 2/6 threads. I'm sorry, but you have to try very hard indeed to make such a huge discrepancy. You seem to have succeeded:((
As others have pointed out, using multiple threads on one data set only becomes worth it if the time spent on inter-thread communication, (locks etc.), is less than the time gained by the concurrent operations.
If, for example, you find that you are merging successively smaller data sections, (eg. with a merge sort), you are effectively optimizing the time wasted on inter-thread comms and cache-thrashing. This is why multi-threaded merge-sorts are frequently started with an in-place sort once the data has been divided up into a chunk less than the size of the L1 cache.
'each thread will lock the global array' - try to not do this. Locking large data structures for extended periods, or continually locking them for successive short periods, is a very bad plan. Locking the global once serializes the threads and generates one thread with too much inter-thread comms. Continualy locking/releasing generates one thread with far, far too much inter-thread comms.
Once the operations get so short that the returns are diminished to the point of uselessness, you would be better off queueing those operations to one thread that finishes off the job on its own.
Locking is often grossly over-used and/or misused. If I find myself locking anything for longer than the time taken to push/pop a pointer onto a queue or similar, I start to get jittery..
Without seeing/analysing the code, and more importantly, data,, (I guess both are complex), it's difficult to give any direct advice:(

Implementing a FIFO mutex in pthreads

I'm trying to implement a binary tree supporting concurrent insertions (which could occur even between nodes), but without having to allocate a global lock or a separate mutex or mutexes for each node. Rather, the quantity of such locks allocated should be on the order of the quantity of threads using the tree.
Consequently, I end up with a type of lock convoy problem. Explained more simply, it's what potentially happens when two or more threads do the following:
1 for(;;) {
2 lock(mutex)
3 do_stuff
4 unlock(mutex)
5 }
That is, if Thread#1 executes instructions 4->5->1->2 all in one "cpu burst" then Thread#2 gets starved from execution.
On the other hand, if there was a FIFO-type locking option for mutexes in pthreads, then such a problem could be avoided. So, is there a way to implement FIFO-type mutex locking in pthreads? Can altering thread priorities accomplish this?

You can implement a fair queuing system where each thread is added to a queue when it blocks, and the first thread on the queue always gets the resource when it becomes available. Such a "fair" ticket lock built on pthreads primitives might look like this:
#include <pthread.h>
typedef struct ticket_lock {
pthread_cond_t cond;
pthread_mutex_t mutex;
unsigned long queue_head, queue_tail;
} ticket_lock_t;
#define TICKET_LOCK_INITIALIZER { PTHREAD_COND_INITIALIZER, PTHREAD_MUTEX_INITIALIZER }
void ticket_lock(ticket_lock_t *ticket)
{
unsigned long queue_me;
pthread_mutex_lock(&ticket->mutex);
queue_me = ticket->queue_tail++;
while (queue_me != ticket->queue_head)
{
pthread_cond_wait(&ticket->cond, &ticket->mutex);
}
pthread_mutex_unlock(&ticket->mutex);
}
void ticket_unlock(ticket_lock_t *ticket)
{
pthread_mutex_lock(&ticket->mutex);
ticket->queue_head++;
pthread_cond_broadcast(&ticket->cond);
pthread_mutex_unlock(&ticket->mutex);
}

You could do something like this:
define a "queued lock" that consists of a free/busy flag plus a linked-list of pthread condition variables. access to the queued_lock is protected by a mutex
to lock the queued_lock:
seize the mutex
check the 'busy' flag
if not busy; set busy = true; release mutex; done
if busy; create a new condition # end of queue & wait on it (releasing mutex)
to unlock:
seize the mutex
if no other thread is queued, busy = false; release mutex; done
pthread_cond_signal the first waiting condition
do not clear the 'busy' flag - ownership is passing to the other thread
release mutex
when waiting thread unblocked by pthread_cond_signal:
remove our condition var from head of queue
release mutex
Note that the mutex is locked only while the state of the queued_lock is being altered, not for the whole duration that the queued_lock is held.

You can obtain a fair Mutex with the idea sketched by #caf, but using atomic operations to acquire the ticket before doing the actual lock.
#if defined(_MSC_VER)
typedef volatile LONG Sync32_t;
#define SyncFetchAndIncrement32(V) (InterlockedIncrement(V) - 1)
#elif (__GNUC__ * 10000 + __GNUC_MINOR__ * 100 + __GNUC_PATCHLEVEL__) > 40100
typedef volatile uint32_t Sync32_t;
#define SyncFetchAndIncrement32(V) __sync_fetch_and_add(V, 1)
#else
#error No atomic operations
#endif
class FairMutex {
private:
Sync32_t _nextTicket;
Sync32_t _curTicket;
pthread_mutex_t _mutex;
pthread_cond_t _cond;
public:
inline FairMutex() : _nextTicket(0), _curTicket(0), _mutex(PTHREAD_MUTEX_INITIALIZER), _cond(PTHREAD_COND_INITIALIZER)
{
}
inline ~FairMutex()
{
pthread_cond_destroy(&_cond);
pthread_mutex_destroy(&_mutex);
}
inline void lock()
{
unsigned long myTicket = SyncFetchAndIncrement32(&_nextTicket);
pthread_mutex_lock(&_mutex);
while (_curTicket != myTicket) {
pthread_cond_wait(&_cond, &_mutex);
}
}
inline void unlock()
{
_curTicket++;
pthread_cond_broadcast(&_cond);
pthread_mutex_unlock(&_mutex);
}
};
More broadly, i would not call this a FIFO Mutex, because it gives the impression to maintain an order which is not there in the first place. If your threads are calling a lock() in parallel, they can not have an order before calling the lock, so it makes no sense to create a mutex preserving an order relationship which is not there.

The example as you post it has no solution. Basically you only have one critical section and there is no place for parallelism.
That said, you see that it is important to reduce the period that your threads hold the mutex to a minimum, just a handful of instructions. This is difficult for insertion in a dynamic data structure such as a tree. The conceptually simplest solution is to have one read-write lock per tree node.
If you don't want to have individual locks per tree node you could have one lock structure per level of the tree. I'd experiment with read-write locks for that. You may use just read-locking of the level of the node in hand (plus the next level) when you traverse the tree. Then when you have found the right one to insert lock that level for writing.

The solution could be to use atomic operations. No locking, no context switching, no sleeping, and much much faster than mutexes or condition variables. Atomic ops are not the end all solution to everything, but we have created a lot of thread safe versions of common data structures using just atomic ops. They are very fast.
Atomic ops are a series of simple operations like increment, or decrement or assignment that are guaranteed to execute atomically in a multi threaded environment. If two threads hit the op at the same time, the cpu makes sure one thread executes the op at a time. Atomic ops are hardware instructions, so they are fast. "Compare and swap" is very useful for thread safe data structures. in our testing atomic compare and swap is about as fast as 32 bit integer assignment. Maybe 2x as slow. When you consider how much cpu is consumed with mutexes, atomic ops are infinitely faster.
Its not trivial to do rotations to balance your tree with atomic operations, but not impossible. I ran into this requirement in the past and cheated by making a thread safe skiplist since a skiplist can be done real easy with atomic operations. Sorry I can't give you a copy of our code...my company would fire me, but its easy enough to do yourself.
How atomic ops work to make lock free data structures can be visualized by the simple threadsafe linked list example. To add an item to a global linked list (_pHead) without using locks. First save a copy of _pHead, pOld. I think of these copies as "the state of the world" when executing concurrent ops. Next create a new node, pNew, and set its pNext to the COPY. Then use atomic "compare and swap" to change _pHead to pNew ONLY IF pHead IS STILL pOld. The atomic op will succeed only if _pHead hasn't changed. If it fails, loop back to get a copy of the new _pHead and repeat.
If the op succeeds, the rest of the world will now see a new head. If a thread got the old head a nanosecond before, that thread wont see the new item, but the list will still be safe to iterate through. Since we preset the pNext to the old head BEFORE we added our new item to the list, if a thread sees the new head a nanosecond after we added it, the list is safe to traverse.
Global stuff:
typedef struct _TList {
int data;
struct _TList *pNext;
} TList;
TList *_pHead;
Add to list:
TList *pOld, *pNew;
...
// allocate/fill/whatever to make pNew
...
while (1) { // concurrency loop
pOld = _pHead; // copy the state of the world. We operate on the copy
pNew->pNext = pOld; // chain the new node to the current head of recycled items
if (CAS(&_pHead, pOld, pNew)) // switch head of recycled items to new node
break; // success
}
CAS is shorthand for __sync_bool_compare_and_swap or the like. See how easy? No Mutexes...no locks! In the rare event that 2 threads hit that code at the same time, one simply loops a second time. We only see the second loop because the scheduler swaps a thread out while in the concurrency loop. so it is rare and inconsequential.
Things can be pulled off the head of a linked list in a similar way. You can atomically change more than one value if you use unions and you can use uup to 128 bit atomic ops. We have tested 128 bit on 32 bit redhat linux and they are ~same speed as the 32, 64 bit atomic ops.
You will have to figure out how to use this type of technique with your tree. A b tree node will have two ptrs to child nodes. You can CAS them to change them. The balancing problem is tough. I can see how you could analyze a tree branch before you add something and make a copy of the branch from a certain point. when you finish changing the branch, you CAS the new one in. This would be a problem for large branches. Maybe balancing can be done "later" when the threads are not fighting over the tree. Maybe you can make it so the tree is still searchable even though you haven't cascaded the rotation all the way...in other words if thread A added a node and is recursively rotating nodes, thread b can still read or add nodes. Just some ideas. In some cases, we make a structure that has version numbers or lock flags in the 32 bits after the 32 bits of pNext. We use 64 bit CAS then. Maybe you could make the tree safe to read at all times without locks, but you might have to use the versioning technique on a branch that is being modified.
Here are a bunch of posts I have made talking about the advantages of atomic ops:
Pthreads and mutexes; locking part of an array
Efficient and fast way for thread argument
Configuration auto reloading with pthreads
Advantages of using condition variables over mutex
single bit manipulation
Is memory allocation in linux non-blocking?

You might take a look at the pthread_mutexattr_setprioceiling function.
int pthread_mutexattr_setprioceiling
(
pthread_mutexatt_t * attr,
int prioceiling,
int * oldceiling
);
From the documentation:
pthread_mutexattr_setprioceiling(3THR) sets the priority ceiling attribute of a mutex attribute object.
attr points to a mutex attribute object created by an earlier call to pthread_mutexattr_init().
prioceiling specifies the priority ceiling of initialized mutexes. The ceiling defines the minimum priority level at which the critical section guarded by the mutex is executed. prioceiling will be within the maximum range of priorities defined by SCHED_FIFO. To avoid priority inversion, prioceiling will be set to a priority higher than or equal to the highest priority of all the threads that might lock the particular mutex.
oldceiling contains the old priority ceiling value.

concurrent variable access in c

I have a fairly specific question about concurrent programming in C. I have done a fair bit of research on this but have seen several conflicting answers, so I'm hoping for some clarification. I have a program that's something like the following (sorry for the longish code block):
typedef struct {
pthread_mutex_t mutex;
/* some shared data */
int eventCounter;
} SharedData;
SharedData globalSharedData;
typedef struct {
/* details unimportant */
} NewData;
void newData(NewData data) {
int localCopyOfCounter;
if (/* information contained in new data triggers an
event */) {
pthread_mutex_lock(&globalSharedData.mutex);
localCopyOfCounter = ++globalSharedData.eventCounter;
pthread_mutex_unlock(&globalSharedData.mutex);
}
else {
return;
}
/* Perform long running computation. */
if (localCopyOfCounter != globalSharedData.eventCounter) {
/* A new event has happened, old information is stale and
the current computation can be aborted. */
return;
}
/* Perform another long running computation whose results
depend on the previous one. */
if (localCopyOfCounter != globalSharedData.eventCounter) {
/* Another check for new event that causes information
to be stale. */
return;
}
/* Final stage of computation whose results depend on two
previous stages. */
}
There is a pool of threads servicing the connection for incoming data, so multiple instances of newData can be running at the same time. In a multi-processor environment there are two problems I'm aware of in getting the counter handling part of this code correct: preventing the compiler from caching the shared counter copy in a register so other threads can't see it, and forcing the CPU to write the store of the counter value to memory in a timely fashion so other threads can see it. I would prefer not to use a synchronization call around the counter checks because a partial read of the counter value is acceptable (it will produce a value different than the local copy, which should be adequate to conclude that an event has occurred). Would it be sufficient to declare the eventCounter field in SharedData to be volatile, or do I need to do something else here? Also is there a better way to handle this?

Unfortunately, the C standard says very little about concurrency. However, most compilers (gcc and msvc, anyway) will regard a volatile read as if having acquire semantics -- the volatile variable will be reloaded from memory on every access. That is desirable, your code as it is now may end up comparing values cached in registers. I wouldn't even be surprised if the both comparisons were optimized out.
So the answer is yes, make the eventCounter volatile. Alternatively, if you don't want to restrict your compiler too much, you can use the following function to perform reads of eventCounter.
int load_acquire(volatile int * counter) { return *counter; }
if (localCopy != load_acquire(&sharedCopy))
// ...

preventing the compiler from caching
the local counter copy in a register
so other threads can't see it
Your local counter copy is "local", created on the execution stack and visible only to the running thread. Every other thread runs in a different stack and has the own local counter variable (no concurrency).
Your global counter should be declared volatile to avoid register optimization.

You can also use hand coded assembly or compiler intrinsics which will garuntee atomic checks against your mutex, they can also atomically ++ and -- your counter.
volatile is useless these days, for the most part, you should look at memory barrier's which are other low level CPU facility to help with multi-core contention.
However the best advice I can give, would be for you to bone up on the various managed and native multi-core support libraries. I guess some of the older one's like OpenMP or MPI (message based), are still kicking and people will go on about how cool they are... however for most developers, something like intel's TBB or Microsoft's new API's, I also just dug up this code project article, he's apparently using cmpxchg8b which is the lowlevel hardware route which I mentioned initially...
Good luck.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight