I have a C application, part of which does some threaded stuff, which I'm having some difficulty to implement. I'm using pthread.h (POSIX thread programming) as a guideline.
I need to synchronize two threads that repeat a certain task a predefined number of times, and with each repetition the two tasks need to start at the same time. My idea is to let each thread initialize and do their work before the sync'd task begins, and when this happens thread one (let's call this thread TX) will signal thread 2 (RX) that it can begin doing the task.
Here's an example:
static pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
static pthread_cond_t tx_condvar = PTHREAD_COND_INITIALIZER;
static bool tx_ready = false;
These are declared in a header file. The TX thread is shown below:
while (reps > 0 && !task->quit) {
pthread_mutex_lock(&mutex);
tx_ready = true;
pthread_cond_signal(&tx_condvar);
pthread_mutex_unlock(&mutex);
status = do_stuff();
if (status != 0) {
print_error();
goto tx_task_out;
}
reps--;
// one task done, wait till it's time to do the next
usleep(delay_value_us);
tx_ready = false;
}
And then the RX
while (!done && !task->quit) {
// wait for the tx_ready signal before carrying on
pthread_mutex_lock(&mutex);
while (!tx_ready){
pthread_cond_wait(&tx_condvar, &mutex);
}
pthread_mutex_unlock(&mutex);
status = do stuff();
if (status != 0) {
print_error();
goto rx_task_out;
}
n = fwrite(samples, 2 * sizeof(samples[0]), to_rx, p->out_file);
num_rx += to_rx;
if (num_rx == s->rx_length){
done = true;
}
}
Is there a better way to handle this, and am I even doing it correctly? It's incredibly important that the two tasks within the tx/rx threads start at the same time for each repetition.
Thanks in advance for your input!
What you are looking for is called a barrier. Basically it blocks threads entering the barrier until a certain number of threads have entered and then it releases them all.
I believe pthreads have a barrier, although it might be a extension.
https://computing.llnl.gov/tutorials/pthreads/man/pthread_barrier_wait.txt
First, if you are not using a multi-core processor, you cannot run two tasks at precisely the same time.
Starting two tasks at precisely the same time requires a multi-core system. (SMP architecture where multiple cores with shared memory are under a single OS). There are extensions available on several dev environments specifically for taking advantage of features such as processor affinity, where you can dedicate a particular thread to run only on a specific core, or set up processors to be dedicated to running specific threads, and when to run them.
I use LabWindows/CVI (ANSI C with extensions for instrumentation). HERE is a small white paper on capabilities using multi-core. Here is another that is mostly NI specific, but also includes some generic techniques applicable using any ANSI C compiler (toward bottom) dealing with time critical loops.
Related
I want to create two tasks that run simultaneously in FreeRTOS. The first task will deal with the LED, the second task will monitor the temperature.
I have two questions:
Will this code create two tasks that run simultaneously?
How do I send the data between the tasks, for example: if the temperature is more than x degrees, turn on the LED?
void firstTask(void *pvParameters) {
while (1) {
puts("firstTask");
}
}
void secondTask(void *pvParameters) {
while (1) {
puts("secondTask");
}
}
int main() {
xTaskCreate(firstTask, "firstTask", STACK_SIZE, NULL, TASK_PRIORITY, NULL);
xTaskCreate(secondTask, "secondTask", STACK_SIZE, NULL, TASK_PRIORITY, NULL);
vTaskStartScheduler();
}
Tasks of equal priority are round-robin scheduled. This means that firstTask will run continuously until the end of its time-slice or until it is blocked, then secondTask will run for a complete timeslice or until it is blocked then back to firstTask repeating indefinitely.
On the face of it you have no blocking calls, but it is possible that if you have implemented RTOS aware buffered I/O for stdio, then puts() may well be blocking when its buffer is full.
The tasks on a single core processor are never truly concurrent, but are scheduled to run as necessary depending on the scheduling algorithm. FreeRTOS is a priority-based preemptive scheduler.
Your example may or not behave as you intend, but both tasks will get CPU time and run in some fashion. It is probably largely academic as this is not a very practical or useful use of an RTOS.
Tasks never really run simultaneously - assuming you only have one core. In you case you are creating the tasks with the same priority and they never block (although they do output strings, probably in a way that is not thread safe), so they will 'share' CPU time by time slicing. Each task will execute up to the next tick interrupt, at which point it will switch to the other.
I recommend reading the free pdf version of the FreeRTOS book for a gentle introduction into the basics https://www.freertos.org/Documentation/RTOS_book.html
This is concerning Microsoft/Visual Studio and Intel/AMD-specific implementation only.
Say, if declare a global variable:
volatile __declspec(align(16)) ULONG vFlags = 0;
And, say, I have multiple contending threads:
//Thread 1
ULONG prevFlags;
prevFlags = InterlockedExchange(&vFlags, 0);
if(prevFlags != 0)
{
//Do work
}
and then from other threads, I do:
//Thread N
vFlags = SomeNonZeroValue;
So say, on a multi-CPU system, at the moment in time while thread 1 is executing a locked InterlockedExchange instruction, some other threads come to executing vFlags = 2 and vFlags = 4 instructions.
What would happen in that case? Would vFlags = 2 and vFlags = 4 be stalled until InterlockedExchange completes, or will it disregard that lock?
Or do I need to use this instead?
//Thread N
InterlockedOr(&vFlags, SomeNonZeroValue);
Instructions that don't use locks to update a variable do not interact with instructions that do. Locking is a cooperative process that all participants must observe in order for it to work. So, yes, updating the flag with a simple assignment on one thread will not be blocked by another thread calling InterlockedExchange.
On the other hand, assigning different values to variables that are read by other threads raises the issue of visibility across cores since other threads may not immediately, or indeed ever, see the updates. InterlockedExchange solves this issue as well by providing implicit memory fences.
In conclusion, I would use InterlockedExchange in all threads updating the flag.
I have a program in C, which takes arbitrary number of files as commandline argument, and calculates sha1sum for every file. I am using pthreads, so that I can take advantage of all 4 my cores.
Currently, my code runs all threads in parallel at the same time.
Here is a snippet:
c = 0;
for (n = optind; n < argc; n++) {
if (pthread_create(&t[c], NULL, &sha1sum, (void *) argv[n])) {
fprintf(stderr, "Error creating thread\n");
return 1;
}
c++;
}
c = 0;
for (n = optind; n < argc; n++) {
pthread_join(t[c], NULL);
c++;
}
Obviously, it is not efficient (or scalable) to start all threads at once.
What would be the best way to make sure, that only 4 threads are running at any time? Somehow I need to start 4 threads at the beginning, and then "replace" a thread with new one as soon as it completes.
How can I do that ?
Obviously, it is not efficient (or scalable) to start all threads at once.
Creating 4 threads is not necessarily provides the best performance on a 4 core machine. If the threads are doing IO or waiting on something, then creating more than 4 threads could also result in better performance/efficiency. You just need to figure out an approximate number based on the work your threads do and possbily a mini-benchmark.
Regardless of what number you choose (i.e. number of threads), what you are looking for is a thread pool. The idea is to create a fixed number of threads and feed them work as soon as they complete.
See C: What's the way to make a poolthread with pthreads? for a simple skeleton. This repo also shows a self-contained example (check the license if you are going to use it). You can find many similar examples online.
The thing you are looking for is a semaphore; it will allow you to restrict only 4 threads to be running at a time. You would/could start them all up initially, and it will take care of letting a new one proceed when a running one finishes.
I have an application which plays a network server role, and will pthread_create multiple threads and each thread will listen on a particular TCP port and accept multiple TCP socket connections.
Now, after some time for example, 60 seconds, all network clients (TCP socket clients) have been closed (but my application is still running those threads and listening on those ports), how do I collect data (such as total_bytes received) from those threads created by my application?
One solution I currently used is: in each socket accept(), when new data arrives, the corresponding thread will update a static variable with pthread_mutex_t. But I suspect this is low efficiency and waste time by the mutex.
Is there any lock-free way to do this?
If there any solution of "per_cpu" counters just like it is used in network driver/without lock/mutex?
Or, I don't update the Receiver_Total_Bytes when receiving n bytes from socket (read()). Instead, I keep calculate the total bytes within the thread. But the question is, how do I get the total bytes number from a un-finished thread?
===sudo code===
register long Receiver_Total_Bytes = 0;
static pthread_mutex_t Summarizer_Mutex = PTHREAD_MUTEX_INITIALIZER;
void add_server_transfer_bytes(register long bytes )
{
pthread_mutex_lock( &Summarizer_Mutex );
Receiver_Total_Bytes += bytes;
pthread_mutex_unlock( &Summarizer_Mutex );
}
void reset_server_transfer_bytes( )
{
pthread_mutex_lock( &Summarizer_Mutex );
Receiver_Total_Bytes = 0;
pthread_mutex_unlock( &Summarizer_Mutex );
}
Then in socket read:
if((n = read(i, buffer, bytes_to_be_read)) > 0) {
............
add_server_transfer_bytes(n);
Another option is to allocate a structure for each thread, and have that structure include the desired counters, say connections and total_bytes, at least.
The thread itself should increment these using atomic built-ins:
__sync_fetch_and_add(&(threadstruct->connections), 1);
__sync_fetch_and_add(&(threadstruct->total_bytes), bytes);
or
__atomic_fetch_add(&(threadstruct->connections), 1, __ATOMIC_SEQ_CST);
__atomic_fetch_add(&(threadstruct->total_bytes), bytes, __ATOMIC_SEQ_CST);
These are slightly slower than non-atomic operations, but the overhead is very small, if there is no contention. (In my experience, cacheline ping-pong -- when different CPUs try to access the variable at the same time -- is a significant risk and a real-world cause for slowdown, but here the risk is minimal. At worst, only the current thread and the main thread may access the variables at the same time. Of course, the main thread should not calculate the summaries too often; say, once or twice a second should be enough.)
Because the structure is allocated in the main thread, the main thread can also access the counters. To collect the totals, it'll use a loop, and inside the loop,
overall_connections += __sync_fetch_and_add(&(threadstruct[thread]->connections), 0);
overall_total_bytes += __sync_fetch_and_add(&(threadstruct[thread]->total_bytes), 0);
or
overall_connections += __atomic_load_n(&(threadstruct[thread]->connections));
overall_total_bytes += __atomic_load_n(&(threadstruct[thread]->total_bytes));
See the GCC manual for further information on the __atomic and __sync built-in functions. Other C compilers like Intel CC also provide these -- or at least used to; the last time I verified this was a few years ago. The __sync ones are older (and more widely supported in older compiler versions), but the __atomic ones reflect the memory models specified in C11, so are more likely to be supported by future C compilers.
Yes, your concerns are warranted. The worst thing you can do here is to use mutex as suggested in another answer. Mutexes preempt threads, so they really are multithreaders worst enemy. The other thing which might come to mind is to use atomic operations for incrementing (also mentioned in the same answer). Terrible idea indeed! Atomic operations perform very poor under contention (atomic increment is a actually a loop, which will try to incrememnt until succeeds). Since in the case described I imagine the conention will be high, atomics will behave bad.
The other problem with atomics and mutexes a like is that enforce memory ordering and impose bariers. Not a good thing for performance!
The real solution to the question, is, of course, having each thread using it's own private counter. It is not per-cpu, it is per thread. Once the threads are done, those counters can be accumulated.
I have a fairly specific question about concurrent programming in C. I have done a fair bit of research on this but have seen several conflicting answers, so I'm hoping for some clarification. I have a program that's something like the following (sorry for the longish code block):
typedef struct {
pthread_mutex_t mutex;
/* some shared data */
int eventCounter;
} SharedData;
SharedData globalSharedData;
typedef struct {
/* details unimportant */
} NewData;
void newData(NewData data) {
int localCopyOfCounter;
if (/* information contained in new data triggers an
event */) {
pthread_mutex_lock(&globalSharedData.mutex);
localCopyOfCounter = ++globalSharedData.eventCounter;
pthread_mutex_unlock(&globalSharedData.mutex);
}
else {
return;
}
/* Perform long running computation. */
if (localCopyOfCounter != globalSharedData.eventCounter) {
/* A new event has happened, old information is stale and
the current computation can be aborted. */
return;
}
/* Perform another long running computation whose results
depend on the previous one. */
if (localCopyOfCounter != globalSharedData.eventCounter) {
/* Another check for new event that causes information
to be stale. */
return;
}
/* Final stage of computation whose results depend on two
previous stages. */
}
There is a pool of threads servicing the connection for incoming data, so multiple instances of newData can be running at the same time. In a multi-processor environment there are two problems I'm aware of in getting the counter handling part of this code correct: preventing the compiler from caching the shared counter copy in a register so other threads can't see it, and forcing the CPU to write the store of the counter value to memory in a timely fashion so other threads can see it. I would prefer not to use a synchronization call around the counter checks because a partial read of the counter value is acceptable (it will produce a value different than the local copy, which should be adequate to conclude that an event has occurred). Would it be sufficient to declare the eventCounter field in SharedData to be volatile, or do I need to do something else here? Also is there a better way to handle this?
Unfortunately, the C standard says very little about concurrency. However, most compilers (gcc and msvc, anyway) will regard a volatile read as if having acquire semantics -- the volatile variable will be reloaded from memory on every access. That is desirable, your code as it is now may end up comparing values cached in registers. I wouldn't even be surprised if the both comparisons were optimized out.
So the answer is yes, make the eventCounter volatile. Alternatively, if you don't want to restrict your compiler too much, you can use the following function to perform reads of eventCounter.
int load_acquire(volatile int * counter) { return *counter; }
if (localCopy != load_acquire(&sharedCopy))
// ...
preventing the compiler from caching
the local counter copy in a register
so other threads can't see it
Your local counter copy is "local", created on the execution stack and visible only to the running thread. Every other thread runs in a different stack and has the own local counter variable (no concurrency).
Your global counter should be declared volatile to avoid register optimization.
You can also use hand coded assembly or compiler intrinsics which will garuntee atomic checks against your mutex, they can also atomically ++ and -- your counter.
volatile is useless these days, for the most part, you should look at memory barrier's which are other low level CPU facility to help with multi-core contention.
However the best advice I can give, would be for you to bone up on the various managed and native multi-core support libraries. I guess some of the older one's like OpenMP or MPI (message based), are still kicking and people will go on about how cool they are... however for most developers, something like intel's TBB or Microsoft's new API's, I also just dug up this code project article, he's apparently using cmpxchg8b which is the lowlevel hardware route which I mentioned initially...
Good luck.