How to optimize two threads running heavy loops - c

I have more of a conceptual question.
Assume one program running two threads.
Both threads are running loops all the time.
One thread is responsible for streaming data and the other thread is responsible for receiving the file that the first thread has to stream.
So the file transfer thread is loops to receive the data which it writes to a file and the streaming thread reads that data from that file as it needs it and streams it.
The problem I see here is how to avoid starvation when the file transfer is taking too much CPU cycles for it's own and thus making the streaming thread lag?
How would I be able to share the CPU effectively between those two threads knowing that the streamer streams data far slower than the file transfer receives it.
I thank you for your advice.

Quite often this kind of problems are solved by using somekind of flow control:
Block the sender when the receiver is busy.
This cause also problems: If your program must be able to fast forward (seek forward),
then this is not good idea.
In your case, you could block the file transfer thread when there is more than 2MB unstreamed data in the file. And resume it when there is less than 1MB unstreamed data.

See if pthread_setschedparam() helps you balance out the threads' usage of the CPU
From man page of pthread_setschedparam, you can change the thread priorities.
pthread_setschedparam(pthread_t thread, int policy,
const struct sched_param *param);
struct sched_param {
int sched_priority; /* Scheduling priority */
};
As can be seen, only one scheduling parameter is supported. For details of
the permitted ranges for scheduling priorities in each scheduling policy, see
sched_setscheduler(2).
Also,
the file transfer is taking too much CPU cycles for it's own
If you read this SO post, it seems to suggest that changing thread priorities may not help. Because the reason the file transfer thread is consuming more CPU cycles is that it needs it. But in your case, you are OK if the file transfering is slowed down as the streamer thread cannot compete anyways! Hence I suggested you to change priorities and deprive file transfer thread of some cycles even if needs it

Related

Best way to synchronise threads and measure performance at sub-microsecond frequency

I'm working on a standard x86 six core SMP machine, 3.6GHz clock speed, plain C code.
I have a threaded producer/consumer scheme in which my "producer" thread is reading from file at roughly 1,000,000 lines/second, and handing the data it reads off to either two or four "consumer" threads which do a bit of work on it and then stick it into a database. While they are consuming it is busy reading the next line.
So both producer and consumers have to have some means of synchronisation which works at sub-microsecond frequency, for which I use a "busy spin wait" loop, because all the normal synchronisation mechanisms I can find are just too slow. In pseudo code terms:
Producer thread
While(something in file)
{
read a line
populate 1/2 of data double buffer
wait for consumers to idle
set some key data
set memory fence
swap buffers
}
And the consumer threads likewise
while(not told to die)
{
wait for key data change event
consume data
}
At both sides the "wait" loop is coded:
while(waiting)
{
_mm_pause(); /* Intel say this is a good hint to processor that this is a spin wait */
if(#iterations > 1000) yield_thread(); /* Sleep(0) on Windows, pthread_yield() on Linux */
}
This all works, and I get some quite nice speed-ups compared to the equivalent serial code, but my profiler (Intel's VTune Amplifier) shows that I am spending a horrendous amount of time in my busy wait loops, and the ratio of "spin" to "useful work done" is depressingly high. Given the way the profiler concentrates its feedback on the busiest sections this also means that the lines of code doing useful work tend not to be reported, since (relatively speaking) their %age of total cpu is down at the noise level ... or at least that is what the profiler is saying. They must be doing something otherwise I wouldn't see any speed up!
I can and do time things, but it is hard to distinguish between delays imposed by disk latency in the producer thread, and delays spent while the threads synchronise.
So is there a better way to measure what is actually going on? By which I mean just how much time are these threads really spending waiting for one another? Measuring time accurately is really hard at sub-microsecond resolution, the profiler doesn't seem to give me much help, and I am struggling to optimise the scheme.
Or maybe my spin wait scheme is rubbish, but I can't seem to find a better solution for sub-microsecond synchronisation.
Any hints would be really welcome :-)
Even better than fast locks is not locking at all. Try switching to a lock-free queue. Producers and consumers wouldn't need to wait at all.
Lock-free data structures are process, thread and interrupt safe (i.e. the same data structure instance can be safely used concurrently and simultaneously across cores, processes, threads and both inside and outside of interrupt handlers), never sleep (and so are safe for kernel use when sleeping is not permitted), operate without context switches, cannot fail (no need to handle error cases, as there are none), perform and scale literally orders of magnitude better than locking data structures, and liblfds itself (as of release 7.0.0) is implemented such that it performs no allocations (and so works with NUMA, stack, heap and shared memory) and compiles not just on a freestanding C89 implementation, but on a bare C89 implementation.
Thank you to all who commented above, the suggestion of making the quantum of work bigger was the key. I have now implemented a queue (1000 entry long rotating buffer) for my consumer threads, so the producer only has to wait if that queue is full, rather than waiting for its half of the double buffer in my previous scheme. So its synchronisation time is now sub-millisecond instead of sub-microsecond - well that's a surmise, but it's definitely 1000x longer than before!
If the producer hits "queue full" I can now yield its thread immediately, instead of spin waiting, safe in the knowledge that any time slice it loses will be used gainfully by the consumer threads. This does indeed show up as a small amount of sleep/spin time in the profiler. The consumer threads benefit too since they have a more even workload.
Net outcome is a 10% reduction in the overall time to read a file, and given that only part of the file is able to be processed in a threaded manner that suggests that the threaded part of the process is around 15% or more faster.

Multiple threads on different cores reading same set of files

I have a multi threaded process, where each thread runs on one core. I am reading the same set of files from each of the threads and processing them. Will reading the same set of files by multiple threads affect the performance of the process?
Not necessarily, but there are a few factors to be taken on account.
When you open a file for READING you don't need to put a read lock on it.
That means multiple threads can be reading from the same file.
In fact all threads from a process share the process memory, so you can use that for your benefit by caching the whole set (or part of it, depending on the size) on the process memory. That will reduce access time.
Otherwise if we assume all files are in the same device, the problem is that reading multiple files simultaneously from the same device at the same time is slow and, depending on the number of threads and the storage type it can be noticeably slower
Reading the same set of files from each different thread may reduce the performance of the process, because the IO request are normally costly and slow, in addition to being repeating the same read operation for each difference thread.
One possible solution to deal with this is having one thread dealing with the IO reads/writes and the rest processing the data, for example as a producer consumer.
You may consider Memory-Mapped Files for concurrent read access.
It will avoid overhead of copying data into every process address space.

Mitigate effects of polling ring buffers

I have a single producer multiple consumer program with threads for each role. I am thinking of implementing a circular buffer for tcp on each of the consumers and allow the producer to keep pointers to the circular buffers' memory then handing out pointer space to the tcp to offload data into.
My problem, how to have consumer threads know when data is in?
I am thinking of busy wait checking the pointer location for something other than a 0; I don't mind being a cpu hog.
I should mention each thread is cpuset and soft RT by SCHED_FIFO, and of course c implemented.
In my experience, the problem with multiple consumer datastructures is to properly handle concurrency while avoiding issues with false sharing or excessivly wasting CPU cycles.
So if your problems allow it, I would use pipe to create a pipe to each consumer and putting items into these pipes in a round robin fashion. The consumers can then use epoll to watch the file handles. This avoids having to implement and optimize a concurrent datastructure and you won't burn CPU cycles needlessly. The cost is that you have to go through syscalls.
If you want to do everything yourself with polling to avoid syscalls, you can build a circular buffer but you have to make sure that only one process reads an item at the same time and only after the item has been written. Usually this is done with 4 pointers and proper mutexes.
This article about Xen's I/O ringbuffers might be of interest.

What is a Thread-pool?

What is the concept of implementing Thread-pool (in C with help from pthreads)?
how can a thread be assigned to execute from the thread pool ?
A thread-pool is a collection of a fixed number of threads which are created on application startup. The threads then sit waiting for requests to come to them, typically via a queue controlled by a semaphore. When a request is made, and there is at least one thread waiting, the thread is woken up, services the request, and goes back to waiting on the semaphore. If no threads are available, requests queue up until one is.
Thread-pools are a generally more efficient way of managing resources than simply starting a new thread for every request. However, some architectures allow new threads to be created and added to the pool as the application runs, depending on request loading.
To clarify something in previous answers:
The reason that instantiating more and more threads leads to inefficiency is context switching time. The OS periodically switches one thread for another on the processor. This involves saving one thread's state and loading another thread's state from memory, so it takes non-negligible time, N ms, per context switch.
For example, if you have 10 threads, the context switching takex 10*N ms. If you have 1000 threads, it's 1000*N ms. As the number of concurrent threads increases, eventually the context switching begins to overwhelm any efficiencies derived from multithreading. Your application has a sweet spot in terms of the best number of threads. Once you determine this sweet number by experimentation, you can set your thread pool max size to that number of threads, thereby obtaining maximum efficiency from multithreading.
Adding to anon's answer I'd like to mention that there are Fixed thread pools which have fixed numbers of thread running in them; Cached thread pools which can dynamically grow and then shrink when no work is available; Dynamic thread pools can also be bound by maximum number of threads and/or maximum length of the work queue. I don't think there is actually a set terminology for this kind of stuff and one rarely encounters non-fixed TPs written in C but at least one should know that fixed TP is not the only kind out there.

Efficient way to save data to disk while running a computationally intensive task

I'm working on a piece of scientific software that is very cpu-intensive (its proc bound), but it needs to write data to disk fairly often (i/o bound).
I'm adding parallelization to this (OpenMP) and I'm wondering what the best way to address the write-to-disk needs.
There's no reason the simulation should wait on the HDD (which is what it's doing now).
I'm looking for a 'best practice' for this, and speed is what I care about most (these can be hugely long simulations).
Thanks
~Alex
First thoughts:
having a separate process do the actual writing to disk so the simulation has two processes: one is CPU-bound (simulation) and one is IO-bound (writing file). This sounds complicated.
Possibly a pipe/buffer? I'm kind of new to these, so maybe that could be a possible solution.
I'd say the best way would be to spawn a different thread to save the data, not a completely new process; with a new process, you run the trouble of having to communicate the data to be saved across the process boundary, which introduces a new set of difficulties.
The first solution that comes to mind is pretty much what you've said - having disk writes in their own process with a one-way pipe from the sim to the writer. The writer does writes as fast as possible (drawing new data off the pipe). The problem with this is that if the sim gets too far ahead of the writer, the sim is going to be blocking on the pipe writes anyway, and it will be I/O bound at one remove.
The problem is that in fact your simulation cycle isn't complete until it's spit out the results.
The second thing that occurs to me is to use non-blocking I/O. Whenever the sim needs to write, it should do so via non-blocking I/O. On the next need to write, it can then pick up the results of the previous I/O operation (possibly incurring a small wait) before starting the new one. This keeps the simulation running as much as possible in parallel with the I/O without letting the simulation get very far ahead of the writing.
The first solution would be better if the simulation processing cycle varies (sometimes smaller than the time for a write, sometimes longer) because on average the writes might keep up with the sim.
If the processing cycle is always (or almost always) going to be shorter than the write time
then you might as well not bother with the pipe and just use non-blocking I/O, because if you use the pipe it will eventually fill up and the sim will get hung up on the I/O anyway.
If you implementing OpenMP to your program, then it is better to use #pragma omp single or #pragma omp master from parallel section to save to file. These pragmas allow only one thread to execute something. So, you code may look as following:
#pragma omp parallel
{
// Calculating the first part
Calculate();
// Using barrier to wait all threads
#pragma omp barrier
#pragma omp master
SaveFirstPartOfResults();
// Calculate the second part
Calculate2();
#pragma omp barrier
#pragma omp master
SaveSecondPart();
Calculate3();
// ... and so on
}
Here team of threads will do calculation, but only single thread will save results to disk.
It looks like software pipeline. I suggest you to consider tbb::pipeline pattern from Intel Threading Building Blocks library. I may refer you to the tutorial on software pipelines at http://cache-www.intel.com/cd/00/00/30/11/301132_301132.pdf#page=25. Please read paragraph 4.2. They solved the problem: one thread to read from drive, second one to process read strings, third one to save to drive.
Since you are CPU and IO bound: Let me guess: There is still plenty of memory available, right?
If so you should buffer the data that has to be written to disk in memory to a certain extend. Writing huge chunks of data is usually a lot faster than writing small pieces.
For the writing itself: Consider using memory mapped IO. It's been a while since I've benchmarked, but last time I did it was significant faster.
Also you can always trade of CPU vs. IO a bit. I think you're currently writing the data as some kind of raw, uncompressed data, right? You may get some IO performance if you use a simple compression scheme to reduce the amount of data to be written. The ZLIB library is pretty easy to work with and compresses very fast on the lowest compression level. It depends on the nature of your data, but if there is a lot of redundancy in it even a very crude compression algorithm may eliminate the IO bound problem.
One thread continually executes a step of the computationally-intensive process and then adds the partial result to a queue of partial results. Another thread continually removes partial results from the queue and writes them to disk. Make sure to synchronize access to the queue. A queue is a list-like data structure where you can add items to the end and remove items from the front.
Make your application have two threads, one for CPU and one for the hard disk.
Have the CPU thread push completed data into a queue which the hard disk thread then pulls from as data comes in.
This way the CPU just gets rid of the data and lets someone else handle it and the hard drive just patiently waits for any data in its queue.
Implementation wise, you could do the queue as a shared memory type of object, but I think a pipe would be exactly what you would be looking for. The CPU simply writes to the pipe when needed. On the hard disk side, you would just read the pipe and whenever you got valid data, proceed from there.

Resources