When considering performance as the only factor, for extremely fast addition in a multithreaded context, is it better to use the GCC builtin sync / atomic operations to add to a single variable, or is it more performant to add to a single counter per thread?
For example, if I have 8 threads, where a total count of processed items must be incremented (at an extremely high rate), would it be better to have a single variable and increment it from each thread using the atomic operations, or would it be better to have 8 separate variables, one for each thread, and then aggregate the data from the 8 variables at some interval?
It would most likely be much faster for each thread to do its work separately and then aggregate it at the end. ADD instructions are some of the simplest in the instruction set and run very quickly (~1 clock cycle). The overhead to lock a mutex or similar would be larger than the actual computation. Perhaps more importantly, if it's not shared the counter can reside in a register instead of in main memory which is also significantly faster.
In general, it's both faster and easier to avoid sharing state unless you have to.
Related
I'm working on an assignment in operating system course on Xv6. I need to implement a data status structure for a process for its creation time, termination time, sleep time, etc...
As of now I decided to use the ticks variable directly without using the tickslock because it seems not a good idea to use a lock and slow down the system for such a low priority objective.
Since the ticks variable only used like so: ticks++, is there a way where I will try to retrieve the current number of ticks and get a wrong number?
I don't mind getting a wrong number by +-10 ticks but is there a way where it will be really off. Like when the number 01111111111111111 will increment it will need to change 2 bytes. So my question is this, is it possible that the CPU storing data in stages and another CPU will be able to fetch the data in that memory location between the start and complete of the store operation?
So as I see it, if the compiler will create a mov instruction or an inc instruction, what I want to know is if the store operation can be seen between the start and end of it.
There's no problem in asm: aligned loads/stores done with a single instruction on x86 are atomic up to qword (8-byte) width. Why is integer assignment on a naturally aligned variable atomic on x86?
(On 486, the guarantee is only for 4-byte aligned values, and maybe not even that for 386, so possibly this is why Xv6 uses locking? I'm not sure if it's supposed to be multi-core safe on 386; my understanding is that the rare 386 SMP machines didn't exactly implement the modern x86 memory model (memory ordering and so on).)
But C is not asm. Using a plain non-atomic variable from multiple "threads" at once is undefined behaviour, unless all threads are only reading. This means compilers can assume that a normal C variable isn't changed asynchronously by other threads.
Using ticks in a loop in C will let the compiler read it once and keep using the same value repeatedly. You need a READ_ONCE macro like the Linux kernel uses, e.g. *(volatile int*)&ticks. Or simply declare it as volatile unsigned ticks;
For a variable narrow enough to fit in one integer register, it's probably safe to assume that a sane compiler will write it with a single dword store, whether that's a mov or a memory-destination inc or add dword [mem], 1. (You can't assume that a compiler will use a memory-destination inc/add, though, so you can't depend on an increment being single-core-atomic with respect to interrupts.)
With one writer and multiple readers, yes the readers can simply read it without any need for any kind of locking, as long as they use volatile.
Even in portable ISO C, volatile sig_atomic_t has some very limited guarantees of working safely when written by a signal handler and read by the thread that ran the signal handler. (Not necessarily by other threads, though: in ISO C volatile doesn't avoid data-race UB. But in practice on x86 with non-hostile compilers it's fine.)
(POSIX signals are the user-space equivalent of interrupts.)
See also Can num++ be atomic for 'int num'?
For one thread to publish a wider counter in two halves, you'd usually use a SeqLock. With 1 writer and multiple readers, there's no actual locking, just retry by the readers if a write overlapped with their read. See Implementing 64 bit atomic counter with 32 bit atomics
First, using locks or not isn't a matter of whether your objective is low priority or not, but a matter of solving a race condition.
Second, in the specific case you describe, it will be safe to read ticks variable without any locks as this is not a race condition case because RAM access to the same region (even same address here) cannot be made by 2 separate CPUs simultaneously (read more) and because ticks writing only increments the value by 1 and not doing any major changes that you really miss.
I have a computer with 4 cores, and I have a program that creates an N x M grid, which could range from a 1 by 1 square up to a massive grid. The program then fills it with numbers and performs calculations on each number, averaging them all together until they reach roughly the same number. The purpose of this is to create a LOT of busy work, so that computing with parallel threads is a necessity.
If we have a optional parameter to select the number of threads used, what number of threads would best optimize the busy work to make it run as quickly as possible?
Would using 4 threads be 4 times as fast as using 1 thread? What about 15 threads? 50? At some point I feel like we will be limited by the hardware (number of cores) in our computer and adding more threads will stop helping (and might even hinder?)
I think the best way to answer is to give a first overview on how threads are managed by the system. Nowadays all processors are actually multi-core and multi-thread per core, but for sake of simplicity let's first imagine a single core processor with single thread. This is physically limited in performing only a single task at the time, but we are still capable of running multitask programs.
So how is this possible? Well it is simply illusion!
The CPU is still performing a single task at the time, but switches between one and the other giving the illusion of multitasking. This process of changing from one task to the other is named Context switching.
During a Context switch all the data related to the task that is running is saved and the data related to the next task is loaded. Depending on the architecture of the CPU data can be saved in registers, cache, RAM, etc. The more the technology advances, the more performing solutions have been discovered. When the task is resumed, the whole data is fetched and the task continues its operations.
This concept introduces many issues in managing tasks, like:
Race condition
Synchronization
Starvation
Deadlock
There are other points, but this is just a quick list since the question does not focus on this.
Getting back to your question:
If we have a optional parameter to select the number of threads used, what number of threads would best optimize the busy work to make it run as quickly as possible?
Would using 4 threads be 4 times as fast as using 1 thread? What about 15 threads? 50? At some point I feel like we will be limited by the hardware (number of cores) in our computer and adding more threads will stop helping (and might even hinder?)
Short answer: It depends!
As previously said, to switch between a task and another, a Context switch is required. To perform this some storing and fetching data operations are required, but these operations are just an overhead for you computation and don't give you directly any advantage. So having too many tasks requires a high amount of Context switching, thus meaning a lot of computational time wasted! So at the end your task might be running slower than with less tasks.
Also, since you tagged this question with pthreads, it is also necessary to check that the code is compiled to run on multiple HW cores. Having a multi core CPU does not guarantee that you multitask code will run on multiple HW cores!
In your particular case of application:
I have a computer with 4 cores, and I have a program that creates an N x M grid, which could range from a 1 by 1 square up to a massive grid. The program then fills it with numbers and performs calculations on each number, averaging them all together until they reach roughly the same number. The purpose of this is to create a LOT of busy work, so that computing with parallel threads is a necessity.
Is a good example of concurrent and data independent computing. This sort of tasks run great on GPU, since operations don't have data correlation and concurrent computing is performed in hardware (modern GPU have thousands of computing cores!)
I am computing a project with openMP. In this project i need to do a computation, in particular:
gap = (DEFAULT_HEIGHT / nthreads);
where DEFAULT_HEIGHT is a constant and nthreads is the number of threads inside my parallel region. My problem is that i can't compute the variable gap outside the parallel region because i need to be inside to know nthreads . But on the other hand i don't want to compute gap for every thread. Moreover i can't set a code like this:
if(tid==0){
gap = (DEFAULT_HEIGHT / nthreads);
}
because i don't know the order of esecution of every thread, so it could be that the thread 0 start for last and all my other computation that need gap will be wrong ( because it will be not setted ). So, there is a way to make this computation only once without this problem?
Thanks
Ensure that gap is a shared variable and enclose it in an OpenMP single directive, something like
#pragma omp single
{
gap = (DEFAULT_HEIGHT / nthreads);
}
Only one thread will execute the code enclosed in the single directive, the other threads will wait at the end of the enclosed block of code.
An alternative would be to make gap private and let all threads compute their own value. This might be faster, the single option requires some synchronisation which always takes time. If you are concerned, try both and compare results. ( I think this is what ComicSansMS is suggesting.)
Here's the tradeoff: If only one thread does the computation you have to synchronize access to that value. In particular also threads that only want to read the value do have to synchronize, as they usually have no other means of determining whether the write is already finished. If initialization of the variable is so expensive that it can compensate for this, you should go for it. But it's probably not.
Keep in mind that you can do a lot of computation on the CPU in the time it takes to fetch data from memory. Synchronizing this access properly will eat away additional cycles and can lead to undesired stalling effects. Even worse, the impact of such effects usually increases drastically with the number of threads sharing the resource.
It's not uncommon to accept some redundancy in parallel computation as the synchronization overhead easily nullifies any benefits from saved computation time for redundant data.
I've implemented the Barnes-Hut gravity algorithm in C as follows:
Build a tree of clustered stars.
For each star, traverse the tree and apply the gravitational forces from each applicable node.
Update the star velocities and positions.
Stage 2 is the most expensive stage, and so is implemented in parallel by dividing the set of stars. E.g. with 1000 stars and 2 threads, I have one thread processing the first 500 stars and the second thread processing the second 500.
In practice this works: it speeds the computation by about 30% with two threads on a two-core machine, compared to the non-threaded version. Additionally, it yields the same numerical results as the original non-threaded version.
My concern is that the two threads are accessing the same resource (namely, the tree) simultaneously. I have not added any synchronisation to the thread workers, so it's likely they will attempt to read from the same location at some point. Although access to the tree is strictly read-only I am not 100% sure it's safe. It has worked when I've tested it but I know this is no guarantee of correctness!
Questions
Do I need to make a private copy of the tree for each thread?
Even if it is safe, are there performance problems of accessing the same memory from multiple threads?
Update Benchmark results for the curious:
Machine: Intel Atom CPU N270 # 1.60GHz, cpu MHz 800, cache size 512 KB
Threads real user sys
0 69.056 67.324 1.720
1 76.821 66.268 5.296
2 50.272 63.608 10.585
3 55.510 55.907 13.169
4 49.789 43.291 29.838
5 54.245 41.423 31.094
0 means no threading at all; 1 and above means spawn that many worker threads and for the main thread to wait for them. I would not expect much of an improvement for anything beyond 2 threads, since it's entirely CPU bound and that's how many cores there are. It's interesting that an odd number of threads is slightly worse than an even number.
Looking at sys it's apparent that there's a cost with making threads. Currently it's making the threads for each frame (so N*1000 thread creations). This was easy to program (during my 15 minutes on the train this morning). I'll need to think a bit about how to reuse threads...
Update #2 I've made it use a pool of threads, synchronised with two barriers. This has no noticeable performance advantage over recreating the threads each frame.
You don't specify how your data is structured, but in general reading memory from multiple threads simultaneously is safe and does not introduce any performance issues. You only get problems if someone is writing.
It is interesting that you say you're only getting 30% speedup out of two threads. If you have an otherwise idle machine, two or more CPUs and only readonly shared data (i.e. no synchronization) I would expect to see much closer to 50% speed improvement. This suggests that your operation is actually completing so quickly that the overhead of creating the thread is becoming significant in your numbers. Are you running on a hyperthreaded CPU?
If your data is read-only, then no, you do not need to make a private copy of the tree for each thread. This is the biggest advantage that a shared memory threading model offers!
I'm not aware of any performance problems with such a model. If anything, it should be faster depending on if your CPUs can share some of their cache.
I have a large tree structure on which several threads are working at the same time. Ideally, I would like to have an individual mutex lock for each cell.
I looked at the definition of pthread_mutex_t in bits/pthreadtypes.h and it is fairly short, so the memory usage should not be an issue in my case.
However, is there any performance penalty when using many (let's say a few thousand) different pthread_mutex_ts for only 8 threads?
If you are locking and unlocking very frequently, there can be a penalty, since obtaining and releasing locks does take some time, and can take a fair amount of time if the locks are contended.
When using many locks in a structure like this, you will have to be very specific about what each lock actually locks, and make sure you are careful of AB-BA deadlocks. For example, if you are changing the tree's structure during a locking operation, you will need to lock all the nodes that will be changed, in a consistent order, and make sure that threads working on descendants do not become confused.
If you have a very large number of locks, spread out across memory, caching issues could cause performance problems, depending on the architecture, as locking operations will generally invalidate at least some part of the cache.
Your best bet is probably to implement a simple locking structure, then profile it, then refine it to improve performance, if necessary. I'm not sure what you're doing with the tree, but a good place to start might be a single reader-writer lock for the whole tree, if you expect to read much more than you update.
"We should forget about small efficiencies, say about 97% of the time: premature optimization is the root of all evil."
-- Donald Knuth
Your locking/access patterns need to be stated in order to properly evaluate this. If each thread would only hold one or a few locks at a time and the probability that any two or more threads would want the same lock at the same time is low (either a random access patter or 8 runners on different positions on a circular track running at roughly the same speed or other more complicated things) then you will mostly avoid the worst case where a thread has to sleep to get a lock (or in some cases have to get the OS involved to decide who wins) because you have so few threads and so many locks.
If each thread might want hundreds or thousands of locks at any one time then things will start to change.
I won't touch deadlock avoidance because I don't know anything about the container that you are using, but you need to be aware of the need to avoid them.