Openmp not speeding up parallel loop - c

I have the following embarassingly parallel loop
//#pragma omp parallel for
for(i=0; i<tot; i++)
pointer[i] = val;
Why does uncommenting the #pragma line cause performance to drop? I'm getting a slight increase in program run time when I use openmp to parallelize this for loop. Since each access is independent, shouldn't it greatly increase the speed of the program?
Is it possible that if this for loop isn't run for large values of tot, the overhead is slowing things down?

Achieving performance with multiple threads in a Shared Memory environment usually depends on:
The task granularity;
Load balance between parallel tasks;
The number of parallel task/number of cores used;
The amount of synchronization among parallel tasks;
The type of bound of the algorithm;
The machine architecture.
I will give a brief overview of each of the aforementioned points.
You need to check if the granularity of the parallel tasks is enough to overcome the overhead of the parallelization (e.g., thread creation and synchronization). Maybe the number of iterations of your loop, and the computation pointer[i] = val; is not enough to justify the overhead of thread creation; Worth-noting, however, that too large of a task granularity can also lead to problems, for instance, load unbalancing.
You have to test the load balance (the amount of work per thread). Ideally, each thread should compute the same amount of work. In your code example this is not problematic;
Are you using hyper-threading?! Are you utilizing more threads than cores?! Because, if you are, threads will start "competing" for resources, and this can lead to a drop in performance;
Usually, one wants to reduce the amount of synchronization among threads. Consequently, sometimes one uses finer-grain synchronization mechanisms and even data redundancy (among other approaches) to achieve that. Your code does not have this issue.
Before attempting to parallelize your code you should analyze if it is memory-bound, CPU-bound, and so on. If it is memory-bound you may start by improving the cache usage, before you tackling the parallelization. For this task, it is highly recommended the use of a profiler.
To extract the most out of the underlining architecture, the multi-threaded approach needs to tackle the constraints of that architecture. For example, implementing an efficient multi-threaded approach to execute in a SMP architecture is different than implementing it to execute in a NUMA architecture. Since in the latter, one has to take into account the memory affinity.
EDIT: Suggestion from #Hristo lliev
Thread affinity: "Binding threads to cores improves performance in general and even more on NUMA systems since it improves data locality."
Btw, I recommend you to read this Intel Guide for Developing Multithreaded Applications.

Related

Modern System Architecture?

What could happen if we used Peterson's solution to the critical section problem on a modern computer? It is my understanding that systems with multiple CPUs can run into difficulty because of the ordering of memory reads and writes with respect to other reads and writes in memory, but is this the problem with most modern systems? Are there any advantages to using semaphores VS mutex locks?
Hey interesting question! So basically in order to understand what you're asking you have to ensure that you know what it is you're asking. The critical section is just the part of a program that should not be concurrently executed by any more than one of that program's processes or threads at a time. Multiple concurrent accesses are not allowed, so all that means is that only one process is interacting with the system at a time. Typically this "critical section" accesses a resource like a data structure, or network connection.
Mutual Exclusion or mutex just describes the requirement that only one concurrent process is in the critical section at a time, so concurrent access to shared data must ensure this "mutual exclusion".
So this introduces the problem! How do we assure that processes run completely independently of other processes, in other words, how do we ensure "atomic access" to the various critical sections by the threads?
There are a few solutions to the "critical-section problem" but the one you mention is Peterson's solution so we will discuss that.
Peterson's algorithm is designed for mutual exclusion and allows two tasks to share a single-use resource. They use shared memory for communicating.
In the algorithm, two tasks will compete for the critical section; you'll have to look into mutex, bound waiting and other properties a bit more for a full understanding, but the just of it is that in peterson's method, a process waits 1 turn and 1 turn only to get entrance into the critical section, if it gives priority to the other task or process, then that process will run to completion and hereby allowing the other process to enter the critical section.
That is the original solution proposed.
However this has no guarantee of working on today's multiprocessing modern architectures and it only works for two concurrent tasks. It is kind of messy on modern computers when it comes to reading and writing because it has an out-of-order type of execution, so sometimes sequential operations happen in an incorrect order and thus there are limitations. I suggest you also take a look at locks. Hope that helps :)
Can anyone else think of anything to add that I might have missed?
It is my understanding that systems with multiple CPUs can run into difficulty because of the ordering of memory reads and writes with respect to other reads and writes in memory, but is this the problem with most modern systems?
No. Any modern systems with "less strict" memory ordering will have ways to make the memory ordering more strict where it matters (e.g. fences).
Are there any advantages to using semaphores VS mutex locks?
Mutexes are typically simpler and faster (in the same way that a boolean is simpler than a counter); but ignoring overhead a mutex is equivalent to a semaphore with "resource count = 1".
What could happen if we used Peterson's solution to the critical section problem on a modern computer?
The big problem here is that most modern operating systems support some kind of multi-tasking (e.g. multiple processes, where each process can have multiple threads), there's usually 100 other processes (just for the OS alone), and modern hardware has power management (where you try to avoid power consumption by putting CPUs to sleep when they can't do useful work). This means that (unbounded) spinning/busy waiting is a horrible idea (e.g. you can have N CPUs being wasted spinning/trying to acquire a lock while the task that currently holds the lock isn't running on any CPU because the scheduler decided that 1234 other tasks should get 10 ms of CPU time each).
Instead; to avoid (excessive) spinning you want to ask the scheduler to block your task until/unless the lock actually can be acquired; and (especially for heavily contended locks) you probably want "fairness" (to avoid the risk of timing problems that lead to some tasks being repeatedly lucky while other tasks starve and make no progress).
This ends up being "no spinning", or "brief spinning" (to avoid scheduler overhead in cases where the task holding the lock actually can/does release it quickly); followed by the task being put on a FIFO queue and the scheduler giving the CPU to a different task or putting the CPU to sleep; where if the lock is released the scheduler wakes up the first task on the FIFO queue. Of course it's never that simple (e.g. for performance you want to do as much as you can in user-space; and you need special care and cooperating between user-space and kernel to avoid race conditions - the lock being released before a task is put on the wait queue).
Fortunately modern systems also provide simpler ways to implement locks (e.g. "atomic compare and swap"), so there's no need to resort to Peterson's algorithm (even if its just for insertion/removal of tasks from the real lock's FIFO queue).

At what point does adding more threads stop helping?

I have a computer with 4 cores, and I have a program that creates an N x M grid, which could range from a 1 by 1 square up to a massive grid. The program then fills it with numbers and performs calculations on each number, averaging them all together until they reach roughly the same number. The purpose of this is to create a LOT of busy work, so that computing with parallel threads is a necessity.
If we have a optional parameter to select the number of threads used, what number of threads would best optimize the busy work to make it run as quickly as possible?
Would using 4 threads be 4 times as fast as using 1 thread? What about 15 threads? 50? At some point I feel like we will be limited by the hardware (number of cores) in our computer and adding more threads will stop helping (and might even hinder?)
I think the best way to answer is to give a first overview on how threads are managed by the system. Nowadays all processors are actually multi-core and multi-thread per core, but for sake of simplicity let's first imagine a single core processor with single thread. This is physically limited in performing only a single task at the time, but we are still capable of running multitask programs.
So how is this possible? Well it is simply illusion!
The CPU is still performing a single task at the time, but switches between one and the other giving the illusion of multitasking. This process of changing from one task to the other is named Context switching.
During a Context switch all the data related to the task that is running is saved and the data related to the next task is loaded. Depending on the architecture of the CPU data can be saved in registers, cache, RAM, etc. The more the technology advances, the more performing solutions have been discovered. When the task is resumed, the whole data is fetched and the task continues its operations.
This concept introduces many issues in managing tasks, like:
Race condition
Synchronization
Starvation
Deadlock
There are other points, but this is just a quick list since the question does not focus on this.
Getting back to your question:
If we have a optional parameter to select the number of threads used, what number of threads would best optimize the busy work to make it run as quickly as possible?
Would using 4 threads be 4 times as fast as using 1 thread? What about 15 threads? 50? At some point I feel like we will be limited by the hardware (number of cores) in our computer and adding more threads will stop helping (and might even hinder?)
Short answer: It depends!
As previously said, to switch between a task and another, a Context switch is required. To perform this some storing and fetching data operations are required, but these operations are just an overhead for you computation and don't give you directly any advantage. So having too many tasks requires a high amount of Context switching, thus meaning a lot of computational time wasted! So at the end your task might be running slower than with less tasks.
Also, since you tagged this question with pthreads, it is also necessary to check that the code is compiled to run on multiple HW cores. Having a multi core CPU does not guarantee that you multitask code will run on multiple HW cores!
In your particular case of application:
I have a computer with 4 cores, and I have a program that creates an N x M grid, which could range from a 1 by 1 square up to a massive grid. The program then fills it with numbers and performs calculations on each number, averaging them all together until they reach roughly the same number. The purpose of this is to create a LOT of busy work, so that computing with parallel threads is a necessity.
Is a good example of concurrent and data independent computing. This sort of tasks run great on GPU, since operations don't have data correlation and concurrent computing is performed in hardware (modern GPU have thousands of computing cores!)

pthread offer no performance increase when using virtual cores

I am playing around with pthreads for the first time and have noticed something strange when running on my machine.
I have an Intel i5 with 2 physical cores and 4 virtual cores.
When running my program with 2 threads, I get roughly double the performance, yet when running with 4 threads, I get the same performance as two threads. Why is this the case?
Results with 2 threads:
real 0m9.335s
user 0m18.233s
sys 0m0.132s
Results with 4 threads:
real 0m9.427s
user 0m34.130s
sys 0m0.180s
Edit: The code is fully parallelizable and the threads are running independently without any shared resources.
Because you only really have 2 cores. Hyper-threading will not magically create 2 more cores for you. Hyper-threading makes it possible to run 4 threads on the CPU but not simultaneously. It will still allocate the threads on the two physical cores and switch the threads back and forth in the execution pipeline.
The performance increase you may expect is at BEST 30%.
Keep in mind that hyperthreading is basically a way of reusing spare execution units on the CPU for a separate thread of execution. You're still working with the horsepower of two cores, it's just split four ways.
If your code is optimized such that it fully utilizes most of the available EUs, there's no spare resources left once it's running on both physical cores, so the hyperthreaded cores can't do any better.
This old article from when HyperThreading (HT) was first introduced provides a lot of details on how it works (though I'm sure many improvements have been made over the last 10 years). http://www.intel.com/technology/itj/2002/volume06issue01/vol6iss1_hyper_threading_technology.pdf:
Each logical processor maintains a complete set of the architecture state. The architecture state consists of registers including the general-purpose registers, the control registers, the advanced programmable interrupt controller (APIC) registers, and some machine state registers. From a software perspective, once the architecture state is duplicated, the processor appears to be two processors. The number of transistors to store the architecture state is an extremely small fraction of the total.
However, the following sentence shows where HT can bottleneck:
Logical processors share nearly all other resources on the physical processor, such as caches, execution units, branch predictors, control logic, and buses.
If the threads execution are each keeping one or more of those shared resources (such as the execution unit or buses) 100% busy, then the hyperthreading will not improve throughput. Since benchmarks often exercise one aspect of a system (intentionally or not), it's not surprising that one of these shared processor resources would end up being a bottleneck and prevent HT from showing a benefit.
The performance gain when using multiple threads is very difficult to determine. Hyperthreading is also "less than one extra core" in performance for sure.
Besides from that, you may run into memory throughput issues, or your code is contending over locks or some such now that you have more of them - even if your own code is lock-less doesn't mean that for example I/O or some functions you call are completely able to run in parallel - there are sometimes "hidden" shared resources.
But most likely, your processor just can't go any faster.

Mutex vs busy wait for tcp io

I do not care about being a cpu hog as I have one thread assigned to each core and the system threads blocked off to their own set. My understanding is that mutex is of use when other tasks are to run, in this case that is not important so I am considering having a consumer thread loop on an address in memory waiting for its value to be non zero - as in the single producer thread that is looping recv()ing with TCP_NONBLOCK set just deposited information and it is now non zero.
Is my implantation a smart one given my circumstances or should I be using a mutex or custom interrupt even though no other tasks will run.
In addition to points by #ugoren and comments by others:
Even if you have a valid use-case for busy-waiting and burning a core, which are admittedly rare, you need to:
Protect the data shared between threads. This is where locks come into play - you need mutual exclusion when accessing any complex shared data structure. People tend to look into lock-free algorithms here, but these are way-way not obvious and error-prone and are still considered deep black magic. Don't even try these until you have a solid understanding of concurrency.
Notify threads about changed state. This is where you'd use conditional variables or monitors. There are other methods too, eventfd(2) on Linux, for example.
Here are some links for you to show that it's much harder then you seem to think:
Memory Ordering
Out-of-order execution
ABA problem
Cache coherence
Busy-wait can give you a lower latency and somewhat better performance in some cases.
Letting other threads use the CPU is the obvious reason not to do it, but there are others:
You consume more power. An idle CPU goes into a low power state, reducing consumption very significantly. Power consumption is a major issue in data centers, and any serious application must bit waste power.
If your code runs in a virtual machine (and everything is being virtualized these days), your machine competes for CPU with others. Consuming 100% CPU leaves less for the others, and may cause the hypervisor to give your machine less CPU when it's really needed.
You should always stick to mainstream methods, unless there's a good reason not to. In this case, the mainstream is to use select or poll (or epoll). This lets you do other stuff while waiting, if you want, and doesn't waste CPU time. Is the performance difference large enough to justify busy wait?

How can I evaluate performances of a lockless queue?

I have implemented a lockless queue using the hazard pointer methodology explained in http://www.research.ibm.com/people/m/michael/ieeetpds-2004.pdf using GCC CAS instructions for the implementation and pthread local storage for thread local structures.
I'm now trying to evaluate the performance of the code I have written, in particular I'm trying to do a comparison between this implementation and the one that uses locks (pthread mutexes) to protect the queue.
I'm asking this question here because I tried comparing it with the "locked" queue and I found that this has better performances with respect to the lockless implementation. The only test I tried is creating 4 thread on a 4-core x86_64 machine doing 10.000.000 random operations on the queue and it it significantly faster than the lockless version.
I want to know if you can suggest me an approach to follow, i.e. what kind of operation I have to test on the queue and what kind of tool I can use to see where my lockless code is wasting its time.
I also want to understand if it is possible that the performance are worse for the lockless queue just because 4 threads are not enough to see a major improvement...
Thanks
First point: lock-free programming doesn't necessarily improve speed. Lock-free programming (when done correctly) guarantees forward progress. When you use locks, it's possible for one thread to crash (e.g., go into an infinite loop) while holding a mutex. When/if that happens, no other thread waiting on that mutex can make any more progress. If that mutex is central to normal operation, you may easily have to restart the entire process before any more work can be done at all. With lock-free programming, no such circumstance can arise. Other threads can make forward progress, regardless of what happens in any one thread1.
That said, yes, one of the things you hope for is often better performance -- but to see it, you'll probably need more than four threads. Somewhere in the range of dozens to hundreds of threads would give your lock-free code a much better chance of showing improved performance over a lock-based queue. To really do a lot of good, however, you not only need more threads, but more cores as well -- at least based on what I've seen so far, with four cores and well-written code, there's unlikely to be enough contention over a lock for lock-free programming to show much (if any) performance benefit.
Bottom line: More threads (at least a couple dozen) will improve the chances of the lock-free queue showing a performance benefit, but with only four cores, it won't be terribly surprising if the lock-based queue still keeps up. If you add enough threads and cores, it becomes almost inevitable that the lock-free version will win. The exact number of threads and cores necessary is hard to predict, but you should be thinking in terms of dozens at a minimum.
1 At least with respect to something like a mutex. Something like a fork-bomb that just ate all the system resources might be able to deprive the other threads of enough resources to get anything done -- but some care with things like quotas can usually prevent that as well.
The question is really to what workloads you are optimizing for. If congestion is rare, lock structures on modern OS are probably not too bad. They mainly use CAS instructions under the hood as long as they are on the fast path. Since these are quite optimized out it will be difficult to beat them with your own code.
Our own implementation can only win substantially for the congested part. Just random operations on the queue (you are not too precise in your question) will probably not do this if the average queue length is much longer than the number of threads that hack on it in parallel. So you must ensure that the queue is short, perhaps by introducing a bias about the random operation that is chosen if the queue is too long or too short. Then I would also charge the system with at least twice as much threads than there are cores. This would ensure that wait times (for memory) don't play in favor of the lock version.
The best way in my opinion is to identify hotspots in your application with locks
by profiling the code.Introduce the lockless mechanism and measure the same again.
As mentioned already by other posters, there may not be a significant improvement
at lower scale (number of threads, application scale, number of cores) but you might
see throughput improvements as you scale up the system.This is because deadlock
situations have been eliminated and threads are always making forward progress.
Another way of looking at an advantage with lockless schemes are that to some
extent one decouples system state from application performance because there
is no kernel/scheduler involvement and much of the code is userland except
for CAS which is a hw instruction.
With locks that are heavily contended, threads block and are scheduled once
locks are obtained which basically means they are placed at the end of the run
queue (for a specific prio level).Inadvertently this links the application to system
state and response time for the app now depends on the run queue length.
Just my 2 cents.

Resources