OpenMP Parallel for-loop showing little performance increase

OpenMP Parallel for-loop showing little performance increase - c

I am in the process of learning how to use OpenMP in C, and as a HelloWorld exercise I am writing a program to count primes. I then parallelise this as follows:
int numprimes = 0;
#pragma omp parallel for reduction (+:numprimes)
for (i = 1; i <= n; i++)
{
if (is_prime(i) == true)
numprimes ++;
}
I compile this code using gcc -g -Wall -fopenmp -o primes primes.c -lm (-lm for the math.h functions I am using). Then I run this code on an Intel® Core™2 Duo CPU E8400 # 3.00GHz × 2, and as expected, the performance is better than for a serial program.
The problem, however, comes when I try to run this on a much more powerful machine. (I have also tried to manually set the number of threads to use with num_threads, but this did not change anything.) Counting all the primes up to 10 000 000 gives me the following times (using time):
8-core machine:
real 0m8.230s
user 0m50.425s
sys 0m0.004s
dual-core machine:
real 0m10.846s
user 0m17.233s
sys 0m0.004s
And this pattern continues for counting more primes, the machine with more cores shows a slight performance increase, but not as much as I would expect for having so many more cores available. (I would expect 4 times more cores to imply almost 4 times less running time?)
Counting primes up to 50 000 000:
8-core machine:
real 1m29.056s
user 8m11.695s
sys 0m0.017s
dual-core machine:
real 1m51.119s
user 2m50.519s
sys 0m0.060s
If anyone can clarify this for me, it would be much appreciated.
EDIT
This is my prime-checking function.
static int is_prime(int n)
{
/* handle special cases */
if (n == 0) return 0;
else if (n == 1) return 0;
else if (n == 2) return 1;
int i;
for(i=2;i<=(int)(sqrt((double) n));i++)
if (n%i==0) return 0;
return 1;
}

This performance is happening because:
is_prime(i) takes longer the higher i gets, and
Your OpenMP implementation uses static scheduling by default for parallel for constructs without the schedule clause, i.e. it chops the for loop into equal sized contiguous chunks.
In other words, the highest-numbered thread is doing all of the hardest operations.
Explicitly selecting a more appropriate scheduling type with the schedule clause allows you to divide work among the threads fairly.
This version will divide the work better:
int numprimes = 0;
#pragma omp parallel for schedule(dynamic, 1) reduction(+:numprimes)
for (i = 1; i <= n; i++)
{
if (is_prime(i) == true)
numprimes ++;
}
Information on scheduling syntax is available via MSDN and Wikipedia.
schedule(dynamic, 1) may not be optimal, as High Performance Mark notes in his answer. There is a more in-depth discussion of scheduling granularity in this OpenMP wihtepaper.
Thanks also to Jens Gustedt and Mahmoud Fayez for contributing to this answer.

The reason for the apparently poor scaling of your program is, as #naroom has suggested, the variability in the run time of each call to your is_prime function. The run time does not simply increase with the value of i. Your code shows that the test terminates as soon as the first factor of i is found so the longest run times will be for numbers with few (and large) factors, including the prime numbers themselves.
As you've already been told, the default schedule for your parallelisation will parcel out the iterations of the master loop a chunk at a time to the available threads. For your case of 5*10^7 integers to test and 8 cores to use, the first thread will get the integers 1..6250000 to test, the second will get 6250001..12500000 and so on. This will lead to a severely unbalanced load across the threads because, of course, the prime numbers are not uniformly distributed.
Rather than using the default scheduling you should experiment with dynamic scheduling. The following statement tells the run-time to parcel out the iterations of your master loop m iterations at a time to the threads in your computation:
#pragma omp parallel for schedule(dynamic,m)
Once a thread has finished its m iterations it will be given m more to work on. The trick for you is to find the sweet spot for m. Too small and your computation will be dominated by the work that the run time does in parcelling out iterations, too large and your computation will revert to the unbalanced loads that you have already seen.
Take heart though, you will learn some useful lessons about the costs, and benefits, of parallel computation by working through all of this.

I think your code need to use dynamic so the threads each can consume different number of iterations as your iterations have different work load so the current code is balanced which won't help in your case try this out please:
int numprimes = 0;
#pragma omp parallel for reduction (+:numprimes) schedule(dynamic,1)
for (i = 1; i <= n; i++){
if (is_prime(i) == true)
++numprimes;
}

Related

Why is there no speed up when using OpenMP to generate random numbers?

I am looking to run a type of monte carlo simulations which require the generation of random numbers, and a set of instructions based on those random numbers.
I wish to make use of parallel processing but when testing my code (written in C) there seems to be an inverse speed up with more cores! I'm not sure what I could be doing wrong. I then copied the code form another answer and still get this effect.
The code slightly modified form the answer is
#define NRANDS 1000000
int main() {
int a[NRANDS];
#pragma omp parallel default(none) shared(a)
{
int i;
unsigned int myseed = omp_get_thread_num();
#pragma omp for
for(i=0; i<NRANDS; i++)
a[i] = rand_r(&myseed);
}
double sum = 0.;
for (long int i=0; i<NRANDS; i++) {
sum += a[i];
}
printf("sum = %lf\n", sum);
return 0;
}
where I have just then run the time command in terminal in order to time how long it takes to run. I varied the number of threads allowed using export OMP_NUM_THREADS=2. The output of my terminal is:
Thread total: 1
sum = 1074808568711883.000000
real 0m0,041s
user 0m0,036s
sys 0m0,004s
Thread total: 2
sum = 1074093295878604.000000
real 0m0,037s
user 0m0,058s
sys 0m0,008s
Thread total: 3
sum = 1073700114076905.000000
real 0m0,032s
user 0m0,061s
sys 0m0,010s
Thread total: 4
sum = 1073422298606608.000000
real 0m0,035s
user 0m0,074s
sys 0m0,024s

Note that the time command adds up the time spent on all cores when it prints the user and sys values. Observe that your wall time (real) is nearly constant.
Also, your benchmark is too small. There is a significant cost of creating and managing threads. This overhead may be overshadowing the actual execution time of the random number generation. A million values isn't that many. In other words, the time taken to actually compute the random numbers is so small that it's lost in the noise and dwarfed by the setup/teardown costs. If you generate a whole lot more, you may start to see the advantage due to parallelism.

CPU runs faster than GPU (OpenCL code)

I wrote a code in OpenCL to find the first 5000 prime numbers. Here's that code:
__kernel void dataParallel(__global int* A)
{
A[0]=2;
A[1]=3;
A[2]=5;
int pnp;//pnp=probable next prime
int pprime;//previous prime
int i,j;
for(i=3;i<5000;i++)
{
j=0;
pprime=A[i-1];
pnp=pprime+2;
while((j<i) && A[j]<=sqrt((float)pnp))
{
if(pnp%A[j]==0)
{
pnp+=2;
j=0;
}
j++;
}
A[i]=pnp;
}
}
Then I found out the execution time of this kernel code using OpenCL profiling. Here's the code:
cl_event event;//link an event when launch a kernel
ret=clEnqueueTask(cmdqueue,kernel,0, NULL, &event);
clWaitForEvents(1, &event);//make sure kernel has finished
clFinish(cmdqueue);//make sure all enqueued tasks finished
//get the profiling data and calculate the kernel execution time
cl_ulong time_start, time_end;
double total_time;
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_START, sizeof(time_start), &time_start, NULL);
clGetEventProfilingInfo(event, CL_PROFILING_COMMAND_END, sizeof(time_end), &time_end, NULL);
//total_time = (cl_double)(time_end - time_start)*(cl_double)(1e-06);
printf("OpenCl Execution time is: %10.5f[ms] \n",(time_end - time_start)/1000000.0);
I ran these codes on various devices and this is what I got:
Platform:Intel(R) OpenCL
Device:Intel(R) Xeon(R) CPU X5660 # 2.80GHz
OpenCl Execution time is: 3.54796[ms]
Platform:AMD Accelerated Parallel Processing
Device:Pitcairn (AMD FirePro W7000 GPU)
OpenCl Execution time is: 194.18133[ms]
Platform:AMD Accelerated Parallel Processing
Device:Intel(R) Xeon(R) CPU X5660 # 2.80GHz
OpenCl Execution time is: 3.58488[ms]
Platform:NVIDIA CUDA
Device:Tesla C2075
OpenCl Execution time is: 125.26886[ms]
But aren't GPUs supposed to be faster than CPUs? Or, is there anything wrong with my code/implementation?
Please explain this behaviour.

clEnqueueTask() So basically, you are running 1 single "thread"(work items) in the GPU. A GPU will never beat a CPU in single thread performance.
You need to convert your code, such that you divide each prime calculation to a thread and then you run 5000+ work items(ideally millions). Then, the GPU will beat the CPU simply because it will run all that in parallel and CPU can't.
In order to use multiple work items, call your kernel with clEnqueueNDRangeKernel()

The provided code is a sequential algorithm that relies on previous values.
If you are running it with a global_work_size > 1, you are just performing same calculation over and over.
The opencl implementation should compute primes less than N sequentially, then run a test in parallel for numbers [N+1; N*N] if they are divisible by any of those primes and fill in the sieve array with 0 if the number is not prime and 1 if the number is prime.
E.g. not my code, someone's homework, and i did not check if it really works
If you need more than N^2 elements, calculate a prefix sum of the sieve array (exclusive scan).AMD APP SDK contains a sample of this operation.
This will give you offsets of the prime numbers for copying into prime number array and you'll be able to populate it:
__kernel scatter(uint* numbers, uint* sieve_prefix_sum, uint* sieve, uint offset, uint* prime_numbers)
{
if (sieve[get_global_id(0)])
prime_numbers[offset + sieve_prefix_sum[get_global_id(0)] = numbers[get_global_id(0)];
}
This algorithm works like a tree - you compute primes up to N sequentially and then evaluate K blocks in [N+1, N*N] range, then you repeat and grow next set of branches for [N^2, N^4] etc.

OpenMP for beginners

I just got started with openMP; I wrote a little C code in order to check if what I have studied is correct. However I found some troubles; here is the main.c code
#include "stdio.h"
#include "stdlib.h"
#include "omp.h"
#include "time.h"
int main(){
float msec_kernel;
const int N = 1000000;
int i, a[N];
clock_t start = clock(), diff;
#pragma omp parallel for private(i)
for (i = 1; i <= N; i++){
a[i] = 2 * i;
}
diff = clock() - start;
msec_kernel = diff * 1000 / CLOCKS_PER_SEC;
printf("Kernel Time: %e s\n",msec_kernel*1e-03);
printf("a[N] = %d\n",a[N]);
return 0;
}
My goal is to see how long it takes to the PC to do such operation using 1 and 2 CPUs; in order to to compile the program I type the following line in the terminal:
gcc -fopenmp main.c -o main
And then I select the number of CPUs like so:
export OMP_NUM_THREADS=N
where N is either 1 or 2; however I don't get the right execution time; my results in fact are:
Kernel Time: 5.000000e-03 s
a[N] = 2000000
and
Kernel Time: 6.000000e-03 s
a[N] = 2000000
Both corresponding to N=1 and N=2. as you can see when I use 2 CPUs it takes slightly more time than using just one! What am I doing wrong? How can I fix this problem?

First of all, using multiple cores doesn't implicitly mean, that you're going to get better performance.
OpenMP has to manage the data distribution among you're cores which is going to take time as well. Especially for very basic operations such as only a single multiplication you are doing, performance of a sequential (single core) program will be better.
Second, by going through every element of you're array only once and not doing anything else, you make no use of cache memory and most certainly not of shared cache between cpu's.
So you should start reading some things about general algorithm performance. To make use of multiple cores using shared cache is in my opinion the essence.
Todays computers have come to a stage where the CPU is so much faster than a memory allocation, read or write. This means when using multiple cores, you'll only have a benefit if you use things like shared cache, because the data distribution,initialization of the threads and managing them will use time as well. To really see a performance speedup (See the link, essential term in parallel computing) you should program an algorithm which has a heavy accent on computation not on memory; this has to do with locality (another important term).
So if you wanna experience a big performance boost by using multiple cores test it on a matrix-matrix-multiplication on big matrices such as 10'000*10'000. And plot some graphs with inputsize(matrix-size) to time and matrix-size to gflops and compare the multicore with the sequential version.
Also make yourself comfortable with the complexity analysis (Big O notation).
Matrix-matrix-multiplication has a locality of O(n).
Hope this helps :-)
I suggest setting the numbers of cores/threads within the code itself either directly at the #pragma line #pragma omp parallel for num_threads(2) or using the omp_set_num_threads function omp_set_num_threads(2);
Further, when doing time/performance analysis it is really important to always run the program multiple times and then take the mean of all the runtimes or something like that. Running the respective programs only once will not give you a meaningful reading of used time. Always call multiple times in a row. Not to forget to also alternate the quality of data.
I suggest writing a test.c file, which takes your actual program function within a loop and then calculates the time per execution of the function:
int executiontimes = 20;
clock_t initial_time = clock();
for(int i = 0; i < executiontimes; i++){
function_multiplication(values);
}
clock_t final_time = clock();
clock_t passed_time = final_time - initial_time;
clock_t time_per_exec = passed_time / executiontimes;
Improve this test algorithm, add some rand() for your values etc. seed them with srand() etc. If you have more questions on the subject or to my answer leave a comment and I'll try to explain further by adding more explanations.

The function clock() returns elapsed CPU time, which includes ticks from all cores. Since there is some overhead to using multiple threads, when you sum the execution time of all threads the total cpu time will always be longer than the serial time.
If you want the real time (wall clock time), try to use the OMP Runtime Library function omp_get_wtime() defined in omp.h. It is cross platform portable and should be the preferred way to do wall timing.
You can also use the POSIX functions defined in time.h:
struct timespec start, stop;
clock_gettime(CLOCK_REALTIME, &start);
// action
clock_gettime(CLOCK_REALTIME, &stop);
double elapsed_time = (stop.tv_sec - start.tv_sec) +
1e-9 * (stop.tv_nsec - start.tv_nsec);

Why doesn't this code scale linearly?

I wrote this SOR solver code. Don't bother too much what this algorithm does, it is not the concern here. But just for the sake of completeness: it may solve a linear system of equations, depending on how well conditioned the system is.
I run it with an ill conditioned 2097152 rows sparce matrix (that never converges), with at most 7 non-zero columns per row.
Translating: the outer do-while loop will perform 10000 iterations (the value I pass as max_iters), the middle for will perform 2097152 iterations, split in chunks of work_line, divided among the OpenMP threads. The innermost for loop will have 7 iterations, except in very few cases (less than 1%) where it can be less.
There is data dependency among the threads in the values of sol array. Each iteration of the middle for updates one element but reads up to 6 other elements of the array. Since SOR is not an exact algorithm, when reading, it can have any of the previous or the current value on that position (if you are familiar with solvers, this is a Gauss-Siedel that tolerates Jacobi behavior on some places for the sake of parallelism).
typedef struct{
size_t size;
unsigned int *col_buffer;
unsigned int *row_jumper;
real *elements;
} Mat;
int work_line;
// Assumes there are no null elements on main diagonal
unsigned int solve(const Mat* matrix, const real *rhs, real *sol, real sor_omega, unsigned int max_iters, real tolerance)
{
real *coefs = matrix->elements;
unsigned int *cols = matrix->col_buffer;
unsigned int *rows = matrix->row_jumper;
int size = matrix->size;
real compl_omega = 1.0 - sor_omega;
unsigned int count = 0;
bool done;
do {
done = true;
#pragma omp parallel shared(done)
{
bool tdone = true;
#pragma omp for nowait schedule(dynamic, work_line)
for(int i = 0; i < size; ++i) {
real new_val = rhs[i];
real diagonal;
real residual;
unsigned int end = rows[i+1];
for(int j = rows[i]; j < end; ++j) {
unsigned int col = cols[j];
if(col != i) {
real tmp;
#pragma omp atomic read
tmp = sol[col];
new_val -= coefs[j] * tmp;
} else {
diagonal = coefs[j];
}
}
residual = fabs(new_val - diagonal * sol[i]);
if(residual > tolerance) {
tdone = false;
}
new_val = sor_omega * new_val / diagonal + compl_omega * sol[i];
#pragma omp atomic write
sol[i] = new_val;
}
#pragma omp atomic update
done &= tdone;
}
} while(++count < max_iters && !done);
return count;
}
As you can see, there is no lock inside the parallel region, so, for what they always teach us, it is the kind of 100% parallel problem. That is not what I see in practice.
All my tests were run on a Intel(R) Xeon(R) CPU E5-2670 v2 # 2.50GHz, 2 processors, 10 cores each, hyper-thread enabled, summing up to 40 logical cores.
On my first set runs, work_line was fixed on 2048, and the number of threads varied from 1 to 40 (40 runs in total). This is the graph with the execution time of each run (seconds x number of threads):
The surprise was the logarithmic curve, so I thought that since the work line was so large, the shared caches were not very well used, so I dug up this virtual file /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size that told me this processor's L1 cache synchronizes updates in groups of 64 bytes (8 doubles in the array sol). So I set the work_line to 8:
Then I thought 8 was too low to avoid NUMA stalls and set work_line to 16:
While running the above, I thought "Who am I to predict what work_line is good? Lets just see...", and scheduled to run every work_line from 8 to 2048, steps of 8 (i.e. every multiple of the cache line, from 1 to 256). The results for 20 and 40 threads (seconds x size of the split of the middle for loop, divided among the threads):
I believe the cases with low work_line suffers badly from cache synchronization, while bigger work_line offers no benefit beyond a certain number of threads (I assume because the memory pathway is the bottleneck). It is very sad that a problem that seems 100% parallel presents such bad behavior on a real machine. So, before I am convinced multi-core systems are a very well sold lie, I am asking you here first:
How can I make this code scale linearly to the number of cores? What am I missing? Is there something in the problem that makes it not as good as it seems at first?
Update
Following suggestions, I tested both with static and dynamic scheduling, but removing the atomics read/write on the array sol. For reference, the blue and orange lines are the same from the previous graph (just up to work_line = 248;). The yellow and green lines are the new ones. For what I could see: static makes a significant difference for low work_line, but after 96 the benefits of dynamic outweighs its overhead, making it faster. The atomic operations makes no difference at all.

The sparse matrix vector multiplication is memory bound (see here) and it could be shown with a simple roofline model. Memory bound problems benefit from higher memory bandwidth of multisocket NUMA systems but only if the data initialisation is done in such a way that the data is distributed among the two NUMA domains. I have some reasons to believe that you are loading the matrix in serial and therefore all its memory is allocated on a single NUMA node. In that case you won't benefit from the double memory bandwidth available on a dual-socket system and it really doesn't matter if you use schedule(dynamic) or schedule(static). What you could do is enable memory interleaving NUMA policy in order to have the memory allocation spread among both NUMA nodes. Thus each thread would end up with 50% local memory access and 50% remote memory access instead of having all threads on the second CPU being hit by 100% remote memory access. The easiest way to enable the policy is by using numactl:
$ OMP_NUM_THREADS=... OMP_PROC_BIND=1 numactl --interleave=all ./program ...
OMP_PROC_BIND=1 enables thread pinning and should improve the performance a bit.
I would also like to point out that this:
done = true;
#pragma omp parallel shared(done)
{
bool tdone = true;
// ...
#pragma omp atomic update
done &= tdone;
}
is a probably a not very efficient re-implementation of:
done = true;
#pragma omp parallel reduction(&:done)
{
// ...
if(residual > tolerance) {
done = false;
}
// ...
}
It won't have a notable performance difference between the two implementations because of the amount of work done in the inner loop, but still it is not a good idea to reimplement existing OpenMP primitives for the sake of portability and readability.

Try running the IPCM (Intel Performance Counter Monitor). You can watch memory bandwidth, and see if it maxes out with more cores. My gut feeling is that you are memory bandwidth limited.
As a quick back of the envelope calculation, I find that uncached read bandwidth is about 10 GB/s on a Xeon. If your clock is 2.5 GHz, that's one 32 bit word per clock cycle. Your inner loop is basically just a multiple-add operation whose cycles you can count on one hand, plus a few cycles for the loop overhead. It doesn't surprise me that after 10 threads, you don't get any performance gain.

Your inner loop has an omp atomic read, and your middle loop has an omp atomic write to a location that could be the same one read by one of the reads. OpenMP is obligated to ensure that atomic writes and reads of the same location are serialized, so in fact it probably does need to introduce a lock, even though there isn't any explicit one.
It might even need to lock the whole sol array unless it can somehow figure out which reads might conflict with which writes, and really, OpenMP processors aren't necessarily all that smart.
No code scales absolutely linearly, but rest assured that there are many codes that do scale much closer to linearly than yours does.

I suspect you are having caching issues. When one thread updates a value in the sol array, it invalids the caches on other CPUs that are storing that same cache line. This forces the caches to be updated, which then leads to the CPUs stalling.

Even if you don't have an explicit mutex lock in your code, you have one shared resource between your processes: the memory and its bus. You don't see this in your code because it is the hardware that takes care of handling all the different requests from the CPUs, but nevertheless, it is a shared resource.
So, whenever one of your processes writes to memory, that memory location will have to be reloaded from main memory by all other processes that use it, and they all have to use the same memory bus to do so. The memory bus saturates, and you have no more performance gain from additional CPU cores that only serve to worsen the situation.

OpenMP with 1 thread slower than sequential version

I have implemented knapsack using OpenMP (gcc version 4.6.3)
#define MAX(x,y) ((x)>(y) ? (x) : (y))
#define table(i,j) table[(i)*(C+1)+(j)]
for(i=1; i<=N; ++i) {
#pragma omp parallel for
for(j=1; j<=C; ++j) {
if(weights[i]>j) {
table(i,j) = table(i-1,j);
}else {
table(i,j) = MAX(profits[i]+table(i-1,j-weights[i]), table(i-1,j));
}
}
}
execution time for the sequential program = 1s
execution time for the openmp with 1 thread = 1.7s (overhead = 40%)
Used the same compiler optimization flags (-O3) in the both cases.
Can someone explain the reason behind this behavior.
Thanks.

Enabling OpenMP inhibits certain compiler optimisations, e.g. it could prevent loops from being vectorised or shared variables from being kept in registers. Therefore OpenMP-enabled code is usually slower than the serial and one has to utilise the available parallelism to offset this.
That being said, your code contains a parallel region nested inside the outer loop. This means that the overhead of entering and exiting the parallel region is multiplied N times. This only makes sense if N is relatively small and C is significantly larger (like orders of magnitude larger) than N, therefore the work being done inside the region greatly outweighs the OpenMP overhead.