The performance of openmp - c

i'm making test about the performance of openmp,bu i find some strange results,here are my test code:
void test()
{
int a = 0;
clock_t t1 = clock();
int length =50000;
double *t3 = new double[length]();
double *t4 = new double[length]();
for (int i = 0; i <8000; i++)
{
for (int j = 0; j < length; j++)
{
t3[j] = t3[j] + t4[j];
}
}
clock_t t2 = clock();
printf("Time = %d %d\n", t2 - t1,omp_get_thread_num());
delete[] t3;
delete[] t4;
}
int main()
{
clock_t t1 = clock();
printf("In parallel region:\n");
#pragma omp parallel for
for (int j = 0; j < 8; j++)
{
test();
}
clock_t t2 = clock();
printf("Total time = %d\n", t2 - t1);
printf("In sequential region:\n");
test();
printf("\n");
}
when i set the length=50000 or length=100000 or length=150000 respectively,the results are showed in the figure:
it is strange that
the elapsed time is not a straight line up (the elapsed time when length=150000is almost 5 times of that when length=50000), while the amount of calculation is a straight line up.
it also strange that elapsed time for the test function in the parallel region doesn’t equal to the elapsed time for the test function in the sequential region when length=150000.
my cpu is intel Core i5-4590(4 cores) and platform is vs2013 ,win8
I’m eager to hope somebody can tell me the reason and how to solve this problem to improve the performance of openmp,thank you very much.

There is nothing strange here. Your code is memory bound and the slowdown when going from length=50000 to longer arrays is due to the data no longer being able to fit into the CPU last-level cache.
length=50000: data size is 4 threads x 2 arrays x 50000 elements x 8 bytes per element = 3.05 MiB < L3 cache size (6 MiB for i5-4590)
length=100000: data size is 6.10 MiB > L3 cache size
length=150000: data size is 9.16 MiB > L3 cache size
In the second case, the array is just slightly larger than the CPU cache, therefore the time difference is only a bit bigger than 2x. In the third case, half of the array data cannot be fitted into the cache and must be streamed from and to the main memory.
When the function is called from the main thread only, the memory used is 1/4 of what is used in the parallel region and the arrays fit entirely in the L3 cache for all three different lengths.
Check my answer to this question for more details.

Related

OpenMP: No Speedup in parallel workloads

So I can't really figure this bit out with my fairly simple OpenMP parallelized for loop. When running on the same input size, P=1 runs in ~50 seconds, but running P=2 takes almost 300 Seconds, with P=4 running ~250 Seconds.
Here's the parallelized loop
double time = omp_get_wtime();
printf("Input Size: %d\n", n);
#pragma omp parallel for private(i) reduction(+:in)
for(i = 0; i < n; i++) {
double x = (double)(rand() % 10000)/10000;
double y = (double)(rand() % 10000)/10000;
if(inCircle(x, y)) {
in++;
}
}
double ratio = (double)in/(double)n;
double est_pi = ratio * 4.0;
time = omp_get_wtime() - time;
Runtimes:
p=1, n=1073741824 - 52.764 seconds
p=2, n=1073741824 - 301.66 seconds
p=4, n=1073741824 - 274.784 seconds
p=8, n=1073741824 - 188.224 seconds
Running in a Ubuntu 20.04 VM with 8 cores of a Xeon 5650 and 16gb of DDR3 EEC RAM on top of a FreeNas installation on a Dual Xeon 5650 System with 70Gb of RAM.
Partial Solution:
The rand() function inside of the loop causes the time to jump when running on multiple threads.
Since rand() uses state saved from the previous call to generated the next PRN it can't run in multiple threads at the same time. Multiple threads would need to read/write the PRNG state at the same time.
POSIX states that rand() need not be thread safe. This means your code could just not work right. Or the C library might put in a mutex so that only one thread could call rand() at a time. This is what's happening, but it slows the code down considerably. The threads are nearly entirely consumed trying to get access to the rand critical section as nothing else they are doing takes any significant time.
To solve this, try using rand_r(), which does not use shared state, but instead is passed the seed value it should use for state.
Keep in mind that using the same seed for every thread will defeat the purpose of increasing the number of trials in your Monte Carlo simulation. Each thread would just use the exact same pseudo-random sequence. Try something like this:
unsigned int seed;
#pragma omp parallel private(seed)
{
seed = omp_get_thread_num();
#pragma omp for private(i) reduction(+:in)
for(i = 0; i < n; i++) {
double x = (double)(rand_r(&seed) % 10000)/10000;
double y = (double)(rand_r(&seed) % 10000)/10000;
if(inCircle(x, y)) {
in++;
}
}
}
BTW, you might notice your estimate is off. x and y need to be evenly distributed in the range [0, 1], and they are not.

False sharing in multi threads

The following code runs slower as I increase the NTHREADS. Why use more threads make the program run slower? Is there any way to fix it? Someone said it is about false sharing but I do not really understand that concept.
The program basicly calculate the sum from 1 to 100000000. The idea to use multithread is to seperate the number list into several chuncks, and calculate the sum of each chunck parallelly to make the calculation faster.
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#define LENGTH 100000000
#define NTHREADS 2
#define NREPEATS 10
#define CHUNCK (LENGTH / NTHREADS)
typedef struct {
size_t id;
long *array;
long result;
} worker_args;
void *worker(void *args) {
worker_args *wargs = (worker_args*) args;
const size_t start = wargs->id * CHUNCK;
const size_t end = wargs->id == NTHREADS - 1 ? LENGTH : (wargs->id+1) * CHUNCK;
for (size_t i = start; i < end; ++i) {
wargs->result += wargs->array[i];
}
return NULL;
}
int main(void) {
long* numbers = malloc(sizeof(long) * LENGTH);
for (size_t i = 0; i < LENGTH; ++i) {
numbers[i] = i + 1;
}
worker_args *args = malloc(sizeof(worker_args) * NTHREADS);
for (size_t i = 0; i < NTHREADS; ++i) {
args[i] = (worker_args) {
.id = i,
.array = numbers,
.result = 0
};
}
pthread_t thread_ids[NTHREADS];
for (size_t i = 0; i < NTHREADS; ++i) {
pthread_create(thread_ids+i, NULL, worker, args+i);
}
for (size_t i = 0; i < NTHREADS; ++i) {
pthread_join(thread_ids[i], NULL);
}
long sum = 0;
for (size_t i = 0; i < NTHREADS; ++i) {
sum += args[i].result;
}
printf("Run %2zu: total sum is %ld\n", n, sum);
free(args);
free(numbers);
}
Why use more threads make the program run slower?
There is an overhead creating and joining threads. If the threads hasn't much to do then this overhead may be more expensive than the actual work.
Your threads are only doing a simple sum which isn't that expensive. Also consider that going from e.g. 10 to 11 threads doesn't change the work load per thread a lot.
10 threads --> 10000000 sums per thread
11 threads --> 9090909 sums per thread
The overhead of creating an extra thread may exceed the "work load saved" per thread.
On my PC the program runs in less than 100 milliseconds. Multi-threading isn't worth the trouble.
You need a more processing intensive task before multi-threading is worth doing.
Also notice that it seldom make sense to create more threads than the number of cores (incl hyper thread) your computer has.
false sharing
yes, "false sharing" can impact the performance of a multi-threaded program but I doubt it's the real problem in your case.
"false sharing" is something that happens in (some) cache systems when two threads (or rather two cores) writes to two different variables that belongs to the same cache line. In such cases the two threads/cores competes to own the cache line (for writing) and consequently, they'll have to refresh the memory and the cache again and again. That's bad for performance.
As I said - I doubt that is your problem. A clever compiler will do your loop solely be using CPU registers and only write to memory at the end. You can check the disassemble of your code to see if that is the case.
You can avoid "false sharing" by increasing the sizeof of your struct so that each struct fits the size of a cache line on your system.

cpu cacheline and prefetch policy

I read this article http://igoro.com/archive/gallery-of-processor-cache-effects/. The article said that because cacheline delay, the code:
int[] arr = new int[64 * 1024 * 1024];
// Loop 1
for (int i = 0; i < arr.Length; i++) arr[i] *= 3;
// Loop 2
for (int i = 0; i < arr.Length; i += 16) arr[i] *= 3;
will almost have same execute time, and I wrote some sample c code to test it. I run the code on Xeon(R) E3-1230 V2 with Ubuntu 64bit, ARMv6-compatible processor rev 7 with Debian, and also run it on Core 2 T6600. All results are not what the article said.
My code is as follows:
long int jobTime(struct timespec start, struct timespec stop) {
long int seconds = stop.tv_sec - start.tv_sec;
long int nsec = stop.tv_nsec - start.tv_nsec;
return seconds * 1000 * 1000 * 1000 + nsec;
}
int main() {
struct timespec start;
struct timespec stop;
int i = 0;
struct sched_param param;
int * arr = malloc(LENGTH * 4);
printf("---------sieofint %d\n", sizeof(int));
param.sched_priority = 0;
sched_setscheduler(0, SCHED_FIFO, &param);
//clock_gettime(CLOCK_MONOTONIC, &start);
//for (i = 0; i < LENGTH; i++) arr[i] *= 5;
//clock_gettime(CLOCK_MONOTONIC, &stop);
//printf("step %d : time %ld\n", 1, jobTime(start, stop));
clock_gettime(CLOCK_MONOTONIC, &start);
for (i = 0; i < LENGTH; i += 2) arr[i] *= 5;
clock_gettime(CLOCK_MONOTONIC, &stop);
printf("step %d : time %ld\n", 2, jobTime(start, stop));
}
Each time I choose one piece to compile and run (comment one and uncomment another).
compile with:
gcc -O0 -o cache cache.c -lrt
On Xeon I get this:
step 1 : 258791478
step 2 : 97875746
I want to know whether or not what the article said was correct? Alternatively, do the newest cpus have more advanced prefetch policies?
Short Answer (TL;DR): you're accessing uninitialized data, your first loop has to allocate new physical pages for the entire array within the timed loop.
When I run your code and comment each of the sections in turn, I get almost the same timing for the two loops. However, I do get the same results you report when I uncomment both sections and run them one after the other. This makes me suspect you also did that, and suffered from cold start effect when comparing the first loop with the second. It's easy to check - just replace the order of the loops and see if the first is still slower.
To avoid, either pick a large enough LENGTH (depending on your system) so that you dont get any cache benefits from the first loop helping the second, or just add a single traversal of the entire array that's not timed.
Note that the second option wouldn't exactly prove what the blog wanted to say - that memory latency masks the execution latency, so it doesn't matter how many elements of a cache line you use, you're still bottlenecked by the memory access time (or more accurately - the bandwidth)
Also - benchmarking code with -O0 is a really bad practice
Edit:
Here's what i'm getting (removed the scheduling as it's not related).
This code:
for (i = 0; i < LENGTH; i++) arr[i] = 1; // warmup!
clock_gettime(CLOCK_MONOTONIC, &start);
for (i = 0; i < LENGTH; i++) arr[i] *= 5;
clock_gettime(CLOCK_MONOTONIC, &stop);
printf("step %d : time %ld\n", 1, jobTime(start, stop));
clock_gettime(CLOCK_MONOTONIC, &start);
for (i = 0; i < LENGTH; i+=16) arr[i] *= 5;
clock_gettime(CLOCK_MONOTONIC, &stop);
Gives :
---------sieofint 4
step 1 : time 58862552
step 16 : time 50215446
While commenting the warmup line gives the same advantage as you reported on the second loop:
---------sieofint 4
step 1 : time 279772411
step 16 : time 50615420
Replacing the order of the loops (warmup is still commented) shows it's indeed not related to the step size but to the ordering:
---------sieofint 4
step 16 : time 250033980
step 1 : time 59168310
(gcc version 4.6.3, on Opteron 6272)
Now a note about what's going on here - in theory, you'd expect warmup to be meaningful only when the array is small enough to sit in some cache - in this case the LENGTH you used is too big even for the L3 on most machines. However, you're forgetting the pagemap - you didn't just skip warming the data itself - you avoided initializing it in the first place. This can never give you meaningful results in real life, but since this a benchmark you didn't notice that, you're just multiplying junk data for the latency of it.
This means that each new page you access on the first loop doesn't only go to memory, it would probably get a page fault and have to call the OS to map a new physical page for it. This is a lengthy process, multiplies by the number of 4K pages you use - accumulating to a very long time. At this array size you can't even benefit from TLBs (you have 16k different physical 4k pages, way more than most TLBs can support even with 2 levels), so it's just the question of the fault flows. This can probably be measures by any profiling tool.
The second iteration on the same array won't have this effect and would be much faster - even though is still has to do a full pagewalk on each new page (that's done purely in HW), and then fetch the data from memory.
By the way, this is also the reason when you benchmark some behavior, you repeat the same thing multiple times (in this case it would have solved your problem if you had run over the array several time with the same stride, and ignored the first few rounds).

analysis of cpu cache access time

I have the following program which I with the help of someother on stackoverflow wrote to understand cachelines and CPU caches.I have the result of the calculation posted below.
1 450.0 440.0
2 420.0 230.0
4 400.0 110.0
8 390.0 60.0
16 380.0 30.0
32 320.0 10.0
64 180.0 10.0
128 60.0 0.0
256 40.0 10.0
512 10.0 0.0
1024 10.0 0.0
I have plotted a graph using gnuplot which is posted below.
I have the following questions.
is my timing calculation in milliseconds correct ? 440ms seems to
be a lot of time?
From the graph cache_access_1 (redline) can we conclude that the
size of cache line is 32 bits (and not 64-bits?)
Between the for loops in the code is it a good idea to clear the
cache? If yes how do I do that programmatically?
As you can see I have some 0.0 values in the result above.?
What does this indicate? is the granularity of measurement too
coarse?
Kindly reply.
#include <stdio.h>
#include <sys/time.h>
#include <time.h>
#include <unistd.h>
#include <stdlib.h>
#define MAX_SIZE (512*1024*1024)
int main()
{
clock_t start, end;
double cpu_time;
int i = 0;
int k = 0;
int count = 0;
/*
* MAX_SIZE array is too big for stack.This is an unfortunate rough edge of the way the stack works.
* It lives in a fixed-size buffer, set by the program executable's configuration according to the
* operating system, but its actual size is seldom checked against the available space.
*/
/*int arr[MAX_SIZE];*/
int *arr = (int*)malloc(MAX_SIZE * sizeof(int));
/*cpu clock ticks count start*/
for(k = 0; k < 3; k++)
{
start = clock();
count = 0;
for (i = 0; i < MAX_SIZE; i++)
{
arr[i] += 3;
/*count++;*/
}
/*cpu clock ticks count stop*/
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("cpu time for loop 1 (k : %4d) %.1f ms.\n",k,(cpu_time*1000));
}
printf("\n");
for (k = 1 ; k <= 1024 ; k <<= 1)
{
/*cpu clock ticks count start*/
start = clock();
count = 0;
for (i = 0; i < MAX_SIZE; i += k)
{
/*count++;*/
arr[i] += 3;
}
/*cpu clock ticks count stop*/
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("cpu time for loop 2 (k : %4d) %.1f ms.\n",k,(cpu_time*1000));
}
printf("\n");
/* Third loop, performing the same operations as loop 2,
but only touching 16KB of memory
*/
for (k = 1 ; k <= 1024 ; k <<= 1)
{
/*cpu clock ticks count start*/
start = clock();
count = 0;
for (i = 0; i < MAX_SIZE; i += k)
{
count++;
arr[i & 0xfff] += 3;
}
/*cpu clock ticks count stop*/
end = clock();
cpu_time = ((double) (end - start)) / CLOCKS_PER_SEC;
printf("cpu time for loop 3 (k : %4d) %.1f ms.\n",k,(cpu_time*1000));
}
return 0;
}
Since you are on Linux, I'll answer from that perspective. I will also write with an Intel (i.e., x86-64) architecture in mind.
440 ms is probably accurate. A better way to look at the results would be time per element or access. Note that increasing your k reduces the number of elements accessed. Now, cache access 2 shows a fairly steady result of 0.9ns / access. This time is roughly comparable to 1 - 3 cycles per access (depending on CPU's clock rate). So sizes 1 - 16 (maybe 32) are accurate.
No (although I will first assume you mean 32 versus 64 byte). You should ask yourself, what does "cache line size" look like? If you access smaller than the cache line, then you will miss and subsequently hit one or more times. If you are greater than or equal to the cache line size, every access will miss. At k=32 and above, the access time for access 1 is relatively constant at 20ns per access. At k=1-16, overall access time is constant, suggesting that there are approximately the same number of cache misses. So I would conclude that the cache line size is 64 bytes.
Yes, at least for the last loop that is only storing ~16KB. How? Either touch a lot of other data, like another GB array. Or call an instruction like x86's WBINVD, which writes to memory and then invalidates all cache contents; however, it requires you to be in kernel-mode.
As you noted, beyond size 32, the times hover around 10ms, which is showing your timing granularity. You need to either increase the time required (so that a 10ms granularity is sufficient) or switch to a different timing mechanism, which is what the comments are debating. I'm a fan of using the instruction rdtsc (read timestamp counter (i.e., cycle count)), but this can be even more problematic than the suggestions above. Switching your code to rdtsc basically required switching clock, clock_t, and CLOCKS_PER_SEC. However, you could still face clock drift if your thread migrates, but this is a fun test so I wouldn't concern myself with this issue.
More caveats: the trouble with consistent strides (like powers of 2) is that the processor likes to hide the cache miss penalty by prefetching. You can disable the prefetcher on many machines in the BIOS (see "Changing the Prefetcher for Intel Processors").
Page faults may also be impacting your results. You are allocating 500M ints or about 2GB of storage. Loop 1 tries to touch the memory so that the OS will allocate pages, but if you don't have this much available memory (not just total, as the OS, etc takes up some space) then your results will be skewed. Furthermore, the OS may start reclaiming some of the space so that you will always be page faulting on some of your accesses.
Related to the previous, the TLB is also going to have some impact on the results. The hardware keeps a small cache of mappings from virtual to physical address in a translation lookaside buffer (TLB). Each page of memory (4KB on Intel) needs a TLB entry. So your experiment is going to need 2GB / 4KB => ~500,000 entries. Most TLBs hold less than 1000 entries, so the measurements are also skewed by this miss. Fortunately, it is only once every 4KB or 1024 ints. It is possible that malloc is allocating "large" or "huge" pages for you, for more details - Huge Pages in Linux.
Another experiment would be to repeat the third loop, but change the mask that you are using, so that you can observe the size of each cache level (L1, L2, maybe L3, rarely L4). You may also find that different cache levels use different cacheline sizes.

optimize MSE algorithm using openmp

I wanted to optimize below code using openMP
double val;
double m_y = 0.0f;
double m_u = 0.0f;
double m_v = 0.0f;
#define _MSE(m, t) \
val = refData[t] - calData[t]; \
m += val*val;
#pragma omp parallel
{
#pragma omp for
for( i=0; i<(width*height)/2; i++ ) { //yuv422: 2 pixels at a time
_MSE(m_u, 0);
_MSE(m_y, 1);
_MSE(m_v, 2);
_MSE(m_y, 3);
#pragma omp reduction(+:refData) reduction(+:calData)
refData += 4;
calData += 4;
// int id = omp_get_thread_num();
//printf("Thread %d performed %d iterations of the loop\n",id ,i);
}
}
Any suggestion welcome for optimizing above code currently I have wrong output.
I think the easiest thing you can do is allow it to split into 4 threads, and calculate the UYVY errors in each of those. Instead of making them separate values, make them an array:
double sqError[4] = {0};
const int numBytes = width * height * 2;
#pragma omp parallel for
for( int elem = 0; elem < 4; elem++ ) {
for( int i = elem; i < numBytes; i += 4 ) {
int val = refData[i] - calData[i];
sqError[elem] += (double)(val*val);
}
}
This way, each thread operates exclusively on one thing and there is no contention.
Maybe it's not the most advanced use of OMP, but you should see a speedup.
After your comment about performance hit, I did some experiments and found that indeed the performance was worse. I suspect this may be due to cache misses.
You said:
performance hit this time with openMP : Time :0.040637 with serial
Time :0.018670
So I reworked it using the reduction on each variable and using a single loop:
#pragma omp parallel for reduction(+:e0) reduction(+:e1) reduction(+:e2) reduction(+:e3)
for( int i = 0; i < numBytes; i += 4 ) {
int val = refData[i] - calData[i];
e0 += (double)(val*val);
val = refData[i+1] - calData[i+1];
e1 += (double)(val*val);
val = refData[i+2] - calData[i+2];
e2 += (double)(val*val);
val = refData[i+3] - calData[i+3];
e3 += (double)(val*val);
}
With my test case on a 4-core machine, I observed a little less than 4-fold improvement:
serial: 2025 ms
omp with 2 loops: 6850 ms
omp with reduction: 455 ms
[Edit] On the subject of why the first piece of code performed worse than the non-parallel version, Hristo Iliev said:
Your first piece of code is a terrible example of what false sharing
does in multithreaded codes. As sqError has only 4 elements of 8 bytes
each, it fits in a single cache line (even in a half cache line on
modern x86 CPUs). With 4 threads constantly writing to neighbouring
elements, this would generate a massive amount of inter-core cache
invalidation due to false sharing. One can get around this by using
instead a structure like this struct _error { double val; double
pad[7]; } sqError[4]; Now each sqError[i].val will be in a separate
cache line, hence no false sharing.
The code looks like it's calculating the MSE but adding to the same sum, m. For parallelism to work properly, you need to eliminate sharing of m, one approach would be preallocating an array (width*height/2 I imagine) just to store the different sums, or ms. Finally, add up all the sums at the end.
Also, test that this is actually faster!

Resources