Optimising and why openmp is much slower than sequential way?

Optimising and why openmp is much slower than sequential way? - c

I am a newbie in programming with OpenMp. I wrote a simple c program to multiply matrix with a vector. Unfortunately, by comparing executing time I found that the OpenMP is much slower than the Sequential way.
Here is my code (Here the matrix is N*N int, vector is N int, result is N long long):
#pragma omp parallel for private(i,j) shared(matrix,vector,result,m_size)
for(i=0;i<m_size;i++)
{
for(j=0;j<m_size;j++)
{
result[i]+=matrix[i][j]*vector[j];
}
}
And this is the code for sequential way:
for (i=0;i<m_size;i++)
for(j=0;j<m_size;j++)
result[i] += matrix[i][j] * vector[j];
When I tried these two implementations with a 999x999 matrix and a 999 vector, the execution time is:
Sequential: 5439 ms
Parallel: 11120 ms
I really cannot understand why OpenMP is much slower than sequential algo (over 2 times slower!) Anyone who can solve my problem?

Your code partially suffers from the so-called false sharing, typical for all cache-coherent systems. In short, many elements of the result[] array fit in the same cache line. When thread i writes to result[i] as a result of the += operator, the cache line holding that part of result[] becomes dirty. The cache coherency protocol then invalidates all copies of that cache line in the other cores and they have to refresh their copy from the upper level cache or from the main memory. As result is an array of long long, then one cache line (64 bytes on x86) holds 8 elements and besides result[i] there are 7 other array elements in the same cache line. Therefore it is possible that two "neighbouring" threads will constantly fight for ownership of the cache line (assuming that each thread runs on a separate core).
To mitigate false sharing in your case, the easiest thing to do is to ensure that each thread gets an iteration block, whose size is divisible by the number of elements in the cache line. For example you can apply the schedule(static,something*8) where something should be big enough so that the iteration space is not fragmented into too many pieces, but in the same time it should be small enough so that each thread gets a block. E.g. for m_size equal to 999 and 4 threads you would apply the schedule(static,256) clause to the parallel for construct.
Another partial reason for the code to run slower might be that when OpenMP is enabled, the compiler might become reluctant to apply some code optimisations when shared variables are being assigned to. OpenMP provides for the so-called relaxed memory model where it is allowed that the local memory view of a shared variable in each threads is different and the flush construct is provided in order to synchronise the views. But compilers usually see shared variables as being implicitly volatile if they cannot prove that other threads would not need to access desynchronised shared variables. You case is one of those, since result[i] is only assigned to and the value of result[i] is never used by other threads. In the serial case the compiler would most likely create a temporary variable to hold the result from the inner loop and would only assign to result[i] once the inner loop has finished. In the parallel case it might decide that this would create a temporary desynchronised view of result[i] in the other threads and hence decide not to apply the optimisation. Just for the record, GCC 4.7.1 with -O3 -ftree-vectorize does the temporary variable trick with both OpenMP enabled and not.

Because when OpenMP distributes the work among threads there is a lot of administration/synchronisation going on to ensure the values in your shared matrix and vector are not corrupted somehow. Even though they are read-only: humans see that easily, your compiler may not.
Things to try out for pedagogic reasons:
0) What happens if matrix and vector are not shared?
1) Parallelize the inner "j-loop" first, keep the outer "i-loop" serial. See what happens.
2) Do not collect the sum in result[i], but in a variable temp and assign its contents to result[i] only after the inner loop is finished to avoid repeated index lookups. Don't forget to init temp to 0 before the inner loop starts.

I did this in reference to Hristo's comment. I tried using schedule(static, 256). For me it makes it does not help changing the default chunck size. Maybe it even makes it worse. I printed out the thread number and its index with and without setting the schedule and it's clear that OpenMP already chooses the thread indices to be far from one another so that false sharing does not seem to be an issue. For me this code already gives a good boost with OpenMP.
#include "stdio.h"
#include <omp.h>
void loop_parallel(const int *matrix, const int ld, const int*vector, long long* result, const int m_size) {
#pragma omp parallel for schedule(static, 250)
//#pragma omp parallel for
for (int i=0;i<m_size;i++) {
//printf("%d %d\n", omp_get_thread_num(), i);
long long sum = 0;
for(int j=0;j<m_size;j++) {
sum += matrix[i*ld +j] * vector[j];
}
result[i] = sum;
}
}
void loop(const int *matrix, const int ld, const int*vector, long long* result, const int m_size) {
for (int i=0;i<m_size;i++) {
long long sum = 0;
for(int j=0;j<m_size;j++) {
sum += matrix[i*ld +j] * vector[j];
}
result[i] = sum;
}
}
int main() {
const int m_size = 1000;
int *matrix = new int[m_size*m_size];
int *vector = new int[m_size];
long long*result = new long long[m_size];
double dtime;
dtime = omp_get_wtime();
loop(matrix, m_size, vector, result, m_size);
dtime = omp_get_wtime() - dtime;
printf("time %f\n", dtime);
dtime = omp_get_wtime();
loop_parallel(matrix, m_size, vector, result, m_size);
dtime = omp_get_wtime() - dtime;
printf("time %f\n", dtime);
}

Related

Reduce the number of executions by 3 times, but the execution efficiency is almost unchanged. In C

In C, I reduced the total number of loop executions by nearly 3 times, but through testing the execution time, I found that there is almost no improvement in doing so. All optimization levels have been tested, and the results are basically the same(including O0, O1, O2 and O3). I guess it’s a problem with the compiler, but I want to know what causes this situation. And what to do to make the results meet expectations.
The code is as follow:
#include <stdio.h>
#include <time.h>
#include <stdlib.h>
#define Len 10000000
// Two variables that count the number of loops
int count1 = 0;
int count2 = 0;
int main(int argc, const char * argv[]) {
srandom((unsigned)time(NULL));
// An array to increase the index,
// the range of its elements is 1-256
int rand_arr[128];
for (int i = 0; i < 128; ++i)
rand_arr[i] = random()%256+1;
// A random text, the range of its elements is 0-127
char *tex = malloc((sizeof *tex) * Len);
for (int i = 0; i < Len; ++i)
tex[i] = random()%128;
// The first testing
clock_t start = clock();
for (int i = 0; i < Len; i += rand_arr[tex[i]])
count1++;
printf("No.1: %lf s\n", ((double)(clock() - start)) / CLOCKS_PER_SEC);
// The second testing (x3)
start = clock();
for (int i = 0; i < Len; i += rand_arr[tex[i]]+256)
count2++;
printf("No.2: %lf s\n", ((double)(clock() - start)) / CLOCKS_PER_SEC);
printf("count1: %d\n", count1);
printf("count2: %d\n", count2);
return 0;
}
The print result(average) is as follows：
No.1: 0.002213 s
No.2: 0.002209 s
count1: 72661
count2: 25417

The problem comes from the processor itself and not the compiler. This is a complex problem related to the behaviour of the CPU caches, CPU prefetching units and the random access pattern.
Both codes read the tex array based i value that cannot be easily predicted ahead of time by the processor because of the random increments stored in the rand_arr. Because tex is relatively big, it will likely not be fully stored in L1 cache (nor in an intermediate L2 cache if any) but rather in the last-level cache (LLC) or even in RAM. As a result, tex need to be reloaded in each loops from the LLC cache or the RAM. The latency of the LLC cache and the RAM are relatively big nowadays. This thing is that the second loop is harder to predict and less cache-friendly than the first although the number of iteration is smaller!
On x86 CPU, caches pack values by block of 64 bytes called a cache line. When a value is fetched from the main memory or another cache (typically due to a cache miss), a full cache line is fetched. The following accesses to the same cache line are faster because the CPU do not need to fetch it again (as long as the cache line is not invalidated).
In the first loop, the average increment of i is 128 (because the mean of rand_arr is 128). This means that the average stride between two fetched item from tex is 128. In the worst case, the stride is 256. In the second loop, the average stride is 256+128=384 and in the worst case it is 256+256=512. When the stride is smaller than 64, there is a high probability that is will be already fetched in the first case while this is never the case in the second loop. Moreover, the prefetching units can prefetch cache line when several access are contiguous or close each other. This enable the processor to most items of the tex array ahead of time in the first loop. Meanwhile, in the second loop the prefetcher will likely fail to recognize any cache line fetching access. The prefetching units will probably not prefetch anything (because it is too expensive to do that) and the result is many cache misses with a high latency that cannot be mitigated because the accesses are inherently sequential and unpredictible. If the prefeteching units decide to prefetch all the cache lines, then the second loop should not be faster than the first (because the two loops are both bound by the memory hierarchy).
Note that random and srandom are not standard functions.
Also, be aware that clock is not always precise on all platforms. Actually, it as a precision of 1 ms on my Windows (using GCC and MinGW). This can also be seen on some Linux systems.

There are a couple of things that could be causing this:
If you're timing your code, you should have:
startTime = clock();
CODE
endTime = clock();
Then print your results/do analysis on it after. You do some math on it and use the PRINTF function which is horrifically inefficient as far as timing goes. Also the cast to double is not necessary, and can be causing the vast majority of your timing, as double math is horrifically slow. Stick to int as this is potentially 1000x faster
odd for loop code - the standard for for loop equations is:
for(int i = 0;i<length;i++)
Code
You have
for(int i = 0;i<length;code)
i++;
Which is odd as far as syntax goes, and may be affecting your timing
clock() - this may be affecting your timing. If clock() is returning a double I would suggest doing it a different way, with a function that returns int or unsigned int, as doubles will destroy your timing as stated above. If you're worried about this, I'd suggest testing it by way of the following:
startTime = clock()
for(int i = 0;i<10000;i++)
dummy = clock()
endTime = clock()
totalTime = (endTime - startTime)/10000;
for loop - this in itself can be the main source of your base-timing (although it's unlikely, especially since it doesn't seem like you're doing any particularly complicated math. You can fix this buy using the #pragma unroll processor instruction, which will basically copy and paste all iterations of your for loop into your code, removing it's timing affects

Reproducibility issue of an openMP output

I am going through openMP tutorials and as I progressed, I have written an openMP version of a code that calculates PI by using an integral.
I have written a serial version so I know the serial counterpart is ok. Once the openMP version completed, I noticed that everytime I run it, it gives me a different answer. If I do several runs, I can see that the outputs is broadly around the right number but still, I didn't expect several openMP run give different answers.
#include<stdio.h>
#include<stdlib.h>
#include<omp.h>
void main()
{ int nb=200,i,blob;
float summ=0,dx,argg;
dx=1./nb;
printf("\n dx------------: %f \n",dx);
omp_set_num_threads(nb);
#pragma omp parallel
{
blob=omp_get_num_threads();
printf("\n we have now %d number of threads...\n",blob);
int ID=omp_get_thread_num();
i=ID;
printf("\n i is now: %d \n",i);
argg=(4./(1.+i*dx*i*dx))*dx;
summ=summ+argg;
printf("\t\t and summ is %f \n",summ);
}
printf("\ntotal summ after loop: %f\n",summ);
}
I compile this code on a RedHat using gcc -f mycode.c -fopenmp and when I run it, say 3 times, I get:
3.117
3.113
3.051
Could anyone help to understand why I get different results? Am I doing something wrong? The parallelism just splic the integration interval, but as the rectangles are calculated , it should be the same when they get summed up at the end, no?
the serial version gives me 3.13
(the fact I don't get 3.14 is normal because I used a very coarse sampling of the integral with just 200 divisions between 0 and 1)
I have tried also to add a barrier, but I still get different answers, although closer to the serial-version, still with a spread in values and not identical...

I believe the problem lies in declaring int i and float argg outside the parallel loop.
What's happening is that all your 200 threads are overwriting i and argg, so sometimes the argg of a thread is overwritten by argg from another thread, resulting in the unpredictable error that you observe.
Here is a working code that always prints the same value (up to 6 decimals or so):
void main()
{
int nb = 200, blob;
float summ = 0, dx;// , argg;
dx = 1. / nb;
printf("\n dx------------: %f \n", dx);
omp_set_num_threads(nb);
#pragma omp parallel
{
blob = omp_get_num_threads();
printf("\n we have now %d number of threads...\n", blob);
int i = omp_get_thread_num();
printf("\n i is now: %d \n", i);
float argg = (4. / (1. + i * dx*i*dx))*dx;
summ = summ + argg;
printf("\t\t and summ is %f \n", summ);
}
printf("\ntotal summ after loop: %f\n", summ);
}
However, changing the last line to %.9f reveals that it's not in fact the exact same float number. This is due to numerical errors in floating point additions. a+b+c does not guarantee the same result as a+c+b. You can try this in below example:
First add float* arr = new float[nb]; before the parallel loop AND arr[i] = argg; within the parallel loop, after argg is defined, of course. Then add the following after the parallel loop:
float testSum = 0;
for (int i = 0; i < nb; i++)
testSum += arr[i];
printf("random sum: %.9f\n", testSum);
std::sort(arr, arr + nb);
testSum = 0;
for (int i = 0; i < nb; i++)
testSum += arr[i];
printf("sorted sum: %.9f\n", testSum);
testSum = 0;
for (int i = nb-1; i >= 0; i--)
testSum += arr[i];
printf("reversed sum: %.9f\n", testSum);
Most likely, the sorted sum and reversed sum are slightly different, even though they are composed by adding up the exact same 200 numbers.
Another thing you might want to note is that you're very unlikely to find a processor that can actually run 200 threads in parallel. Most common processors can handle 4 to 32 threads, while specialised server processors can go up to 112 threads with the $15k Xeon Platinum 9282.
As such, we usually do the following:
We remove omp_set_num_threads(nb); to use the recommended number of threads
We remove int i = omp_get_thread_num(); to use int i from for loop
We rewrite the loop as a for loop:
#pragma omp parallel for
for (int i = 0; i < nb; i++)
{...}
The result should be identical, but you're now only using as many threads as available on the actual hardware. This reduces context switching between threads and should improve time-performance of your code.

The problem comes from variables summ, argg and i. They belong to the global sequential scope and cannot be modified without precautions. You will have races between threads and this may result in an unexpected values in these var. Races are completely undeterministic and that explains the different results that you get. You may as well get the correct result or any incorrect result depending on the temporal occurrences of reads and writes to these vars.
The proper way to deal with this problem :
for variable argg and i: they are declared in the global scope, but they are used to perform temporay computation in the threads. You should : either declare them in the parallel domain to make them thread private, or add private(argg,i) in the omp directive. Note hat there also a potential problem for blob, but its value is identical in all threads and this should not modify the behavior of the program.
for variable summ the situation is different. This is really a global variable that accumulates some values from the threads. It must remain global, but you must add the atomic openmp directive when modifying it. The complete read-modify-write operation on the variable will become unbreakable and this will ensure a race free modification.
Here is a modified version of your code that give coherent result (but floats are not associative and the last decimal may change).
#include<stdio.h>
#include<stdlib.h>
#include<omp.h>
void main()
{
int nb=200,i,blob;
float summ=0,dx,argg;
dx=1./nb;
printf("\n dx------------: %f \n",dx);
omp_set_num_threads(nb);
# pragma omp parallel private(argg,i)
{
blob=omp_get_num_threads();
printf("\n we have now %d number of threads...\n",blob);
int ID=omp_get_thread_num();
i=ID;
printf("\n i is now: %d \n",i);
argg=(4./(1.+i*dx*i*dx))*dx;
#pragma omp atomic
summ=summ+argg;
printf("\t\t and summ is %f \n",summ);
}
printf("\ntotal summ after loop: %f\n",summ);
}
As already noted, this is not the best way to use threads. Creating and synchronizing threads is expensive and it is rarely required to have more threads that the number of cores.

Why should I use a reduction rather than an atomic variable?

Assume we want to count something in an OpenMP loop. Compare the reduction
int counter = 0;
#pragma omp for reduction( + : counter )
for (...) {
...
counter++;
}
with the atomic increment
int counter = 0;
#pragma omp for
for (...) {
...
#pragma omp atomic
counter++
}
The atomic access provides the result immediately, while a reduction only assumes its correct value at the end of the loop. For instance, reductions do not allow this:
int t = counter;
if (t % 1000 == 0) {
printf ("%dk iterations\n", t/1000);
}
thus providing less functionality.
Why would I ever use a reduction instead of atomic access to a counter?

Short answer:
Performance
Long Answer:
Because an atomic variable comes with a price, and this price is synchronization.
In order to ensure that there is no race conditions i.e. two threads modifying the same variable at the same moment, threads must synchronize which effectively means that you lose parallelism, i.e. threads are serialized.
Reduction on the other hand is a general operation that can be carried out in parallel using parallel reduction algorithms.
Read this and this articles for more info about parallel reduction algorithms.
Addendum: Getting a sense of how a parallel reduction work
Imagine a scenario where you have 4 threads and you want to reduce a 8 element array A. What you could do this in 3 steps (check the attached image to get a better sense of what I am talking about):
Step 0. Threads with index i<4 take care of the result of summing A[i]=A[i]+A[i+4].
Step 1. Threads with index i<2 take care of the result of summing A[i]=A[i]+A[i+4/2].
Step 2. Threads with index i<4/4 take care of the result of summing A[i]=A[i]+A[i+4/4]
At the end of this process you will have the result of your reduction in the first element of A i.e. A[0]

Performance is the key point.
Consider the following program
#include <stdio.h>
#include <omp.h>
#define N 1000000
int a[N], sum;
int main(){
double begin, end;
begin=omp_get_wtime();
for(int i =0; i<N; i++)
sum+=a[i];
end=omp_get_wtime();
printf("serial %g\t",end-begin);
begin=omp_get_wtime();
# pragma omp parallel for
for(int i =0; i<N; i++)
# pragma omp atomic
sum+=a[i];
end=omp_get_wtime();
printf("atomic %g\t",end-begin);
begin=omp_get_wtime();
# pragma omp parallel for reduction(+:sum)
for(int i =0; i<N; i++)
sum+=a[i];
end=omp_get_wtime();
printf("reduction %g\n",end-begin);
}
When executed (gcc -O3 -fopenmp), it gives :
serial 0.00491182 atomic 0.0786559 reduction 0.001103
So approximately atomic=20xserial=80xreduction
The 'reduction' exploits properly the parallelism, and with a 4 cores computer, we can get 3--6 performances boosts vs "serial".
Now, "atomic" is 20 times longer than "serial". Not only, as explained in the previous answer, the serialization of memory accesses disables parallelism, but all memory accesses are done by atomic operations. These operations require at least 20--50 cycles on modern computers and will dramatically slow down your performances if used intensively.

Effect of cache size on code

I want to study the effect of the cache size on code. For programs operating on large arrays, there can be a significant speed-up if the array fits in the cache.
How can I meassure this?
I tried to run this c program:
#define L1_CACHE_SIZE 32 // Kbytes 8192 integers
#define L2_CACHE_SIZE 256 // Kbytes 65536 integers
#define L3_CACHE_SIZE 4096 // Kbytes
#define ARRAYSIZE 32000
#define ITERATIONS 250
int arr[ARRAYSIZE];
/*************** TIME MEASSUREMENTS ***************/
double microsecs() {
struct timeval t;
if (gettimeofday(&t, NULL) < 0 )
return 0.0;
return (t.tv_usec + t.tv_sec * 1000000.0);
}
void init_array() {
int i;
for (i = 0; i < ARRAYSIZE; i++) {
arr[i] = (rand() % 100);
}
}
int operation() {
int i, j;
int sum = 0;
for (j = 0; j < ITERATIONS; j++) {
for (i = 0; i < ARRAYSIZE; i++) {
sum =+ arr[i];
}
}
return sum;
}
void main() {
init_array();
double t1 = microsecs();
int result = operation();
double t2 = microsecs();
double t = t2 - t1;
printf("CPU time %f milliseconds\n", t/1000);
printf("Result: %d\n", result);
}
taking values of ARRAYSIZE and ITERATIONS (keeping the product, and hence the number of instructions, constant) in order to check if the program run faster if the array fits in the cache, but I always get the same CPU time.
Can anyone say what I am doing wrong?

What you really want to do is build a "memory mountain." A memory mountain helps you visualize how memory accesses affect program performance. Specifically, it measures read throughput vs spatial locality and temporal locality. Good spatial locality means that consecutive memory accesses are near each other and good temporal locality means that a certain memory location is accessed multiple times in a short amount of program time. Here is a link that briefly mentions cache performance and memory mountains. The 3rd edition of the textbook mentioned in that link is a very good reference, specifically chapter 6, for learning about memory and cache performance. (In fact, I'm currently using that section as a reference as I answer this question.)
Another link shows a test function that you could use to measure cache performance, which I have copied here:
void test(int elems, int stride)
{
int i, result = 0;
volatile int sink;
for (i = 0; i < elems; i+=stride)
result += data[i];
sink = result;
}
Stride is the temporal locality - how far apart the memory accesses are.
The idea is that this function would estimate the number of cycles that it took to run. To get throughput, you'll want to take (size / stride) / (cycles / MHz), where size is the size of the array in bytes, cycles is the result of this function, and MHz is the clock speed of your processor. You'd want to call this once before you take any measurements to "warm up" your cache. Then, run the loop and take measurements.
I found a GitHub repository that you could use to build a 3D memory mountain on your own machine. I encourage you to try it on multiple machines with different processors and compare differences.

There's a typo in your code. =+ instead of +=.

The arr array is linked into the BSS [uninitialized] section. The default value for the variables in this section is zero. All pages in this section are initially mapped R/O to a single zero page. This is linux/Unix centric, but, probably applies to most modern OSes
So, regardless of the array size, you're only fetching from a single page, which will get cached, so that's why you get the same results.
You'll need to break the "zero page mapping" by writing something to all of arr before doing your tests. That is, do something like memset first. This will cause the OS to create a linear page mapping for arr using its COW (copy-on-write) mechanism.

How to declare arrays in omp pragma

I'm modifying existing library from single thread to multi threading. I have code like a provided below. I can't understand how to declare arrays x, y, array1, array2. Which of them I should declare as share or threadprivate. Do I need use flush. If yes in which case ?
//global variables
static int array1[100000];
static int array2[100000];
//part of program code from one of function.
int i
int x[1000000];
int y[1000000];
#pragma omp parallel for
for(i=0, i<100; i++)
{
y[i] = i*i-3*i-10*random();
x[i] = myfunc(i, y[i])
}
//additional function
int myfunc(j, z)
int j,
int z[]
{
array1[array2[j]] += z[j]+j;
return array1[j];
}

The problem I see in your code is in this line
array1[array2[j]] += z[j]+j;
This means that array1 can potentially be modified by whichever j index. And j in the context of the function myfunc() corresponds to index i at the upper level. The trouble is that i is the index upon which the loop is parallelised, therefore, this means that array1 can be modified concurrently at any moment by any thread.
The crucial question now is to know if array2 can have the same value for different indexes:
If you are sure that for whatever j1 != j2 you have array2[j1] != array2[j2], then your code is trivially parallelisable.
If there are values j1 != j2 for which you have array[j1] == array[j2], then you have dependencies across iterations for array1 and the code is no longer (simply and/or effectively) parallelisable.
So let's assume we are in the former case, then the OpenMP directives you have already in the code are sufficient:
i needs to be private but is implicitly already so as it is the index of the parallelised loop;
x and y should to be shared (which they are by default) since their access index is the one that is distributed in parallel (namely i) so their parallel updates do not overlap;
array2 is only accessed in read mode so it's a no brainer shared (which it is by default again);
array1 is read and written, but due to our initial assumption, there are no possible collisions between threads as their sets of indexes to access it are disjoin. Therefore, the default shared qualifier just works fine.
But now, if we are in the case where array2 allows for non-disjoin sets of indexes for accessing array1, we will have to preserve the ordering of these accesses / updates of array1. This can be done with the ordered clause / directive. And since we still want the parallelisation to be (somewhat) effective, we will have to add a schedule(static,1) clause to the parallel directive. For more details about this, please refer to this great answer. Your code would now look like this:
//global variables
static int array1[100000];
static int array2[100000];
//part of program code from one of function.
int i
int x[1000000];
int y[1000000];
#pragma omp parallel for schedule(static,1) ordered
for(i=0; i<100; i++)
{
y[i] = i*i-3*i-10*random();
x[i] = myfunc(i, y[i])
}
//additional function
int myfunc(j, z)
int j,
int z[]
{
int tmp = z[j]+j;
#pragma omp ordered
array1[array2[j]] += tmp;
return array1[j];
}
This would (I think) work and be in term of parallelism not too bad (for a limited number of threads), but this has a big (enormous) flaw: it generates tons of false sharing while updating x and y. Therefore, it might be more advantageous to use some per-thread copies of these and to only update the global arrays at the end. The central part of code snippet would then look something like this (not tested at all):
#pragma omp parallel
#pragma omp single
int nbth = omp_get_num_threads();
int *xm = malloc(1000000*nbth*sizeof(int));
int *ym = malloc(1000000*nbth*sizeof(int));
#pragma omp parallel
{
int tid = omp_get_thread_num();
int *xx = xm+1000000*tid;
int *yy = ym+1000000*tid;
#pragma omp for schedule(static,1) ordered
for(i=0; i<100; i++)
{
yy[i] = i*i-3*i-10*random();
xx[i] = myfunc(i, y[i])
}
#pragma omp for
for (i=0; i<100; i++)
{
int j;
x[i] = 0;
y[i] = 0;
for (j=0; j<nbth; j++)
{
x[i] += xm[j*1000000+i];
y[i] += ym[j*1000000+i];
}
}
}
free(xm);
free(ym);
This will avoid the false sharing, but will increase the number of memory accesses and the overhead of parallelisation. So it might not be very beneficial after all. You'll have to see it for yourself in your actual code.
BTW, the fact that i only loops until 100 looks suspicious to me when the corresponding arrays are declared to be 1000000 long. If 100 is truly the correct size for the loop, then probably the parallelisation isn't worth it anyway...
EDIT:
As Jim Cownie pointed it out in a comment, I missed the call to random() as source of dependency across iterations, preventing from proper parallelisation. I'm not sure how relevant this is in the context of your actual code (I doubt you truly fill your y array with random data) but in case you do, you'll have to change this part in order to do it in parallel (otherwise, the serialisation needed to have the random number series generated will just kill whichever gain from parallelisation). But generating non-correlated pseudo-random series in parallel is not as simple as it sounds. You can use rand_r() instead of random() as a thread-safe alternative for the RNG and initialise its seed per-thread to different values. However, you're not sure that one thread's series won't collide with another thread's one too soon (with a thread starting to generate the very same series than another one after a while, messing-up your expected asymptotic behaviour).
As I'm pretty sure you're not truly interested in that, I won't develop any further (this is a whole question all by itself), but I will just use the (not so good) rand_r() trick. If you want more details on a possible alternative for generating good parallel random series, just ask another question.
The case where no problem comes from array2 (disjoin sets of indexes), the code would become:
// global variable
unsigned int seed;
#pragma omp threadprivate(seed)
// done just once somewhere
#pragma omp parallel
seed = omp_get_thread_num(); //or something else, but different for each thread
// then the parallelised loop
#pragma omp parallel for
for(i=0; i<100; i++)
{
y[i] = i*i-3*i-10*rand_r(&seed);
x[i] = myfunc(i, y[i])
}
Then the other case would have to use the same trick in addition to what has already been described. But again, keep in mind that this isn't good enough for serious RNG based computation (like Monte-Carlo methods). Its does the job if all you want is generate some values for testing purpose, but it won't pass any serious statistical quality test.