This is a question which was asked in previous year exam.
Consider the following fragment of C code:
int i, array[1000000];
array[0] = 0;
for (i = 1; i < 1000000; i++)
array[i] = array[i-1] + 3;
Can we simply run 1,000,000 of the array update statement in the
for loop in parallel? If not, change the update statement so that it
can run in parallel and still produce the same final contents of the
array.
I understand that it is not possible to simply run 1,000,000 of the array update statement in the for loop in parallel. Only ways that are coming into my mind is to use recursion, which is not parallel, and to use 1000000 threads, which is not a great idea.
So is there another way of getting this done in parallel with very few update statements ? We may use openMPI or openCL
Edit: This is not a homework question, but I think it was given as homework in some school. This is from a past exam paper. I uploaded it here
The problem is that you cannot parallelize that loop, neither with only 2 thread, since each iteration depends on the previous.
Your algorithm produces:
array[0] = 0;
array[1] = 3;
array[2] = 6;
...
So, you can write the update statement in a way that each iteration does not depend on the previous:
int i, array[1000000];
array[0] = 0;
for (i = 1; i < 1000000; i++)
array[i] = 3*i;
In this way you have removed the data dependency and you can easily parallelize the loop (e.g. with OpenMP or MPI).
Related
I have a piece of code in C which generates an int matrix and assigns 0 to every field. Afterwards, when I run this:
#pragma omp parallel for
for (i = 0; i < 100; i++)
for (j = 0; j < 100; j++)
a[i][j] = a[i][j] + 1
without OpenMP, I get, as expected, 1s in every field.
But when I run it in parallel, I get plotches of random values (0s and sometimes even 2) every once in a while, despite (what I think is) a piece of code with no data dependency. Everytime it's ran, it produces a different result with different plotches of messy values. Am I missing something? I made sure that it's the same code by simply writing it in serial first, then copying it over and just adding the extra lines making it parallel. Thanks in advance!
Your i and j variables aren't declared inside the parallel pragma.
According to http://supercomputingblog.com/openmp/tutorial-parallel-for-loops-with-openmp/ this can cause the j variable to be shared across all parallel threads, meaning it gets incremented too many times and rows get skipped (causing 0's).
I suspect with the right ordering this also causes increments to be lost (causing 2's, 3's and 4's), but I'm not sure what order that is off the top of my head.
I am trying to compare linear memory access to random memory access. I am traversing an array in the order of its indices to log performance of linear memory access. However to log memory's performance with random memory access I want to traverse my array randomly i.e arr[8], arr[17], arr[34], arr[2]...
Can I use pointer chasing to achieve this while ensuring that no index are accessed twice? Is pointer chasing most optimal approach in this case?
If your goal is to show that sequential access is faster than non-sequential access, simply pointer chasing the latter is not a good way to demonstrate that. You would be comparing access via a single pointer plus simple offset against deterrencing one or more pointers before offsetting.
To use pointer chasing, you'd have to apply it to both cases. Here's an example:
int arr[n], i;
int *unshuffled[n];
int *shuffled[n];
for(i = 0; i < n; i++) {
unshuffled[i] = arr + i;
}
/* I'll let you figure out how to randomize your indices */
shuffle(unshuffled, shuffled)
/* Do toning on these two loops */
for(i = 0; i < n; i++) {
do_stuff(*unshuffled[i]);
}
for(i = 0; i < n; i++) {
do_stuff(*shuffled[i]);
}
It you want to time the direct access better though, you could construct some simple formula for advancing the index instead of randomizing the access completely:
for(i = 0; i < n; i++) {
do_stuff(arr[i]);
}
for(i = 0; i < n; i++) {
do_stuff(arr[i / 2 + (i % 2) * (n / 2)]);
}
This will only work properly for even n as shown, but it illustrates the idea. You could go so far as to compensate for the extra flops in computing the index within do_stuff.
Probably the most apples-to-apples test would be to literally access the indices you want, without loops or additional computations:
do_stuff(arr[0]);
do_stuff(arr[1]);
do_stuff(arr[2]);
...
do_stuff(arr[123]);
do_stuff(arr[17]);
do_stuff(arr[566]);
...
Since I'd imagine you'd want to test with large arrays, you can write a program to generate the actual test code for you, and possibly compile and run the result.
I can tell you that for arrays in C the access time is constant regardless of the index being accessed. There will be no difference between accessing them randomly or sequentially other than the fact that randomizing will in itself introduce additional computations.
But, to really answer your question, you would probably be best off to build some kind of lookup array and shuffle it a few times and use that array to get the next index. Obviously, you would be accessing two arrays, one sequentially and another randomly, by doing so, thus making the exercise pretty much useless.
I'm trying to write code to find A in a system of linear equations Ax=B, so I used LU decomposition. Now that I have L and U properly, I'm stuck in the forward substitution in order to get the y in B=Ly.
I wrote some code in MatLab that works perfectly, but I can't get the same results rewriting the code in C. So I was wondering if someone may know what i'm doing wrong, I'm not fully used to C.
Here's my code in MatLab:
y(1,1) = B(1,1)/L(1,1);
for i= 2:n
sum=0;
sum2=0;
for k = 1: i-1
sum = sum + L(i,k)*y(k,1);
end
y(i,1)=(B(i,1)-sum)/L(i,i);
end
where L is my lower triangle matrix, B is a vector of the same size, and n is 2498 in this case.
My C code is the following:
float sum = 0;
y_prev[0]=B[0]/(float)Low[0][0];
for (int i = 1; i < CONST; i++)
{
for (int k = 0; k < i-1; k++)
{
sum = sum +Low[i][k]*y_prev[k];
}
y_prev[i]= (B[i]- sum)/(float)Low[i][i];
}
One difference between the codes comes from the way you've changed the for loop indices to work with the zero based indexing in C. (I can't run the MATLAB version, and don't have some of the context for the code, so there may be other differences.)
The variables i and k have values which are smaller by 1 in the C code. This is exactly what you want for the loop indices, but a problem arises when you use i to control the number of iterations in the inner loop over k. This is i-1 in both versions of the code, even though i has different values. For instance, in the first iteration of the outer loop the inner loop runs once in the MATLAB code but not at all in the C one.
A possible fix would be to rewrite the inner loop in the C code as
for (int k = 0; k < i; k++)
{
sum = sum +Low[i][k]*y_prev[k];
}
A second difference is that you're resetting sum to zero in the MATLAB code but not in the C (the MATLAB code also has a sum2 which doesn't seem to be used?). This will cause differences in y_prev[i] for i>0.
I'm having trouble parallelizing a block of C code.
The block is something like this:
for(n = 0; n < N; n++)
{
for(x = 0; x < X; x++)
{
var_1[x] = var_1[x] * 3 * var_2[x];
var_2[x] = var_2[x] * 2;
}
for(y = 0; y < Y; y++)
{
var_3[y] = var_1[y] * var_2[y];
}
}
That's not the actual code (it's for an assignment so I can't post the source code) but the problem is that the variables lie in a nested loop, and each iteration is dependent upon the previous.
Simply adding a #pragma omp for in front of the outer loop doesn't work because each thread's work begins at the initial values of var_1, var_2 and var_3.
Please let me know if I can explain any better! I'm quite lost.
As HighPerformanceMark comments, the inner loop has no loop-carried dependency. This invites vectorisation first, parallelisation second, at least to me. More importantly, the cheapest computation is the one you never do in the first place. As it is written, var_3[] is never read in either loop. You can remove the second inner loop, and only compute var_3[] after the last iteration of the outer loop.
I suspect further performance gains are entirely academic. It seems to me that this algorithm will very quickly overflow (if used on integers), or lose precision (if used on floats), so the maximum number of iterations has to be small.
I have the following piece of c code,
double findIntraClustSimFullCoverage(cluster * pCluster)
{
double sum = 0;
register int i = 0, j = 0;
double perElemSimilarity = 0;
for (i = 0; i < 10000; i++)
{
perElemSimilarity = 0;
for (j = 0; j < 10000; j++)
{
perElemSimilarity += arr[i][j];
}
perElemSimilarity /= pCluster->size;
sum += perElemSimilarity;
}
return (sum / pCluster->size);
}
NOTE: arr is a matrix of size 10000 X 10000
This is a portion of a GA code, hence this nested for loop runs many times.
This affects the performance of the code i.e. takes hell a lot of time to give the results.
I profiled the code using valgrind / kcachegrind.
This indicated that 70 % of the process execution time was spent in running this nested for loop.
The register variables i and j, do not seem to be stored in register values (profiling with and without "register" keyword indicated this)
I simply can not find a way to optimize this nested for loop portion of code (as it is very simple and straight forward).
Please help me in optimizing this portion of code.
I'm assuming that you change the arr matrix frequently, else you could just compute the sum (see Lucian's answer) once and remember it.
You can use a similar approach when you modify the matrix. Instead of completely re-computing the sum after the matrix has (likely) been changed, you can store a 'sum' value somewhere, and have every piece of code that updates the matrix update the stored sum appropriately. For instance, assuming you start with an array of all zeros:
double arr[10000][10000];
< initialize it to all zeros >
double sum = 0;
// you want set arr[27][53] to 82853
sum -= arr[27][53];
arr[27][53] = 82853;
sum += arr[27][53];
// you want set arr[27][53] to 473
sum -= arr[27][53];
arr[27][53] = 473;
sum += arr[27][53];
You might want to completely re-calculate the sum from time to time to avoid accumulation of errors.
If you're sure that you have no option for algorithmic optimization, you'll have to rely on very low level optimizations to speed up your code. These are very platform/compiler specific so your mileage may vary.
It is probable that, at some point, the bottleneck of the operation is pulling the values of arr from the memory. So make sure that your data is laid out in a linear cache friendly way. That is to say that &arr[i][j+1] - &arr[i][j] == sizeof(double).
You may also try to unroll your inner loop, in case your compiler does not already do it. Your code :
for (j = 0; j < 10000; j++)
{
perElemSimilarity += arr[i][j];
}
Would for example become :
for (j = 0; j < 10000; j+=10)
{
perElemSimilarity += arr[i][j+0];
perElemSimilarity += arr[i][j+1];
perElemSimilarity += arr[i][j+2];
perElemSimilarity += arr[i][j+3];
perElemSimilarity += arr[i][j+4];
perElemSimilarity += arr[i][j+5];
perElemSimilarity += arr[i][j+6];
perElemSimilarity += arr[i][j+7];
perElemSimilarity += arr[i][j+8];
perElemSimilarity += arr[i][j+9];
}
These are the basic ideas, difficult to say more without knowing your platform, compiler, looking at the generated assembly code.
You might want to take a look at this presentation for more complete examples of optimization opportunities.
If you need even more performance, you could take a look at SIMD intrinsics for your platform, of try to use, say OpenMP, to distribute your computation on multiple threads.
Another step would be to try with OpenMP, something along the following (untested) :
#pragma omp parallel for private(perElemSimilarity) reduction(+:sum)
for (i = 0; i < 10000; i++)
{
perElemSimilarity = 0;
/* INSERT INNER LOOP HERE */
perElemSimilarity /= pCluster->size;
sum += perElemSimilarity;
}
But note that even if you bring this portion of code to 0% (which is impossible) of your execution time, your GA algorithm will still take hours to run. Your performance bottleneck is elsewhere now that this portion of code takes 'only' 22% of your running time.
I might be wrong here, but isn't the following equivalent:
for (i = 0; i < 10000; i++)
{
for (j = 0; j < 10000; j++)
{
sum += arr[i][j];
}
}
return (sum / ( pCluster->size * pCluster->size ) );
The register keyword is an optimizer hint, if the optimizer doesn't think the register is well spent there, it won't be.
Is the matrix well packed, i.e. is it a contiguous block of memory?
Is 'j' the minor index (i.e. are you going from one element to the next in memory), or are you jumping from one element to that plus 1000?
Is arr fairly static? Is this called more than once on the same arr? The result of the inner loop only depends on the row/column that j traverses, so calculating it lazily and storing it for future reference will make a big difference
The way this problem is stated, there isn't much you can do. You are processing 10,000 x 10,000 double input values, that's 800 MB. Whatever you do is limited by the time it takes to read 800 MB of data.
On the other hand, are you also writing 10,000 x 10,000 values each time this is called? If not, you could for example store the sums for each row and have a boolean row indicating that a row sum needs to be calculated, which is set each time you change a row element. Or you could even update the sum for a row each time an array element is change.