Currently I'm learning about parallel programming. I have the following loop that needs to be parallelized.
for(i=0; i<n/2; i++)
a[i] = a[i+1] + a[2*i]
If I run this sequentially there is no problem, but if I want to run this in parallel, there occurs data recurrence. To avoid this I want to store the information to 'read' in a seperate variable e.g b.
So then the code would be:
b = a;
#pragma omp parallel for private(i)
for(i=0; i<n/2; i++)
a[i] = b[i+1] + b[2*i];
But here comes the part I where I begin to doubt. Probably the variable b will point to the same memory location as a. So the second code block will do exactly as the first code block. Including recurrence I'm trying to avoid.
I tried something with * restric {variable}. Unfortunately I can't really find the right documentation.
My question:
Do I avoid data recurrence by writing the code as follows?
int *restrict b;
int *restrict a;
b = a;
#pragma omp parallel for private(i)
for(i=0; i<n/2; i++)
a[i] = b[i+1] + b[2*i];
If not, what is a correct way to achieve this goal?
Thanks,
Ter
In your proposed code:
int *restrict b;
int *restrict a;
b = a;
the assignment of a to b violates the restrict requirement. That requires that a and b do not point to the same memory, yet they clearly do point to the same memory.
It is not safe.
You'd have to.make a separately allocated copy of the array to be safe. You could do that with:
int *b = malloc(n * size of(*b));
…error check…;
memmove(b, a, n *sizeof(*b));
…revised loop using a and b…
free(b);
I always use memmove() because it is always correct, dealing with overlapping copies. In this case, it would be legitimate to use memcpy() because the space allocated for b will be separate from the space for a. The system would be broken if the newly allocated space for b overlaps with a at all, assuming the pointer to a is valid. If there was an overlap, the trouble would be that a was allocated and freed — so a is a dangling pointer pointing to released memory (and should not be being used at all), and b was coincidentally allocated where the old a was previously allocated. On the whole, it's not a problem worth worrying about. (Using memmove() doesn't help if a is a dangling pointer, but it is always safe if given valid pointers, even if the areas of memory overlap.)
Related
The following code is from pg. 93 of Parallel and High Performance Computing and is a single contiguous memory allocation for a 2D array:
double **malloc_2D(int nrows, int ncols) {
double **x = (double **)malloc(
nrows*sizeof(double*)
+ nrows*ncols*sizeof(double)); // L1
x[0] = (double *)x + nrows; // L2
for (int j = 1; j < nrows; j++) { // L3
x[j] = x[j-1] + ncols;
}
return x;
}
The book states that this improves memory allocation and cache efficiency. Is there any reason w.r.t efficiency to prefer the first code to something like the below code? It seems like the below code is more readable, and it's also easily usable with MPI (I only mention this because the book also covers MPI later).
double *malloc_2D(int nrows, int ncols) {
double *M = (double *)malloc(nrows * ncols * sizeof(double))
return M
}
I include the below image to make sure that my mental model of the first code is correct. If it is not, please mention that in the answer. The image is the result of calling the first function to create a 5 x 2 matrix. Note that I just write the indices in the boxes in the below image for clarity, of course the values stored at these memory locations will not be 0 through 14. Also note that L# refers to lines in the first code.
The book states that this improves memory allocation and cache efficiency.
The book’s code improves efficiency relative to a too-often seen method of allocating pointers separately, as in:
double **x = malloc(nrows * sizeof *x);
for (size_t i = 0; i < nrows; ++i)
x[i] = malloc(ncols * sizeof *x[i]);
(Note that all methods should test the malloc result and handle allocation failures. This is elided for the discussion here.)
That method allocates each row separately (from other rows and from the pointers). The book’s method has some benefit that only one allocation is done and that the memory for the array is contiguous. Also, the relationships between elements in different rows are known, and that may allow programmers to take advantage of the relationships in designing algorithms that work well with cache and memory access.
Is there any reason w.r.t efficiency to prefer the first code to something like the below code?
Not for efficiency, no. Both the book’s method and the method above have the disadvantage that they generally require a pointer lookup for every array access (aside from the base pointer, x). Before the processor can get an element from the memory of a row, it has to get the address of the row from memory.
With the method you show, this additional lookup is unnecessary. Further, the processor and/or the compiler may be able to predict some things about the accesses. For example, with your method, the compiler may be able to see that M[(i+1)*ncols + j] is a different element from M[(i+2)*cols + j], whereas with x[i+1][j] and x[i+2][j], it generally cannot know the two pointers x[i+1] and x[i+2] are different.
The book’s code is also defective. The number of bytes it allocates is nrows*sizeof(double*) + nrows*ncols*sizeof(double). Lets say r is nrows, c is ncols, p is sizeof(double*) and d is sizeof(double). Then the code allocates rp + rcd bytes. Then the code sets x[0] to (double *)x + nrows. Because it casts to double *, the addition of nrows is done in units of the pointed-to type, double. So this adds rd bytes to the starting address. And, after that, it expects to have all the elements of the array, which is rcd bytes. So the code is using rd + rcd bytes even though it allocated rp + rcd. If p > d, some elements at the end of the array will be outside of the allocated memory. In current ordinary C implementations, the size of double * is less than or equal to the size of double, but this should not be relied on. Instead of setting x[0] to (double *)x + nrows;, it should calculate x plus the size of nrows elements of type double * plus enough padding to get to the alignment requirement of double, and it should include that padding in the allocation.
If we cannot use variable length arrays, then the array indexing can be provided by a macro, as by defining a macro that replaces x(i, j) with x[i*ncols+j], such as #define x(i, j) x[(i)*ncols + (j)].
I don't know how OpenMP works, but I presume calling a function with restricted pointer arguments inside a parallel for loop doesn't work if the objects could be shared by multiple threads? Take the following example of serial code meant to perform a weighted sum across matrix columns:
const int n = 10;
const double x[n][n] = {...}; // matrix, containing some numbers
const double w[n] = {...}; // weights, containing some numbers
// my weighted sum function
double mywsum(const double *restrict px, const double *restrict pw, const int n) {
double tmp = 0.0;
for(int i = 0; i < n; ++i) tmp += px[i] * pw[i];
return tmp;
}
double res[n]; // results vector
const double *pw = &w[0]; // creating pointer to w
// loop doing column-wise weighted sum
for(int j = 0; j < n; ++j) {
res[j] = mywsum(&x[0][j], pw, n);
}
Now I want to parallelize this loop using OpenMP, e.g.:
#pragma omp parallel for
for(int j = 0; j < n; ++j) {
res[j] = mywsum(&x[0][j], pw, n);
}
I believe the *restrict px could still be valid as the particular elements pointed to can only be accessed by one thread at a time, but the *restrict pw should cause problems as the elements of w are accessed concurrently by multiple threads, so the restrict clause should be removed here?
I presume calling a function with restricted pointer arguments inside a parallel for loop doesn't work if the objects could be shared by multiple threads?
The restrict keyword is totally independent of using multiple threads. It tells the compiler that the pointer target an object that is not aliased, that is, referenced by any other pointers in the function. It is meant to avoid aliasing in C. The fact that other threads can call the function is not a problem. In fact, if threads write in the same location, you have a much bigger problem: a race condition. If multiple threads read in the same location, this is not a problem (with or without the restrict keyword). The compiler basically does not care about multi-threading when the function mywsum is compiled. It can ignore the effect of other threads since there is no locks, atomic operations or memory barriers.
I believe the *restrict px could still be valid as the particular elements pointed to can only be accessed by one thread at a time, but the *restrict pw should cause problems as the elements of w are accessed concurrently by multiple threads, so the restrict clause should be removed here?
It should be removed because it is not useful, but not because it cause any issue.
The use of the restrict keyword is not very useful here since the compiler can easily see that there is no possible overlapping. Indeed, the only store done in the loop is the one of tmp which is a local variable and the input arguments cannot point on tmp because it is a local variable. In fact, compilers will store tmp in a register if optimizations are enabled (so it does not even have an address in practice).
One should keep in mind that restrict is bound to the function scope where it is define (ie. in the function mywsum). Thus, inlining or the use of the function in a multithreaded context have no impact on the result with respect to the restrict keyword.
I think &x[0][j] is wrong because the loop of the function iterate over n items and the pointer starts to the j-th item. This means the loop access to the item x[0][j+n-1] theoretically causing out-of-bound accesses. In practice you will observe no error because 2D C array are flatten in memory and &x[0][n] should be equal to &x[1][0] in your case. The result will certainly not what you want.
I am trying to swap two arrays in C by pointing from array B to array A and then free up A so I am only left with array B with the contents of array A. Is this possible?
Code:
int *a = malloc(4*sizeof(int));
a[0] = 1;
a[1] = 2;
a[2] = 3;
a[3] = 4;
int *b = malloc(4*sizeof(int));
//Assign pointer from b to memory position of a so b[] = a[]?
Thank you in advance,
Wouter
#include <string.h>
#define ARRAY_LENGTH 4
// insert your code above here.. and as the other answer says you should
// always check malloc's return value to make sure it succeeded before
// continuing or you will seg fault. If it fails, you should gracefully
// handle the error
memcpy(b, a, sizeof(int) * ARRAY_LENGTH); // copies sizeof(int)*ARRAY_LENGTH bytes
// from the memory space pointed to by a
// to the memory space pointed to by b
free(a); // returns the memory pointed to by a to the heap
More information about memcpy can be found here. It's a highly optimized function for copying memory. If you only have 4 values, I doubt you'll see much performance difference between your own loop, memcpy, or simply manually-assigning (unrolling the loop) each value,, unless you're running this many thousands or millions of times.
And just a side note, as a general rule of thumb, you want to use malloc as sparingly as possible. The only times you should use it are if you don't know how much memory you'll need until runtime, or if you want the scope of the memory to persist outside of the function. Incorrectly managing memory is the source of many, many bugs that can be difficult to track down in large programs since they don't always manifest themselves at the same time in the same place the same way. Here, you don't show enough code for me to know exactly what you're doing. But you do know the size of the array ahead of time (4 ints), so unless you need these arrays outside of the function, just go ahead and put them in localized scope (on the stack in most systems):
int a[4];
int b[4];
// perform assignment to a
// perform copy to b
// do work on data
// ....
// now when the function ends, these arrays will automatically get destroyed, saving you the trouble
I'll just take you at your word that you have a good reason for copying the a array, as that's not evident from your code.
Finally, this was a dupe and neither of us should've answered it :)
How to copy one integer array to another
You can use a loop:
for (size_t i = 0; i < 4; i++)
b[i] = a[i];
Note: always check the return value of malloc() for failure.
If you create an array on the heap, with malloc(), you get a pointer which is like any other variable in that it may be assigned or copied. However there is only one memory buffer.
so
int *a = malloc(n * sizeof(int));
int *b = a;
means both a and b point to the same region of memory. That's called pointer aliasing, and it causes deep problems if you don't keep tight control of what you are doing. Generally if a points to the buffer, we should use a to identify it. A copy isn't useful to us.
If we have, as in your example
int *a = malloc(n * sizeof(int));
int *b = malloc(n * sizeof(int));
we have two buffers. So assigning b = a would make b point to a's buffer, and orphan the buffer that b is pointing to. Again, not what we want.
However if we do this
int *a = malloc(n * sizeof(int));
int *b = malloc(n * sizeof(int));
int * temp'
temp = a;
a = b;
b = temp
We have swapped a and b. b now points to the buffer a pointed to previously, a now points to the buffer b pointed to previously.
That's fine. And occasionally it's sensible to do that.
With current C compilers, is it still true that using the array syntax (a[i]) is slower than using pointers (*(p+i))?
They are exactly equivalent. Array access is syntactic sugar for pointer math.
They should be the same. But:
for( i = 0; i < ...; ++ i ) ... array[i] ...
could be slower than:
for( p = array; *p; ++ p ) ... *p ...
because in the former case compiler could need to do *(array+i), while in the second you just do (*p).
In trivial cases, however, compiler should be able to optimize and generate the same machine code.
No, according to the C and C++ standards a[i] is by definition equivalent to *(a+i). this also implies that a[1] is equivalent to 1[a]. Try it :)
Hell No! a[i] is always equivalent to *(a+i).
a[i] = *(a + i) = *(i + a) =i[a];
On x86 or x86_64, using p[i] or *(p+i) (note: these two forms are identical in the C language) may be faster than incrementing p at each step of the loop, assuming you need to keep i around for some other purpose already. The x86 architecture has efficient addressing for base/offset usage like this, including scaling the offset by small powers of 2.
On the other hand, if you can eliminate i by incrementing p, you may reduce the number of registers needed in you loop, which could allow further optimization. Here are some quick thoughts on the relative cost in various cases on x86:
If you array a has static storage duration (static or extern global) in non-PIC-compiled code, the a[i] and *p (with no i) methods use the same number of registers (e.g. 0xdeadbeef(,%ecx,4) vs (%esi)).
If Your array a is automatic (local stack variable) and not a variable-length array, the a[i] and *p methods use the same number of registers (e.g. 12(%esp,%ecx,4) vs (%esi) where %esp, the stack pointer, is already reserved anyway).
If your array a has static storage duration and your code is compiled as PIC, a[i] is probably significantly more expensive than the *p method, even if you have to keep i around for another purpose anyway.
If your array a is not an array variable in the current scope, but a pointer to an array passed to your function from somewhere else, then a[i] takes one more register than *p (e.g. (%eax,%ecx,4) vs (%esi)).
There may be some circumstances where something like
while (*s++ = *d++)
;
might be faster than
while (s[i] = d[i])
i++;
but even that will probably be optimized away by good compilers.
The compiler will translate them to pointer code anyway, it just makes it easier to use them.
Think of array operator as an inline function for the pointer equivalent.
They are equivalent. But algorithms that use the array syntax are typically written
int a[];
for (int n = 0; n < size; n++) { ... Do stuff with a[n]; }
which needs one more addition for each access to an element of a than
int a[size];
int *end = a + size;
for (int *i = a; i != end; ++i) { ... Do the same stuff with *i ; }
Some compilers might optimize the first version into the second though.
It's obvious that's not what he means, this should help.
// which is faster? *t or t[i]?
process_some_array(T *t, size_t n)
{
T *end = &t[n];
for (T *t = t; t < end; ++t)
// do stuff with *t
for (size_t i = 0; i < n; ++i)
// do stuff with t[i]
}
The answer is not as long as your compiler has no optimization whatsoever. For the general case you should not worry about the difference, as it is miniscule if present at all. Furthermore it's generally easier for yourself, and your fellows to parse and debug.
If this is not what you meant, I'm afraid it's a stupid question, see one of the other answers.
Pointer and Array both are accessing through address.Array is behave like a constant pointer and pointer used to hold the address of any value.So same thing is both access the value through address. On account of that both are exactly equivalent.
I am playing around with multidimensional array of unequal second dimension size.
Lets assume that I need the following data structure:
[&ptr0]->[0][1][2][3][4][5][6][7][8][9]
[&ptr1]->[0][1][2]
[&ptr2]->[0][1][2][3][4]
int main()
{
int *a[3];
int *b;
int i;
a[0] = (int *)malloc(10 * sizeof(int));
a[1] = (int *)malloc(2 * sizeof(int));
a[2] = (int *)malloc(4 * sizeof(int));
for(i=0; i<10; i++) a[0][i]=i;
for(i=0; i<2; i++) a[1][i]=i;
for(i=0; i<4; i++) a[2][i]=i;
}
I did some tests and it seems like I can store a value at a[1][3]. Does it mean that rows in my array are of equal size 10?
No, The address a[1][3] does not "officially exist". It is a memory which is not define in your program and accessing it result in a undefined behavior.
It can lead to the following error:
Segmentation fault (access a
restricted memory)
Used a memory already used by other variable (other allocation memory) (so possible overwrite)
It can be an uninitialized value (unsused memory address)
It is undefined behavior of your code. You are accessing something that you don't own. It may work, it may not, but it is always wrong.
No
There is lots of memory in your program used for I/O buffers, library data structures, the malloc system itself, command line arguments and environment, etc. (Some of those are on the stack.)
Yes, you can clobber things out of range.
Keep in mind that x[i] is the same thing as *(x + i). So, it's easy to calculate the address you referenced. It may overlay one of your data structures, it may overlay a part of your data structure that is a private field within malloc's mechanism, or it may overlay library data.