restrict pointers as function arguments in OpenMP for loops? - c

I don't know how OpenMP works, but I presume calling a function with restricted pointer arguments inside a parallel for loop doesn't work if the objects could be shared by multiple threads? Take the following example of serial code meant to perform a weighted sum across matrix columns:
const int n = 10;
const double x[n][n] = {...}; // matrix, containing some numbers
const double w[n] = {...}; // weights, containing some numbers
// my weighted sum function
double mywsum(const double *restrict px, const double *restrict pw, const int n) {
double tmp = 0.0;
for(int i = 0; i < n; ++i) tmp += px[i] * pw[i];
return tmp;
}
double res[n]; // results vector
const double *pw = &w[0]; // creating pointer to w
// loop doing column-wise weighted sum
for(int j = 0; j < n; ++j) {
res[j] = mywsum(&x[0][j], pw, n);
}
Now I want to parallelize this loop using OpenMP, e.g.:
#pragma omp parallel for
for(int j = 0; j < n; ++j) {
res[j] = mywsum(&x[0][j], pw, n);
}
I believe the *restrict px could still be valid as the particular elements pointed to can only be accessed by one thread at a time, but the *restrict pw should cause problems as the elements of w are accessed concurrently by multiple threads, so the restrict clause should be removed here?

I presume calling a function with restricted pointer arguments inside a parallel for loop doesn't work if the objects could be shared by multiple threads?
The restrict keyword is totally independent of using multiple threads. It tells the compiler that the pointer target an object that is not aliased, that is, referenced by any other pointers in the function. It is meant to avoid aliasing in C. The fact that other threads can call the function is not a problem. In fact, if threads write in the same location, you have a much bigger problem: a race condition. If multiple threads read in the same location, this is not a problem (with or without the restrict keyword). The compiler basically does not care about multi-threading when the function mywsum is compiled. It can ignore the effect of other threads since there is no locks, atomic operations or memory barriers.
I believe the *restrict px could still be valid as the particular elements pointed to can only be accessed by one thread at a time, but the *restrict pw should cause problems as the elements of w are accessed concurrently by multiple threads, so the restrict clause should be removed here?
It should be removed because it is not useful, but not because it cause any issue.
The use of the restrict keyword is not very useful here since the compiler can easily see that there is no possible overlapping. Indeed, the only store done in the loop is the one of tmp which is a local variable and the input arguments cannot point on tmp because it is a local variable. In fact, compilers will store tmp in a register if optimizations are enabled (so it does not even have an address in practice).
One should keep in mind that restrict is bound to the function scope where it is define (ie. in the function mywsum). Thus, inlining or the use of the function in a multithreaded context have no impact on the result with respect to the restrict keyword.
I think &x[0][j] is wrong because the loop of the function iterate over n items and the pointer starts to the j-th item. This means the loop access to the item x[0][j+n-1] theoretically causing out-of-bound accesses. In practice you will observe no error because 2D C array are flatten in memory and &x[0][n] should be equal to &x[1][0] in your case. The result will certainly not what you want.

Related

If I declare a variable inside a for loop in C, will it be created multiple times or not?

#include <stdio.h>
int main()
{
for(int i=0;i<100;i++)
{
int count=0;
printf("%d ",++count);
}
return 0;
}
output of the above program is: 1 1 1 1 1 1..........1
Please take a look at the code above. I declared variable "int count=0" inside the for loop.
With my knowledge, the scope of the variable is within the block, so count variable will be alive up to for loop execution.
"int count=0" is executing 100 times, then it has to create the variable 100 times else it has to give the error (re-declaration of the count variable), but it's not happening like that — what may be the reason?
According to output the variable is initializing with zero every time.
Please help me to find the reason.
Such simple code can be visualised on http://www.pythontutor.com/c.html for easy understanding.
To answer your question, count gets destroyed when it goes outside its scope, that is the closing } of the loop. On next iteration, a variable of the same name is created and initialised to 0, which is used by the printf.
And if counting is your goal, print i instead of count.
The C standard describes the C language using an abstract model of a computer. In this model, count is created each time the body of the loop is executed, and it is destroyed when execution of the body ends. By “created” and “destroyed,” we mean that memory is reserved for it and is released, and that the initialization is performed with the reservation.
The C standard does not require compilers to implement this model slavishly. Most compilers will allocate a fixed amount of stack space when the routine starts, with space for count included in this fixed amount, and then count will use that same space in each iteration. Then, if we look at the assembly code generated, we will not see any reservation or release of memory; the stack will be grown and shrunk only once for the whole routine, not grown and shrunk in each loop iteration.
Thus, the answer is twofold:
In C’s abstract model of computing, a new lifetime of count begins and ends in each loop iteration.
In most actual implementations, memory is reserved just once for count, although implementations may also allocate and release memory in each iteration.
However, even if you know your C implementation allocates stack space just once per routine when it can, you should generally think about programs in the C model in this regard. Consider this:
for (int i = 0; i < 100; ++i)
{
int count = 0;
// Do some things with count.
float x = 0;
// Do some things with x.
}
In this code, the compiler might allocate four bytes of stack space to use for both count and x, to be used for one of them at a time. The routine would grow the stack once, when it starts, including four bytes to use for count and x. In each iteration of the loop, it would use the memory first for count and then for x. This lets us see that the memory is first reserved for count, then released, then reserved for x, then released, and then that repeats in each iteration. The reservations and releases occur conceptually even though there are no instructions to grow and shrink the stack.
Another illuminating example is:
for (int i = 0; i < 100; ++i)
{
extern int baz(void);
int a[baz()], b[baz()];
extern void bar(void *, void *);
bar(a, b);
}
In this case, the compiler cannot reserve memory for a and b when the routine starts because it does not know how much memory it will need. In each iteration, it must call baz to find how much memory is needed for a and how much for b, and then it must allocate stack space (or other memory) for them. Further, since the sizes may vary from iteration to iteration, it is not possible for both a and b to start in the same place in each iteration—one of them must move to make way for the other. So this code lets us see that a new a and a new b must be created in each iteration.
int count=0 is executing 100 times, then it has to create the variable 100 times
No, it defines the variable count once, then assigns it the value 0 100 times.
Defining a variable in C does not involve any particular step or code to "create" it (unlike for example in C++, where simply defining a variable may default-construct it). Variable definitions just associate the name with an "entity" that represents the variable internally, and definitions are tied to the scope where they appear.
Assigning a variable is a statement which gets executed during the normal program flow. It usually has "observable effects", otherwise the compiler is allowed to optimize it out entirely.
OP's example can be rewritten in a completely equivalent form as follows.
for(int i=0;i<100;i++)
{
int count; // definition of variable count - defined once in this {} scope
count=0; // assignment of value 0 to count - executed once per iteration, 100 times total
printf("%d ",++count);
}
Eric has it correct. In much shorter form:
Typically compilers determine at compile time how much memory is needed by a function and the offsets in the stack to those variables. The actual memory allocations occur on each function call and memory release on the function return.
Further, when you have variables nested within {curly braces} once execution leaves that brace set the compiler is free to reuse that memory for other variables in the function. There are two reasons I intentionally do this:
The variables are large but only needed for a short time so why make stacks larger than needed? Especially if you need several large temporary structures or arrays at different times. The smaller the scope the less chance of bugs.
If a variable only has a sane value for a limited amount of time, and would be dangerous or buggy to use out of that scope, add extra curly braces to limit the scope of access so improper use generates immediate compiler errors. Using unique names for each variable, even if the compiler doesn't insist on it, can help the debugger, and your mind, less confused.
Example:
your_function(int a)
{
{ // limit scope of stack_1
int stack_1 = 0;
for ( int ii = 0; ii < a; ++ii ) { // really limit scope of ii
stack_1 += some_calculation(i, a);
}
printf("ii=%d\n", ii); // scope error
printf("stack_1=%d\n", stack_1); // good
} // done with stack_1
{
int limited_scope_1[10000];
do_something(a,limited_scope_1);
}
{
float limited_scope_2[10000];
do_something_else(a,limited_scope_2);
}
}
A compiler given code like:
void do_something(int, int*);
...
for (int i=0; i<100; i++)
{
int const j=(i & 1);
doSomething(i, &j);
}
could legitimately replace it with:
void do_something(int, int*);
...
int const __compiler_generated_0 = 0;
int const __compiler_generated_1 = 1;
for (int i=0; i<100; i+=2)
{
doSomething(i, &compiler_generated_0);
doSomething(i+1, &compiler_generated_1);
}
Although a compiler would typically allocate space on the stack once for j, when the function was entered, and then not reuse the storage during the loop (or even the function), meaning that j would have the same address on every iteration of the loop, there is no requirement that the address remain constant. While there typically wouldn't be an advantage to having the address vary on different iterations, compilers are be allowed to exploit such situations should they arise.

Why it is significantly slower to reference global variables in a for loop?

I was reading a book which says to use local variables to eliminate unnecessary memory references. For example, the code below is not very efficient:
int gsum; //global sum variable
void foo(int num) {
for (int i = 0; i < num; i++) {
gsum += i;
}
}
It is more efficient to have the code below:
void foo(int num) {
int fsum;
for (int i = 0; i < num; i++) {
fsum += i;
}
gsum = fsum;
}
I know the second case uses a local variable which is stored in a register. That's why it is a little bit faster while, in the first case, gsum has to be retrieved from main memory too many times.
But I still have questions:
Q1- Isn't the the gcc compiler smart enough to detect it and implicitly use a register to store the global variable so that subsequent references will use the register exactly as the second case?
Q2- If, for some reason, the compiler is not able to optimize, then we still have the cache. Referencing a global variable from the cache is still very fast but I see that some programs which use local variables are 10 times faster than the ones who reference global variables. Why is this?
Q1: That register is probably going to be needed by other functions which will be called between subsequent calls to foo. That means gsum will need to be shuttled in and out of the register whenever this function is called.
Q2: It's possible that the page containing gsum will stay in cache for a while. However, depending on what else your computer is doing, that page may get written to swap space in order to make room in memory for other pages.

Is this the correct use of 'restrict' in C?

Currently I'm learning about parallel programming. I have the following loop that needs to be parallelized.
for(i=0; i<n/2; i++)
a[i] = a[i+1] + a[2*i]
If I run this sequentially there is no problem, but if I want to run this in parallel, there occurs data recurrence. To avoid this I want to store the information to 'read' in a seperate variable e.g b.
So then the code would be:
b = a;
#pragma omp parallel for private(i)
for(i=0; i<n/2; i++)
a[i] = b[i+1] + b[2*i];
But here comes the part I where I begin to doubt. Probably the variable b will point to the same memory location as a. So the second code block will do exactly as the first code block. Including recurrence I'm trying to avoid.
I tried something with * restric {variable}. Unfortunately I can't really find the right documentation.
My question:
Do I avoid data recurrence by writing the code as follows?
int *restrict b;
int *restrict a;
b = a;
#pragma omp parallel for private(i)
for(i=0; i<n/2; i++)
a[i] = b[i+1] + b[2*i];
If not, what is a correct way to achieve this goal?
Thanks,
Ter
In your proposed code:
int *restrict b;
int *restrict a;
b = a;
the assignment of a to b violates the restrict requirement. That requires that a and b do not point to the same memory, yet they clearly do point to the same memory.
It is not safe.
You'd have to.make a separately allocated copy of the array to be safe. You could do that with:
int *b = malloc(n * size of(*b));
…error check…;
memmove(b, a, n *sizeof(*b));
…revised loop using a and b…
free(b);
I always use memmove() because it is always correct, dealing with overlapping copies. In this case, it would be legitimate to use memcpy() because the space allocated for b will be separate from the space for a. The system would be broken if the newly allocated space for b overlaps with a at all, assuming the pointer to a is valid. If there was an overlap, the trouble would be that a was allocated and freed — so a is a dangling pointer pointing to released memory (and should not be being used at all), and b was coincidentally allocated where the old a was previously allocated. On the whole, it's not a problem worth worrying about. (Using memmove() doesn't help if a is a dangling pointer, but it is always safe if given valid pointers, even if the areas of memory overlap.)

Passing parameters to a function to efficiently create array allocated on the stack

I have a function that needs external parameters and afterwards creates variables that are heavily used inside that function. E.g. the code could look like this:
void abc(const int dim);
void abc(const int dim) {
double arr[dim] = { 0.0 };
for (int i = 0; i != dim; ++i)
arr[i] = i;
// heavy usage of the arr
}
int main() {
const int par = 5;
abc(par);
return 0;
}
But I am getting a compiler error, because the allocation on the stack needs compile-time constants. When I tried allocating manually on the stack with _malloca, the time performance of the code worsened (compared to the case when I declare the constant par inside the abc() function). And I don't want the array arr to be on the heap, because it is supposed to contain only small amount of values and it is going to get used quite often inside the function. Is there some way to combine the efficiency while keeping the possibility to pass the size parameter of an array to the function?
EDIT: I am using MSVC compiler and I received an error C2131: expression did not evaluate to a constant in VC 2017.
If you're using a modern C compiler, that implements the entire C99, or the C11 with variable-length array extension, this would work, with one little modification:
void abc(const int dim);
void abc(const int dim) {
double arr[dim];
for (int i = 0; i != dim; ++i)
arr[i] = i;
// heavy usage of the arr
}
int main(void) {
const int par = 5;
abc(par);
return 0;
}
I.e. double arr[dim] would work - it doesn't have a compile-time constant size, but it is enough to know its size at runtime. However, such a VLA cannot be initialized.
Unfortunately MSVC is not a modern C compiler / at MS they don't want to implement the VLA themselves - and I even suspect they're a big part of why the VLA's were made optional in C11, so you'd need to define the array in main then pass a pointer to it to the function abc; or if the size is globally constant, use an actual compile-time constant, i.e. a #define.
However, you're not showing the actual code that you're having performance problems with. It might very well be that the compiler can produce optimized output if it knows the number of iterations - if that is true, then the "globally defined size" might be the only way to get excellent performance.
Unfortunately the Microsoft Compiler does not support variable length arrays.
If the array is not too large you could allocate by the largest possible size needed and pass a pointer to that stack array and a dimension to the function. This approach could help limit the number of allocations.
Another option is to implement a simple heap allocated global pool for functions of this type to use. The pool would allocate a large continuous chunk on the heap and then you can get a pointer to your reservation in the pool. The benefit of this approach is you will not have to worry about over allocation on the stack causing a segmentation fault (which can happen with variable length arrays).

How can I make a parameter vector local for each Kernel in Cuda

How can i make a parameter vector treated as a local variable in each instance in cuda?
__global__ void kern(char *text, int N){
//if i change text[0]='0'; the change only affects the current instance of the kernel and not the other threads
}
Thanks!
Every thread will receive the same input parameters, so in this case char *text is the same in every thread - that's a fundamental part of the programming model. Since the pointer points to global memory, if one thread changes data through the pointer (i.e. modifies the global memory) then the change affects all threads (ignoring hazards).
This is exactly the same as standard C, except now you have multiple threads accessing through the pointer. In other words, if you modify text[0] inside a standard C function then the changes are visible outside the function.
If I understand correctly, you're asking for every thread to have a local copy of the contents of text. Well the solution is exactly the same as for standard C if you don't want changes visible outside the function:
__global__ void kern(char* text, int N) {
// If you have an upper bound for N...
char localtext[NMAX];
// If you don't know the range of N...
char *localtext;
localtext = malloc(N*sizeof(char));
// Copy from text to localtext
// Since each thread has the same value for i this will
// broadcast from the L1 cache
for (int i = 0 ; i < N ; i++)
localtext[i] = text[i];
//...
}
Note that I'm assuming you have sm_20 or later. Also note that while using malloc in device code is possible, you will pay a performance price.

Resources