C and OpenMP: pointer to shared read-only data slows down execution - c

This is my situation, I have an array of pointers that point to arrays of some data... Let's say:
Data** array = malloc ( 100 * sizeof(Data*));
for(i = 0; i < 100; i++) array[i] = malloc (20 * sizeof(Data);
Inside a parallel region, I make some operations that use that data. For instance:
#pragma omp parallel num_threads(4) firstprivate(array)
{
function(array[0], array[omp_get_thread_num()];
}
The first parameter is read-only but is the same along all threads...
The problem is that if I use as the first parameter a diferent block of data, i.e.: array[omp_get_thread_num()+1], each function lasts 1seg. But when I use the same block of data, array[0], each function call lasts 4segs.
My theory is that there is no way to know if the array[0] will be changed or not by the funciton so each thread asks for a copy and invalidate the copies that other threads have and that should explain the delay...
I tried to make a local copy of array[0] like this:
#pragma omp parallel num_threads(4) firstprivate(array)
{
Data* tempData = malloc(20 * sizeof(Data));
memcpy(tempData,array[0], 20*sizeof(Data));
function(tempData, array[omp_get_thread_num()];
}
But I get the same result... It's like the thread doesn't 'release' the Data block so other threads could use it...
I have to note that the first parameter is not always array[0] so I can't use firstprivate(array[0]) in the pragma line...
Questions are:
Am I doing something wrong?
Is there a way to 'release' a shared block of memory so other threads
could use it?
It was very difficult try to make me understand so if you need further information, please let me know!
Thanks in advance... Javier
EDIT: I can't change the function declaration because it comes inside a library! (ACML)

I think you are right in your analysis that the compiler has no way to know that the pointed to array didn't change behind his back. Actually he knows that they might change, since thread 0 receives the same array[0] also as a modifiable argument.
So he has to have the values reloaded too often. First, you should declare your function something like
void function(Data const*restrict A, Data*restrict B);
This is telling the compiler, first, that the values in A can't be changed, and then that none of the pointers can be aliased by the other (or any other pointer), and so that he knows that the values in the arrays will only changed by the function itself.
For thread number 0 the assertion above wouldn't be true, the arrays A and B are actually the same. So you'd best copy array[0] to a common temparray before you go into the #pragma omp parallel, and pass that same temparray as a first argument to every thread:
Data const* tempData = memcpy(malloc(20 * sizeof(Data)), array[0], 20*sizeof(Data));
#pragma omp parallel num_threads(4)
function(tempData, array[omp_get_thread_num()];

I think you are wrong in your analysis. If the data is not changed, imho it will not be synchronized between cores. There are two probable reasons for the slowing down.
The core #0 gets function(array[0], array[0]). You said that first parameter is read-only, but the second is not. So core #0 will change the data in array[0] and the CPU will have to synchronize this data between the cores all the time.
The second possible reason is the small size of your arrays (20 elements). What happens is that the core #1 gets a pointer to the 20-element array, and core #2 gets the pointer to array, which is probably right after the array #1 in memory. Thus there is a high probability that they lie on the same cache line. The CPU does not track changing each particular element - if it sees that elements on the same cache line are changed, it will synchronize the cache between the cores. Solution is to make each array bigger (so that after 20 elements you have unused space equal to cache size (128K? 256K?))
My guess that you have both, #1 and #2 problems in your code.

Related

Allocating matrix performances

I have two scenarios, in both i allocate 78*2 sizeof(int) of memory and initialize it to 0.
Are there any differences regards performances?
Scenario A:
int ** v = calloc(2 , sizeof(int*));
for (i=0; i<2; ++i)
{
v[i] = calloc(78, sizeof(int));
}
Scenario B:
int ** v = calloc(78 , sizeof(int*));
for (i=0; i<78; ++i)
{
v[i] = calloc(2, sizeof(int));
}
I supposed that in performance terms, it's better to use a calloc if an initialize array is needed, let me know if I'm wrong
First, discussing optimization abstractly has some difficulties because compilers are becoming increasingly better at optimization. (For some reason, compiler developers will not stop improving them.) We do not always know what machine code given source code will produce, especially when we write source code today and expect it to be used for many years to come. Optimization may consolidate multiple steps into one or may omit unnecessary steps (such as clearing memory with calloc instead of malloc immediately before the memory is completely overwritten in a for loop). There is a growing difference between what source code nominally says (“Do these specific steps in this specific order”) and what it technically says in the language abstraction (“Compute the same results as this source code in some optimized fashion”).
However, we can generally figure that writing source code without unnecessary steps is at least as good as writing source code with unnecessary steps. With that in mind, let’s consider the nominal steps in your scenarios.
In Scenario A, we tell the computer:
Allocate 2 int *, clear them, and put their address in v.
Twice, allocate 78 int, clear them, and put their addresses in the preceding int *.
In Scenario B, we tell the computer:
Allocate 78 int *, clear them, and put their address in v.
78 times, allocate two int, clear them, and put their addresses in the preceding int *.
We can easily see two things:
Both of these scenarios both clear the memory for the int * and immediately fill it with other data. That is wasteful; there is no need to set memory to zero before setting it to something else. Just set it to something else. Use malloc for this, not calloc. malloc takes just one parameter for the size instead of two that are multiplied, so replace calloc(2, sizeof (int *)) with malloc(2 * sizeof (int *)). (Also, to tie the allocation to the pointer being assigned, use int **v = malloc(2 * sizeof *v); instead of repeating the type separately.)
At the step where Scenario B does 78 things, Scenario A does two things, but the code is otherwise very similar, so Scenario A has fewer steps. If both would serve some purpose, then A is likely preferable.
However, both scenarios allude to another issue. Presumably, the so-called array will be used later in the program, likely in a form like v[i][j]. Using this as a value means:
Fetch the pointer v.
Calculate i elements beyond that.
Fetch the pointer at that location.
Calculate j elements beyond that.
Fetch the int at that location.
Let’s consider a different way to define v: int (*v)[78] = malloc(2 * sizeof *v);.
This says:
Allocate space for 2 arrays of 78 int and put their address in v.
Immediately we see that involves fewer steps than Scenario A or Scenario B. But also look at what it does to the steps for using v[i][j] as a value. Because v is a pointer to an array instead of a pointer to a pointer, the computer can calculate where the appropriate element is instead of having to load an address from memory:
Fetch the pointer v.
Calculate i•78 elements beyond that.
Calculate j elements beyond that.
Fetch the int at that location.
So this pointer-to-array version is one step fewer than the pointer-to-pointer version.
Further, the pointer-to-pointer version requires an additional fetch from memory for each use of v[i][j]. Fetches from memory can be expensive relative to in-processor operations like multiplying and adding, so it is a good step to eliminate. Having to fetch a pointer can prevent a processor from predicting where the next load from memory might be based on recent patterns of use. Additionally, the pointer-to-array version puts all the elements of the 2×78 array together in memory, which can benefit the cache performance. Processors are also designed for efficient use of consecutive memory. With the pointer-to-pointer version, the separate allocations typically wind up with at least some separation between the rows and may have a lot of separation, which can break the benefits of consecutive memory use.

If I declare a variable inside a for loop in C, will it be created multiple times or not?

#include <stdio.h>
int main()
{
for(int i=0;i<100;i++)
{
int count=0;
printf("%d ",++count);
}
return 0;
}
output of the above program is: 1 1 1 1 1 1..........1
Please take a look at the code above. I declared variable "int count=0" inside the for loop.
With my knowledge, the scope of the variable is within the block, so count variable will be alive up to for loop execution.
"int count=0" is executing 100 times, then it has to create the variable 100 times else it has to give the error (re-declaration of the count variable), but it's not happening like that — what may be the reason?
According to output the variable is initializing with zero every time.
Please help me to find the reason.
Such simple code can be visualised on http://www.pythontutor.com/c.html for easy understanding.
To answer your question, count gets destroyed when it goes outside its scope, that is the closing } of the loop. On next iteration, a variable of the same name is created and initialised to 0, which is used by the printf.
And if counting is your goal, print i instead of count.
The C standard describes the C language using an abstract model of a computer. In this model, count is created each time the body of the loop is executed, and it is destroyed when execution of the body ends. By “created” and “destroyed,” we mean that memory is reserved for it and is released, and that the initialization is performed with the reservation.
The C standard does not require compilers to implement this model slavishly. Most compilers will allocate a fixed amount of stack space when the routine starts, with space for count included in this fixed amount, and then count will use that same space in each iteration. Then, if we look at the assembly code generated, we will not see any reservation or release of memory; the stack will be grown and shrunk only once for the whole routine, not grown and shrunk in each loop iteration.
Thus, the answer is twofold:
In C’s abstract model of computing, a new lifetime of count begins and ends in each loop iteration.
In most actual implementations, memory is reserved just once for count, although implementations may also allocate and release memory in each iteration.
However, even if you know your C implementation allocates stack space just once per routine when it can, you should generally think about programs in the C model in this regard. Consider this:
for (int i = 0; i < 100; ++i)
{
int count = 0;
// Do some things with count.
float x = 0;
// Do some things with x.
}
In this code, the compiler might allocate four bytes of stack space to use for both count and x, to be used for one of them at a time. The routine would grow the stack once, when it starts, including four bytes to use for count and x. In each iteration of the loop, it would use the memory first for count and then for x. This lets us see that the memory is first reserved for count, then released, then reserved for x, then released, and then that repeats in each iteration. The reservations and releases occur conceptually even though there are no instructions to grow and shrink the stack.
Another illuminating example is:
for (int i = 0; i < 100; ++i)
{
extern int baz(void);
int a[baz()], b[baz()];
extern void bar(void *, void *);
bar(a, b);
}
In this case, the compiler cannot reserve memory for a and b when the routine starts because it does not know how much memory it will need. In each iteration, it must call baz to find how much memory is needed for a and how much for b, and then it must allocate stack space (or other memory) for them. Further, since the sizes may vary from iteration to iteration, it is not possible for both a and b to start in the same place in each iteration—one of them must move to make way for the other. So this code lets us see that a new a and a new b must be created in each iteration.
int count=0 is executing 100 times, then it has to create the variable 100 times
No, it defines the variable count once, then assigns it the value 0 100 times.
Defining a variable in C does not involve any particular step or code to "create" it (unlike for example in C++, where simply defining a variable may default-construct it). Variable definitions just associate the name with an "entity" that represents the variable internally, and definitions are tied to the scope where they appear.
Assigning a variable is a statement which gets executed during the normal program flow. It usually has "observable effects", otherwise the compiler is allowed to optimize it out entirely.
OP's example can be rewritten in a completely equivalent form as follows.
for(int i=0;i<100;i++)
{
int count; // definition of variable count - defined once in this {} scope
count=0; // assignment of value 0 to count - executed once per iteration, 100 times total
printf("%d ",++count);
}
Eric has it correct. In much shorter form:
Typically compilers determine at compile time how much memory is needed by a function and the offsets in the stack to those variables. The actual memory allocations occur on each function call and memory release on the function return.
Further, when you have variables nested within {curly braces} once execution leaves that brace set the compiler is free to reuse that memory for other variables in the function. There are two reasons I intentionally do this:
The variables are large but only needed for a short time so why make stacks larger than needed? Especially if you need several large temporary structures or arrays at different times. The smaller the scope the less chance of bugs.
If a variable only has a sane value for a limited amount of time, and would be dangerous or buggy to use out of that scope, add extra curly braces to limit the scope of access so improper use generates immediate compiler errors. Using unique names for each variable, even if the compiler doesn't insist on it, can help the debugger, and your mind, less confused.
Example:
your_function(int a)
{
{ // limit scope of stack_1
int stack_1 = 0;
for ( int ii = 0; ii < a; ++ii ) { // really limit scope of ii
stack_1 += some_calculation(i, a);
}
printf("ii=%d\n", ii); // scope error
printf("stack_1=%d\n", stack_1); // good
} // done with stack_1
{
int limited_scope_1[10000];
do_something(a,limited_scope_1);
}
{
float limited_scope_2[10000];
do_something_else(a,limited_scope_2);
}
}
A compiler given code like:
void do_something(int, int*);
...
for (int i=0; i<100; i++)
{
int const j=(i & 1);
doSomething(i, &j);
}
could legitimately replace it with:
void do_something(int, int*);
...
int const __compiler_generated_0 = 0;
int const __compiler_generated_1 = 1;
for (int i=0; i<100; i+=2)
{
doSomething(i, &compiler_generated_0);
doSomething(i+1, &compiler_generated_1);
}
Although a compiler would typically allocate space on the stack once for j, when the function was entered, and then not reuse the storage during the loop (or even the function), meaning that j would have the same address on every iteration of the loop, there is no requirement that the address remain constant. While there typically wouldn't be an advantage to having the address vary on different iterations, compilers are be allowed to exploit such situations should they arise.

Openmp segfault when passing private variable but not when variable is declared within parallel region

The title says it all. When I declare array x outside the parallel region and pass it as a private variable to the threads, I get a segfault.
When I declare the variable within the parallel region, everything works fine.
I am interested in passing the variable as private rather than declaring it, hence I need help to debug the issue.
Here is how it looks like:
//Case1 - doesn't work (segfault)
x = (double *) malloc (solution * sizeof(double));
#pragma omp parallel for private(x)
for...
//Case2 - works
#pragma omp parallel for
for...
x = (double *) malloc (solution * sizeof(double));
I am using 72 threads and I've set the KMP_STACKSIZE to 1m as well as
ulimit -s unlimited
UPDATE
I still get the segfault despite John's suggestion. Here is the actual piece of code. I am actually using CPLEX optimisation library. I've also tried with memcpy for the private variable allocation.
#pragma omp parallel for private(i) shared(lp,env)
for(i=0;i<n;i++){
CPXENVptr envi =env;
CPXLPptr lpi = lp
CPXLpopt(envi,lpi);//this is where the segfault happens
}
Worth noting the CPLXLpopt command changes the size of both envi and lpi variables/
Do you recommend any debugger for openmp?
Your assertions about one code working and the other not are approximate at best. In fact, neither code even compiles successfully as presented. It seems virtually certain that the misbehavior reported in one case depends also on how variable x is used in the parallel section.
With that said, if the only difference between the working and non-working code is the placement of the declaration of and assignment to x, as shown, then it is unsurprising that the version that assigns x outside the parallel region segfaults. The OpenMP specs describe the private xs in scope in each thread running in the parallel region of your Case 1 this way:
A new list item of the same type, with automatic storage duration, is allocated for the construct. [...] The new list item is initialized, or has an undefined initial value, as if it had been locally declared without an initializer.
(from OpenMP 4.5, 2.15.3.3; emphasis added)
That is, the local xs inside your parallel loop do not start with the value that the (separate) x outside the loop has. Their initial values are indeterminate (per C for an automatic object declared without an initializer). Using that initial value produces undefined behavior, which might very well manifest as a segmentation fault.
You could fix this by allowing x to be shared, and using a different private variable in the parallel section, initialized from the shared x. Something like this, for example:
x = (double *) malloc (solution * sizeof(double));
#pragma omp parallel for
for (double *y = x; ...
(x is shared and y is private by default). That serves the scenario where you want each thread to have a private pointer to the same shared space.
Note, however, that the memory to which x points is shared no matter what. If you want each thread to have its own, separate, dynamically-allocated space, then each one needs to allocate such space itself (and subsequently to free that space). Even in that case that allocated space is technically shared, but if the threads do not publish any pointers to their allocated spaces then other threads will be unable to access them.

What is the effect of this code on memory

Can anyone tell me the effect that the below code has on memory?
My questions are:
In code 1, is the same memory location getting updated every
time in the loop?
In code 2, is new memory allocated as the variable is declared and assigned in for loop?
Code 1:
int num;long result;
for(int i=0;i<500;i++){
num = i;
result = num * num;
}
Code 2:
for(int i=0;i<500;i++){
int num = i;
long result = num * num;
}
In both cases only one num and one result instance will be created.
The only different is that on Code 2 num and result won't be accessible after the loop and the memory used to hold them can be reused for other members.
Important: Where you declare a local variable in your source code has very little impact on when the actual allocation and deallocation (for example push/pop on the stack) takes place. If at all, the variable might get optimized away entirely.
Where in your C code you allocate your local variable has most likely no impact on performance what-so-ever. Therefore, you should declare local variables where it gives the best possible readability.
In both cases, the compiler will deduce that num is completely superfluous and optimize it away. No memory will be allocated for it.
In code 2, the compiler will deduce that the local variable result isn't used anywhere in the program, and therefore it will most likely optimize away the whole of code 2 into nothing.
The machine code of code 1 will look something like this:
allocate room for i
allocate room for result
set i to 0
loop:
multiply i by i
store in result
if i is less than 500, jump to loop
Can anyone tell me the effect that the below code has on memory?
As others have mentioned, the answer to your question hugely depends on the compiler you are using.
[1] Assuming you intend to use result and num later in the loop for other computations as in:
for(int i=0; i<500; ++i){
int num = i;
long result = num * num;
// more operations using num and result, e.g. function calls, IO, etc.
}
, a modern compiler (gcc, llvm, mvc, icc) would be smart enough to optimise the two codes you provided to the same thing, thus in both codes the same "memory location" would be updated on each iteration.
I put "memory location" in quotes, as the variables can be promoted to registers, which, while strictly speaking are still memory, are a lot faster to access and thus a preferable location for frequently used variables.
The only difference between your codes is the scope of your variables, but you probably already know that.
If, conversly to [1], you don't intend to use your variables later, the compiler would probably detect that and just skip the loop, not generating any machine code for it, as it is redundant.

Comparing a volatile array to a non-volatile array

Recently I needed to compare two uint arrays (one volatile and other nonvolatile) and results were confusing, there got to be something I misunderstood about volatile arrays.
I need to read an array from an input device and write it to a local variable before comparing this array to a global volatile array. And if there is any difference i need to copy new one onto global one and publish new array to other platforms. Code is something as blow:
#define ARRAYLENGTH 30
volatile uint8 myArray[ARRAYLENGTH];
void myFunc(void){
uint8 shadow_array[ARRAYLENGTH],change=0;
readInput(shadow_array);
for(int i=0;i<ARRAYLENGTH;i++){
if(myArray[i] != shadow_array[i]){
change = 1;
myArray[i] = shadow_array[i];
}
}
if(change){
char arrayStr[ARRAYLENGTH*4];
array2String(arrayStr,myArray);
publish(arrayStr);
}
}
However, this didn't work and everytime myFunc runs, it comes out that a new message is published, mostly identical to the earlier message.
So I inserted a log line into code:
for(int i=0;i<ARRAYLENGTH;i++){
if(myArray[i] != shadow_array[i]){
change = 1;
log("old:%d,new:%d\r\n",myArray[i],shadow_array[i]);
myArray[i] = shadow_array[i];
}
}
Logs I got was as below:
old:0,new:0
old:8,new:8
old:87,new:87
...
Since solving bug was time critical I solved the issue as below:
char arrayStr[ARRAYLENGTH*4];
char arrayStr1[ARRAYLENGTH*4];
array2String(arrayStr,myArray);
array2String(arrayStr1,shadow_array);
if(strCompare(arrayStr,arrayStr1)){
publish(arrayStr1);
}
}
But, this approach is far from being efficient. If anyone have a reasonable explanation, i would like to hear.
Thank you.
[updated from comments:]
For the volatile part, global array has to be volatile, since other threads are accessing it.
If the global array is volatile, your tracing code could be inaccurate:
for(int i=0;i<ARRAYLENGTH;i++){
if(myArray[i] != shadow_array[i]){
change = 1;
log("old:%d,new:%d\r\n",myArray[i],shadow_array[i]);
myArray[i] = shadow_array[i];
}
}
The trouble is that the comparison line reads myArray[i] once, but the logging message reads it again, and since it is volatile, there's no guarantee that the two reads will give the same value. An accurate logging technique would be:
for (int i = 0; i < ARRAYLENGTH; i++)
{
uintu_t value;
if ((value = myArray[i]) != shadow_array[i])
{
change = 1;
log("old:%d,new:%d\r\n", value, shadow_array[i]);
myArray[i] = shadow_array[i];
}
}
This copies the value used in the comparison and reports that. My gut feel is it is not going to show a difference, but in theory it could.
global array has to be volatile, since other threads are accessing it
As you "nicely" observe declaring an array volatile is not the way to protect it against concurrent read/write access by different threads.
Use a mutex for this. For example by wrapping access to the "global array" into a function which locks and unlocks this mutex. Then only use this function to access the "global array".
References:
Why is volatile not considered useful in multithreaded C or C++ programming?
https://www.kernel.org/doc/Documentation/volatile-considered-harmful.txt
Also for printf()ing unsigned integers use the conversion specifier u not d.
A variable (or Array) should be declared volatile when it may Change outside the current program execution flow. This may happen by concurrent threads or an ISR.
If there is, however, only one who actually writes to it and all others are jsut Readers, then the actual writing code may treat it as being not volatile (even though there is no way to tell teh Compiler to do so).
So if the comparison function is the only Point in the Project where teh gloal Array is actually changed (updated) then there is no Problem with multiple reads. The code can be designed with the (external) knowledge that there will be no Change by an external source, despite of the volatile declaration.
The 'readers', however, do know that the variable (or the array content) may change and won't buffer it reads (e.g by storing the read vlaue in a register for further use), but still the array content may change while they are reading it and the whole information might be inconsistent.
So the suggested use of a mutex is a good idea.
It does not, however, help with the original Problem that the comparison Loop fails, even though nobody is messing with the array from outside.
Also, I wonder why myArray is declared volatile if it is only locally used and the publishing is done by sending out a pointer to ArrayStr (which is a pointer to a non-volatile char (array).
There is no reason why myArray should be volatile. Actually, there is no reason for its existence at all:
Just read in the data, create a temporary tring, and if it differes form the original one, replace the old string and publish it. Well, it's maybe less efficient to always build the string, but it makes the code much shorter and apparently works.
static char arrayStr[ARRAYLENGTH*4]={};
char tempStr[ARRAYLENGTH*4];
array2String(tempStr,shadow_array);
if(strCompare(arrayStr,tempStr)){
strCopy(arrayStr, tempStr);
publish(arrayStr);
}
}

Resources