How would you avoid False Sharing in a scenario like this? - c

In the code below I have parallelised using OpenMP's standard parallel for clause.
#pragma omp parallel for private(i, j, k, d_equ) shared(cells, tmp_cells, params)
for(i=0; i<some_large_value; i++)
{
for(j=0; j<some_large_value; j++)
{
....
// Some operations performed over here which are using private variables
....
// Accessing a shared array is causing False Sharing
for(k=0; k<10; k++)
{
cells[i * width + j].speeds[k] = some_calculation(i, j, k, cells);
}
}
}
This has given me a significant improvement to runtime (~140s to ~40s) but there is still one area I have noticed really lags behind - the innermost loop I marked above.
I know for certain the array above is causing False Sharing because if I make the change below, I see another huge leap in performance (~40s to ~13s).
for(k=0; k<10; k++)
{
double somevalue = some_calculation(i, j);
}
In other words, as soon as I changed the memory location to write to a private variable, there was a huge speed up improvement.
Is there any way I can improve my runtime by avoiding False Sharing in the scenario I have just explained? I cannot seem to find many resources online that seem to help with this problem even though the problem itself is mentioned a lot.
I had an idea to create an overly large array (10x what is needed) so that enough margin space is kept between each element to make sure when it enters the cache line, no other thread will pick it up. However this failed to create the desired effect.
Is there any easy (or even hard if needs be) way of reducing or removing the False Sharing found in that loop?
Any form of insight or help will be greatly appreciated!
EDIT: Assume some_calculation() does the following:
(tmp_cells[ii*params.nx + jj].speeds[kk] + params.omega * (d_equ[kk] - tmp_cells[ii*params.nx + jj].speeds[kk]));
I cannot move this calculation out of my for loop because I rely on d_equ which is calculated for each iteration.

Before anwsering your question, I have to ask is it really a false sharing situation when you use the whole cells as the input of the function some_calcutation()? It seems you are sharing the whole array actrually. You may want to provide more info about this function.
If yes, go on with the following.
You've already show that private variable double somevaluewill improve the performance. Why not just use this approach?
Instead of using a single double variable, you could define a private array private_speed[10] just before the for k loop, calculate them in the loop, and copy it back to cells after the loop with Something like
memcpy(cells[i*width+j].speed, private_speed, sizeof(...));

Related

Can I use curly braces instead of #pragma region?

I'm developing code for an arduino based board, and I'm using VSCode, as I find it better than the Arduino IDE.
Now, in some parts of the code, I like to group certain statements together, to organise the code better. In C# (using Visual Studio) I would use #region NAME to do this. The C variant of this is #pragma region, however, I find this clutters the code, and isn't quite as clean as I would want it.
Instead I thought of using curly braces {}, to achieve something similar, but to my understanding the compiler uses them to declare scope right? So would using them like this:
char *data;
{
free(data);
}
produce any odd behaviour? From what I've tried the compiler doesn't seem to mind, but maybe I just haven't tried enough cases.
So, I guess what I want to know is: Would using curly braces in this way be detrimental to general coding in C?
The compound statement forms a block scope.
So for example this code snippet
int x;
int y = 10;
x = y;
is not equivalent tp
int x;
{
int y = 10;
}
x = y;
In the last case the compiler will issue an error that the identifier y is not declared.
Also using redundant braces makes a code less readable and confusing.
Arduino is not C only C++.
The best way of organizing the code in C++ is to use classes and structures.
In C# region is used to collapse and expand the code in the editor when outlining and has no effect on the code execution. Arduino IDE does not have any of those fancy editor features.
Compound statement in C or C++ is something completely different than the #region in C#. It creates a new scope that affects the compilation and execution of the code.
Using blocks to group statements and set them off is fine for suitable purposes. One situation I encountered from time to time was when a matrix needed separate processing for its first and last rows, due to border effects. Then the code could look like:
{
int row = 0;
// Code for first row.
}
for (int row = 1; row < N-1; ++row)
{
// Code for middle rows.
}
{
int row = N-1;
// Code for last row.
}
Often the code was largely similar for the three cases, and using blocks each of them made that similarity more visually apparent to the reader. At the same time, having the same level of indentation for each of the cases made the differences easier to see.
Similarly, blocks can organize sections of a function that are semi-repetitive but do not involve loops (and are not involved enough to deserve functions of their own, or that use so many variables that passing them parameters would be a mess).
Blocks are well defined by the C standard and will not produce any “odd behavior.”

Dealing with 2D grids with MPI

I'm using MPI to parallelise some simple C code, but I'm having a hard time fundamentally understanding it.
Everything was quite reasonable when I was first learning about MPI, but this has stumped me. If you create a 2D grid using two nested loops, i.e.
for(i=0; i<imax; i++) {
for(j=0; j<jmax; j++) {
grid[i][j] = ...
}
}
How would you even go about parallelising this? I initially thought you would use a master thread to gather all of the data that each thread calculates, but apparently this scales badly.
I believe you need to call MPI_Gather(...) for all processes, but I don't really understand how you would then collect this data into one coherent 2D grid.
I should probably note that the ... in the loop is just where it calculates a small value for each grid point. It uses a red/black scheme, so there are no issues with race conditions, etc.
Thanks!

Moving x,y position of all array objects every frame in actionscript 3?

I have my code setup so that I have a movieclip in my library with a class called "block" being duplicated multiple times and added into an array like this:
function makeblock(e:Event){
newblock=new block;
newblock.x=10;
newblock.y=10;
addChild(newblock);
myarray[counter] = newblock; //adds a newblock object into array
counter += 1;
}
Then I have a loop with a currently primitive way of handling my problem:
stage.addEventListener(Event.ENTER_FRAME, gameloop);
function gameloop(evt:Event):void {
if (moveright==true){
myarray[0].x += 5;
myarray[1].x += 5;
myarray[2].x += 5
-(and so on)-
My question is how can I change x,y values every frame for new objects duplicated into the array, along with the previous ones that were added. Of course with a more elegant way than writing it out myself... array[0].x += 5, array[1], array[2], array[3] etc.
Ideally I would like this to go up to 500 or more array objects for one array so obviously I don't want to be writing it out individually haha, I also need it to be consistent with performance so using a for loop or something to loop through the whole array and move each x += 5 wouldn't work would it? Anyway, if anyone has any ideas that'd be great!
If you have to move 100 objects, you have to move them. No alternatives.
But what you can really do to save performance, is optimize the solution itself. A few cents from me:
Of course the loop has to be applied in your case, managing 100+ assignments line by line is definitely not the right way to go. Although you gain nothing performance wise with just using a loop.
Try grouping the objects. As I see above, you seem to be moving all those objects with similar increment. Group them all into larger movieclips (or Sprites) & move that instead.
Learn Blitting & caching methods to save a lot on performance, Or you would sooner or later hit on the road where your logic cannot be twisted anymore & performance will be a pain.
Also, in extent of the previous step, do consider using Sprite Sheets if you have multiple states of the same object.
Finally, I would also like to caution you to not waste time on micro optimizations & thinking about them.
You can use some container sprite and add the blocks to that on creation:
// Some init place
var blockContainer:Sprite = new Sprite();
addChild(blockContainer);
Make the blocks:
function makeblock(e:Event){
newblock=new block;
newblock.x=10;
newblock.y=10;
// Add the block to the container
blockContainer.addChild(newblock);
myarray[counter] = newblock; //adds a newblock object into array
counter += 1;
}
And the gameloop:
stage.addEventListener(Event.ENTER_FRAME, gameloop);
function gameloop(evt:Event):void {
if (moveright==true){
blockContainer.x += 5;
}
// etc...
}
This way you'll only have to move one object. Of course this method will only work so long as all the blocks need to move in the same direction. By the way, a for loop will work just as well - 500 iterations is nothing. The only performance issue will likely be just rendering and that will happen regardless of what method you choose, as you have to somehow move the blocks (in other words, performance here is not really the issue as you have to render the movement change, the only question is how you choose to implement the movement for your own coding convenience).

Understanding this C function

I'm trying to understand how this function works, I have studied several algorithms to generate sudoku puzzles and found out this one.
Tested the function and it does generates a valid 9x9 Latin Square (Sudoku) Grid.
My problem is that I can't understand how the function works, i do know the struct is formed by to ints, p and b , p will hold the number for the cell in the table, But after that I don't understand why it creates more arrays (tab 1 and tab2) and how it checks for a latin square =/ etc , summarizing , I'm completely lost.
I'm not asking for a line by line explanation, the general concept behind this function.
would help me a lot !
Thanks again <3
int sudoku(struct sudoku tabla[9][9],int x,int y)
{
int tab[9] = {1,1,1,1,1,1,1,1,1};
int i,j;
for(i=0;i<y;++i)
{
tab[tabla[x][i].p-1]=0;
for(i=0;i<x;++i)
{
tab[tabla[i][y].p-1]=0;
}
for(i=(3*(x/3));i<(3*(x/3)+3);++i)
{
for(j=(3*(y/3));j<y;++j)
{
tab[tabla[i][j].p-1]=0;
}
}
int n=0;
for(i=0;i<9;++i)
{
n=n+tab[i];
}
int *tab2;
tab2=(int*)malloc(sizeof(int)*n);
j=0;
for(i=0;i<9;++i)
{ if(tab[i]==1)
{
tab2[j]=i+1;
j++;
}
}
int ny, nx;
if(x==8)
{
ny=y+1;
nx=0;
}
else
{
ny=y;
nx=x+1;
}
while(n>0)
{
int los=rand()%n;
tabla[x][y].p=tab2[los];
tab2[los]=tab2[n-1];
n--;
if(x==8 && y==8)
{
return 1;
}
if (sudoku(tabla,nx,ny)==1)
{
return 1;
}
}
return 0;
}
EDIT
Great, I now understand the structure, thanks lijie's answer. What I still don't understand is the part that tries out the values in random order). I don't understand how it checks if the random value placement is valid without calling the part of the code that checks if the movement is legal, also, after placing the random numbers is it necessary to check if the grid is valid again? –
Basically, the an invocation of the function fills in the positions at and "after" (x, y) in the table tabla, and the function assumes that the positions "prior" to (x, y) are filled, and returns whether a legal "filling in" of the values is possible.
The board is linearized via increasing x, then y.
The first part of the function finds out the values that are legal at (x, y), and the second part tries out the values in a random order, and attempts fills out the rest of the board via a recursive call.
There isn't actually a point in having tab2 because tab can be reused for that purpose, and the function leaks memory (since it is never freed, but aside from these, it works).
Does this make sense to you?
EDIT
The only tricky area in the part that checks for legal number is the third loop (checking the 3x3 box). The condition for j is j < y because those values where j == y are already checked by the second loop.
EDIT2
I nitpick, but the part that counts n and fills tab2 with the legal values should really be
int n = 0;
for (i = 0; i < 9; ++i) if (tab[i]) tab[n++] = i+1;
hence omitting the need for tab2 (the later code can just use tab and n instead of tab2). The memory leak is thusly eliminated.
EDIT
Note that the randomness is only applied to valid values (the order of trying the values is randomized, not the values themselves).
The code follows a standard exhaustive search pattern: try each possible candidate value, immediately returning if the search succeeds, and backtracking with failure if all the candidate values fail.
Try to solve sudoku yourself, and you'll see that there is inherent recursion in finding a solution to it. So, you have function that calls itself until whole board is solved.
As for code, it can be significantly simplified, but it will be for the best if you try to write one yourself.
EDIT:
Here is one from java, maybe it will be similar to what you are trying to do.
A quick description of the principles - ignoring the example you posted. Hopefully with the idea, you can tie it to the example yourself.
The basic approach is something that was the basis of a lot of "Artificial Intelligence", at least as it was seen until about the end of the 80s. The most general solution to many puzzles is basically to try all possible solutions.
So, first you try all possible solutions with a 1 in the top-left corner, then all possible solutions with a 2 in the top-left corner and so on. You recurse to try the options for the second position, third position and so on. This is called exhaustive search - or "brute force".
Trouble is it takes pretty much forever - but you can short-cut a lot of pointless searching.
For example, having placed a 1 in the top-left corner, you recurse. You place a 1 in the next position and recurse again - but now you detect that you've violated two rules (two ones in a row, two ones in a 3x3 block) even without filling in the rest of the board. So you "backtrack" - ie exit the recursion to the previous level and advance to putting a 2 in that second position.
This avoids a lot of searching, and makes things practical. There are further optimisations, as well, if you keep track of the digits still unused in each row, column and block - think about the intersection of those sets.
What I described is actually a solution algorithm (if you allow for some cells already being filled in). Generating a random solved sudoku is the same thing but, for each digit position, you have to try the digits in random order. This also leaves the problem of deciding which cells to leave blank while ensuring the puzzle can still be solved and (much harder) designing puzzles with a level-of-difficulty setting. But in a way, the basic approach to those problems is already here - you can test whether a particular set of left-blank spaces is valid by running the solution algorithm and finding if (and how many) solutions you get, for example, so you can design a search for a valid set of cells left blank.
The level-of-difficulty thing is difficult because it depends on a human perception of difficulty. Hmmm - can I fit "difficult" in there again somewhere...
One approach - design a more sophisticated search algorithm which uses typical human rules-of-thumb in preference to recursive searching, and which judges difficulty as the deepest level of recursion needed. Some rules of thumb might also be judged more advanced than others, so that using them more counts towards difficulty. Obviously difficulty is subjective, so there's no one right answer to how precisely the scoring should be done.
That gives you a measure of difficulty for a particular puzzle. Designing a puzzle directly for a level of difficulty will be hard - but when trying different selections of cells to leave blank, you can try multiple options, keep track of all the difficulty scores, and at the end select the one that was nearest to your target difficulty level.

Is SIMD Worth It? Is there a better option?

I have some code that runs fairly well, but I would like to make it run better. The major problem I have with it is that it needs to have a nested for loop. The outer one is for iterations (which must happen serially), and the inner one is for each point particle under consideration. I know there's not much I can do about the outer one, but I'm wondering if there is a way of optimizing something like:
void collide(particle particles[], box boxes[],
double boxShiftX, double boxShiftY) {/*{{{*/
int i;
double nX;
double nY;
int boxnum;
for(i=0;i<PART_COUNT;i++) {
boxnum = ((((int)(particles[i].sX+boxShiftX))/BOX_SIZE)%BWIDTH+
BWIDTH*((((int)(particles[i].sY+boxShiftY))/BOX_SIZE)%BHEIGHT));
//copied and pasted the macro which is why it's kinda odd looking
particles[i].vX -= boxes[boxnum].mX;
particles[i].vY -= boxes[boxnum].mY;
if(boxes[boxnum].rotDir == 1) {
nX = particles[i].vX*Wxx+particles[i].vY*Wxy;
nY = particles[i].vX*Wyx+particles[i].vY*Wyy;
} else { //to make it randomly pick a rot. direction
nX = particles[i].vX*Wxx-particles[i].vY*Wxy;
nY = -particles[i].vX*Wyx+particles[i].vY*Wyy;
}
particles[i].vX = nX + boxes[boxnum].mX;
particles[i].vY = nY + boxes[boxnum].mY;
}
}/*}}}*/
I've looked at SIMD, though I can't find much about it, and I'm not entirely sure that the processing required to properly extract and pack the data would be worth the gain of doing half as many instructions, since apparently only two doubles can be used at a time.
I tried breaking it up into multiple threads with shm and pthread_barrier (to synchronize the different stages, of which the above code is one), but it just made it slower.
My current code does go pretty quickly; it's on the order of one second per 10M particle*iterations, and from what I can tell from gprof, 30% of my time is spent in that function alone (5000 calls; PART_COUNT=8192 particles took 1.8 seconds). I'm not worried about small, constant time things, it's just that 512K particles * 50K iterations * 1000 experiments took more than a week last time.
I guess my question is if there is any way of dealing with these long vectors that is more efficient than just looping through them. I feel like there should be, but I can't find it.
I'm not sure how much SIMD would benefit; the inner loop is pretty small and simple, so I'd guess (just by looking) that you're probably more memory-bound than anything else. With that in mind, I'd try rewriting the main part of the loop to not touch the particles array more than needed:
const double temp_vX = particles[i].vX - boxes[boxnum].mX;
const double temp_vY = particles[i].vY - boxes[boxnum].mY;
if(boxes[boxnum].rotDir == 1)
{
nX = temp_vX*Wxx+temp_vY*Wxy;
nY = temp_vX*Wyx+temp_vY*Wyy;
}
else
{
//to make it randomly pick a rot. direction
nX = temp_vX*Wxx-temp_vY*Wxy;
nY = -temp_vX*Wyx+temp_vY*Wyy;
}
particles[i].vX = nX;
particles[i].vY = nY;
This has the small potential side effect of not doing the extra addition at the end.
Another potential speedup would be to use __restrict on the particle array, so that the compiler can better optimize the writes to the velocities. Also, if Wxx etc. are global variables, they may have to get reloaded each time through the loop instead of possibly stored in registers; using __restrict would help with that too.
Since you're accessing the particles in order, you can try prefetching (e.g. __builtin_prefetch on GCC) a few particles ahead to reduce cache misses. Prefetching on the boxes is a bit tougher since you're accessing them in an unpredictable order; you could try something like
int nextBoxnum = ((((int)(particles[i+1].sX+boxShiftX) /// etc...
// prefetch boxes[nextBoxnum]
One last one that I just noticed - if box::rotDir is always +/- 1.0, then you can eliminate the comparison and branch in the inner loop like this:
const double rot = boxes[boxnum].rotDir; // always +/- 1.0
nX = particles[i].vX*Wxx + rot*particles[i].vY*Wxy;
nY = rot*particles[i].vX*Wyx + particles[i].vY*Wyy;
Naturally, the usual caveats of profiling before and after apply. But I think all of these might help, and can be done regardless of whether or not you switch to SIMD.
Just for the record, there's also libSIMDx86!
http://simdx86.sourceforge.net/Modules.html
(On compiling you may also try: gcc -O3 -msse2 or similar).
((int)(particles[i].sX+boxShiftX))/BOX_SIZE
That's expensive if sX is an int (can't tell). Truncate boxShiftX/Y to an int before entering the loop.
Do you have sufficient profiling to tell you where the time is spent within that function?
For instance, are you sure it's not your divs and mods in the boxnum calculation where the time is being spent? Sometimes compilers fail to spot possible shift/AND alternatives, even where a human (or at least, one who knew BOX_SIZE and BWIDTH/BHEIGHT, which I don't) might be able to.
It would be a pity to spend lots of time on SIMDifying the wrong bit of the code...
The other thing which might be worth looking into is if the work can be coerced into something which could work with a library like IPP, which will make well-informed decisions about how best to use the processor.
Your algorithm has too many memory, integer and branch instructions to have enough independent flops to profit from SIMD. The pipeline will be constantly stalled.
Finding a more effective way to randomize would be top of the list. Then, try to work either in float or int, but not both. Recast conditionals as arithmetic, or at least as a select operation. Only then does SIMD become a realistic proposition

Resources