I'm having trouble parallelizing a block of C code.
The block is something like this:
for(n = 0; n < N; n++)
{
for(x = 0; x < X; x++)
{
var_1[x] = var_1[x] * 3 * var_2[x];
var_2[x] = var_2[x] * 2;
}
for(y = 0; y < Y; y++)
{
var_3[y] = var_1[y] * var_2[y];
}
}
That's not the actual code (it's for an assignment so I can't post the source code) but the problem is that the variables lie in a nested loop, and each iteration is dependent upon the previous.
Simply adding a #pragma omp for in front of the outer loop doesn't work because each thread's work begins at the initial values of var_1, var_2 and var_3.
Please let me know if I can explain any better! I'm quite lost.
As HighPerformanceMark comments, the inner loop has no loop-carried dependency. This invites vectorisation first, parallelisation second, at least to me. More importantly, the cheapest computation is the one you never do in the first place. As it is written, var_3[] is never read in either loop. You can remove the second inner loop, and only compute var_3[] after the last iteration of the outer loop.
I suspect further performance gains are entirely academic. It seems to me that this algorithm will very quickly overflow (if used on integers), or lose precision (if used on floats), so the maximum number of iterations has to be small.
Related
I have an image of width * height pixels in which i want to loop through blocks of pixels, say block size of 10 * 10. How can i do this with minimum number of loops?
I have tried by first looping through each column, then through each row and took the starting x and y position from this two outer loops. Then the loop goes from start position of the block and loops till the block size and manipulates the pixels. This consumes four nested loops.
for (int i = 0; i < Width; i+=Block_Size) {
for (int j = 0; j < Height; j+=Block_Size) {
for (int x = i; x < i + Block_Size; x++) {
for (int y = j; y < j + Block_Size; y++) {
//Get pixel values within the block
}
}
}
}
How can i do this with minimum number of loops?
You can reduce the number of loops by completely unrolling as many loop levels as you like. For fixed raster dimensions, you could unroll them all, yielding a (probably lengthy) implementation with zero loops. For known Block_Size you can unroll one or both of the inner loops regardless of whether the overall dimensions are known, yielding as few as two loops remaining.
But why would you consider such a thing? The question seems to assume that there would be some kind of inherent advantage to reducing the depth of loop nest, but that's not necessarily true, and whatever effect there might be is likely to be small.
I'm inclined to guess that you've studied a bit of computational complexity theory, and taken away the idea that deep loop nests necessarily yield poorly-scaling performance, or even that deep loop nests have inherently poor performance, period. These are misconceptions, albeit relatively common ones, and they anyway look at the problem backwards.
The primary consideration in how the performance of your loop nest scales is how many times the body of the innermost loop,
//Get pixel values within the block
, is executed. You'll have roughly the same performance for any reasonable approach that causes it to be executed exactly once for every pixel in the raster, regardless of how many loops are involved. With that being the case, code clarity should be your goal, and your original four-loop nest is pretty clear.
It is possible to achieve this with three loops, but in order to do that you will need to store information about where each block of pixels starts and how many blocks of pixels there are in total!
Independent of that, both the width as well as the height of the image have to be multiples of your Block_Size.
Here is how it is possible with three loops:
int numberOfBlocks = x;
int pixelBlockStartingPoints[numberOfBlocks] = { startingPoint1, startingPoint2, ... };
for(int i = 0; i < numberOfBlocks; i++){
for(int j = pixelBlockStartingPoints[i]; j < pixelBlockStartingPoint[i] + Block_Size; j++){
for(int k = pixelBlockStartingPoints[i]; k < pixelBlockStartingPoint[i] + Block_Size; k++){
// Get Pixel-Data
}
}
}
i'm trying to parallelize this collapse loops with openMP, but this is what i got:
"smooth.c:47:6: error: not enough perfectly nested loops before ‘sum’ sum = 0;"
Somebody knows a good way to parallelize this? i'm stuck 2 days in this problem.
Here my loops:
long long int sum;
#pragma omp parallel for collapse(3) default(none) shared(DY, DX) private(dx, dy) reduction(+:sum)
for (y = 0; y < height; y++) {
for (x = 0; x < width; x++) {
sum = 0;
for (d = 0; d < 9; d++) {
dx = x + DX[d];
dy = y + DY[d];
if (dx >= 0 && dx < width && dy >= 0 && dy < height)
sum += image(dy, dx);
}
smooth(y, x) = sum / 9;
}
}
Full code:
https://github.com/fernandesbreno/smooth_
i'm trying to parallelize this collapse loops with openMP, but this is what i got: "smooth.c:47:6: error: not enough perfectly nested loops before ‘sum’ sum = 0;"
You cannot collapse three loop levels because the third level is not perfectly nested inside the second. There is
sum = 0;
before it and
smooth(y, x) = sum / 9;
after it in the middle loop. (I suppose smooth() is a macro, else the assignment doesn't make sense. Don't do that, though, because it's confusing.)
Consider how you would rewrite that loop nest into an equivalent single loop by hand, using your knowledge of the problem structure and details. I submit that it would be challenging to do so, and that the result would furthermore have unavoidable data dependencies. But if you managed to do it without introducing dependencies, then voila! You have a single flat loop to parallelize, no collapsing needed.
Your simplest way forward, however, would probably be to collapse only two levels instead of three. Moreover, you want to compare with not collapsing at all, as it's not at all clear that collapsing will yield an improvement vs. parallelizing only the outer loop, and collapsing might even be worse.
But if you must have OpenMP collapse all three levels of the nest, then you need to take the two lines I called out above, and lift them out of the loop nest. Possibly you could do that in part by getting rid of sum altogether and working directly with the result raster. Again, this is not necessarily going to produce an improvement.
Rust's for loops are a bit different than those in C-style languages. I am trying to figure out if I can achieve the same result below in a similar fashion in Rust. Note the condition where the i^2 < n.
for (int i = 2; i * i < n; i++)
{
// code goes here ...
}
You can always do a literal translation to a while loop.
let mut i = 2;
while i * i < n {
// code goes here
i += 1;
}
You can also always write a for loop over an infinite range and break out on an arbitrary condition:
for i in 2.. {
if i * i >= n { break }
// code goes here
}
For this specific problem, you could also use take_while, but I don't know if that is actually more readable than breaking out of the for loop. It would make more sense as part of a longer chain of "combinators".
for i in (2..).take_while(|i| i * i < n) {
// code goes here
}
The take_while suggestion from zwol's answer is the most idiomatic, and therefore usually the best choice. All of the information about the loop is kept together in a single expression instead of getting mixed into the body of the loop.
However, the fastest implementation is to precompute the square root of n (actually a weird sort of rounded-down square root). This lets you avoid doing a comparison every iteration, since you know this is always the final value of i.
let m = (n as f64 - 0.5).sqrt() as _;
for i in 2 ..= m {
// code goes here
}
As a side note, I tried to benchmark these different loops. The take_while was the slowest. The version I just suggested always reported 0 ns/iter, and I'm not sure if that's just due to some code being optimised to the point of not running at all, or if it really is too fast to measure. For most uses, the difference shouldn't be important though.
Update: I have learned more Rust since I wrote this answer. This structure is still useful for some rare situations (like when the logic inside the loop needs to conditionally mutate the counter variable), but usually you'll want to use a Range Expression like zwol said.
I like this form, since it keeps the increment at the top of the loop instead of the bottom:
let mut i = 2 - 1; // You need to subtract 1 from the initial value.
loop {
i+=1; if i*i >= n { break }
// code goes here...
}
I was wondering, why does one set of loops allow for better cache performance than another in spite of logically doing the same thing?
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
accum = 0.0;
for (k = 0; k < n; k++) {
accum += b[j][k] * a[k][i];
}
c[j][i] = accum;
}
}
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
val = b[j][k];
for (i = 0; i < n; i++) {
c[j][i] += val * a[k][i];
}
}
}
I believe the first one above delivers better cache performance, but why?
Also, when we increase block size, but keep cache size and associativity constant, does it influence the miss rate? At a certain point increasing block size can cause a higher miss rate, right?
Just generally speaking, the most efficient loops through a matrix are going to cycle through the last dimension, not the first ("last" being c in m[a][b][c]).
For example, given a 2D matrix like an image which has its pixels represented in memory from top-left to bottom-right, the quickest way to sequentially iterate through it is going to be horizontally across each scanline, like so:
for (int y=0; y < h; ++y) {
for (int x=0; x < w; ++x)
// access pixel[y][x]
}
... not like this:
for (int x=0; x < w; ++x) {
for (int y=0; y < h; ++y)
// access pixel[y][x]
}
... due to spatial locality. It's because the computer grabs memory from slower, bigger regions of the hierarchy and moves it to faster, small regions in large, aligned chunks (ex: 64 byte cache lines, 4 kilobyte pages, and down to a little teeny 64-bit general-purpose register, e.g.). The first example accesses all the data from such a contiguous chunk immediately and prior to eviction.
harold on this site gave me a nice view on how to look at and explain this subject by suggesting not to focus so much on cache misses, but instead focusing on striving to use all the data in a cache prior to eviction. The second example fails to do that for all but the most trivially-small images by iterating through the image vertically with a large, scanline-sized stride rather than horizontally with a small, pixel-sized one.
Also, when we increase block size, but keep cache size and associativity constant, does it influence the miss rate? At a certain point increasing block size can cause a higher miss rate, right?
The answer here would be "yes", as an increase in block size would naturally equate to more compulsory misses (that would be more simply "misses" though rather than "miss rate") but also just more data to process which won't all necessarily fit into the fastest L1 cache. If we're accessing a large amount of data with a large stride, we end up getting a higher non-compulsory miss rate as a result of more data being evicted from the cache before we utilize it, only to then redundantly load it back into a faster cache.
There is also a case where, if the block size is small enough and aligned properly, all the data will just fit into a single cache line and it wouldn't matter so much how we sequentially access it.
Matrix Multiplication
Now your example is quite a bit more complex than this straightforward image example above, but the same concepts tend to apply.
Let's look at the first one:
for (i = 0; i < n; i++) {
for (j = 0; j < n; j++) {
accum = 0.0;
for (k = 0; k < n; k++)
accum += b[j][k] * a[k][i];
c[j][i] = accum;
}
}
If we look at the innermost k loop, we access b[j][k]. That's a fairly optimal access pattern: "horizontal" if we imagine a row-order memory layout. However, we also access a[k][i]. That's not so optimal, especially for a very large matrix, as it's accessing memory in a vertical pattern with a large stride and will tend to suffer from data being evicted from the fastest but smallest forms of memory before it is used, only to load that chunk of data again redundantly.
If we look at the second j loop, that's accessing c[j][i], again in a vertical fashion which is not so optimal.
Now let's have a glance at the second example:
for (j = 0; j < n; j++) {
for (k = 0; k < n; k++) {
val = b[j][k];
for (i = 0; i < n; i++)
c[j][i] += val * a[k][i];
}
}
If we look at the second k loop in this case, it's starting off accessing b[j][k] which is optimal (horizontal). Furthermore, it's explicitly memoizing the value to val, which might improve the odds of the compiler moving that to a register and keeping it there for the following loop (this relates to compiler concepts related to aliasing, however, rather than CPU cache).
In the innermost i loop, we're accessing c[j][i] which is also optimal (horizontal) along with a[k][i] which is also optimal (horizontal).
So this second version is likely to be more efficient in practice. Note that we can't absolutely say that, as aggressive optimizing compilers can do all sorts of magical things like rearranging and unrolling loops for you. Yet short of that, we should be able to say the second one has higher odds of being more efficient.
"What's a profiler?"
I just noticed this question in the comments. A profiler is a measuring tool that can give you a precise breakdown of where time is spent in your code, along with possibly further statistics like cache misses and branch mispredictions.
It's not only good for optimizing real-world production code and helping you more effectively prioritize your efforts to places that really matter, but it can also accelerate the learning process of understanding why inefficiencies exist through the process of chasing one hotspot after another.
Loop Tiling/Blocking
It's worth mentioning an advanced optimization technique which can be useful for large matrices -- loop tiling/blocking. It's beyond the scope of this subject but that one plays to temporal locality.
Deep C Optimization
Hopefully later you will be able to C these things clearly as a deep C explorer. While most optimization is best saved for hindsight with a profiler in hand, it's useful to know the basics of how the memory hierarchy works as you go deeper and deeper exploring the C.
This question already has answers here:
Why does the order of the loops affect performance when iterating over a 2D array?
(7 answers)
Closed 8 years ago.
I am given two functions for finding the product of two matrices:
void MultiplyMatrices_1(int **a, int **b, int **c, int n){
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
}
void MultiplyMatrices_2(int **a, int **b, int **c, int n){
for (int i = 0; i < n; i++)
for (int k = 0; k < n; k++)
for (int j = 0; j < n; j++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
}
I ran and profiled two executables using gprof, each with identical code except for this function. The second of these is significantly (about 5 times) faster for matrices of size 2048 x 2048. Any ideas as to why?
I believe that what you're looking at is the effects of locality of reference in the computer's memory hierarchy.
Typically, computer memory is segregated into different types that have different performance characteristics (this is often called the memory hierarchy). The fastest memory is in the processor's registers, which can (usually) be accessed and read in a single clock cycle. However, there are usually only a handful of these registers (usually no more than 1KB). The computer's main memory, on the other hand, is huge (say, 8GB), but is much slower to access. In order to improve performance, the computer is usually physically constructed to have several levels of caches in-between the processor and main memory. These caches are slower than registers but much faster than main memory, so if you do a memory access that looks something up in the cache it tends to be a lot faster than if you have to go to main memory (typically, between 5-25x faster). When accessing memory, the processor first checks the memory cache for that value before going back to main memory to read the value in. If you consistently access values in the cache, you will end up with much better performance than if you're skipping around memory, randomly accessing values.
Most programs are written in a way where if a single byte in memory is read into memory, the program later reads multiple different values from around that memory region as well. Consequently, these caches are typically designed so that when you read a single value from memory, a block of memory (usually somewhere between 1KB and 1MB) of values around that single value is also pulled into the cache. That way, if your program reads the nearby values, they're already in the cache and you don't have to go to main memory.
Now, one last detail - in C/C++, arrays are stored in row-major order, which means that all of the values in a single row of a matrix are stored next to each other. Thus in memory the array looks like the first row, then the second row, then the third row, etc.
Given this, let's look at your code. The first version looks like this:
for (int i = 0; i < n; i++)
for (int j = 0; j < n; j++)
for (int k = 0; k < n; k++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
Now, let's look at that innermost line of code. On each iteration, the value of k is changing increasing. This means that when running the innermost loop, each iteration of the loop is likely to have a cache miss when loading the value of b[k][j]. The reason for this is that because the matrix is stored in row-major order, each time you increment k, you're skipping over an entire row of the matrix and jumping much further into memory, possibly far past the values you've cached. However, you don't have a miss when looking up c[i][j] (since i and j are the same), nor will you probably miss a[i][k], because the values are in row-major order and if the value of a[i][k] is cached from the previous iteration, the value of a[i][k] read on this iteration is from an adjacent memory location. Consequently, on each iteration of the innermost loop, you are likely to have one cache miss.
But consider this second version:
for (int i = 0; i < n; i++)
for (int k = 0; k < n; k++)
for (int j = 0; j < n; j++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
Now, since you're increasing j on each iteration, let's think about how many cache misses you'll likely have on the innermost statement. Because the values are in row-major order, the value of c[i][j] is likely to be in-cache, because the value of c[i][j] from the previous iteration is likely cached as well and ready to be read. Similarly, b[k][j] is probably cached, and since i and k aren't changing, chances are a[i][k] is cached as well. This means that on each iteration of the inner loop, you're likely to have no cache misses.
Overall, this means that the second version of the code is unlikely to have cache misses on each iteration of the loop, while the first version almost certainly will. Consequently, the second loop is likely to be faster than the first, as you've seen.
Interestingly, many compilers are starting to have prototype support for detecting that the second version of the code is faster than the first. Some will try to automatically rewrite the code to maximize parallelism. If you have a copy of the Purple Dragon Book, Chapter 11 discusses how these compilers work.
Additionally, you can optimize the performance of this loop even further using more complex loops. A technique called blocking, for example, can be used to notably increase performance by splitting the array into subregions that can be held in cache longer, then using multiple operations on these blocks to compute the overall result.
Hope this helps!
This may well be the memory locality. When you reorder the loop, the memory that's needed in the inner-most loop is nearer and can be cached, while in the inefficient version you need to access memory from the entire data set.
The way to test this hypothesis is to run a cache debugger (like cachegrind) on the two pieces of code and see how many cache misses they incur.
Apart from locality of memory there is also compiler optimisation. A key one for vector and matrix operations is loop unrolling.
for (int k = 0; k < n; k++)
c[i][j] = c[i][j] + a[i][k]*b[k][j];
You can see in this inner loop i and j do not change. This means it can be rewritten as
for (int k = 0; k < n; k+=4) {
int * aik = &a[i][k];
c[i][j] +=
+ aik[0]*b[k][j]
+ aik[1]*b[k+1][j]
+ aik[2]*b[k+2][j]
+ aik[3]*b[k+3][j];
}
You can see there will be
four times fewer loops and accesses to c[i][j]
a[i][k] is being accessed continuously in memory
the memory accesses and multiplies can be pipelined (almost concurrently) in the CPU.
What if n is not a multiple of 4 or 6 or 8? (or whatever the compiler decides to unroll it to) The compiler handles this tidy up for you. ;)
To speed up this solution faster, you could try transposing the b matrix first. This is a little extra work and coding, but it means that accesses to b-transposed are also continuous in memory. (As you are swapping [k] with [j])
Another thing you can do to improve performance is to multi-thread the multiplication. This can improve performance by a factor of 3 on a 4 core CPU.
Lastly you might consider using float or double You might think int would be faster, however that is not always the case as floating point operations can be more heavily optimised (both in hardware and the compiler)
The second example has c[i][j] is changing on each iteration which makes it harder to optimise.
Probably the second one has to skip around in memory more to access the array elements. It might be something else, too -- you could check the compiled code to see what is actually happening.