Performance Optimization for Matrix Rotation - c

I'm now trapped by a performance optimization lab in the book "Computer System from a Programmer's Perspective" described as following:
In a N*N matrix M, where N is multiple of 32, the rotate operation can be represented as:
Transpose: interchange elements M(i,j) and M(j,i)
Exchange rows: Row i is exchanged with row N-1-i
A example for matrix rotation(N is 3 instead of 32 for simplicity):
------- -------
|1|2|3| |3|6|9|
------- -------
|4|5|6| after rotate is |2|5|8|
------- -------
|7|8|9| |1|4|7|
------- -------
A naive implementation is:
#define RIDX(i,j,n) ((i)*(n)+(j))
void naive_rotate(int dim, pixel *src, pixel *dst)
{
int i, j;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
I come up with an idea by inner-loop-unroll. The result is:
Code Version Speed Up
original 1x
unrolled by 2 1.33x
unrolled by 4 1.33x
unrolled by 8 1.55x
unrolled by 16 1.67x
unrolled by 32 1.61x
I also get a code snippet from pastebin.com that seems can solve this problem:
void rotate(int dim, pixel *src, pixel *dst)
{
int stride = 32;
int count = dim >> 5;
src += dim - 1;
int a1 = count;
do {
int a2 = dim;
do {
int a3 = stride;
do {
*dst++ = *src;
src += dim;
} while(--a3);
src -= dim * stride + 1;
dst += dim - stride;
} while(--a2);
src += dim * (stride + 1);
dst -= dim * dim - stride;
} while(--a1);
}
After carefully read the code, I think main idea of this solution is treat 32 rows as a data zone, and perform the rotating operation respectively. Speed up of this version is 1.85x, overwhelming all the loop-unroll version.
Here are the questions:
In the inner-loop-unroll version, why does increment slow down if the unrolling factor increase, especially change the unrolling factor from 8 to 16, which does not effect the same when switch from 4 to 8? Does the result have some relationship with depth of the CPU pipeline? If the answer is yes, could the degrade of increment reflect pipeline length?
What is the probable reason for the optimization of data-zone version? It seems that there is no too much essential difference from the original naive version.
EDIT:
My test environment is Intel Centrino Duo architecture and the verion of gcc is 4.4
Any advice will be highly appreciated!
Kind regards!

What kind of processor are you testing this on? I dimly remember that unrolling loops helps when the processor can handle multiple operations at once, but only up to the maximum number of parallel executions. So if your processor can only handle 8 simultaneous instructions, then unrolling to 16 won't help. But someone with knowledge of more recent processor design will have to pipe up/correct me.
EDIT: According to this PDF, the centrino core2 duo has two processors, each of which is capable of 4 simultaneous instructions. It's generally not so simple, though. Unless your compiler is optimizing across both cores (ie, when you run the task manager (if you're on windows, top if you're on linux), you'll see that CPU usage is maxed out), your process will be running on one core at a time. The processor also features 14 stages of execution, so if you can keep the pipeline full, you'll get a faster execution.
Continuing along the theoretical route, then, you get a speed improvement of 33% with a single unroll because you're starting to take advantage of simultaneous instruction execution. Going to 4 unrolls doesn't really help, because you're now still within that 4-simultaneous-instruction limit. Going to 8 unrolls helps because the processor can now fill the pipeline more completely, so more instructions will get executed per clock cycle.
For this last, think about how a McDonald's drive through works (I think that that's relatively widespread?). A car enters the drivethrough, orders at one window, pays at a second window, and receives food at a third window. If a second drive enters when the first is still ordering, then by the time both finish (assuming each operation in the drive through takes one 'cycle' or time unit), then 2 full operations will be done by the time 4 cycles have elapsed. If each car did all of their operations at one window, then the first car would take 3 cycles for ordering, paying, and getting food, and then the second car would also take 3 cycles for ordering, paying and getting food, for a total of 6 cycles. So, operation time due to pipelining decreases.
Of course, you have to keep the pipeline full to get the largest speed improvement. 14 stages is a lot of stages, so going to 16 unrolls will give you some improvement still because more operations can be in the pipeline.
Going to 32 causing a decrease in performance may have to do with bandwidth to the processor from the cache (again a guess, can't know for sure without seeing your code exactly, as well as the machine code). If all the instructions can't fit into cache or into the registers, then there is some time necessary to prepare them all to run (ie, people have to get into their cars and get to the drive through in the first place). There will be some reduction in speed if they all get there all at once, and some shuffling of the line has to be done to make the operation proceed.
Note that each movement from src to dst is not free or a single operation. You have the lookups into the arrays, and that costs time.
As for why the second version works so quickly, I'm going to hazard a guess that it has to do with the [] operator. Every time that gets called, you're doing some lookups into both the src and dst arrays, resolving pointers to locations, and then retrieving the memory. The other code is going straight to the pointers of the arrays and accessing them directly; basically, for each of the movements from src to dst, there are less operations involved in the move, because the lookups have been handled explicitly through pointer placement. If you use [], these steps are followed:
do any math inside the []
take a pointer to that location (startOfArray + [] in memory)
return the result of that location in memory
If you walk along with a pointer, you just do the math to do the walk (typically just an addition, no multiplication) and then return the result, because you've already done the second step.
If I'm right, then you might get better results with the second code by unrolling its inner loop as well, so that multiple operations can be pipelined simultaneously.

The first part of the question I'm not sure about. My initial thought was some sort of cache problem, but you're only accessing each item once.
The other code could be faster for a coupe reasons.
1) The loops count down instead of up. Comparing a loop counter to zero costs nothing on most architectures (a flag is set by the decrement automatically) you have to explicitly compare to a max value with each iteration.
2) There is no math in the inner loop. You are doing a bunch of math in your inner loop. I see 2 subtractions in the main code and a multiply in the macro (which is used twice). There is also the implicit addition of the resulting indexes to the base address of the array which is avoided by the use of pointers (good addressing modes on x86 should eliminate this penalty too).
When writing optimized code, you always construct it bottom up from the inside. This means taking the inner-most loop and reducing its content to nearly zero. In this case, moving data is unavoidable. Incrementing a pointer is the bare minimum to get to the next item, the other pointer needs to add an offset to get to its next item. So at a minimum we have 4 operations: load, store, increment, add. If an architecture supported "move with post-increment" this would be 2 instructions total. On Intel I suspect it's 3 or 4 instructions. Anything more than this like subtractions and multiplication is going to add significant code.
Looking at the assembly code of each version should offer much insight.
If you run this repeatedly on a small matrix (32x32) that fits completely in cache you should should see even more dramatic differences in implementations. Running on a 1024x1024 matrix will be much slower than doing 1024 rotations of a single 32x32 even though the number of data copies is the same.

The main purpose of loop unrolling is to reduce the time spent on the loop control (test for completion, incrementing counters, etc...). This is a case of diminishing returns though, since as the loop is unrolled more and more, the time spent on loop control becomes less and less significant. Like mmr said, loop unrolling may also help the compiler to execute things in parallel, but only up to a point.
The "data-zone" algorithm appears to be a version of a cache efficient matrix transpose algorithm. The problem with computing a transpose the naive way is that it results in a lot of cache misses. For the source array, you are accessing the memory along each row, so it is accessed in a linear manner, element-by-element. However, this requires that you access the destination array along the columns, meaning you are jumping dim elements each time you access an element. Basically, for each row of the input, you are traversing the memory of the entire destination matrix. Since the whole matrix probably won't fit in the cache, memory has to be loaded and unloaded from the cache very often.
The "data-zone" algorithm takes the matrix that you are accessing by column and only performs the transpose for 32 rows at a time, so the amount of memory you are traversing is 32xstride, which should hopefully fit completely into the cache. Basically the aim is to work on sub-sections that fit in the cache and reduce the amount of jumping around in memory.

Related

Troubles with slow speeds in opencl

I am trying to use opencl for the first time, the goal is to calculate the argmin of each row in an array. Since the operation on each row is independent of the others, I thought this would be easy to put on the graphics card.
I seem to get worse performance using this code than when i just run the code on the cpu with an outer forloop, any help would be appreciated.
Here is the code:
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
int argmin(global double *array, int end)
{
double minimum = array[0];
int index;
for (int j = 0; j < end; j++)
{
if (array[j] < minimum)
{
minimum = array[j];
index = j;
}
}
return index;
}
kernel void execute(global double *dist, global long *res, global double *min_dist)
{
int row_size = 0;
int i = get_global_id(0);
int row_index = i * row_size;
res[i] = argmin(&dist[row_index], row_size);
min_dist[i] = dist[res[i] + row_index];
}
The commenters make some valid points, but I'll try to be a little more constructive and organised:
Your data appears to consist of double precision floating point values. Depending on your GPU, this can be bad news in itself. Consumer grade GPUs typically are not optimised for working with doubles, often only achieving 1/32 or 1/16 the throughput compared to single-precision float operations. Many pro-grade GPUs (Quadro, Tesla, FirePro, some Radeon Pro cards) are fine with them though, achieving 1/2 or 1/4 throughput versus float. As you're only performing a trivial arithmetic operation (comparison), and there's a good chance your runtime is dominated by memory access, it could be fine on consumer hardware too.
I assume your row_size is not actually 0, it would help to know what the true (typical) value is, and whether it's fixed, variable by row, or variable per run but the same for each row. In any case, unless row_size is very small, the fact that you are running a serial for loop over it could be holding your code back.
How big is your work size? In other words, how many rows in your array (give a typical range if it varies)? If it is very small, you will see little benefit from GPU parallelism: GPUs have a large number of processors and can schedule a few threads per processor. So your work items will need to number hundreds or better thousands to achieve decent hardware utilisation.
You are reading a very large array from (presumably) system memory and not performing any intensive operations on it. This means your bottleneck will typically be on the memory access side - for discrete GPUs, system memory access needs to go through PCIe, so the speed of that link will place an upper bound on your performance. Additionally, your memory access pattern far from ideal for GPUs - you typically want work items to read adjacent memory cells at the same time as the memory unit typically fetches 64 bytes or more at once.
Improvement suggestions:
Profiling. If at all possible, use your GPU vendor's profiling tools to determine your true bottlenecks. Otherwise we're just guessing.
For (4) - if at all possible, try not to move large amounts of data around too much. If you can generate your input arrays on the GPU, do so, so they never leave VRAM.
For (4) - Optimise your memory accesses. AMD, NVidia and Intel all have OpenCL GPU optimisation guides which explain how to do this. Essentially, re-structure your data layout, or your kernel such that adjacent work items read adjacent pieces of memory. You ideally want work item 0 to read array item 0, work item 1 to read array item 1, etc. You may need to use local memory to coordinate between work items. Another option is to read vector-sized chunks of data per work item. (e.g. each work-item reads a double8 at a time) Watch for alignment in this case though.
For (2) & (3) - Unless row_size is very small (and fixed), try to split your loop across multiple work items and coordinate using local memory (reduction algorithms) and atomic operations in global memory.
For (1): If you've optimised everything else and profiling is telling you that comparing doubles on consumer hardware is too slow, either check if you can generate the data as floats without loss of accuracy (this will also halve your memory bandwidth woes), or check if you can otherwise do better somehow, for example by treating the double as a long and manually unpacking and comparing the exponent and mantissa using integer operations.

Hit / Miss rate counting by array caching

I'm reading Computer Systems book from Bryant & O'Hallaron, there is an exercises the solution of which seems to be incorrect. So I'd like to make it sure
given
struct point {
int x;
int y; };
struct array[32][32];
for(i = 31; i >= 0; i--) {
for(j = 31; j >= 0; j--) {
sum_x += array[j][i].x;
sum_y += array[j][i].y; }}
sizeof(int) = 4;
we have 4096 byte cache with block (line) size 32 byte.
The hit rate is asked.
My reasoning was, we have 4096/32 = 128 blocks, each block can store 4 points (2*4*4 = 32), therefore the cache can store 1/2 of the array, i.e. 512 points (total 32*32 = 1024). Since the code accesses array in column major order, access to each point is miss. So we have array[j][i].x is always miss, while array[j][i].y is hit. Finally miss rate = hit rate = 1/2.
Problem: The solution says the hit rate is 3/4 because the cache can store the whole array.
But according to my reasoning the cache can store only half points
Did I miss something?
The array's top four rows occupy a part of the cache:
|*ooooooooooooooooooooooooooooooo|
|*ooooooooooooooooooooooooooooooo|
|*ooooooooooooooooooooooooooooooo|
|*ooooooooooooooooooooooooooooooo|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx|
|...
Above is a schematic of the array as an applied mathematician would write the array on paper. Each element consists of an (x,y) pair, a point.
The four rows labeled o in the diagram comprise 128 points, enough to fill 1024 bytes, which is only one quarter of the cache, but see: in your code, the variable i is
the major loop counter and also
the array's row index (as written on paper).
So, let's look at the diagram again. How do your nested loops step through the array as diagrammed?
Answer: apparently, your loops step rightward across the top row as diagrammed, with j (column) as the minor loop counter. However, as you have observed, the array is stored by columns. Therefore, when element [j][i] == [0][0] is loaded, an entire cache line is loaded with it. And what comprises that cache line? It's the four elements marked * in the diagram.
Therefore, while your inner loop iterates across the array's top row as diagrammed, the cache misses every time, fetching four elements each time. And then for the next three rows, it's all hits.
This isn't easy to think about. It's a fine problem, nor would I expect you to grasp my answer instantly, but if you carefully consider the sequence of loads as I have explained, it should (after a bit of pondering) begin to make sense.
With the given loop nesting, the hit rate is indeed 3/4.
FURTHER DISCUSSION
In comments, you have asked a good follow-up question:
Can you write an element (e.g. array[3][14].x) that would hit?
I can. The array[j][i] == array[10][5] would hit. (Both .x and .y would hit.)
I will explain. The array[j][i] == array[10][4] would miss, whereas array[10][5], array[10][6] and array[10][7] would eventually hit. Why eventually? This is significant. Although all four of the elements I have named are loaded by cache hardware at once, array[10][5] is not accessed by your code (that is, by the CPU) when array[10][4] is accessed. Rather, after array[10][4] is accessed, array[11][4] is next accessed by the program and CPU.
The program and CPU only get around to accessing array[10][5] rather later.
And, indeed, if you think about it, this makes sense, doesn't it, because that is part of what caches do: they load additional data now, quietly as part of a cache line, so that the CPU can quickly access the additional data later if it needs it.
APPENDIX: FORTRAN/BLAS/LAPACK MATRIX ORDERING
It is standard in numerical computing to store matrices by column rather than by row. This is called column-major storage. Unfortunately, unlike the earlier Fortran programming language, the C programming language was not originally designed for numerical computing, so, in C, to store arrays by column, one must write array[column][row] == array[j][i]—which notation of course reverses the way an applied mathematician with his or her pencil would write it.
This is an artifact of the C programming language. The artifact has no mathematical significance but, when programming in C, you must remember to type [j][i]. [Were you programming in the now mostly obsolete Fortran programming language, you would type (i, j), but this isn't Fortran.]
The reason column-major storage is standard has to do with the sequence in which the CPU performs scalar, floating-point multiplications and additions when, in mathematical/pencil terminology, a matrix [A] left-operates on a column vector x. The standard Basic Linear Algebra Subroutines (BLAS) library, used by LAPACK and others, works this way. You and I should work this way, too, not only because we are likely to need to interface with BLAS and/or LAPACK but because, numerically, it's smoother.
If you've transcribed the program correctly then you're correct, the 3/4 answer is wrong.
The 3/4 answer would be correct if the indexes in the innermost sum += ... statements were arranged so that the rightmost index varied the most quickly, i.e. as:
sum_x += array[i][j].x;
sum_y += array[i][j].y;
In that case the 1st, 5th, 9th ... iterations of the loop would miss, but the line loaded into the cache by each of those misses would cause the next three iterations to hit.
However, with the program as written, every iteration misses. Each cache line that is loaded from memory supplies data for only a single point, and then that line is always replaced before the data for any of the other three points in the line is accessed.
As an example (assuming for simplicity that the address of the first member array[0][0] is aligned with the start of the cache), the reference to array[31][31] in the first pass through the loop is a miss that causes line 127 of the cache to be loaded. Line 127 now contains the data for [31][28], [31][29], [31][30] and [31][31]. However, the fetch of array[15][31] causes line 127 to be overwritten before array[31][30] is referenced, so when [31][30]'s turn eventually arrives it is a miss too. And then a miss at [15][30] replaces the line before [31][29] is referenced.
IMO your 1/2 hit ratio is overgenerous because it counts the access to the .y coordinate as a hit. However, that's not what the original 3/4 answer does. If the fetch of the .y coordinate were counted as a hit then the original answer would have been 7/8. Instead it counts each complete point, or perhaps each loop iteration, as a hit or a miss. By that measure the hit rate for the program as written in your question is a nice round 0.

What is 'differential timing' technique for doing benchmarks?

At around 39 minute of "Writing Fast Code I" by Andrei Alexandrescu (link here to youtube)
there is a slide of how to use differential timing... can someone show me some basic code with this approach? It was only mentioned for a second, but I think that's an interesting idea.
Run baseline 2n times (t2a)
vs. baseline n times + contender n times (ta+b).
Relative improvement = "t2a / (2ta+b - t2a)"
some overhead noises canceled
Alexsandrescu's slide is rather trivial to pour into code:
auto start = clock::now();
for( int i = 0; i < 2*n; i++ )
baseline();
auto t2a = clock::now() - start;
start = clock::now();
for( int i = 0; i < n; i++ )
baseline();
// *
for( int i = 0; i < n; i++ )
contender();
auto taplusb = clock::now() - start;
double r = t2a / (2 * taplusb - t2a) // relative speedup
* Synchronization point which prevents optimization across the last two loops.
I would be more interested in the mathematical reasoning behind measuring the relative speed up this way as opposed to simply tBaseline / tContender as I've been doing for ever. He only vaguely hints at '...overhead noise (being) cancelled (out)', but doesn't explain it in detail.
If you keep watching until 41:40 or so, he mentions it again when warning about the pitfall of first run vs. subsequent (allocators warmed up, etc.)
The best solution for that is doing warm-up runs before the first timed region.
I think he's picturing that 2n baseline vs. n baseline + n contender in separate invocations of the benchmark program.
So instead of doing some warmup runs before the timed region, he's using the baseline as a controlled warmup inside the timed region. This might make it possible to just time the whole program, e.g. perf stat, instead of calling a time function inside the program. Depending on how much process startup overhead your OS has vs. how long you make your repeat loop.
Microbenchmarking is hard and there are many pitfalls. Notably benchmarking optimized code while still making sure there isn't optimization between iterations of your repeat loop. (Often it's useful to use inline asm "escape" macros to force the compiler to materialize a value in an integer register, and/or to forget about the value of a variable to defeat CSE. Sometimes it's sufficient to just add the final result of each iteration to a sum that you print at the end.)
This is the first I've heard of this differential idea. It doesn't sound more useful than normal warm-ups.
If anything it will make the contender look slightly worse than using the function under test for some warm-up runs before the timed region. Using the same function as the timed region will warm up branch-prediction for it. Or not because after inlining the warm-up vs. main versions will be at different addresses. The same pattern at different addresses may possibly still help a modern TAGE predictor but IDK.
Or if contender has any lookup tables, those will become hot in cache from the warmup.
In any case, warmups are essential, unless you make the repeat count long enough to dwarf the time it takes for the CPU to switch to max turbo and so on. And to page-fault in all the memory you touch.
If your calculated time/iteration doesn't stay constant with your repeat count, your microbenchmark is broken.
Take the rest of his advice with a grain of salt, too. Most of it use useful (e.g. prefer 32-bit integers even for local temporaries, not just for arrays for cache-footprint reasons), but the reasoning is wrong for some of them.
His explanation that an ALU can do 2x 32-bit adds or 1x 64-bit add only applies to SIMD: 4x 32-bit int in a vector for paddd or 2x 64-bit int in a vector for paddq. But x86 scalar add r32, r32 has the same throughput as add r64,r64. I don't think it was true even on Pentium 4 (Nocona) despite P4 having funky double-pumped ALUs with 0.5 cycle latency for add. At least before Prescott/Nocona which introduced 64-bit support.
Using 32-bit unsigned integers on x86-64 can stop the compiler from optimizing to pointer increments if it wants to. It has to maintain correctness in case of 32-bit wraparound of a variable before array indexing.
Using 16-bit or 8-bit locals to match the data in an array can sometimes help auto-vectorization, IIRC. Gcc/clang sometimes make really braindead code that unpacks to 32-bit and then re-packs down to 8-bit elements, when processing an array of int8_t or uint8_t. I forget if I've every actually worked around that by using narrow locals, though. C default integer promotions bring most expressions back up to 32-bit.
Also, at https://youtu.be/vrfYLlR8X8k?t=3498, he claims that FP->int is expensive. That's never been true on x86-64: FP math uses SSE/SSE2 which has an instruction that does truncating conversion. FP->int used to be slow in the bad old days of x87 math, where you had to change the FP rounding mode, fistp, then change it back, to get C truncation semantics. But SSE includes cvttsd2si exactly for that common case.
He also says float is no faster than double. That's true for scalar (other than div/sqrt), but if your code can auto-vectorize then you get twice as much work done per instruction and the instructions have the same throughput. (Twice as many elements fit in a SIMD vector.)
How the math works:
It just cancels out the n * baseline time from both parts, effectively doing (2 * baseline) / (2*contender) = baseline/contender.
It assumes that the times add normally (not overlapping computation). t_2a = 2 * baseline, and 2 * t_ab = 2 * baseline + 2 * contender. Subtracting cancels the 2*baseline parts, leaving you with 2*contender.
The trick isn't in the math, if anything this is more mathematically dangerous because subtracting two larger numbers accumulates error. i.e. if the n*baseline actually takes different amounts of time in the two runs (because you didn't control that perfectly), then it doesn't cancel and contributes error to your estimate.

most efficient way to get through array?

I have been thinking and was wondering what the fastest algorithm is to get through every element of a (large - lets say more than say 10,000 sized) unsorted int array. My first thought was to go through the linear motion and check every element at a time - then my mind wandered to recursion and wondered if cutting the array into parallels each time and check the elements would be fine.
The goal I'm trying to figure out is if a number (in this kind of array) will be a multiple of a seemingly "randomly" generated int. Then after this I will progress to try and find if a subset of the large array will equate to a multiple of this number as well. (But I will get to that part another day!)
What are all of your thoughts? Questions? Comments? Concerns?
You seem under the false impression that the bottleneck for running through an array sequentially ist the CPU: it isn't, it is your memory bus. Modern platforms are very good in predicting sequential access and doing everything to streamline the access, you can't do much more than that. Parallelizing will usually not help, since you only have one memory bus, which is the bottleneck, in the contrary you are risking false sharing so it could even get worse.
If for some reason you are really doing a lot of computation on each element of your array, the picture changes. Then, you can start to try some parallel stuff.
For an unsorted array, linear search is as good as you can do. Cutting the array each time and then searching the elements would not help you much, instead it may slow down your program as calling functions needs stack maintenance.
The most efficient way to process every element of a contiguous array in a single thread is sequentially. So the simplest solution is the best. Enabling compiler optimisation is likely to have a significant effect on simple iterative code.
However if you have multiple cores, and very large arrays, greater efficiency may be achieved by separating the tasks into separate threads. As suggested a using a library specifically aimed at performing parallel processing is likely to perform better and more deterministically that simply using the OS support for threading.
Another possibility is to offload the task to a GPU, but that is hardware specific and requires GPU library support such as CUDA.
All that said 10000 elements does not seem that many - how fast do you need it to go, and how long does it currently take? You need to be measuring this if performance is of specific interest.
If you want to perform some kind of task on every element of the array, then it's not going to be possible to do any better than visiting each element once; if you did manage to somehow perform the action on N/2 elements of an N-sized array, then the only possibility is that you didn't visit half of the elements. The best case scenario is visiting every element of the array no more than once.
You can approach the problem recursively, but it's not going to be any better than a simple linear method. If you use tail recursion (the recursive call is at the end of the function), then the compiler is probably going to turn it into a loop anyway. If it doesn't turn it into a loop, then you have to deal with the additional cost of pushing onto the call stack, and you have the possibility of stack overflows for very large arrays.
The cool modern way to do it is with parallel programming. However, don't be fooled by everyone suggesting libraries; even though the run time looks faster than a linear method, each element is still being visited once. Parallelism (see OpenMP, MPI, or GPU programming) cheats by dividing the work into different execution units, like different cores in your processor or different machines on a network. However, it's very possible that the overhead of adding the parallelism will incur a larger cost than the time you'll save by dividing the work, if the problem set isn't large enough.
I do recommend looking into OpenMP; with it, one line of code can automatically divide up a task to different execution units, without you needing to handle any kind of inter-thread communication or anything nasty.
The following program shows a simple way to implement the idea of parallelization for the case you describe - the timing benchmark shows that it doesn't provide any benefit (since the inner loop "doesn't do enough work" to justify the overhead of parallelization).
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>
#include <stdlib.h>
#define N 1000000
int main(void) {
int ii,jj, kk;
int *array;
double t1, t2;
int threads;
// create an array of random numbers:
array = malloc(N * sizeof *array);
for(ii=0; ii<N; ii++) {
array[ii]=rand();
}
for(threads = 1; threads < 5; threads++) {
jj=0;
omp_set_num_threads(threads);
t1=omp_get_wtime();
// perform loop 100 times for better timing accuracy
for(kk=0; kk<100; kk++) {
#pragma omp parallel for reduction(+:jj)
for(ii=0; ii<N; ii++) {
jj+=(array[ii]%6==0)?1:0;
}
}
t2=omp_get_wtime();
printf("jj is now %d\n", jj);
printf("with %d threads, elapsed time = %.3f ms\n", threads, 1000*(t2-t1));
}
return 0;
}
Compile this with
gcc -Wall -fopenmp parallel.c -o parallel
and the output is
jj is now 16613400
with 1 threads, elapsed time = 467.238 ms
jj is now 16613400
with 2 threads, elapsed time = 248.232 ms
jj is now 16613400
with 3 threads, elapsed time = 314.938 ms
jj is now 16613400
with 4 threads, elapsed time = 251.708 ms
This shows that the answer is the same, regardless of the number of threads used; but the amount of time taken does change a little bit. Since I am doing this on a 6 year old dual core machine, you don't actually expect a speed-up with more than two threads, and indeed you don't see one; but there is a difference between 1 thread and 2.
My point was really to show how easy it is to implement a parallel loop for the task you envisage - but also to show that it's not really worth it (for me, on my hardware).
Whether it helps for your case depends on the amount of work going on inside your innermost loop, and the number of cores available. If you are limited by memory access speed, this doesn't help; but since the modulo operation is relatively slow, it's possible that you gain a small amount of speed from doing this - and more cores, and more complex calculations, will increase the performance gain.
Final point - the omp syntax is relatively straightforward to understand. The only thing that is strange is the reduction(+:jj) statement. This means "create individual copies of jj. When you are done, add them all together."
This is how we make sure the total count of numbers divisible by 6 is kept track of across the different threads.

Use two loop bodies or one (result identical)?

I have long wondered what is more efficient with regards to making better use of CPU caches (which are known to benefit from locality of reference) - two loops each iterating over the same mathematical set of numbers, each with a different body statement (e.g. a call to a function for each element of the set), or having one loop with a body that does the equivalent of two (or more) body statements. We assume identical application state after all the looping.
In my opinion, having two loops would introduce fewer cache misses and evictions because more instructions and data used by the loop fit in the cache. Am I right?
Assuming:
Cost of a f and g call is negligible compared to cost of the loop
f and g use most of the cache each by itself, and so the cache would be spilled when one is called after another (the case with a single-loop version)
Intel Core Duo CPU
C language source code
The GCC compiler, "no extra switches"
I want answers outside the "premature optimization is evil" character, if possible.
An example of the two-loops version that I am advocating for:
int j = 0, k = 0;
for(int i = 0; i < 1000000; i++)
{
j += f(i);
}
for(int i = 0; i < 1000000; i++)
{
k += g(i);
}
To measure is to know.
I can see three variables (even in a seemingly simple chunk of code):
What do f() and g() do? Can one of them invalidate all of the instruction cache lines (effectively pushing the other one out)? Can that happen in L2 instruction cache too (unlikely)? Then keeping only one of them in it might be beneficial. Note: The inverse does not imply "have a single loop", because:
Do f() and g() operate on large amounts of data, according to i? Then, it'd be nice to know if they operate on the same set of data - again you have to consider whether operating on two different sets screws you up via cache misses.
If f() and g() are indeed that primitive as you first state, and I'm assuming both in code size as well as running time and code complexity, cache locality issues won't arise in little chunks of code like this - your biggest concern would be if some other process were scheduled with actual work to do, and invalidated all the caches until it were your process' turn to run.
A final thought: given that such processes like above might be a rare occurrence in your system (and I'm using "rare" quite liberally), you could consider making both your functions inline, and let the compiler unroll the loop. That is because for the instruction cache, faulting back to L2 is no big deal, and the probability that the single cache line that'd contain i, j, k would be invalidated in that loop doesn't look so horrible. However, if that's not the case, some more details would be useful.
Intuitively one loop is better: you increment i a million fewer times and all the other operation counts remain the same.
On the other hand it completely depends on f and g. If both are sufficiently large that each of their code or cacheable data that they use nearly fills a critical cache then swapping between f and g may completely swamp any single loop benefit.
As you say: it depends.
Your question is not clear enough to give a remotely accurate answer, but I think I understand where you are headed. The data you are iterating over is large enough that before you reach the end you will start to evict data so that the second time (second loop) you iterate over it some if not all will have to be read again.
If the two loops were joined so that each element/block is fetched for the first operation and then is already in cache for the second operation, then no matter how large the data is relative to the cache most if not all of the second operations will take their data from the cache.
Various things like the nature of the cache, the loop getting evicted by data then being fetched evicting data may cause some misses on the second operation. On a pc with an operating system, lots of evictions will occur with other programs getting time slices. But assuming an ideal world the first operation on index i of the data will fetch it from memory, the second operation will grab it from cache.
Tuning for a cache is difficult at best. I regularly demonstrate that even with an embedded system, no interrupts, single task, same source code. Execution time/performance can vary dramatically by simply changing compiler optimization options, changing compilers, both brands of compilers or versions of compilers, gcc 2.x vs 3.x vs 4.x (gcc is not necessarily producing faster code with newer versions btw)(and a compiler that is pretty good at a lot of targets is not really good at any one particular target). Same code different compilers or options can change execution time by several times, 3 times faster, 10 times faster, etc. Once you get into testing with or without a cache, it gets even more interesting. Add a single nop in your startup code so that your whole program moves one instruction over in memory and your cache lines now hit in different places. Same compiler same code. Repeat this with two nops, three nops, etc. Same compiler, same code you can see tens of percent (for the tests I ran that day on that target with that compiler) differences better and worse. That doesnt mean you cant tune for a cache, it just means that trying to figure out if your tuning is helping or hurting can be difficult. The normal answer is just "time it and see", but that doesnt work anymore, and you might get great results on your computer that day with that program with that compiler. But tomorrow on your computer or any day on someone elses computer you may be making things slower not faster. You need to understand why this or that change made it faster, maybe it had nothing to do with your code, your email program may have been downloading a lot of mail in the background during one test and not during the other.
Assuming I understood your question correctly I think the single loop is probably faster in general.
Breaking the loops into smaller chunks is a good idea.. It could improves the cache-hit ratio quite a lot and can make a lot of difference to the performance...
From your example:
int j = 0, k = 0;
for(int i = 0; i < 1000000; i++)
{
j += f(i);
}
for(int i = 0; i < 1000000; i++)
{
k += g(i);
}
I would either fuse the two loops into one loop like this:
int j = 0, k = 0;
for(int i = 0; i < 1000000; i++)
{
j += f(i);
k += g(i);
}
Of if this is not possible do the optimization called Loop-Tiling:
#define TILE_SIZE 1000 /* or whatever you like - pick a number that keeps */
/* the working-set below your first level cache size */
int i=0;
int elements = 100000;
do {
int n = i+TILE_SIZE;
if (n > elements) n = elements;
// perform loop A
for (int a=i; a<n; a++)
{
j += f(i);
}
// perform loop B
for (int a=i; a<n; a++)
{
k += g(i);
}
i += n
} while (i != elements)
The trick with loop tiling is, that if the loops share an access pattern the second loop body has the chance to re-use the data that has already been read into the cache by the first loop body. This won't happen if you execute loop A a million times because the cache is not large enough to hold all this data.
Breaking the loop into smaller chunks and executing them one after another will help here a lot. The trick is to limit the working-set of memory below the size of your first level cache. I aim for half the size of the cache, so other threads that get executed in-between don't mess up my cache so much..
If I came across the two-loop version in code, with no explanatory comments, I would wonder why the programmer did it that way, and probably consider the technique to be of dubious quality, whereas a one-loop version would not be surprising, commented or not.
But if I came across the two-loop version along with a comment like "I'm using two loops because it runs X% faster in the cache on CPU Y", at least I'd no longer be puzzled by the code, although I'd still question if it was true and applicable to other machines.
This seems like something the compiler could optimize for you so instead of trying to figure it out yourself and making it fast, use whatever method makes your code more clear and readable. If you really must know, time both methods for input size and calculation type that your application uses (try the code you have now but repeat your calculations many many times and disable optimization).

Resources