I am trying to use opencl for the first time, the goal is to calculate the argmin of each row in an array. Since the operation on each row is independent of the others, I thought this would be easy to put on the graphics card.
I seem to get worse performance using this code than when i just run the code on the cpu with an outer forloop, any help would be appreciated.
Here is the code:
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
int argmin(global double *array, int end)
{
double minimum = array[0];
int index;
for (int j = 0; j < end; j++)
{
if (array[j] < minimum)
{
minimum = array[j];
index = j;
}
}
return index;
}
kernel void execute(global double *dist, global long *res, global double *min_dist)
{
int row_size = 0;
int i = get_global_id(0);
int row_index = i * row_size;
res[i] = argmin(&dist[row_index], row_size);
min_dist[i] = dist[res[i] + row_index];
}
The commenters make some valid points, but I'll try to be a little more constructive and organised:
Your data appears to consist of double precision floating point values. Depending on your GPU, this can be bad news in itself. Consumer grade GPUs typically are not optimised for working with doubles, often only achieving 1/32 or 1/16 the throughput compared to single-precision float operations. Many pro-grade GPUs (Quadro, Tesla, FirePro, some Radeon Pro cards) are fine with them though, achieving 1/2 or 1/4 throughput versus float. As you're only performing a trivial arithmetic operation (comparison), and there's a good chance your runtime is dominated by memory access, it could be fine on consumer hardware too.
I assume your row_size is not actually 0, it would help to know what the true (typical) value is, and whether it's fixed, variable by row, or variable per run but the same for each row. In any case, unless row_size is very small, the fact that you are running a serial for loop over it could be holding your code back.
How big is your work size? In other words, how many rows in your array (give a typical range if it varies)? If it is very small, you will see little benefit from GPU parallelism: GPUs have a large number of processors and can schedule a few threads per processor. So your work items will need to number hundreds or better thousands to achieve decent hardware utilisation.
You are reading a very large array from (presumably) system memory and not performing any intensive operations on it. This means your bottleneck will typically be on the memory access side - for discrete GPUs, system memory access needs to go through PCIe, so the speed of that link will place an upper bound on your performance. Additionally, your memory access pattern far from ideal for GPUs - you typically want work items to read adjacent memory cells at the same time as the memory unit typically fetches 64 bytes or more at once.
Improvement suggestions:
Profiling. If at all possible, use your GPU vendor's profiling tools to determine your true bottlenecks. Otherwise we're just guessing.
For (4) - if at all possible, try not to move large amounts of data around too much. If you can generate your input arrays on the GPU, do so, so they never leave VRAM.
For (4) - Optimise your memory accesses. AMD, NVidia and Intel all have OpenCL GPU optimisation guides which explain how to do this. Essentially, re-structure your data layout, or your kernel such that adjacent work items read adjacent pieces of memory. You ideally want work item 0 to read array item 0, work item 1 to read array item 1, etc. You may need to use local memory to coordinate between work items. Another option is to read vector-sized chunks of data per work item. (e.g. each work-item reads a double8 at a time) Watch for alignment in this case though.
For (2) & (3) - Unless row_size is very small (and fixed), try to split your loop across multiple work items and coordinate using local memory (reduction algorithms) and atomic operations in global memory.
For (1): If you've optimised everything else and profiling is telling you that comparing doubles on consumer hardware is too slow, either check if you can generate the data as floats without loss of accuracy (this will also halve your memory bandwidth woes), or check if you can otherwise do better somehow, for example by treating the double as a long and manually unpacking and comparing the exponent and mantissa using integer operations.
Related
Abstract:
I am looking for an elegant and fast way to "rearrange" the values in my ADC Buffer for further processing.
Introduction:
on an ARM Cortex M4 Processor I am using 3 ADCs to sample analog values, with DMA and "Double Buffer Technique". When I get a "half buffer complete Interrupt" the data in the 1D array are arranged like this:
Ch1S1, Ch2S1, Ch3S1, Ch1S2, Ch2S2, Ch3S2, Ch1S3 ..... Ch1Sn-1, Ch2Sn-1, Ch3Sn-1, Ch1Sn, Ch2Sn, Ch3Sn
Where Sn stands for Sample# and CHn for Channel Number.
As I do 2x Oversampling n equals 16, the channel count is 9 in reality, in the example above it is 3
Or written in an 2D-form
Ch1S1, Ch2S1, Ch3S1,
Ch1S2, Ch2S2, Ch3S2,
Ch1S3 ...
Ch1Sn-1, Ch2Sn-1, Ch3Sn-1,
Ch1Sn, Ch2Sn, Ch3Sn
Where the rows represent the n samples and the colums represent the channels ...
I am using CMSIS-DSP to calculate all the vector stuff, like shifting, scaling, multiplication, once I have "sorted out" the channels. This part is pretty fast.
Issue:
But the code I am using for "reshaping" the 1-D Buffer array to an accumulated value for each channel is pretty poor and slow:
for(i = 0; i < ADC_BUFFER_SZ; i++) {
for(j = 0; j < MEAS_ADC_CHANNELS; j++) {
if(i) *(ADC_acc + j) += *(ADC_DMABuffer + bP); // sum up all elements
else *(ADC_acc + j) = *(ADC_DMABuffer + bP); // initialize new on first run
bP++;
}
}
After this procedure I get a 1D array with one (accumulated) U32 value per Channel, but this code is pretty slow: ~4000 Clock cycles for 16 Samples per channel / 9 Channels or ~27 Clock cycles per sample. In order to archive higher Sample rates, this needs to be many times faster, than it is right now.
Question(s):
What I am looking for is: some elegant way, using the CMSIS-DPS functions to archive the same result as above, but much faster. My gut says that I am thinking in the wrong direction, that there must be a solution within the CMSIS-DSP lib, as I am most probably not the first guy who stumbles upon this topic and I most probably won't be the last. So I'm asking for a little push in the right direction, I as guess this could be a severe case of "work-blindness" ...
I was thinking about using the dot-product function "arm_dot_prod_q31" together with an array filled with ones for the accumulation task, because I could not find the CMSIS function which would simply sum up an 1D array? But this would not solve the "reshaping" issue, I still had to copy data around and create new buffers to prepare the vectors for the "arm_dot_prod_q31" call ...
Besides that it feels somehow awkward using a dot-product, where I just want to sum up array elements …
I also thought about transforming the ADC Buffer into a 16 x 9 or 9 x 16 Matrix, but then I could not find anything where I could easily (=fast & elegant) access rows or columns, which would leave me with another issue to solve, which would eventually require to create new buffers and copying data around, as I am missing a function where I could multiply a matrix with a vector ...
Maybe someone has a hint for me, that points me in the right direction?
Thanks a lot and cheers!
ARM is a risk device, so 27 cycles is roughly equal to 27 instructions, IIRC. You may find that you're going to need a higher clock rate to meet your timing requirements. What OS are you running? Do you have access to the cache controller? You may need to lock data buffers into the cache to get high enough performance. Also, keep your sums and raw data physically close in memory as your system will allow.
I am not convinced your perf issue is entirely the consequence of how you are stepping through your data array, but here's a more streamlined approach than what you are using:
int raw[ADC_BUFFER_SZ];
int sums[MEAS_ADC_CHANNELS];
for (int idxRaw = 0, int idxSum = 0; idxRaw < ADC_BUFFER_SZ; idxRaw++)
{
sums[idxSum++] += raw[idxRaw];
if (idxSum == MEAS_ADC_CHANNELS) idxSum = 0;
}
Note that I have not tested the above code, nor even tried to compile it. The algorithm is simple enough, you should be able to get working quickly.
Writing pointer math in your code, will not make it any faster. The compiler will convert array notation to efficient pointer math for you. You definitely don't need two loops.
That said, I often use a pointer for iteration:
int raw[ADC_BUFFER_SZ];
int sums[MEAS_ADC_CHANNELS];
int *itRaw = raw;
int *itRawEnd = raw + ADC_BUFFER_SZ;
int *itSums = sums;
int *itSumsEnd = itSums + MEAS_ADC_CHANNELS;
while(itRaw != itEnd)
{
*itSums += *itRaw;
itRaw++;
itSums++;
if (itSums == itSumsEnd) itSums = sums;
}
But almost never, when I am working with a mathematician or scientist, which is often the case with measurement/metrological device development. It's easier to explain the array notation to non-C reviewers, than the iterator form.
Also, if I have an algorithm description that uses the phrase "for each...", I tend to prefer the for loop form, but when the description uses "while ...", then of course I will probably use the while... form, unless I can skip one or more variable assignment statements by rearranging it to a do..while. But I often stick as close as possible to the original description until after I've passed all the testing criteria, then do rearrangement of loops for code hygiene purposes. It's easier to argue with a domain expert that their math is wrong, when you can easily convince them that you implemented what they described.
Always get it right first, then measure and make the determination whether to further hone the code. Decades ago, some C compilers for embedded systems could do a better job of optimizing one kind of loop than another. We used to have to keep a warry eye on the machine code they generated, and often developed habits that avoided those worst case scenarios. That is uncommon today, and almost certainly not the case for you ARM tool chain. But you may have to look into how your compilers optimization features work and try something different.
Do try to avoid doing value math on the same line as your pointer math. It's just confusing:
*(p1 + offset1) += *(p2 + offset2); // Can and should be avoided.
*(p1++) = *(p2++); // reasonable, especially for experienced coders/reviewers.
p1[offset1] += p2[offset2]; // Okay. Doesn't mix math notation with pointer notation.
p1[offset1 + A*B/C] += p2...; // Very bad.
// But...
int offset1 += A*B/C; // Especially helpful when stepping in the debugger.
p1[offset1]... ; // Much better.
Hence the iterator form mentioned earlier. It may reduce the lines of code, but does not reduce the complexity and definitely does increase the odds of introducing a bug at some point.
A purist could argue that p1[x] is in fact pointer notation in C, but array notation has almost, if not completely universal binding rules across languages. Intentions are obvious, even to non programmers. While the examples above are pretty trivial and most C programmers would have no problems reading any of them, it's when the number of variables involved and the complexity of the math increases, that mixing your value math with pointer math quickly becomes problematic. You'll almost never do it for anything non-trivial, so for consistency's sake, just get in the habit of avoiding it all-together.
I am currently trying to speed up a simple matrix subtraction benchmark with OpenMP on the Maestro processor, which has a NUMA architecture and is based on the Tilera Tile64 processor. The Maestro board has 49 processors arranged in a two-dimensional array in a 7x7 configuration. Each core has its own L1 and L2 cache. A layout of the board can be seen here: http://i.imgur.com/naCWTuK.png
I am new to the idea of writing applications that are 'NUMA-aware', but the main consensus from what I've read is that data locality is a big part of maximizing performance. When parallelizing code among the cores, I should keep the data being used local to the thread doing the processing as possible.
For this matrix subtraction benchmark (C[i] = A[i] - B[i]), I thought it would be a good idea to allocate each thread its own private A, B, and C arrays with the size being the total work size divided by the number of threads. So for example if the total size of the arrays were 6000*6000 and I was trying to parallelize it across 20 threads, I would allocate private arrays with size (6000*6000)/20. Each thread would do this subtraction on its own private array and then I would gather the results back into a final array of the total size 6000*6000. For example (without the gathering of results from each thread into a final array):
int threads = 20;
int size = 6000;
uint8_t *C_final = malloc(sizeof(uint8_t)*(size*size));
#pragma omp parallel num_threads(threads) private(j)
{
uint8_t *A_priv = malloc(sizeof(uint8_t)*((size*size)/threads));
uint8_t *B_priv = malloc(sizeof(uint8_t)*((size*size)/threads));
uint8_t *C_priv = malloc(sizeof(uint8_t)*((size*size)/threads));
for(j=0; j<((size*size)/threads); j++)
{
A_priv[j]=100;
B_priv[j]=omp_get_thread_num();
C_priv[j]=0;
}
for(j=0; j<((size*size)/threads); j++)
{
C_priv[j] = A_priv[j]-B_priv[j];
}
}
The initial values for the arrays are arbitrary, I just have omp_get_thread_num() in there so I get different values in C_priv from each thread. I'm currently experimenting with the User Dynamic Network that the board has that provides hardware to route packets between CPUs in order to accumulate all of the individual thread results into a final resulting array.
I have achieved speedup doing it this way along with pinning the threads with OMP_PROC_BIND=true but I'm worried that accumulating the individual results into a final array may cause overhead that would negate the speedup.
Is this a proper way to go about this type of problem? What type of techniques should I look into for getting speedup on a NUMA architecture for a problem like this that uses OpenMP?
Edit:
For clarification, this is what I originally tried and where I noticed a slower execution time than if I just ran the code serially:
int threads = 20;
int size = 6000;
uint8_t *A_priv = malloc(sizeof(uint8_t)*(size*size));
uint8_t *B_priv = malloc(sizeof(uint8_t)*(size*size));
uint8_t *C_priv = malloc(sizeof(uint8_t)*(size*size));
int i;
for(i=0; i<(size*size); i++)
{
A[i] = 10;
B[i] = 5;
C[i] = 0;
}
#pragma omp parallel for num_threads(threads)
for(i=0; i<(size*size); i++)
{
C[i] = A[i] - B[i];
}
After seeing that I was getting a slower execution time when using OpenMP, I tried looking into why that is the case. It seemed as though data locality was the issue. This assumption is based on what I have read up about NUMA architectures.
I am having a hard time trying to figure out how to alleviate the bottlenecks that are slowing it down. I found some help with similar questions like this: OpenMP: for schedule where it walks about allocating data to each thread so each thread works on its local data.
I just feel like something as simple as a matrix subtraction should not be difficult to get increased performance when using OpenMP. I'm not sure how to go about figuring out what exactly the bottleneck is and how to alleviate it.
On a quick search and scan of the TILE64 datasheet, it doesn't look like the architecture exposes performance counters like what you'd use on x86 via tools like oprofile, VTune or xperf. Without those, you'll have to devise some experiments of your own to iteratively narrow down on what portion of the code is hot and why - in the absence of microarchitectural docs along with tools to indicate how your code is exercising the hardware, a bit of a reverse engineering task.
Some ideas about where to start on that:
Do some scaling experiments. Is there a knee in the curve where going over a certain problem size or number of threads has a big effect on the overall performance? Does that number hint at some clear relationship with the size of a certain level in the memory hierarchy, or a dimension of the grid of processors, or similar?
Record execution times at a few points through the program. It would probably be useful to know, for example, at a high level how much time is spent on the mallocs vs. the first loop vs. the second.
"I have achieved speedup doing it this way along with pinning the threads with OMP_PROC_BIND=true but I'm worried that accumulating the individual results into a final array may cause overhead that would negate the speedup." - this worry is also empirically testable, especially if you're working on a large enough problem size that your timer accuracy as in (2) is not an issue for isolating time taken for the gather step vs. the part that's completely parallelizable.
Try a different operation - say, addition or element-wise division instead of subtraction and see if that changes the results. On many architectures different arithmetic operations have different latency and throughput. If you looked up and found that that was the case for the TILE64, making a change like this and instrumenting the runtime of your second example might tell you something useful about how much of the time spent over running it serially actually has to do with data locality issues vs. startup time or other overhead related to the OpenMP runtime that might have more to do in the overall results with its relationship to a small problem size than with the properly parallel part of the parallel implementation actually running slower.
You could examine generated assembly. The assumption that the compiler would do basically the same things in the examples you've posted seems reasonable, but doesn't necessarily hold as strongly as you would want it to when looking at odd performance. Maybe there's something about the code size or layout that changes with/without OpenMP or when moving from one parallel approach to another, like use of instruction cache, availability of reservation station or ROB entries (if the TILE64 has those things)...? Who knows, until you look.
I have been thinking and was wondering what the fastest algorithm is to get through every element of a (large - lets say more than say 10,000 sized) unsorted int array. My first thought was to go through the linear motion and check every element at a time - then my mind wandered to recursion and wondered if cutting the array into parallels each time and check the elements would be fine.
The goal I'm trying to figure out is if a number (in this kind of array) will be a multiple of a seemingly "randomly" generated int. Then after this I will progress to try and find if a subset of the large array will equate to a multiple of this number as well. (But I will get to that part another day!)
What are all of your thoughts? Questions? Comments? Concerns?
You seem under the false impression that the bottleneck for running through an array sequentially ist the CPU: it isn't, it is your memory bus. Modern platforms are very good in predicting sequential access and doing everything to streamline the access, you can't do much more than that. Parallelizing will usually not help, since you only have one memory bus, which is the bottleneck, in the contrary you are risking false sharing so it could even get worse.
If for some reason you are really doing a lot of computation on each element of your array, the picture changes. Then, you can start to try some parallel stuff.
For an unsorted array, linear search is as good as you can do. Cutting the array each time and then searching the elements would not help you much, instead it may slow down your program as calling functions needs stack maintenance.
The most efficient way to process every element of a contiguous array in a single thread is sequentially. So the simplest solution is the best. Enabling compiler optimisation is likely to have a significant effect on simple iterative code.
However if you have multiple cores, and very large arrays, greater efficiency may be achieved by separating the tasks into separate threads. As suggested a using a library specifically aimed at performing parallel processing is likely to perform better and more deterministically that simply using the OS support for threading.
Another possibility is to offload the task to a GPU, but that is hardware specific and requires GPU library support such as CUDA.
All that said 10000 elements does not seem that many - how fast do you need it to go, and how long does it currently take? You need to be measuring this if performance is of specific interest.
If you want to perform some kind of task on every element of the array, then it's not going to be possible to do any better than visiting each element once; if you did manage to somehow perform the action on N/2 elements of an N-sized array, then the only possibility is that you didn't visit half of the elements. The best case scenario is visiting every element of the array no more than once.
You can approach the problem recursively, but it's not going to be any better than a simple linear method. If you use tail recursion (the recursive call is at the end of the function), then the compiler is probably going to turn it into a loop anyway. If it doesn't turn it into a loop, then you have to deal with the additional cost of pushing onto the call stack, and you have the possibility of stack overflows for very large arrays.
The cool modern way to do it is with parallel programming. However, don't be fooled by everyone suggesting libraries; even though the run time looks faster than a linear method, each element is still being visited once. Parallelism (see OpenMP, MPI, or GPU programming) cheats by dividing the work into different execution units, like different cores in your processor or different machines on a network. However, it's very possible that the overhead of adding the parallelism will incur a larger cost than the time you'll save by dividing the work, if the problem set isn't large enough.
I do recommend looking into OpenMP; with it, one line of code can automatically divide up a task to different execution units, without you needing to handle any kind of inter-thread communication or anything nasty.
The following program shows a simple way to implement the idea of parallelization for the case you describe - the timing benchmark shows that it doesn't provide any benefit (since the inner loop "doesn't do enough work" to justify the overhead of parallelization).
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>
#include <stdlib.h>
#define N 1000000
int main(void) {
int ii,jj, kk;
int *array;
double t1, t2;
int threads;
// create an array of random numbers:
array = malloc(N * sizeof *array);
for(ii=0; ii<N; ii++) {
array[ii]=rand();
}
for(threads = 1; threads < 5; threads++) {
jj=0;
omp_set_num_threads(threads);
t1=omp_get_wtime();
// perform loop 100 times for better timing accuracy
for(kk=0; kk<100; kk++) {
#pragma omp parallel for reduction(+:jj)
for(ii=0; ii<N; ii++) {
jj+=(array[ii]%6==0)?1:0;
}
}
t2=omp_get_wtime();
printf("jj is now %d\n", jj);
printf("with %d threads, elapsed time = %.3f ms\n", threads, 1000*(t2-t1));
}
return 0;
}
Compile this with
gcc -Wall -fopenmp parallel.c -o parallel
and the output is
jj is now 16613400
with 1 threads, elapsed time = 467.238 ms
jj is now 16613400
with 2 threads, elapsed time = 248.232 ms
jj is now 16613400
with 3 threads, elapsed time = 314.938 ms
jj is now 16613400
with 4 threads, elapsed time = 251.708 ms
This shows that the answer is the same, regardless of the number of threads used; but the amount of time taken does change a little bit. Since I am doing this on a 6 year old dual core machine, you don't actually expect a speed-up with more than two threads, and indeed you don't see one; but there is a difference between 1 thread and 2.
My point was really to show how easy it is to implement a parallel loop for the task you envisage - but also to show that it's not really worth it (for me, on my hardware).
Whether it helps for your case depends on the amount of work going on inside your innermost loop, and the number of cores available. If you are limited by memory access speed, this doesn't help; but since the modulo operation is relatively slow, it's possible that you gain a small amount of speed from doing this - and more cores, and more complex calculations, will increase the performance gain.
Final point - the omp syntax is relatively straightforward to understand. The only thing that is strange is the reduction(+:jj) statement. This means "create individual copies of jj. When you are done, add them all together."
This is how we make sure the total count of numbers divisible by 6 is kept track of across the different threads.
A matrix is created in processor 0 and scattered to other processors. A matrix is a symmetric dense matrix. That's why it is initialized in processor 0.
A matrix is created in this way:
A=malloc(sizeof(double)*N*N);
for (i=0; i<N; i++)
for(j=0; j<N; j++)
A(i,j)=rand()%10; // The code will be changed.
A(i,j) is defined as:
#define A(i,j) A[i*N+j]
and N has to be 100,000 to test the algorithm.
The problem here is: if N=100,000 then the memory needed is approximately 76GB. What do you suggest to store the A matrix?
PS: Algorithm works very well when N<20.000 and the cluster is a distrubed memory system(2GB RAM per processor)
If you are doing this, as stated in comments, to do a scaling test, then Oli Charlesworth is completely right; anything you do is going to make this an apples-to-oranges comparison, because your node doesn't have 76GB to use. Which is fine; one of the big reasons to use MPI is to tackle problems that couldn't fit on one node. But by trying to shoehorn 76GB of data onto one processor, the comparison you're doing isn't going to make any sense. As mentioned by both Oli Charlesworth and caf, through various methods you can use disk instead of RAM, but then your 1 processor answer is going not going to be directly comparable to the fits-in-RAM numbers you get from larger number of nodes, so you're going to be going to a lot of work to get a number which won't actually mean anything.
If you want scaling results on this sort of problem, you either start with the lowest number of nodes that the problem does fit on, and take data at increasing numbers of processors, or you do weak scaling, rather than strong scaling tests -- you keep the work-per-processor constant while scaling up the number of processors, rather than the total work being constant.
Incidentally, however you do the measurements, you'll end up with better results if, as Oli Charlesworth suggests, you have each procesor generate its own data rather than have a serial bottleneck by having rank 0 do the generation of the matrix and then have all the processors receive their parts.
If you are programming on a POSIX system with sufficient virtual address space (which in practice will mean a 64 bit system), you can use mmap().
Either create an anonymous mapping of the required size (this will be swap-backed, which will mean you'll need at least 76GB of swap), or create a real file of the required size and map that.
The file-backed solution has the advantage that if your cluster has a shared file system, you don't need to explicitly transfer the matrix to each processor - you can simply msync() it after creating it, and then map the right region on each processor.
If you can switch to C++, you might look into STXXL, which is an STL implementation specifically designed for huge datasets, with transparent disk-backed support, etc.
I'm now trapped by a performance optimization lab in the book "Computer System from a Programmer's Perspective" described as following:
In a N*N matrix M, where N is multiple of 32, the rotate operation can be represented as:
Transpose: interchange elements M(i,j) and M(j,i)
Exchange rows: Row i is exchanged with row N-1-i
A example for matrix rotation(N is 3 instead of 32 for simplicity):
------- -------
|1|2|3| |3|6|9|
------- -------
|4|5|6| after rotate is |2|5|8|
------- -------
|7|8|9| |1|4|7|
------- -------
A naive implementation is:
#define RIDX(i,j,n) ((i)*(n)+(j))
void naive_rotate(int dim, pixel *src, pixel *dst)
{
int i, j;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
I come up with an idea by inner-loop-unroll. The result is:
Code Version Speed Up
original 1x
unrolled by 2 1.33x
unrolled by 4 1.33x
unrolled by 8 1.55x
unrolled by 16 1.67x
unrolled by 32 1.61x
I also get a code snippet from pastebin.com that seems can solve this problem:
void rotate(int dim, pixel *src, pixel *dst)
{
int stride = 32;
int count = dim >> 5;
src += dim - 1;
int a1 = count;
do {
int a2 = dim;
do {
int a3 = stride;
do {
*dst++ = *src;
src += dim;
} while(--a3);
src -= dim * stride + 1;
dst += dim - stride;
} while(--a2);
src += dim * (stride + 1);
dst -= dim * dim - stride;
} while(--a1);
}
After carefully read the code, I think main idea of this solution is treat 32 rows as a data zone, and perform the rotating operation respectively. Speed up of this version is 1.85x, overwhelming all the loop-unroll version.
Here are the questions:
In the inner-loop-unroll version, why does increment slow down if the unrolling factor increase, especially change the unrolling factor from 8 to 16, which does not effect the same when switch from 4 to 8? Does the result have some relationship with depth of the CPU pipeline? If the answer is yes, could the degrade of increment reflect pipeline length?
What is the probable reason for the optimization of data-zone version? It seems that there is no too much essential difference from the original naive version.
EDIT:
My test environment is Intel Centrino Duo architecture and the verion of gcc is 4.4
Any advice will be highly appreciated!
Kind regards!
What kind of processor are you testing this on? I dimly remember that unrolling loops helps when the processor can handle multiple operations at once, but only up to the maximum number of parallel executions. So if your processor can only handle 8 simultaneous instructions, then unrolling to 16 won't help. But someone with knowledge of more recent processor design will have to pipe up/correct me.
EDIT: According to this PDF, the centrino core2 duo has two processors, each of which is capable of 4 simultaneous instructions. It's generally not so simple, though. Unless your compiler is optimizing across both cores (ie, when you run the task manager (if you're on windows, top if you're on linux), you'll see that CPU usage is maxed out), your process will be running on one core at a time. The processor also features 14 stages of execution, so if you can keep the pipeline full, you'll get a faster execution.
Continuing along the theoretical route, then, you get a speed improvement of 33% with a single unroll because you're starting to take advantage of simultaneous instruction execution. Going to 4 unrolls doesn't really help, because you're now still within that 4-simultaneous-instruction limit. Going to 8 unrolls helps because the processor can now fill the pipeline more completely, so more instructions will get executed per clock cycle.
For this last, think about how a McDonald's drive through works (I think that that's relatively widespread?). A car enters the drivethrough, orders at one window, pays at a second window, and receives food at a third window. If a second drive enters when the first is still ordering, then by the time both finish (assuming each operation in the drive through takes one 'cycle' or time unit), then 2 full operations will be done by the time 4 cycles have elapsed. If each car did all of their operations at one window, then the first car would take 3 cycles for ordering, paying, and getting food, and then the second car would also take 3 cycles for ordering, paying and getting food, for a total of 6 cycles. So, operation time due to pipelining decreases.
Of course, you have to keep the pipeline full to get the largest speed improvement. 14 stages is a lot of stages, so going to 16 unrolls will give you some improvement still because more operations can be in the pipeline.
Going to 32 causing a decrease in performance may have to do with bandwidth to the processor from the cache (again a guess, can't know for sure without seeing your code exactly, as well as the machine code). If all the instructions can't fit into cache or into the registers, then there is some time necessary to prepare them all to run (ie, people have to get into their cars and get to the drive through in the first place). There will be some reduction in speed if they all get there all at once, and some shuffling of the line has to be done to make the operation proceed.
Note that each movement from src to dst is not free or a single operation. You have the lookups into the arrays, and that costs time.
As for why the second version works so quickly, I'm going to hazard a guess that it has to do with the [] operator. Every time that gets called, you're doing some lookups into both the src and dst arrays, resolving pointers to locations, and then retrieving the memory. The other code is going straight to the pointers of the arrays and accessing them directly; basically, for each of the movements from src to dst, there are less operations involved in the move, because the lookups have been handled explicitly through pointer placement. If you use [], these steps are followed:
do any math inside the []
take a pointer to that location (startOfArray + [] in memory)
return the result of that location in memory
If you walk along with a pointer, you just do the math to do the walk (typically just an addition, no multiplication) and then return the result, because you've already done the second step.
If I'm right, then you might get better results with the second code by unrolling its inner loop as well, so that multiple operations can be pipelined simultaneously.
The first part of the question I'm not sure about. My initial thought was some sort of cache problem, but you're only accessing each item once.
The other code could be faster for a coupe reasons.
1) The loops count down instead of up. Comparing a loop counter to zero costs nothing on most architectures (a flag is set by the decrement automatically) you have to explicitly compare to a max value with each iteration.
2) There is no math in the inner loop. You are doing a bunch of math in your inner loop. I see 2 subtractions in the main code and a multiply in the macro (which is used twice). There is also the implicit addition of the resulting indexes to the base address of the array which is avoided by the use of pointers (good addressing modes on x86 should eliminate this penalty too).
When writing optimized code, you always construct it bottom up from the inside. This means taking the inner-most loop and reducing its content to nearly zero. In this case, moving data is unavoidable. Incrementing a pointer is the bare minimum to get to the next item, the other pointer needs to add an offset to get to its next item. So at a minimum we have 4 operations: load, store, increment, add. If an architecture supported "move with post-increment" this would be 2 instructions total. On Intel I suspect it's 3 or 4 instructions. Anything more than this like subtractions and multiplication is going to add significant code.
Looking at the assembly code of each version should offer much insight.
If you run this repeatedly on a small matrix (32x32) that fits completely in cache you should should see even more dramatic differences in implementations. Running on a 1024x1024 matrix will be much slower than doing 1024 rotations of a single 32x32 even though the number of data copies is the same.
The main purpose of loop unrolling is to reduce the time spent on the loop control (test for completion, incrementing counters, etc...). This is a case of diminishing returns though, since as the loop is unrolled more and more, the time spent on loop control becomes less and less significant. Like mmr said, loop unrolling may also help the compiler to execute things in parallel, but only up to a point.
The "data-zone" algorithm appears to be a version of a cache efficient matrix transpose algorithm. The problem with computing a transpose the naive way is that it results in a lot of cache misses. For the source array, you are accessing the memory along each row, so it is accessed in a linear manner, element-by-element. However, this requires that you access the destination array along the columns, meaning you are jumping dim elements each time you access an element. Basically, for each row of the input, you are traversing the memory of the entire destination matrix. Since the whole matrix probably won't fit in the cache, memory has to be loaded and unloaded from the cache very often.
The "data-zone" algorithm takes the matrix that you are accessing by column and only performs the transpose for 32 rows at a time, so the amount of memory you are traversing is 32xstride, which should hopefully fit completely into the cache. Basically the aim is to work on sub-sections that fit in the cache and reduce the amount of jumping around in memory.