I am currently trying to speed up a simple matrix subtraction benchmark with OpenMP on the Maestro processor, which has a NUMA architecture and is based on the Tilera Tile64 processor. The Maestro board has 49 processors arranged in a two-dimensional array in a 7x7 configuration. Each core has its own L1 and L2 cache. A layout of the board can be seen here: http://i.imgur.com/naCWTuK.png
I am new to the idea of writing applications that are 'NUMA-aware', but the main consensus from what I've read is that data locality is a big part of maximizing performance. When parallelizing code among the cores, I should keep the data being used local to the thread doing the processing as possible.
For this matrix subtraction benchmark (C[i] = A[i] - B[i]), I thought it would be a good idea to allocate each thread its own private A, B, and C arrays with the size being the total work size divided by the number of threads. So for example if the total size of the arrays were 6000*6000 and I was trying to parallelize it across 20 threads, I would allocate private arrays with size (6000*6000)/20. Each thread would do this subtraction on its own private array and then I would gather the results back into a final array of the total size 6000*6000. For example (without the gathering of results from each thread into a final array):
int threads = 20;
int size = 6000;
uint8_t *C_final = malloc(sizeof(uint8_t)*(size*size));
#pragma omp parallel num_threads(threads) private(j)
{
uint8_t *A_priv = malloc(sizeof(uint8_t)*((size*size)/threads));
uint8_t *B_priv = malloc(sizeof(uint8_t)*((size*size)/threads));
uint8_t *C_priv = malloc(sizeof(uint8_t)*((size*size)/threads));
for(j=0; j<((size*size)/threads); j++)
{
A_priv[j]=100;
B_priv[j]=omp_get_thread_num();
C_priv[j]=0;
}
for(j=0; j<((size*size)/threads); j++)
{
C_priv[j] = A_priv[j]-B_priv[j];
}
}
The initial values for the arrays are arbitrary, I just have omp_get_thread_num() in there so I get different values in C_priv from each thread. I'm currently experimenting with the User Dynamic Network that the board has that provides hardware to route packets between CPUs in order to accumulate all of the individual thread results into a final resulting array.
I have achieved speedup doing it this way along with pinning the threads with OMP_PROC_BIND=true but I'm worried that accumulating the individual results into a final array may cause overhead that would negate the speedup.
Is this a proper way to go about this type of problem? What type of techniques should I look into for getting speedup on a NUMA architecture for a problem like this that uses OpenMP?
Edit:
For clarification, this is what I originally tried and where I noticed a slower execution time than if I just ran the code serially:
int threads = 20;
int size = 6000;
uint8_t *A_priv = malloc(sizeof(uint8_t)*(size*size));
uint8_t *B_priv = malloc(sizeof(uint8_t)*(size*size));
uint8_t *C_priv = malloc(sizeof(uint8_t)*(size*size));
int i;
for(i=0; i<(size*size); i++)
{
A[i] = 10;
B[i] = 5;
C[i] = 0;
}
#pragma omp parallel for num_threads(threads)
for(i=0; i<(size*size); i++)
{
C[i] = A[i] - B[i];
}
After seeing that I was getting a slower execution time when using OpenMP, I tried looking into why that is the case. It seemed as though data locality was the issue. This assumption is based on what I have read up about NUMA architectures.
I am having a hard time trying to figure out how to alleviate the bottlenecks that are slowing it down. I found some help with similar questions like this: OpenMP: for schedule where it walks about allocating data to each thread so each thread works on its local data.
I just feel like something as simple as a matrix subtraction should not be difficult to get increased performance when using OpenMP. I'm not sure how to go about figuring out what exactly the bottleneck is and how to alleviate it.
On a quick search and scan of the TILE64 datasheet, it doesn't look like the architecture exposes performance counters like what you'd use on x86 via tools like oprofile, VTune or xperf. Without those, you'll have to devise some experiments of your own to iteratively narrow down on what portion of the code is hot and why - in the absence of microarchitectural docs along with tools to indicate how your code is exercising the hardware, a bit of a reverse engineering task.
Some ideas about where to start on that:
Do some scaling experiments. Is there a knee in the curve where going over a certain problem size or number of threads has a big effect on the overall performance? Does that number hint at some clear relationship with the size of a certain level in the memory hierarchy, or a dimension of the grid of processors, or similar?
Record execution times at a few points through the program. It would probably be useful to know, for example, at a high level how much time is spent on the mallocs vs. the first loop vs. the second.
"I have achieved speedup doing it this way along with pinning the threads with OMP_PROC_BIND=true but I'm worried that accumulating the individual results into a final array may cause overhead that would negate the speedup." - this worry is also empirically testable, especially if you're working on a large enough problem size that your timer accuracy as in (2) is not an issue for isolating time taken for the gather step vs. the part that's completely parallelizable.
Try a different operation - say, addition or element-wise division instead of subtraction and see if that changes the results. On many architectures different arithmetic operations have different latency and throughput. If you looked up and found that that was the case for the TILE64, making a change like this and instrumenting the runtime of your second example might tell you something useful about how much of the time spent over running it serially actually has to do with data locality issues vs. startup time or other overhead related to the OpenMP runtime that might have more to do in the overall results with its relationship to a small problem size than with the properly parallel part of the parallel implementation actually running slower.
You could examine generated assembly. The assumption that the compiler would do basically the same things in the examples you've posted seems reasonable, but doesn't necessarily hold as strongly as you would want it to when looking at odd performance. Maybe there's something about the code size or layout that changes with/without OpenMP or when moving from one parallel approach to another, like use of instruction cache, availability of reservation station or ROB entries (if the TILE64 has those things)...? Who knows, until you look.
Related
I am trying to use opencl for the first time, the goal is to calculate the argmin of each row in an array. Since the operation on each row is independent of the others, I thought this would be easy to put on the graphics card.
I seem to get worse performance using this code than when i just run the code on the cpu with an outer forloop, any help would be appreciated.
Here is the code:
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
int argmin(global double *array, int end)
{
double minimum = array[0];
int index;
for (int j = 0; j < end; j++)
{
if (array[j] < minimum)
{
minimum = array[j];
index = j;
}
}
return index;
}
kernel void execute(global double *dist, global long *res, global double *min_dist)
{
int row_size = 0;
int i = get_global_id(0);
int row_index = i * row_size;
res[i] = argmin(&dist[row_index], row_size);
min_dist[i] = dist[res[i] + row_index];
}
The commenters make some valid points, but I'll try to be a little more constructive and organised:
Your data appears to consist of double precision floating point values. Depending on your GPU, this can be bad news in itself. Consumer grade GPUs typically are not optimised for working with doubles, often only achieving 1/32 or 1/16 the throughput compared to single-precision float operations. Many pro-grade GPUs (Quadro, Tesla, FirePro, some Radeon Pro cards) are fine with them though, achieving 1/2 or 1/4 throughput versus float. As you're only performing a trivial arithmetic operation (comparison), and there's a good chance your runtime is dominated by memory access, it could be fine on consumer hardware too.
I assume your row_size is not actually 0, it would help to know what the true (typical) value is, and whether it's fixed, variable by row, or variable per run but the same for each row. In any case, unless row_size is very small, the fact that you are running a serial for loop over it could be holding your code back.
How big is your work size? In other words, how many rows in your array (give a typical range if it varies)? If it is very small, you will see little benefit from GPU parallelism: GPUs have a large number of processors and can schedule a few threads per processor. So your work items will need to number hundreds or better thousands to achieve decent hardware utilisation.
You are reading a very large array from (presumably) system memory and not performing any intensive operations on it. This means your bottleneck will typically be on the memory access side - for discrete GPUs, system memory access needs to go through PCIe, so the speed of that link will place an upper bound on your performance. Additionally, your memory access pattern far from ideal for GPUs - you typically want work items to read adjacent memory cells at the same time as the memory unit typically fetches 64 bytes or more at once.
Improvement suggestions:
Profiling. If at all possible, use your GPU vendor's profiling tools to determine your true bottlenecks. Otherwise we're just guessing.
For (4) - if at all possible, try not to move large amounts of data around too much. If you can generate your input arrays on the GPU, do so, so they never leave VRAM.
For (4) - Optimise your memory accesses. AMD, NVidia and Intel all have OpenCL GPU optimisation guides which explain how to do this. Essentially, re-structure your data layout, or your kernel such that adjacent work items read adjacent pieces of memory. You ideally want work item 0 to read array item 0, work item 1 to read array item 1, etc. You may need to use local memory to coordinate between work items. Another option is to read vector-sized chunks of data per work item. (e.g. each work-item reads a double8 at a time) Watch for alignment in this case though.
For (2) & (3) - Unless row_size is very small (and fixed), try to split your loop across multiple work items and coordinate using local memory (reduction algorithms) and atomic operations in global memory.
For (1): If you've optimised everything else and profiling is telling you that comparing doubles on consumer hardware is too slow, either check if you can generate the data as floats without loss of accuracy (this will also halve your memory bandwidth woes), or check if you can otherwise do better somehow, for example by treating the double as a long and manually unpacking and comparing the exponent and mantissa using integer operations.
I have been thinking and was wondering what the fastest algorithm is to get through every element of a (large - lets say more than say 10,000 sized) unsorted int array. My first thought was to go through the linear motion and check every element at a time - then my mind wandered to recursion and wondered if cutting the array into parallels each time and check the elements would be fine.
The goal I'm trying to figure out is if a number (in this kind of array) will be a multiple of a seemingly "randomly" generated int. Then after this I will progress to try and find if a subset of the large array will equate to a multiple of this number as well. (But I will get to that part another day!)
What are all of your thoughts? Questions? Comments? Concerns?
You seem under the false impression that the bottleneck for running through an array sequentially ist the CPU: it isn't, it is your memory bus. Modern platforms are very good in predicting sequential access and doing everything to streamline the access, you can't do much more than that. Parallelizing will usually not help, since you only have one memory bus, which is the bottleneck, in the contrary you are risking false sharing so it could even get worse.
If for some reason you are really doing a lot of computation on each element of your array, the picture changes. Then, you can start to try some parallel stuff.
For an unsorted array, linear search is as good as you can do. Cutting the array each time and then searching the elements would not help you much, instead it may slow down your program as calling functions needs stack maintenance.
The most efficient way to process every element of a contiguous array in a single thread is sequentially. So the simplest solution is the best. Enabling compiler optimisation is likely to have a significant effect on simple iterative code.
However if you have multiple cores, and very large arrays, greater efficiency may be achieved by separating the tasks into separate threads. As suggested a using a library specifically aimed at performing parallel processing is likely to perform better and more deterministically that simply using the OS support for threading.
Another possibility is to offload the task to a GPU, but that is hardware specific and requires GPU library support such as CUDA.
All that said 10000 elements does not seem that many - how fast do you need it to go, and how long does it currently take? You need to be measuring this if performance is of specific interest.
If you want to perform some kind of task on every element of the array, then it's not going to be possible to do any better than visiting each element once; if you did manage to somehow perform the action on N/2 elements of an N-sized array, then the only possibility is that you didn't visit half of the elements. The best case scenario is visiting every element of the array no more than once.
You can approach the problem recursively, but it's not going to be any better than a simple linear method. If you use tail recursion (the recursive call is at the end of the function), then the compiler is probably going to turn it into a loop anyway. If it doesn't turn it into a loop, then you have to deal with the additional cost of pushing onto the call stack, and you have the possibility of stack overflows for very large arrays.
The cool modern way to do it is with parallel programming. However, don't be fooled by everyone suggesting libraries; even though the run time looks faster than a linear method, each element is still being visited once. Parallelism (see OpenMP, MPI, or GPU programming) cheats by dividing the work into different execution units, like different cores in your processor or different machines on a network. However, it's very possible that the overhead of adding the parallelism will incur a larger cost than the time you'll save by dividing the work, if the problem set isn't large enough.
I do recommend looking into OpenMP; with it, one line of code can automatically divide up a task to different execution units, without you needing to handle any kind of inter-thread communication or anything nasty.
The following program shows a simple way to implement the idea of parallelization for the case you describe - the timing benchmark shows that it doesn't provide any benefit (since the inner loop "doesn't do enough work" to justify the overhead of parallelization).
#include <stdio.h>
#include <time.h>
#include <math.h>
#include <omp.h>
#include <stdlib.h>
#define N 1000000
int main(void) {
int ii,jj, kk;
int *array;
double t1, t2;
int threads;
// create an array of random numbers:
array = malloc(N * sizeof *array);
for(ii=0; ii<N; ii++) {
array[ii]=rand();
}
for(threads = 1; threads < 5; threads++) {
jj=0;
omp_set_num_threads(threads);
t1=omp_get_wtime();
// perform loop 100 times for better timing accuracy
for(kk=0; kk<100; kk++) {
#pragma omp parallel for reduction(+:jj)
for(ii=0; ii<N; ii++) {
jj+=(array[ii]%6==0)?1:0;
}
}
t2=omp_get_wtime();
printf("jj is now %d\n", jj);
printf("with %d threads, elapsed time = %.3f ms\n", threads, 1000*(t2-t1));
}
return 0;
}
Compile this with
gcc -Wall -fopenmp parallel.c -o parallel
and the output is
jj is now 16613400
with 1 threads, elapsed time = 467.238 ms
jj is now 16613400
with 2 threads, elapsed time = 248.232 ms
jj is now 16613400
with 3 threads, elapsed time = 314.938 ms
jj is now 16613400
with 4 threads, elapsed time = 251.708 ms
This shows that the answer is the same, regardless of the number of threads used; but the amount of time taken does change a little bit. Since I am doing this on a 6 year old dual core machine, you don't actually expect a speed-up with more than two threads, and indeed you don't see one; but there is a difference between 1 thread and 2.
My point was really to show how easy it is to implement a parallel loop for the task you envisage - but also to show that it's not really worth it (for me, on my hardware).
Whether it helps for your case depends on the amount of work going on inside your innermost loop, and the number of cores available. If you are limited by memory access speed, this doesn't help; but since the modulo operation is relatively slow, it's possible that you gain a small amount of speed from doing this - and more cores, and more complex calculations, will increase the performance gain.
Final point - the omp syntax is relatively straightforward to understand. The only thing that is strange is the reduction(+:jj) statement. This means "create individual copies of jj. When you are done, add them all together."
This is how we make sure the total count of numbers divisible by 6 is kept track of across the different threads.
As a general question to those working on optimization and performance tuning of programs, how do you figure out if your code is CPU bound or Memory bound? I understand these concepts in general, but if I have say, 'y' amounts of loads and stores and '2y' computations, how does one go about finding what is the bottleneck?
Also can you figure out where exactly you are spending most of your time and say, if you load 'x' amount of data into cache (if its memory bound), in every loop iteration, then your code will run faster? Is there any precise way to determine this 'x', other than trial and error?
Are there any tools that you'll use, say on the IA-32 or IA-64 architecture? Doest VTune help?
For example, I'm currently doing the following:
I have 26 8*8 matrices of complex doubles and I have to perform a MVM (matrix vector multiplication) with (~4000) vectors of length 8, for each of these 26 matrices. I use SSE to perform the complex multiplication.
/*Copy 26 matrices to temporary storage*/
for(int i=0;i<4000;i+=2){//Loop over the 4000 vectors
for(int k=0;k<26;k++){//Loop over the 26 matrices
/*
Perform MVM in blocks of '2' between kth matrix and
'i' and 'i+1' vector
*/
}
}
The 26 matrices take 26kb (L1 cache is 32KB) and I have laid the vectors out in memory such that I have stride'1' accesses. Once I perform MVM on a vector with the 27 matrices, I don't visit them again, so I don't think cache blocking will help. I have used vectorization but I'm still stuck on 60% of peak performance.
I tried copying, say 64 vectors, into temporary storage, for every iteration of the outer loop thinking they'll be in cache and help, but its only decreased performance. I tried using _mm_prefetch() in the following way: When I am done with about half the matrices, I load the next 'i' and 'i+1' vector into memory, but that too hasn't helped.
I have done all this assuming its memory bound but I want to know for sure. Is there a way?
To my understanding the best way is profiling your application/workload. Based on the input data, the characteristic of the application/workload can significantly vary. These behaviors can however be quantified with to few phases Ref[2, 3] and a histogram can broadly tell the most frequent path of the workload to be optimized. The question that you are asking will also require benchmark programs [like SPEC2006, PARSEC, Media bench etc] for an architecture and is difficult to answer in general terms ( and is an active part of research in computer architecture). However, for specific cases a quantitative result can be stated for different memory hierarchies. You can use tools like:
Perf
OProfile
VTune
LikWid
LLTng
and other monitoring and simulation tools to get the profiling traces of the application. You can look at performance counters like IPC, CPI ( for CPU bound) and memory access, cache misses, cache access , and other memory counters for determining memory boundedness.like IPC, Memory access per cycle (MPC), is often used to determine the memory boundedness of an application/workload.
To specifically improve matrix multiplication, I would suggest using a optimized algorithm as in LAPACK.
I have long wondered what is more efficient with regards to making better use of CPU caches (which are known to benefit from locality of reference) - two loops each iterating over the same mathematical set of numbers, each with a different body statement (e.g. a call to a function for each element of the set), or having one loop with a body that does the equivalent of two (or more) body statements. We assume identical application state after all the looping.
In my opinion, having two loops would introduce fewer cache misses and evictions because more instructions and data used by the loop fit in the cache. Am I right?
Assuming:
Cost of a f and g call is negligible compared to cost of the loop
f and g use most of the cache each by itself, and so the cache would be spilled when one is called after another (the case with a single-loop version)
Intel Core Duo CPU
C language source code
The GCC compiler, "no extra switches"
I want answers outside the "premature optimization is evil" character, if possible.
An example of the two-loops version that I am advocating for:
int j = 0, k = 0;
for(int i = 0; i < 1000000; i++)
{
j += f(i);
}
for(int i = 0; i < 1000000; i++)
{
k += g(i);
}
To measure is to know.
I can see three variables (even in a seemingly simple chunk of code):
What do f() and g() do? Can one of them invalidate all of the instruction cache lines (effectively pushing the other one out)? Can that happen in L2 instruction cache too (unlikely)? Then keeping only one of them in it might be beneficial. Note: The inverse does not imply "have a single loop", because:
Do f() and g() operate on large amounts of data, according to i? Then, it'd be nice to know if they operate on the same set of data - again you have to consider whether operating on two different sets screws you up via cache misses.
If f() and g() are indeed that primitive as you first state, and I'm assuming both in code size as well as running time and code complexity, cache locality issues won't arise in little chunks of code like this - your biggest concern would be if some other process were scheduled with actual work to do, and invalidated all the caches until it were your process' turn to run.
A final thought: given that such processes like above might be a rare occurrence in your system (and I'm using "rare" quite liberally), you could consider making both your functions inline, and let the compiler unroll the loop. That is because for the instruction cache, faulting back to L2 is no big deal, and the probability that the single cache line that'd contain i, j, k would be invalidated in that loop doesn't look so horrible. However, if that's not the case, some more details would be useful.
Intuitively one loop is better: you increment i a million fewer times and all the other operation counts remain the same.
On the other hand it completely depends on f and g. If both are sufficiently large that each of their code or cacheable data that they use nearly fills a critical cache then swapping between f and g may completely swamp any single loop benefit.
As you say: it depends.
Your question is not clear enough to give a remotely accurate answer, but I think I understand where you are headed. The data you are iterating over is large enough that before you reach the end you will start to evict data so that the second time (second loop) you iterate over it some if not all will have to be read again.
If the two loops were joined so that each element/block is fetched for the first operation and then is already in cache for the second operation, then no matter how large the data is relative to the cache most if not all of the second operations will take their data from the cache.
Various things like the nature of the cache, the loop getting evicted by data then being fetched evicting data may cause some misses on the second operation. On a pc with an operating system, lots of evictions will occur with other programs getting time slices. But assuming an ideal world the first operation on index i of the data will fetch it from memory, the second operation will grab it from cache.
Tuning for a cache is difficult at best. I regularly demonstrate that even with an embedded system, no interrupts, single task, same source code. Execution time/performance can vary dramatically by simply changing compiler optimization options, changing compilers, both brands of compilers or versions of compilers, gcc 2.x vs 3.x vs 4.x (gcc is not necessarily producing faster code with newer versions btw)(and a compiler that is pretty good at a lot of targets is not really good at any one particular target). Same code different compilers or options can change execution time by several times, 3 times faster, 10 times faster, etc. Once you get into testing with or without a cache, it gets even more interesting. Add a single nop in your startup code so that your whole program moves one instruction over in memory and your cache lines now hit in different places. Same compiler same code. Repeat this with two nops, three nops, etc. Same compiler, same code you can see tens of percent (for the tests I ran that day on that target with that compiler) differences better and worse. That doesnt mean you cant tune for a cache, it just means that trying to figure out if your tuning is helping or hurting can be difficult. The normal answer is just "time it and see", but that doesnt work anymore, and you might get great results on your computer that day with that program with that compiler. But tomorrow on your computer or any day on someone elses computer you may be making things slower not faster. You need to understand why this or that change made it faster, maybe it had nothing to do with your code, your email program may have been downloading a lot of mail in the background during one test and not during the other.
Assuming I understood your question correctly I think the single loop is probably faster in general.
Breaking the loops into smaller chunks is a good idea.. It could improves the cache-hit ratio quite a lot and can make a lot of difference to the performance...
From your example:
int j = 0, k = 0;
for(int i = 0; i < 1000000; i++)
{
j += f(i);
}
for(int i = 0; i < 1000000; i++)
{
k += g(i);
}
I would either fuse the two loops into one loop like this:
int j = 0, k = 0;
for(int i = 0; i < 1000000; i++)
{
j += f(i);
k += g(i);
}
Of if this is not possible do the optimization called Loop-Tiling:
#define TILE_SIZE 1000 /* or whatever you like - pick a number that keeps */
/* the working-set below your first level cache size */
int i=0;
int elements = 100000;
do {
int n = i+TILE_SIZE;
if (n > elements) n = elements;
// perform loop A
for (int a=i; a<n; a++)
{
j += f(i);
}
// perform loop B
for (int a=i; a<n; a++)
{
k += g(i);
}
i += n
} while (i != elements)
The trick with loop tiling is, that if the loops share an access pattern the second loop body has the chance to re-use the data that has already been read into the cache by the first loop body. This won't happen if you execute loop A a million times because the cache is not large enough to hold all this data.
Breaking the loop into smaller chunks and executing them one after another will help here a lot. The trick is to limit the working-set of memory below the size of your first level cache. I aim for half the size of the cache, so other threads that get executed in-between don't mess up my cache so much..
If I came across the two-loop version in code, with no explanatory comments, I would wonder why the programmer did it that way, and probably consider the technique to be of dubious quality, whereas a one-loop version would not be surprising, commented or not.
But if I came across the two-loop version along with a comment like "I'm using two loops because it runs X% faster in the cache on CPU Y", at least I'd no longer be puzzled by the code, although I'd still question if it was true and applicable to other machines.
This seems like something the compiler could optimize for you so instead of trying to figure it out yourself and making it fast, use whatever method makes your code more clear and readable. If you really must know, time both methods for input size and calculation type that your application uses (try the code you have now but repeat your calculations many many times and disable optimization).
I'm now trapped by a performance optimization lab in the book "Computer System from a Programmer's Perspective" described as following:
In a N*N matrix M, where N is multiple of 32, the rotate operation can be represented as:
Transpose: interchange elements M(i,j) and M(j,i)
Exchange rows: Row i is exchanged with row N-1-i
A example for matrix rotation(N is 3 instead of 32 for simplicity):
------- -------
|1|2|3| |3|6|9|
------- -------
|4|5|6| after rotate is |2|5|8|
------- -------
|7|8|9| |1|4|7|
------- -------
A naive implementation is:
#define RIDX(i,j,n) ((i)*(n)+(j))
void naive_rotate(int dim, pixel *src, pixel *dst)
{
int i, j;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
I come up with an idea by inner-loop-unroll. The result is:
Code Version Speed Up
original 1x
unrolled by 2 1.33x
unrolled by 4 1.33x
unrolled by 8 1.55x
unrolled by 16 1.67x
unrolled by 32 1.61x
I also get a code snippet from pastebin.com that seems can solve this problem:
void rotate(int dim, pixel *src, pixel *dst)
{
int stride = 32;
int count = dim >> 5;
src += dim - 1;
int a1 = count;
do {
int a2 = dim;
do {
int a3 = stride;
do {
*dst++ = *src;
src += dim;
} while(--a3);
src -= dim * stride + 1;
dst += dim - stride;
} while(--a2);
src += dim * (stride + 1);
dst -= dim * dim - stride;
} while(--a1);
}
After carefully read the code, I think main idea of this solution is treat 32 rows as a data zone, and perform the rotating operation respectively. Speed up of this version is 1.85x, overwhelming all the loop-unroll version.
Here are the questions:
In the inner-loop-unroll version, why does increment slow down if the unrolling factor increase, especially change the unrolling factor from 8 to 16, which does not effect the same when switch from 4 to 8? Does the result have some relationship with depth of the CPU pipeline? If the answer is yes, could the degrade of increment reflect pipeline length?
What is the probable reason for the optimization of data-zone version? It seems that there is no too much essential difference from the original naive version.
EDIT:
My test environment is Intel Centrino Duo architecture and the verion of gcc is 4.4
Any advice will be highly appreciated!
Kind regards!
What kind of processor are you testing this on? I dimly remember that unrolling loops helps when the processor can handle multiple operations at once, but only up to the maximum number of parallel executions. So if your processor can only handle 8 simultaneous instructions, then unrolling to 16 won't help. But someone with knowledge of more recent processor design will have to pipe up/correct me.
EDIT: According to this PDF, the centrino core2 duo has two processors, each of which is capable of 4 simultaneous instructions. It's generally not so simple, though. Unless your compiler is optimizing across both cores (ie, when you run the task manager (if you're on windows, top if you're on linux), you'll see that CPU usage is maxed out), your process will be running on one core at a time. The processor also features 14 stages of execution, so if you can keep the pipeline full, you'll get a faster execution.
Continuing along the theoretical route, then, you get a speed improvement of 33% with a single unroll because you're starting to take advantage of simultaneous instruction execution. Going to 4 unrolls doesn't really help, because you're now still within that 4-simultaneous-instruction limit. Going to 8 unrolls helps because the processor can now fill the pipeline more completely, so more instructions will get executed per clock cycle.
For this last, think about how a McDonald's drive through works (I think that that's relatively widespread?). A car enters the drivethrough, orders at one window, pays at a second window, and receives food at a third window. If a second drive enters when the first is still ordering, then by the time both finish (assuming each operation in the drive through takes one 'cycle' or time unit), then 2 full operations will be done by the time 4 cycles have elapsed. If each car did all of their operations at one window, then the first car would take 3 cycles for ordering, paying, and getting food, and then the second car would also take 3 cycles for ordering, paying and getting food, for a total of 6 cycles. So, operation time due to pipelining decreases.
Of course, you have to keep the pipeline full to get the largest speed improvement. 14 stages is a lot of stages, so going to 16 unrolls will give you some improvement still because more operations can be in the pipeline.
Going to 32 causing a decrease in performance may have to do with bandwidth to the processor from the cache (again a guess, can't know for sure without seeing your code exactly, as well as the machine code). If all the instructions can't fit into cache or into the registers, then there is some time necessary to prepare them all to run (ie, people have to get into their cars and get to the drive through in the first place). There will be some reduction in speed if they all get there all at once, and some shuffling of the line has to be done to make the operation proceed.
Note that each movement from src to dst is not free or a single operation. You have the lookups into the arrays, and that costs time.
As for why the second version works so quickly, I'm going to hazard a guess that it has to do with the [] operator. Every time that gets called, you're doing some lookups into both the src and dst arrays, resolving pointers to locations, and then retrieving the memory. The other code is going straight to the pointers of the arrays and accessing them directly; basically, for each of the movements from src to dst, there are less operations involved in the move, because the lookups have been handled explicitly through pointer placement. If you use [], these steps are followed:
do any math inside the []
take a pointer to that location (startOfArray + [] in memory)
return the result of that location in memory
If you walk along with a pointer, you just do the math to do the walk (typically just an addition, no multiplication) and then return the result, because you've already done the second step.
If I'm right, then you might get better results with the second code by unrolling its inner loop as well, so that multiple operations can be pipelined simultaneously.
The first part of the question I'm not sure about. My initial thought was some sort of cache problem, but you're only accessing each item once.
The other code could be faster for a coupe reasons.
1) The loops count down instead of up. Comparing a loop counter to zero costs nothing on most architectures (a flag is set by the decrement automatically) you have to explicitly compare to a max value with each iteration.
2) There is no math in the inner loop. You are doing a bunch of math in your inner loop. I see 2 subtractions in the main code and a multiply in the macro (which is used twice). There is also the implicit addition of the resulting indexes to the base address of the array which is avoided by the use of pointers (good addressing modes on x86 should eliminate this penalty too).
When writing optimized code, you always construct it bottom up from the inside. This means taking the inner-most loop and reducing its content to nearly zero. In this case, moving data is unavoidable. Incrementing a pointer is the bare minimum to get to the next item, the other pointer needs to add an offset to get to its next item. So at a minimum we have 4 operations: load, store, increment, add. If an architecture supported "move with post-increment" this would be 2 instructions total. On Intel I suspect it's 3 or 4 instructions. Anything more than this like subtractions and multiplication is going to add significant code.
Looking at the assembly code of each version should offer much insight.
If you run this repeatedly on a small matrix (32x32) that fits completely in cache you should should see even more dramatic differences in implementations. Running on a 1024x1024 matrix will be much slower than doing 1024 rotations of a single 32x32 even though the number of data copies is the same.
The main purpose of loop unrolling is to reduce the time spent on the loop control (test for completion, incrementing counters, etc...). This is a case of diminishing returns though, since as the loop is unrolled more and more, the time spent on loop control becomes less and less significant. Like mmr said, loop unrolling may also help the compiler to execute things in parallel, but only up to a point.
The "data-zone" algorithm appears to be a version of a cache efficient matrix transpose algorithm. The problem with computing a transpose the naive way is that it results in a lot of cache misses. For the source array, you are accessing the memory along each row, so it is accessed in a linear manner, element-by-element. However, this requires that you access the destination array along the columns, meaning you are jumping dim elements each time you access an element. Basically, for each row of the input, you are traversing the memory of the entire destination matrix. Since the whole matrix probably won't fit in the cache, memory has to be loaded and unloaded from the cache very often.
The "data-zone" algorithm takes the matrix that you are accessing by column and only performs the transpose for 32 rows at a time, so the amount of memory you are traversing is 32xstride, which should hopefully fit completely into the cache. Basically the aim is to work on sub-sections that fit in the cache and reduce the amount of jumping around in memory.