I have several 4-dimensional arrays each having different sizes:
array_one(1:2,1:xm,1:ym,1:zm)
where current_step = 1 and previous_step = 2.
In a long loop, with many other operations, I need to shift the current_step values to the previous_step like:
array_one(previous_step,:,:,:) = array_one(current_step,:,:,:)
I know I can do that in a DO loop but, perhaps it is not the most efficient way. Since I have at least 24 such arrays each having different sizes (i.e. xm,ym,zm) so I need to run separate DO loops for each of them which could make it slower.
I failed with the following way:
array_one(previous_step,:,:,:) = array_one(current_step,:,:,:)
What is the efficient way for such shifting?
Copy methods
I ran a simple benchmark on my system with 8 different methods to copy the arrays. There were two basic forms of copy I tested:
do k=1,nx
do j=1,nx
do i=1,nx
array(2,i,j,k) = array(1,i,j,k)
end do
end do
end do
and
array(2,:,:,:) = array(1,:,:,:)
For each of these I also tested with the t index as the last array index, e.g.:
array(i,j,k,2) = array(i,j,k,1)
and
array(:,:,:,2) = array(:,:,:,1)
Finally I tested each of these 4 copies as shown serially and with openmp directives, e.g.
!$omp parallel do shared(array) private(i,j,k)
...
!$omp end parallel do
for the do loop copy and with
!$omp parallel workshare shared(array)
...
!$omp end parallel workshare
for the array slice copy.
Each copy was performed 100 times for each of arrays sized 100x100x100x2 up to 1000x1000x1000x2 in increments of 100 (ni=nj=nk for all tested arrays).
The compiler and compile flags
I tested with gfortran 4.9.1, and compiled my testcase with
gfortran -march=native -fopenmp -O3 -o arraycopy arraycopy.f90
My CPU is an intel i7 990x (6 cores with HT enabled), and native will target the highest instruction set supported by the chip. OpenMP will spawn 12 threads.
The OS is Linux 3.12.13.
Results
Average time per copy is on the y-axis and the array dimension is on the x-axis (e.g. 500 is a 500x500x500x2 or 2x500x500x500 array). The red lines are the do loop copy (dashed is the variation with t index last). The green lines are the array slice copy (dashed is the variation with t index last). For both serial copies the variations with t index first were faster (I did not investigate why) and the array notation copy is faster than the loop. The blue lines are the openmp copies with t index first. The black lines are the openmp copies with the t index last. The performance for the parallel do and parallel workshare constructs were equivalent.
Discussion
Run your own benchmarks on your own systems with your typical compile flags. The results here are going to be specific to my system including optimization flags, SIMD instructions and OpenMP with 12 threads. This will vary for a system with fewer cores and a CPUs with lesser or greater instruction sets (e.g. a CPU with AVX2 should perform better). These results are also influenced by cache locality, RAM and bus speeds and how my OS scheduler handles hyperthreading.
For my results on my system I would use array slice notation for serial copies and for best performance I would use OpenMP.
In short, when a program issues a memory read operation, say A(i), it will not only read A(i), but instead it will read something like A(i-2), A(i-1), A(i), A(i+1), A(i+2). These values will then be stored in the CPU cache, which is a much faster memory. That is, the CPU will read a chunk of memory and put it into cache for later use. This optimization is based on the fact that it is very likely that your next operation will use some of these surrounding values. If that's the case, the CPU won't need to go and fetch the memory again, which is a very expensive operation (like 100 times more expensive than floating point operations), instead it just needs to look for the value in the cache. This is called data locality.
In Fortran, multidimensional arrays are stored in column-major order. For instance, let's say you have the following 2x2 matrix:
A(1,1)=a11, A(1,2)=a12, A(2,1)=a21, A(2,2)=a22.
The matrix A(1:2,1:2) is stored linearly in memory in this order: a11, a21, a12, a22 (in contrast, in a row-major order like C language, the order would be a11, a12, a21, a22). You can deduce what the order is for higher dimensions.
In short, Fortran arrays are stored linearly in memory from left to right. If you want to exploit data locality, you need to travel through the array from left to right.
Short answer: I think you should change your structure to (1:xm,1:ym,1:zm,1:2), and if you are going to loop through the array, do it this way:
do h = 1, 2
do i = 1, zm
do j = 1, ym
do k = 1, xm
A[k,j,i,h] = *...something...*
end do
end do
end do
end do
Also, the difference between doing A(:)=B(:) and the equivalent do loop is that A(:)=B(:) is equivalent to a forall statement:
forall(i = 1:n)
A(i) = B(i)
end forall
More in here http://en.wikipedia.org/wiki/Fortran_95_language_features#The_FORALL_Statement_and_Construct
Related
This is a follow up of this question.
I have an array D(:,:,:) of size NxMxM. Typically, for the problem that I am considering now, it is M=400 and N=600000 (I reshaped the array in order to give the biggest size to the first entry).
Therefore, for each value l of the first entry, D(l,:,:) is an MxM matrix in a certain basis. I need to perform a change of components of this matrix using a basis set vec(:,:), of size MxM, so as to obtain the matrices D_mod(l,:,:).
I think that the easiest way to do it is with:
D_mod=0.0d0
do b=1,M
do a=1,M
do nu=1,M
do mu=1,M
D_mod(:,mu,nu)=D_mod(:,mu,nu)+vec(mu,a)*vec(nu,b)*D(:,a,b)
end do
end do
end do
end do
Is there a way to improve the speed of this calculation (also using LAPACK/BLAS libraries)?
I was considering this approach: reshaping D into a N x M^2 matrix D_re; doing the tensor product vec(:,:) x vec(:,:) and reshaping it in order to obtain an M^2 x M^2 matrix vecsq_re(:,:) (this motivates this question); finally, doing the matrix product of these two matrices with zgemm. However, I am not sure this is a good strategy.
EDIT
I am sorry, I wrote the question too fast and too late. The size can be up to 600000, yes, but I usually adopt strategies to reduce it by a factor 10 at least. The code is supposed to run on nodes with 100 Gb of memory
As #IanBush has said, your D array is enormous, and you're likely to need some kind of high-memory machine or cluster of machines to process it all at once. However, you probably don't need to process it all at once.
Before we get to that, let's first imagine you don't have any memory issues. For the operation you have described, D looks like an array of N matrices, each of size M*M. You say you have "reshaped the array in order to give the biggest size to the first entry", but for this problem this is the exact opposite of what you want. Fortran is a column-major language, so iterating across the first index of an array is much faster than iterating across the last. In practice, this means that an example triple-loop like
do i=1,N
do j=1,M
do k=1,M
D(i,j,k) = D(i,j,k) +1
enddo
enddo
enddo
will run much slower 1 than the re-ordered triple-loop
do k=1,M
do j=1,M
do i=1,N
D(i,j,k) = D(i,j,k) +1
enddo
enddo
enddo
and so you can immediately speed everything up by transposing D and D_mod from N*M*M arrays to M*M*N arrays and rearranging your loops. You can also speed everything up by replacing your manually-written matrix multiplication with matmul (or BLAS/LAPACK), to give
do i=1,N
D_mod(:,:,i) = matmul(matmul(vec , D(:,:,i)),transpose(vec))
enddo
Now that you're doing matrix multiplication one matrix at a time, you can also find a solution for your memory usage problems: instead of loading everything into memory and trying to do everything at once, just load one D matrix at a time into an M*M array, calculate the corresponding entry of D_mod, and write it out to disk before loading the next matrix.
1 if your compiler doesn't just optimise the loop order.
I am trying to use opencl for the first time, the goal is to calculate the argmin of each row in an array. Since the operation on each row is independent of the others, I thought this would be easy to put on the graphics card.
I seem to get worse performance using this code than when i just run the code on the cpu with an outer forloop, any help would be appreciated.
Here is the code:
#pragma OPENCL EXTENSION cl_khr_fp64 : enable
int argmin(global double *array, int end)
{
double minimum = array[0];
int index;
for (int j = 0; j < end; j++)
{
if (array[j] < minimum)
{
minimum = array[j];
index = j;
}
}
return index;
}
kernel void execute(global double *dist, global long *res, global double *min_dist)
{
int row_size = 0;
int i = get_global_id(0);
int row_index = i * row_size;
res[i] = argmin(&dist[row_index], row_size);
min_dist[i] = dist[res[i] + row_index];
}
The commenters make some valid points, but I'll try to be a little more constructive and organised:
Your data appears to consist of double precision floating point values. Depending on your GPU, this can be bad news in itself. Consumer grade GPUs typically are not optimised for working with doubles, often only achieving 1/32 or 1/16 the throughput compared to single-precision float operations. Many pro-grade GPUs (Quadro, Tesla, FirePro, some Radeon Pro cards) are fine with them though, achieving 1/2 or 1/4 throughput versus float. As you're only performing a trivial arithmetic operation (comparison), and there's a good chance your runtime is dominated by memory access, it could be fine on consumer hardware too.
I assume your row_size is not actually 0, it would help to know what the true (typical) value is, and whether it's fixed, variable by row, or variable per run but the same for each row. In any case, unless row_size is very small, the fact that you are running a serial for loop over it could be holding your code back.
How big is your work size? In other words, how many rows in your array (give a typical range if it varies)? If it is very small, you will see little benefit from GPU parallelism: GPUs have a large number of processors and can schedule a few threads per processor. So your work items will need to number hundreds or better thousands to achieve decent hardware utilisation.
You are reading a very large array from (presumably) system memory and not performing any intensive operations on it. This means your bottleneck will typically be on the memory access side - for discrete GPUs, system memory access needs to go through PCIe, so the speed of that link will place an upper bound on your performance. Additionally, your memory access pattern far from ideal for GPUs - you typically want work items to read adjacent memory cells at the same time as the memory unit typically fetches 64 bytes or more at once.
Improvement suggestions:
Profiling. If at all possible, use your GPU vendor's profiling tools to determine your true bottlenecks. Otherwise we're just guessing.
For (4) - if at all possible, try not to move large amounts of data around too much. If you can generate your input arrays on the GPU, do so, so they never leave VRAM.
For (4) - Optimise your memory accesses. AMD, NVidia and Intel all have OpenCL GPU optimisation guides which explain how to do this. Essentially, re-structure your data layout, or your kernel such that adjacent work items read adjacent pieces of memory. You ideally want work item 0 to read array item 0, work item 1 to read array item 1, etc. You may need to use local memory to coordinate between work items. Another option is to read vector-sized chunks of data per work item. (e.g. each work-item reads a double8 at a time) Watch for alignment in this case though.
For (2) & (3) - Unless row_size is very small (and fixed), try to split your loop across multiple work items and coordinate using local memory (reduction algorithms) and atomic operations in global memory.
For (1): If you've optimised everything else and profiling is telling you that comparing doubles on consumer hardware is too slow, either check if you can generate the data as floats without loss of accuracy (this will also halve your memory bandwidth woes), or check if you can otherwise do better somehow, for example by treating the double as a long and manually unpacking and comparing the exponent and mantissa using integer operations.
After reading several different articles and not finding an answer I am going to introduce the problem and then ask the question.
I have a section of code that can be reduced down to a series of loops like look like the following.
#pragma omp parallel for simd
for(int i = 0; i < a*b*c; i++)
{
array1[i] += array2[i] * array3[i];
}
Now most examples of SIMD use that I have encountered have a, b and c fixed at compile time, allowing for the optimisation to take place. However, my code requires that the values of a b and c are determined at run time.
Lets say that for the case of the computer I am using the register can fit 4 values, and that the value of abc is 127. My understanding of compilation time for this is that the compiler will vectorise everything that is wholly divisible by 4, then serialise the rest (please correct this if I am wrong). However this is when the compiler has full knowledge of the problem. If I were to now allow a run time choice of a, b and c and came to the value of 127, how would vectorisation proceed? Naively I would assume that the code behind the scenes is intelligent enough to understand this might happen have have both a serial and vector code and calls the most suitable. However, as this is an assumption, I would appreciate someone more knowledgeable on the subject to enlighten me further, as I don't want accidental overflows, or non-processing of data, due to a misunderstanding.
On the off chance this matters, I am using OpenMP 4.0 with a C gcc compiler, although I am hoping this will not change your answer as I will always attempt to use the latest OpenMP version and unfortunately may need to routinely change compiler.
Typically, a compiler will unroll beyond the simd length. For optimum results, particularly with gcc, you would specify this unroll factor, e.g. --param max-unroll-times=2 (if you don't expect much longer loops). with a simd length of 4, the loop would consume 8 iterations at a time, leaving a remainder. gcc would build a remainder loop, somewhat like Duff's device, which might have 15 iterations, and would calculate where to jump in at run time. Intel compiler handles a vectorized remainder loop in a different way. Supposing you have 2 simd widths available, the remainder loop would use the shorter width without unrolling, so that the serial part is as short as possible. When compiling for the general case of unaligned data, there is a remainder loop at both ends, with the one at the beginning limited to the length required for alignment of the stored values. With the combination omp parallel simd, the situation gets more complicated; normally, the loop chunks must vary in size, and one might argue that the interior chunks might be set up for alignment, with the end chunks smaller (not normally done).
I work with Fortran now quite a long time but I have a question to which I can't find a satisfying answer.
If I have two arrays and I want to copy one into the other:
real,dimension(0:100,0:100) :: array1,array2
...
do i=0,100
do j=0,100
array1(i,j) = array2(i,j)
enddo
enddo
But I also noticed that it works as well if I do it like that:
real,dimension(0:100,0:100) :: array1,array2
...
array1 = array2
And there is a huge difference in computational time! (The second one is much faster!)
If I do it without a loop can there be a problem because I don't know maybe I'm not coping the content just the memory reference?
Does it change anything if I do another mathematical step like:
array1 = array2*5
Is there a problem on a different architecture (cluster server) or on a different compiler (gfortran, ifort)?
I have to perform various computational steps on huge amounts of data so the computational time is an issue.
Everything that #Alexander_Vogt said, but also:
do i=0,100
do j=0,100
array1(i,j) = array2(i,j)
enddo
enddo
will always be slower than
do j=0,100
do i=0,100
array1(i,j) = array2(i,j)
enddo
enddo
(Unless the compiler notices it and reorders the loops.)
In Fortran, the first parameter is the fastest changing. That means that in the second loop, the compiler can load several elements of the array in one big swoop in the lower level caches to do operations on.
If you have multidimensional loops, always have the innermost loop loop over the first index, and so on. (If possible in any way.)
Fortran is very capable of performing vector operations. Both
array1 = array2
and
array1 = array2*5
are valid operations.
This notation allows the compiler to efficiently parallelize (and/or) optimize the code, as no dependence on the order of the operations exists.
However, these construct are equivalent to the explicit loops, and it depends on the compiler which one will be faster.
Whether the memory will be copied or not depends on what further is done with the arrays and whether the compiler can optimize that. If there is no performance gain, it is safe to assume the array will be copied.
I'm now trapped by a performance optimization lab in the book "Computer System from a Programmer's Perspective" described as following:
In a N*N matrix M, where N is multiple of 32, the rotate operation can be represented as:
Transpose: interchange elements M(i,j) and M(j,i)
Exchange rows: Row i is exchanged with row N-1-i
A example for matrix rotation(N is 3 instead of 32 for simplicity):
------- -------
|1|2|3| |3|6|9|
------- -------
|4|5|6| after rotate is |2|5|8|
------- -------
|7|8|9| |1|4|7|
------- -------
A naive implementation is:
#define RIDX(i,j,n) ((i)*(n)+(j))
void naive_rotate(int dim, pixel *src, pixel *dst)
{
int i, j;
for (i = 0; i < dim; i++)
for (j = 0; j < dim; j++)
dst[RIDX(dim-1-j, i, dim)] = src[RIDX(i, j, dim)];
}
I come up with an idea by inner-loop-unroll. The result is:
Code Version Speed Up
original 1x
unrolled by 2 1.33x
unrolled by 4 1.33x
unrolled by 8 1.55x
unrolled by 16 1.67x
unrolled by 32 1.61x
I also get a code snippet from pastebin.com that seems can solve this problem:
void rotate(int dim, pixel *src, pixel *dst)
{
int stride = 32;
int count = dim >> 5;
src += dim - 1;
int a1 = count;
do {
int a2 = dim;
do {
int a3 = stride;
do {
*dst++ = *src;
src += dim;
} while(--a3);
src -= dim * stride + 1;
dst += dim - stride;
} while(--a2);
src += dim * (stride + 1);
dst -= dim * dim - stride;
} while(--a1);
}
After carefully read the code, I think main idea of this solution is treat 32 rows as a data zone, and perform the rotating operation respectively. Speed up of this version is 1.85x, overwhelming all the loop-unroll version.
Here are the questions:
In the inner-loop-unroll version, why does increment slow down if the unrolling factor increase, especially change the unrolling factor from 8 to 16, which does not effect the same when switch from 4 to 8? Does the result have some relationship with depth of the CPU pipeline? If the answer is yes, could the degrade of increment reflect pipeline length?
What is the probable reason for the optimization of data-zone version? It seems that there is no too much essential difference from the original naive version.
EDIT:
My test environment is Intel Centrino Duo architecture and the verion of gcc is 4.4
Any advice will be highly appreciated!
Kind regards!
What kind of processor are you testing this on? I dimly remember that unrolling loops helps when the processor can handle multiple operations at once, but only up to the maximum number of parallel executions. So if your processor can only handle 8 simultaneous instructions, then unrolling to 16 won't help. But someone with knowledge of more recent processor design will have to pipe up/correct me.
EDIT: According to this PDF, the centrino core2 duo has two processors, each of which is capable of 4 simultaneous instructions. It's generally not so simple, though. Unless your compiler is optimizing across both cores (ie, when you run the task manager (if you're on windows, top if you're on linux), you'll see that CPU usage is maxed out), your process will be running on one core at a time. The processor also features 14 stages of execution, so if you can keep the pipeline full, you'll get a faster execution.
Continuing along the theoretical route, then, you get a speed improvement of 33% with a single unroll because you're starting to take advantage of simultaneous instruction execution. Going to 4 unrolls doesn't really help, because you're now still within that 4-simultaneous-instruction limit. Going to 8 unrolls helps because the processor can now fill the pipeline more completely, so more instructions will get executed per clock cycle.
For this last, think about how a McDonald's drive through works (I think that that's relatively widespread?). A car enters the drivethrough, orders at one window, pays at a second window, and receives food at a third window. If a second drive enters when the first is still ordering, then by the time both finish (assuming each operation in the drive through takes one 'cycle' or time unit), then 2 full operations will be done by the time 4 cycles have elapsed. If each car did all of their operations at one window, then the first car would take 3 cycles for ordering, paying, and getting food, and then the second car would also take 3 cycles for ordering, paying and getting food, for a total of 6 cycles. So, operation time due to pipelining decreases.
Of course, you have to keep the pipeline full to get the largest speed improvement. 14 stages is a lot of stages, so going to 16 unrolls will give you some improvement still because more operations can be in the pipeline.
Going to 32 causing a decrease in performance may have to do with bandwidth to the processor from the cache (again a guess, can't know for sure without seeing your code exactly, as well as the machine code). If all the instructions can't fit into cache or into the registers, then there is some time necessary to prepare them all to run (ie, people have to get into their cars and get to the drive through in the first place). There will be some reduction in speed if they all get there all at once, and some shuffling of the line has to be done to make the operation proceed.
Note that each movement from src to dst is not free or a single operation. You have the lookups into the arrays, and that costs time.
As for why the second version works so quickly, I'm going to hazard a guess that it has to do with the [] operator. Every time that gets called, you're doing some lookups into both the src and dst arrays, resolving pointers to locations, and then retrieving the memory. The other code is going straight to the pointers of the arrays and accessing them directly; basically, for each of the movements from src to dst, there are less operations involved in the move, because the lookups have been handled explicitly through pointer placement. If you use [], these steps are followed:
do any math inside the []
take a pointer to that location (startOfArray + [] in memory)
return the result of that location in memory
If you walk along with a pointer, you just do the math to do the walk (typically just an addition, no multiplication) and then return the result, because you've already done the second step.
If I'm right, then you might get better results with the second code by unrolling its inner loop as well, so that multiple operations can be pipelined simultaneously.
The first part of the question I'm not sure about. My initial thought was some sort of cache problem, but you're only accessing each item once.
The other code could be faster for a coupe reasons.
1) The loops count down instead of up. Comparing a loop counter to zero costs nothing on most architectures (a flag is set by the decrement automatically) you have to explicitly compare to a max value with each iteration.
2) There is no math in the inner loop. You are doing a bunch of math in your inner loop. I see 2 subtractions in the main code and a multiply in the macro (which is used twice). There is also the implicit addition of the resulting indexes to the base address of the array which is avoided by the use of pointers (good addressing modes on x86 should eliminate this penalty too).
When writing optimized code, you always construct it bottom up from the inside. This means taking the inner-most loop and reducing its content to nearly zero. In this case, moving data is unavoidable. Incrementing a pointer is the bare minimum to get to the next item, the other pointer needs to add an offset to get to its next item. So at a minimum we have 4 operations: load, store, increment, add. If an architecture supported "move with post-increment" this would be 2 instructions total. On Intel I suspect it's 3 or 4 instructions. Anything more than this like subtractions and multiplication is going to add significant code.
Looking at the assembly code of each version should offer much insight.
If you run this repeatedly on a small matrix (32x32) that fits completely in cache you should should see even more dramatic differences in implementations. Running on a 1024x1024 matrix will be much slower than doing 1024 rotations of a single 32x32 even though the number of data copies is the same.
The main purpose of loop unrolling is to reduce the time spent on the loop control (test for completion, incrementing counters, etc...). This is a case of diminishing returns though, since as the loop is unrolled more and more, the time spent on loop control becomes less and less significant. Like mmr said, loop unrolling may also help the compiler to execute things in parallel, but only up to a point.
The "data-zone" algorithm appears to be a version of a cache efficient matrix transpose algorithm. The problem with computing a transpose the naive way is that it results in a lot of cache misses. For the source array, you are accessing the memory along each row, so it is accessed in a linear manner, element-by-element. However, this requires that you access the destination array along the columns, meaning you are jumping dim elements each time you access an element. Basically, for each row of the input, you are traversing the memory of the entire destination matrix. Since the whole matrix probably won't fit in the cache, memory has to be loaded and unloaded from the cache very often.
The "data-zone" algorithm takes the matrix that you are accessing by column and only performs the transpose for 32 rows at a time, so the amount of memory you are traversing is 32xstride, which should hopefully fit completely into the cache. Basically the aim is to work on sub-sections that fit in the cache and reduce the amount of jumping around in memory.