Understand whether code sample is CPU bound or Memory bound - c

As a general question to those working on optimization and performance tuning of programs, how do you figure out if your code is CPU bound or Memory bound? I understand these concepts in general, but if I have say, 'y' amounts of loads and stores and '2y' computations, how does one go about finding what is the bottleneck?
Also can you figure out where exactly you are spending most of your time and say, if you load 'x' amount of data into cache (if its memory bound), in every loop iteration, then your code will run faster? Is there any precise way to determine this 'x', other than trial and error?
Are there any tools that you'll use, say on the IA-32 or IA-64 architecture? Doest VTune help?
For example, I'm currently doing the following:
I have 26 8*8 matrices of complex doubles and I have to perform a MVM (matrix vector multiplication) with (~4000) vectors of length 8, for each of these 26 matrices. I use SSE to perform the complex multiplication.
/*Copy 26 matrices to temporary storage*/
for(int i=0;i<4000;i+=2){//Loop over the 4000 vectors
for(int k=0;k<26;k++){//Loop over the 26 matrices
/*
Perform MVM in blocks of '2' between kth matrix and
'i' and 'i+1' vector
*/
}
}
The 26 matrices take 26kb (L1 cache is 32KB) and I have laid the vectors out in memory such that I have stride'1' accesses. Once I perform MVM on a vector with the 27 matrices, I don't visit them again, so I don't think cache blocking will help. I have used vectorization but I'm still stuck on 60% of peak performance.
I tried copying, say 64 vectors, into temporary storage, for every iteration of the outer loop thinking they'll be in cache and help, but its only decreased performance. I tried using _mm_prefetch() in the following way: When I am done with about half the matrices, I load the next 'i' and 'i+1' vector into memory, but that too hasn't helped.
I have done all this assuming its memory bound but I want to know for sure. Is there a way?

To my understanding the best way is profiling your application/workload. Based on the input data, the characteristic of the application/workload can significantly vary. These behaviors can however be quantified with to few phases Ref[2, 3] and a histogram can broadly tell the most frequent path of the workload to be optimized. The question that you are asking will also require benchmark programs [like SPEC2006, PARSEC, Media bench etc] for an architecture and is difficult to answer in general terms ( and is an active part of research in computer architecture). However, for specific cases a quantitative result can be stated for different memory hierarchies. You can use tools like:
Perf
OProfile
VTune
LikWid
LLTng
and other monitoring and simulation tools to get the profiling traces of the application. You can look at performance counters like IPC, CPI ( for CPU bound) and memory access, cache misses, cache access , and other memory counters for determining memory boundedness.like IPC, Memory access per cycle (MPC), is often used to determine the memory boundedness of an application/workload.
To specifically improve matrix multiplication, I would suggest using a optimized algorithm as in LAPACK.

Related

Counting FLOPs and size of data and check whether function is memory-bound or cpu-bound

I am going to analyse and optimize some C-Code and therefore I first have to check, whether the functions I want to optimize are memory-bound or cpu-bound. In general I know, how to do this, but I have some questions about counting Floating Point Operations and analysing the size of data, which is used. Look at the following for-loop, which I want to analyse. The values of the array are doubles (that means 8 Byte each):
for(int j=0 ;j<N;j++){
for(int i=1 ;i<Nt;i++){
matrix[j*Nt+i] = matrix[j*Nt+i-1] * mu + matrix[j*Nt+i]*sigma;
}
}
1) How many floating point operations do you count? I thought about 3*(Nt-1)*N... but do I have to count the operations within the arrays, too (matrix[j*Nt+i], which are 2 more FLOP for this array)?
2)How much data is transfered? 2* ((Nt-1)*N)8Byte or 3 ((Nt-1)*N)*8Byte. I mean, every entry of the matrix has to be loaded. After the calculation, the new values is saved to that index of the array (now these is 1load and 1 store). But this value is used for the next calculation. Is another load operations needed therefore, or is this value (matrix[j*Nt+i-1]) already available without a load operation?
Thx a lot!!!
With this type of code, the direct sort of analysis you are proposing to do can be almost completely misleading. The only meaningful information about the performance of the code is actually measuring how fast it runs in practice (benchmarking).
This is because modern compilers and processors are very clever about optimizing code like this, and it will end up executing in a way which is nothing like your straightforward analysis. The compiler will optimize the code, rearranging the individual operations. The processor will itself try to execute the individual sub-operations in parallel and/or pipelined, so that for example computation is occurring while data is being fetched from memory.
It's useful to think about algorithmic complexity, to distinguish between O(n) and O(n²) and so on, but constant factors (like you ask about 2*... or 3*...) are completely moot because they vary in practice depending on lots of details.

Memory management for gauss elimination

A matrix is created in processor 0 and scattered to other processors. A matrix is a symmetric dense matrix. That's why it is initialized in processor 0.
A matrix is created in this way:
A=malloc(sizeof(double)*N*N);
for (i=0; i<N; i++)
for(j=0; j<N; j++)
A(i,j)=rand()%10; // The code will be changed.
A(i,j) is defined as:
#define A(i,j) A[i*N+j]
and N has to be 100,000 to test the algorithm.
The problem here is: if N=100,000 then the memory needed is approximately 76GB. What do you suggest to store the A matrix?
PS: Algorithm works very well when N<20.000 and the cluster is a distrubed memory system(2GB RAM per processor)
If you are doing this, as stated in comments, to do a scaling test, then Oli Charlesworth is completely right; anything you do is going to make this an apples-to-oranges comparison, because your node doesn't have 76GB to use. Which is fine; one of the big reasons to use MPI is to tackle problems that couldn't fit on one node. But by trying to shoehorn 76GB of data onto one processor, the comparison you're doing isn't going to make any sense. As mentioned by both Oli Charlesworth and caf, through various methods you can use disk instead of RAM, but then your 1 processor answer is going not going to be directly comparable to the fits-in-RAM numbers you get from larger number of nodes, so you're going to be going to a lot of work to get a number which won't actually mean anything.
If you want scaling results on this sort of problem, you either start with the lowest number of nodes that the problem does fit on, and take data at increasing numbers of processors, or you do weak scaling, rather than strong scaling tests -- you keep the work-per-processor constant while scaling up the number of processors, rather than the total work being constant.
Incidentally, however you do the measurements, you'll end up with better results if, as Oli Charlesworth suggests, you have each procesor generate its own data rather than have a serial bottleneck by having rank 0 do the generation of the matrix and then have all the processors receive their parts.
If you are programming on a POSIX system with sufficient virtual address space (which in practice will mean a 64 bit system), you can use mmap().
Either create an anonymous mapping of the required size (this will be swap-backed, which will mean you'll need at least 76GB of swap), or create a real file of the required size and map that.
The file-backed solution has the advantage that if your cluster has a shared file system, you don't need to explicitly transfer the matrix to each processor - you can simply msync() it after creating it, and then map the right region on each processor.
If you can switch to C++, you might look into STXXL, which is an STL implementation specifically designed for huge datasets, with transparent disk-backed support, etc.

should use GPU?

how can i know if my serial code will run faster if i used a GPU? i know it depends on a lot of things... ie if the code could be parallalized in an SMID fation and all this stuff... but what considerations should i take into account to be "sure" that i will gain speed? should the algorithm be embarrassingly parallel? therefore i wouldn't bother trying the GPU if parts of the algorithm cannot be parallelized? should i take into consideration how much memory is required for a sample input?
what are the "specs" of a serial code that would make it run faster on a GPU? can a complex algorithm gain speed on a GPU?
i don't want to waste time and try to code my algorithm on GPU and i am 100% sure that speed will be gained.... that is my problem....
i think that my algorithm could be parallelized on GPU... would it be worth trying it?
It depends upon two factors:
1) The speedup of having many cores performing the floating point operations
This is dependent upon the inherent parallelization of the operations you are performing, the number of cores on your GPU, and the differences in clock rates between your CPU and GPU.
2) The overhead of transferring the data back and forth between main memory and GPU memory.
This is mainly dependent upon the "memory bandwidth" of your particular GPU, and is greatly reduced by the Sandy Bridge architecture where the CPU and GPU are on the same die. With older architectures, some operations such as matrix multiplication where the inner dimensions are small get no improvement. This is because it takes longer to transfer the inner vectors back and forth across the system bus than it does to dot product the vectors on the CPU.
Unfortunately these two factors are tough to estimate and there is no way to "know" without trying it. If you currently use BLAS for your SIMD operations, it is fairly simple to substitute in CUBLAS which has the same API except it sends the operations over to the GPU to perform.
When looking for a parallel solution you should typically ask yourself the questions
The amount of data you have.
The amount of floating point computation you have.
How complicated is your algorithm i.e. conditions and branches in the algorithm. Is there any data localization?
what kind of speedup is required?
Is it Realtime computation or not?
Do alternate algorithms exist (but maybe they are not the most efficient serial algorithm)?
What kind of sw/hw you have access to.
Depending on the answers you are looking for you may want to use GPGPU, cluster computation or distributed computation or a combination of GPU and cluster/distributed machines.
If you could share the any information on your algorithm and size of data then it would be easier to comment.
Regular C code can be converted to CUDA remarkably easily. If the heavy hitters in your algorithm's profile can be parallelized, try it and see if it helps.

general question about solving problems with parallelisation

i have a general question about programming of parallel algorithms in C. Lets assume that our task is to implement some matrix algorithms with MPI and/or OpenMP. There are some situations, like false sharing in OpenMP or in MPI where the communications arise in dependence of the matrix dimension (columns cyclic distrubuted among processes), which cause some problems . Would it be a good and a common attempt to solve this situations by, for example, transposing the matrix, because this would reduce the necessary communications or even avoiding the false sharing problem? After that you would undo the transposition. Of course, assuming that this would lead to a much better speed up.
I dont think that this would be very cunning and more of a lazy way to do this. But im curious to read some opions about this.
Let's start with the first question first: can it make sense to transpose? The answer is, it depends, and you can estimate whether it will improve things or not.
The transposition/retransposition with impose a one-time memory bandwidth cost of 2*(going through memory the fast way) + 2*(going through memory the slow way) where those memory operations are literally memory operations in the multicore case, or network communications in the distributed memory case. You're going to be reading a matrix in the fast way and putting it into memory the slow way. (You can make this, essentially, 4*(going through memory the fast way) by reading the matrix in one cache-sized block at a time, transposing in cache, and writing out in order).
Whether or not that's a win or not depends on how many times you'll be accessing the array. If you would have been hitting the entire non-transposed array 4 times with memory access in the "wrong" direction, then you will clearly win by doing the two transposes. If you'd only be going through the non-transposed array once in the wrong direction, then you almost certainly won't win by doing the transposition.
As to the larger question, #AlexandreC is absolutely right here -- trying to implement your own linear algebra routines is madness. Take a look at, eg, How To Write Fast Numerical Code, figure 3; there can be factors of 40 in performance between naive and highly-tuned (say) GEMM operations. These things are highly memory-bandwidth limited, and in parallel that means network limited. By far, best is to use existing tools.
For multicore linear algebra, existing libraries include
Atlas
Plasma
Flame
For MPI implementations, there are
BLACS
Scalapack
or complete solver environments like
PETSc
Trilinos
I don't know that you'd throw the transpose away the second that you completed the operation, but yes this is a valid mechanism to increase parallelism.
I am not an expert; I've only read a little bit about this topic, and even that was for SIMD architectures, so please take my opinion lightly... but I think the usual mechanism is to lay your structures out in memory to match the machine (so you'd transpose a large matrix to line up better with your vectors and increase the dependency distance in your loops), and then you also build an indexing structure of pointers around that so that you can quickly access individual elements in the transpose differently. This gets more difficult to do as your input changes more dynamically.
I dont think that this would be very cunning and more of a lazy way to do this.
Lazy solutions are usually better than "cunning" ones, because they tend to be more simple and straightforward. They're therefore easier to implement, document, understand and maintain. Indeed, laziness is arguably one of the greatest virtues a programmer can have. As long as the program produces correct results at acceptable speeds, nobody should care how elegantly you solved the problem (including you).

What is the maximum theoretical speed-up due to SSE for a simple binary subtraction?

In trying to figure out whether or not my code's inner loop is hitting a hardware design barrier or a lack of understanding on my part barrier. There's a bit more to it, but the simplest question I can come up with to answer is as follows:
If I have the following code:
float px[32768],py[32768],pz[32768];
float xref, yref, zref, deltax, deltay, deltaz;
initialize_with_random(px);
initialize_with_random(py);
initialize_with_random(pz);
for(i=0;i<32768-1;i++) {
xref=px[i];
yref=py[i];
zref=pz[i];
for(j=0;j<32768-1;j++ {
deltx=xref-px[j];
delty=yref-py[j];
deltz=zref-pz[j];
} }
What type of maximum theoretical speed up would I be able to see by going to SSE instructions in a situation where I have complete control over code (assembly, intrinsics, whatever) but no control over runtime environment other than architecture (i.e. it's a multi-user environment so I can't do anything about how the OS kernel assigns time to my particular process).
Right now I'm seeing a speed up of 3x with my code, when I would have thought using SSE would give me much more vector depth than the 3x speed up is indicating (presumably the 3x speed up tells me I have a 4x maximum theoretical throughput). (I've tried things such as letting deltx/delty/deltz be arrays in case the compiler wasn't smart enough to auto-promote them, but I still see only 3x speed up.) I'm using the intel C compiler with the appropriate compiler flags for vectorization, but no intrinsics obviously.
It depends on the CPU. But the theoretical max won't get above 4x. I don't know of a CPU which can execute more than one SSE instruction per clock cycle, which means that it can at most compute 4 values per cycle.
Most CPU's can do at least one floating point scalar instruction per cycle, so in this case you'd see a theoretical max of a 4x speedup.
But you'll have to look up the specific instruction throughput for the CPU you're running on.
A practical speedup of 3x is pretty good though.
I think you'd probably have to interleave the inner loop somehow. The 3-component vector is getting done at once, but that's only 3 operations at once. To get to 4, you'd do 3 components from the first vector, and 1 from the next, then 2 and 2, and so on. If you established some kind of queue that loads and processes the data 4 components at a time, then separate it after, that might work.
Edit: You could unroll the inner loop to do 4 vectors per iteration (assuming the array size is always a multiple of 4). That would accomplish what I said above.
Consider: How wide is a float? How wide is the SSEx instruction? The ratio should should give you some kind of reasonable upper bound.
It's also worth noting that out-of-order pipes play havok with getting good estimates of speedup.
You should consider loop tiling - the way you are accessing values in the inner loop is probably causing a lot of thrashing in the L1 data cache. It's not too bad, because everything probably still fits in the L2 at 384 KB, but there is easily an order of magnitude difference between an L1 cache hit and an L2 cache hit, so this could make a big difference for you.

Resources