should use GPU? - c

how can i know if my serial code will run faster if i used a GPU? i know it depends on a lot of things... ie if the code could be parallalized in an SMID fation and all this stuff... but what considerations should i take into account to be "sure" that i will gain speed? should the algorithm be embarrassingly parallel? therefore i wouldn't bother trying the GPU if parts of the algorithm cannot be parallelized? should i take into consideration how much memory is required for a sample input?
what are the "specs" of a serial code that would make it run faster on a GPU? can a complex algorithm gain speed on a GPU?
i don't want to waste time and try to code my algorithm on GPU and i am 100% sure that speed will be gained.... that is my problem....
i think that my algorithm could be parallelized on GPU... would it be worth trying it?

It depends upon two factors:
1) The speedup of having many cores performing the floating point operations
This is dependent upon the inherent parallelization of the operations you are performing, the number of cores on your GPU, and the differences in clock rates between your CPU and GPU.
2) The overhead of transferring the data back and forth between main memory and GPU memory.
This is mainly dependent upon the "memory bandwidth" of your particular GPU, and is greatly reduced by the Sandy Bridge architecture where the CPU and GPU are on the same die. With older architectures, some operations such as matrix multiplication where the inner dimensions are small get no improvement. This is because it takes longer to transfer the inner vectors back and forth across the system bus than it does to dot product the vectors on the CPU.
Unfortunately these two factors are tough to estimate and there is no way to "know" without trying it. If you currently use BLAS for your SIMD operations, it is fairly simple to substitute in CUBLAS which has the same API except it sends the operations over to the GPU to perform.

When looking for a parallel solution you should typically ask yourself the questions
The amount of data you have.
The amount of floating point computation you have.
How complicated is your algorithm i.e. conditions and branches in the algorithm. Is there any data localization?
what kind of speedup is required?
Is it Realtime computation or not?
Do alternate algorithms exist (but maybe they are not the most efficient serial algorithm)?
What kind of sw/hw you have access to.
Depending on the answers you are looking for you may want to use GPGPU, cluster computation or distributed computation or a combination of GPU and cluster/distributed machines.
If you could share the any information on your algorithm and size of data then it would be easier to comment.

Regular C code can be converted to CUDA remarkably easily. If the heavy hitters in your algorithm's profile can be parallelized, try it and see if it helps.

Related

how to know clock cycles, overall performance, etc... of program?

I have 3 different algorithms which all calculate the same stuff.
My goal is to compare all three algorithms, i.e. clock cycles, "how intensive it is for the processor", time needed to get the final result, the overall performance etc...
How can I see/get/analyze all of this information?
I am programming in Matlab and in C-language in code composer studio for an embedded system.
EDIT: memory usage/management would be usefull as well for the embedded system especially
First you can compare the size of your Output-files. Most times the bigger one is slower.
Get the exactly clock cycles is not easy. you must know how many clock cycles your Assembler command Needs and calculate it for your code.
If you are running it directly on your Hardware, you can toggle a port at the start and end Point and do a Timing measurement. (Regard there are may Interrupts, that can slow you down)
For the MATLAB part, you should use the timeit function to evaluate performance. You can also use profile to inspect which, if any, parts of the code are causing bottlenecks.

Avoiding CUDA thread divergence for MISD type operation

As part of a bigger code, I have a CUDA RK4 solver that integrates a large number of ODEs (Can be 1000+) in parallel. One step of this operation is calculating 'xdot', which is different for each equation (or data element). As of now, I have a switch-case branching setup to calculate the value for each data element in the kernel. All the different threads use the same 3-6 data elements to calculate their output, but in a different way. For example, for thread 1, it could be
xdot = data[0]*data[0] + data[1];
while for thread 2 it could be,
xdot = -2*data[0] + data[2];
and so on.
So if I have a hundred data elements, the execution path is different for each of them.
Is there any way to avoid/decrease the thread-divergence penalty in such a scenario?
Would running only one thread per block be of any help ?
Running one thread per block simply nulls 31/32 threads in the single warp you launch and wastes a lot of cycles and opportunities to hide latency. I would never recommend it, no matter how much branch divergence penalty your code incurred.
Your application sounds pretty orthognal to the basic CUDA programming paradigm and there really isn't going to be much you can do to avoid branch divergence penalties. One approach which could slightly improve things would be to perform some prior analysis of the expressions for each equation and group those with common arithmetic terms together. Recent hardware can run a number of kernels simultaneously, so it might be profitable to group calculations sharing like terms into different kernels and launch them simultaneously, rather than a single large kernel. CUDA supports C++ templating, and that can be a good way of generating a lot of kernel code from a relatively narrow base and make a lot of logic statically evaluable, which can help the compiler. But don't expect miracles - your problem is probay better suited to a different architecture than the GPU (Intel's Xeon Phi, for example).

function to purposefully have a high cpu load?

I'm making a program which controls other processes (as far as stopped and started states).
One of the criteria is that the load on the computer is under a certain value.
So i need a function or small program which will cause the load to be very high for testing purposes. Can you think of anything?
Thanks.
I can think of this one :
for(;;);
If you want to actually generate peak load on a CPU, you typically want a modest-size (so that the working set fits entirely in cache) trivially parallelizable task, that someone has hand-optimized to use the vector unit on a processor. Common choices are things like FFTs, matrix multiplications, and basic operations on mathematical vectors.
These almost always generate much higher power and compute load than do more number-theoretic tasks like primality testing because they are essentially branch-free (other than loops), and are extremely homogeneous, so they can be designed to use the full compute bandwidth of a machine essentially all the time.
The exact function that should be used to generate a true maximum load varies quite a bit with the exact details of the processor micro-architecture (different machines have different load/store bandwidth in proportion to the number and width of multiply and add functional units), but numerical software libraries (and signal processing libraries) are great things to start with. See if any that have been hand tuned for your platform are available.
If you need to control how long will the CPU burst be, you can use something like the Sieve of Eratosthenes (algorithm to find primes until a certain number) and supply a smallish integer (10000) for short bursts, and a big integer (100000000) for long bursts.
If you will take I/O into account for the load, you can write to a file per each test in the sieve.

Can we benchmark how fast CUDA or OpenCL is compared to CPU performance?

How much faster can an algorithm on CUDA or OpenCL code run compared to a general single processor core? (considering the algorithm is written and optimized for both the CPU and GPU target).
I know it depends on both the graphics card and the CPU, but say, one of the fastest GPUs of NVIDIA and a (single core of a) Intel i7 processor ?
And I know it also depends on the type of algorithm.
I do not need a strict answer, but experienced examples like: for a image manipulation algorithm using double-precision floating point and 10 operations per pixel took first 5 minutes and now runs in x seconds using this hardware.
Your question is overly broad, and very difficult to answer. Moreover only a small percentage of algorithms (the ones that deal without much shared state) are feasable with GPUs.
But I do want to urge you to be critical about claims. I'm in imageprocessing, and read many an article on the subject, but quite often in the GPU case, the time to upload input data to the GPU, and download the results back to main memory is not included in the calculation of the factor.
While there are a few cases where this doesn't matter (both are small or there is a second stage calculation that further reduces the result in size), usually one does have to transfer the results and initial data.
I've seen this turning a claimed plus into a negative, because the upload/download time alone was longer than the main CPU would require to do the calculation.
Pretty much the same thing applies to combining results of different GPU cards.
Update Newer GPUs seem to be able to upload/download and calculate at the same time using ping-pong buffers. But the advise to check the border conditions thoroughly still stands. There is a lot of spin out there.
Update 2 Quite often using a GPU that is shared with video output for this is not optimal. Consider e.g. adding a low budget card for video, and using the onboard video for GPGPU tasks
I think that this video introduction to OpenCL gives a good answer to your question in the first or second episode (I do not remember). I think it was at the end of the first episode...
In general it depends on how well you can "parallelize" the problem. The problem size itself is also a factor, because it costs time to copy the data to the graphics card.
It depends very much on the algorithm and how efficient the implementation can be.
Overall, it's fair to say that GPU is better at computation than CPUs. Thus, an upper bound is to divide the theoretical GFlops rating of a top end GPU by a top end CPU. You can do similar computation for theoretical memory bandwidth.
For example, 1581.1 GFlops for a GTX580 vs. a 107.55 GFLOPS for i7 980XE. Note that the rating for GTX580 is for single precision. I believe you need to cut that down by a factor of 4 for Fermi class non-Tesla to get to the double precision rating. So in this instance, you might expect roughly 4x.
Caveats on why you might do better (or see results which claim far bigger speedups):
GPUs has better memory bandwidth than CPU once the data is on the card. Sometimes, memory bound algorithms can do well on the GPU.
Clever use of caches (texture memory etc.) which can let you do better than advertised bandwidth.
Like Marco says, the transfer time didn't get included. I personally always include such time in my work and thus have found that the biggest speedups I've seen to be in iterative algorithms where all the data fits on the GPU (I've gotten over 300x on a midrange CPU to midrange GPU here personally).
Apples to orange comparisons. Comparing a top end GPU vs. a low end CPU is inherently unfair. The rebuttal is that a high end CPU costs much more than a high end GPU. Once you go to a GFlops/$ or GFlops/Watt comparison, it can look much more favorable to the GPU.
__kernel void vecAdd(__global float* results )
{
int id = get_global_id(0);
}
this kernel code can spawn 16M threads on a new 60$ R7-240 GPU in 10 milliseconds.
This is equivalent to 16 thread creations or context switches in 10 nanoseconds. What is a 140$ FX-8150 8-core CPU timing? It is 1 thread in 50 nanoseconds per core.
Every instruction added in this kernel is a win for a gpu until it makes branching.
Your question is in general, hard to answer; there are simply many different variables that make it hard to give answers that are either accurate, or fair.
Notably, you are comparing both 1) choice of algorithm 2) relative performance of hardware 3) compiler optimisation ability 4) choice of implementation languages and 5) efficiency of algorithm implementation, all at the same time...
Note that, for example, different algorithms may be preferable on GPU vs CPU; and data transfers to and from GPU need to be accounted for in timings, too.
AMD has a case study (several, actually) in OpenCL performance for OpenCL code executing on the CPU and on the GPU. Here is one with performance results for sparse matrix vector multiply.
I've seen figures ranging from 2x to 400x. I also know that the middle-range GPUs cannot compete with high-range CPUs in double-precision computation - MKL on a 8-core Xeon will be faster than CULA or CUBLAS on an $300 GPU.
OpenCL is anecdotally much slower than CUDA.
A new benchmark suite called SHOC (Scalable Heterogeneous Computing) from Oak Ridge National Lab and Georgia Tech has both OpenCL and CUDA implementations of many important kernels. You can download the suite from http://bit.ly/shocmarx. Enjoy.

What is the limit of optimization using SIMD?

I need to optimize some C code, which does lots of physics computations, using SIMD extensions on the SPE of the Cell Processor. Each vector operator can process 4 floats at the same time. So ideally I would expect a 4x speedup in the most optimistic case.
Do you think the use of vector operators could give bigger speedups?
Thanks
The best optimization occurs in rethinking the algorithm. Eliminate unnecessary steps. Find more a direct way of accomplishing the same result. Compute the solution in a domain more relevant to the problem.
For example, if the vector array is a list of n which are all on the same line, then it is sufficient to transform the end points only and interpolate the intermediate points.
It CAN give better speeds up than 4 times over straight floating point as the SIMD instructions could be less exact (Not so much as to give too many problems though) and so take fewer cycles to execute. It really depends.
Best plan is to learn as much about the processor you are optimising for as possible. You may find it can give you far better than 4x improvements. You may find out you can't. We can't say though without knowing more about the algorithm you are optimising and what CPU you are targetting.
On their own, no. But if the process of re-writing your algorithms to support them also happens to improve, say, cache locality or branching behaviour, then you could find unrelated speed-ups. However, this is true of any re-write...
This is entirely possible.
You can do more clever instruction-level micro optimizations than a compiler, if you know what you're doing.
Most SIMD instruction sets offers several powerful operations that don't have any equivalent in normal scalar FPU/ALU code (e.g. PAVG/PMIN etc. in SSE2). Even if these don't fit your problem exactly, you can often combine these instructions for great effect.
Not sure about Cell, but most SIMD instruction sets have features to optimize memory access, for example to prefetch data into cache. I've had very good results with these.
Now this isn't Cell or PPC at all, but a simple image convolution filter of mine got a 20x speedup (C vs. SSE2) on Atom, which is higher than the level of parallelity (16 pixels at a time).
It depends on the architecture.. For the moment I assume x86 architecture (aka SSE).
You can get factor four on tight loops easily. Just replace your existing math with SSE instruction and you're done.
You can even get a little more than that because if you use SSE you do the math in registers which are usually not used by the compiler. This frees up the general purpose register for other task such as loop control and address calculation. In short the code that surrounds the SSE instruction will be more compact and execute faster.
And then there is the option to hint the memory controller how you want to access the memory, e.g. if you want to store data in a way that it bypasses the cache or not. For bandwidth hungry algorithms that may give you some more extra speed ontop of that.

Resources