I recently wrote some software that makes a lot of calculations.
The calculations are done in levels, while for each level, the calculations within it are independent. I.e. logically, I can run them in parallel, as none of them relies on the result of the others.
My question is: Is there a general C library for making parallel math (matrix) operations on the GPU, which works on all the platforms (Windows/Unix/etc.)?
When I say general - I mean to some library that would work with any modern GPU
C is a singlethreaded language, to assign your calculations to run in parallel on a GPU, for something like 'oh I don't know' Bitcoin mining, you should consider using the AMP library in C++. I'm not too sure about the whole, compile to run on GPU core thing, but I know that this will certainly perform your calculations in parallel.
... off loading bitcoin miners? ;)
& well, if you can compile your source directly on the host machine, then you can specify to the compiler,
# IF (some operating system) THEN
& that will solve that. Going back to your parallel computing problem, CUDA.h functions will calculate your this and that on the GPU, but that will most likely require some serious jigging if you say that you have already wrote the calculations part of your project.
Related
Following my former post about comparing the time required to do a simple array addition job (C[i]=A[i]+B[i]) on different devices, I improved the code a little bit to repeat the process for different array length and give back the time required:
The X axis is the array length in logarithm with a base 2 and Y is the time in logarithm with base 10. As it can be seen somewhere between 2^13 and 2^14 the GPUs become faster than the CPU. I guess it is because the memory allocation becomes negligible in comparison to the calculation. (GPI1 is a typo I meant GPU1).
Now hoping my C-OpenCL code is correct I can have an estimation of the time required to do an array addition on different devices: f1(n) for CPU, f2(n) for the first GPU and f3(n) for the second GPU. If I have an array job with a length of n I should theoretically be able to divide it into 3 parts as n1+n2+n3=n and in a way to satisfy the f1(n1)=f2(n2)=f3(n3) and distribute it on three devices on my system to have the fastest possible calculation. I think I can do it using lets say OpenMP or any other multithreading method and use the cores of my CPU to host three different OpenCL tasks. That's not what I like to do because:
It is a wast of resources. Two of the cores are just hosting while could be used for calculation
It makes the code more complicated.
I'm not sure how to do it. I'm now using the Apple Clang compiler with -framework OpenCL to compile the code, but for OpenMP I have to use the GNU compiler. I don't know how to both OpenMP and OpenCL on one of these compilers.
Now I'm thinking if there is any way to do this distribution without multithreading? For example if one of the CPU cores assigns the tasks to the three devices consequentially and then catches the results in the same (or different) order and then concatenate them. It probably needs a little bit of experimenting to adjust for the timing of the task assignment of the subtasks, but I guess it should be possible.
I'm a total beginner to the OpenCL so I would appreciate if you guys could help me know if it is possible and how to do it. Maybe there are already some examples doing so, please let me know. Thanks in advance.
P.S. I have also posted this question here and here on Reddit.
The problem as its read implicitly tells you the solution should be concurrent (asynchronous) thus you require to add the results from three different devices at same time, otherwise what you will do is to run a process first on device A, then device B and then device C (better to run a single process on the fastest device), if you plan to efficiently learn to exploit OpenCL programming (either on mCPU/GPUs) you should be comfortable to do Asynchronous programming (indeed multi threaded).
I've never used MPI before nor taken a formal course on parallel programming. I'm an applied math student working on a large project that consists of a series of for loops. In each for loop, the iterations are completely independent of the others, so I'm pretty sure this can easily be parallelized.
I tried using openmp first but get an error that file can't be found (Mac user), and I'm not really sure how to fix that.
So is this a simple task to do in MPI? Google for some reason comes up short on answers here.
If iterations are independent, then OpenMP is the simplest way to go. On Mac OS X, you need to install gcc to compile with OpenMP, owing to the fact that the clang compiler does not support (for now) OpenMP. You can install easily a precompiled version of gcc from here:
http://hpc.sourceforge.net
You can also use MPI of course, but that is going to be much more difficult for you, (besides installation), given that you are not trained. If you want to stick with MPI, parallelizing a loop consisting of independent iterations requires computing for each process its initial and final iteration, managing data structures appropriately (remember that in MPI memory is not shared among processes) etc.
If you need OpenMP on Mac OS X, you can try to use unofficial clang with OpenMP support. The source code is avaialable here http://clang-omp.github.com. Also you need an OpenMP runtime library, which is available here http://openmp.llvm.org/.
I'm writing a C program that looks like this:
for (int i=0;i<n;i++){
[makes new file for i: i.txt]
[runs a long and intensive computation]
[writes to i.txt]
[closes i.txt]}
where n is some large number. Obviously, several iterations at once could be run in parallel, as they do not depend on each other. I have eight cores on my processor, so it seems that I would want my program to distribute the iterations on the eight cores, so it runs as quickly as possible. This means I would want the program to be multi-threaded.
My question is whether I would need to manually multi-thread this process in C, or whether multi-threading is done automatically on some level. I'm compiling with GCC and I know that GCC optimizes well, but I don't know whether it automatically multithreads. I also don't know whether my OS (Debian Linux) might not automatically distribute the program onto all eight cores of my processor. If not, is there a way I can get it to multi-thread without having to actually write multi-threading code in C?
You can multithread it manually if you want, or you can use libraries and tools that take over as much or as little of the responsibility as you like. OpenMP is definitely worth looking at.
The simple answer is no: GCC does no automatic multithreading in itself. Multithreading is defined in layers strictly above C, so GCC doing that would be kind of a layering violation.
That being said, there are lots of libraries that make the job easier (not that manually multithreading your particular case is a lot of work to begin with), and some of them integrate more tightly at the language level. OpenMP has been mentioned by others.
You may also consider the new additions to the C11 standard regarding multithreading applications
Has anyone taken advantage of the automatic vectorization that gcc can do? In the real world (as opposed to example code)? Does it take restructuring of existing code to take advantage? Are there a significant number of cases in any production code that can be vectorized this way?
I have yet to see either GCC or Intel C++ automatically vectorize anything but very simple loops, even when given the code of algorithms that can (and were, after I manually rewrote them using SSE intrinsics) be vectorized.
Part of this is being conservative - especially when faced with possible pointer aliasing, it can be very difficult for a C/C++ compiler to 'prove' to itself that a vectorization would be safe, even if you as the programmer know that it is. Most compilers (sensibly) prefer to not optimize code rather than risking miscompiling it. This is one area where higher level languages have a real advantage over C, at least in theory (I say in theory since I'm not actually aware of any automatically vectorizing ML or Haskell compilers).
Another part of it is simply analytical limitations - most research in vectorization, I understand, is related to optimizing classical numerical problems (fluid dynamics, say) which was the bread and butter of most vector machines before a few years ago (when, between CUDA/OpenCL, Altivec/SSE, and the STI Cell, vector programming in various forms became widely available in commercial systems).
It's fairly unlikely that code written for a scalar processor in mind will be easy for a compiler to vectorize. Happily, many things you can do to make it easier for a compiler to understand how to vectorize it, like loop tiling and partial loop unrolling, also (tend to) help performance on modern processors even if the compiler doesn't figure out how to vectorize it.
It is hard to use in any business logic, but gives speed ups when you are processing volumes of data in the same way.
Good example is sound/video processing where you apply the same operation to every sample/pixel.
I have used VisualDSP for this, and you had to check the results after compiling - if it is really used where it should.
Vectorized instructions are not limited to Cell processors - most modern workstations-like CPU have them (PPC, x86 since pentium 3, Sparc, etc...). When used well for floating points operations, it can help quite a lot for very computing intensive tasks (filters, etc...). In my experience, automatic vectorization does not work so well.
You may have noticed that pretty much no-one actually knows how to make good use of GCC's Automatic Vectorization. If you search around the web to see people's comments, it always come to the idea that GCC allows you to enable automatic vectorization, but it extremely rarely makes actual use of it, and so if you want to use SIMD acceleration (eg: MMX, SSE, AVX, NEON, AltiVec), then you basically haveto figure out how to write it using compiler intrinsics or Assembly language code.
But the problem with intrinsics is that you effectively need to understand the Assembly language side of it and then also learn the Intrinsics method of describing what you want, which is likely to result in much less efficient code than if you wrote it in Assembly code (such as by a factor of 10x), because the compiler is still going to have trouble making good use of your intrinsic instructions!
For example, you might be using SIMD Intrinsics so that many operations can be performed in parallel at the same time, but your compiler will probably generate Assembly code that transfers the data between the SIMD registers and the normal CPU registers and back, effectively making your SIMD code run at a similar speed (or even slower) than normal code!
So basically:
If you want upto 100% speedups (2x
speed), then either buy the
official Intel/ARM compilers or convert some of your code to use SIMD C/C++ Intrinsics.
If you
want 1000% speedups (10x speed), then
write it in Assembly code using SIMD instructions by hand. Or if available on your hardware, use GPU acceleration instead such as OpenCL or Nvidia's CUDA SDK, since they can provide similar speedups in the GPU as SIMD does in the CPU.
I have a lot of nice MATLAB code that runs too slowly and would be a pain to write over in C. The MATLAB compiler for C does not seem to help much, if at all. Should it be speeding execution up more? Am I screwed?
If you are using the MATLAB complier (on a recent version of MATLAB) then you will almost certainly not see any speedups at all. This is because all the compiler actually does is give you a way of packaging up your code so that it can be distributed to people who don't have MATLAB. It doesn't convert it to anything faster (such as machine code or C) - it merely wraps it in C so you can call it.
It does this by getting your code to run on the MATLAB Compiler Runtime (MCR) which is essentially the MATLAB computational kernel - your code is still being interpreted. Thanks to the penalty incurred by having to invoke the MCR you may find that compiled code runs more slowly than if you simply ran it on MATLAB.
Put another way - you might say that the compiler doesn't actually compile - in the traditional sense of the word at least.
Older versions of the compiler worked differently and speedups could occur in certain situations. For Mathwork's take on this go to
http://www.mathworks.com/support/solutions/data/1-1ARNS.html
In my experience slow MATLAB code usually comes from not vectorizing your code (i.e., writing for-loops instead of just multiplying arrays (simple example)).
If you are doing file I/O look out for reading data in one piece at a time. Look in the help files for the vectorized version of fscanf.
Don't forget that MATLAB includes a profiler, too!
I'll echo what dwj said: if your MATLAB code is slow, this is probably because it is not sufficiently vectorized. If you're doing explicit loops when you could be doing operations on whole arrays, that's the culprit.
This applies equally to all array-oriented dynamic languages: Perl Data Language, Numeric Python, MATLAB/Octave, etc. It's even true to some extent in compiled C and FORTRAN compiled code: specially-designed vectorization libraries generally use carefully hand-coded inner loops and SIMD instructions (e.g. MMX, SSE, AltiVec).
First, I second all the above comments about profiling and vectorizing.
For a historical perspective...
Older version of Matlab allowed the user to convert m files to mex functions by pre-parsing the m code and converting it to a set of matlab library calls. These calls have all the error checking that the interpreter did, but old versions of the interpreter and/or online parser were slow, so compiling the m file would sometimes help. Usually it helped when you had loops because Matlab was smart enough to inline some of that in C. If you have one of those versions of Matlab, you can try telling the mex script to save the .c file and you can see exactly what it's doing.
In more recent version (probably 2006a and later, but I don't remember), Mathworks started using a just-in-time compiler for the interpreter. In effect, this JIT compiler automatically compiles all mex functions, so explicitly doing it offline doesn't help at all. In each version since then, they've also put a lot of effort into making the interpreter much faster. I believe that newer versions of Matlab don't even let you automatically compile m files to mex files because it doesn't make sense any more.
The MATLAB compiler wraps up your m-code and dispatches it to a MATLAB runtime. So, the performance you see in MATLAB should be the performance you see with the compiler.
Per the other answers, vectorizing your code is helpful. But, the MATLAB JIT is pretty good these days and lots of things perform roughly as well vectorized or not. That'a not to say there aren't performance benefits to be gained from vectorization, it's just not the magic bullet it once was. The only way to really tell is to use the profiler to find out where your code is seeing bottlenecks. Often times there are some places where you can do local refactoring to really improve the performance of your code.
There are a couple of other hardware approaches you can take on performance. First, much of the linear algebra subsystem is multithreaded. You may want to make sure you have enabled that in your preferences if you are working on a multi-core or multi-processor platform. Second, you may be able to use the parallel computing toolbox to take more advantage of multiple processors. Finally, if you are a Simulink user, you may be able to use emlmex to compile m-code into c. This is particularly effective for fixed point work.
Have you tried profiling your code? You don't need to vectorize ALL your code, just the functions that dominate running time. The MATLAB profiler will give you some hints on where your code is spending the most time.
There are many other things you you should read up on the Tips For Improving Performance section in the MathWorks manual.
mcc won't speed up your code at all--it's not really a compiler.
Before you give up, you need to run the profiler and figure out where all your time is going (Tools->Open Profiler). Also, judicious use of "tic" and "toc" can help. Don't optimize your code until you know where the time is going (don't try to guess).
Keep in mind that in matlab:
bit-level operations are really slow
file I/O is slow
loops are generally slow, but vectorizing is fast (if you don't know the vector syntax, learn it)
core operations are really fast (e.g. matrix multiply, fft)
if you think you can do something faster in C/Fortran/etc, you can write a MEX file
there are commercial solutions to convert matlab to C (google "matlab to c") and they work
You could port your code to "Embedded Matlab" and then use the Realtime-Workshop to translate it to C.
Embedded Matlab is a subset of Matlab. It does not support Cell-Arrays, Graphics, Marices of dynamic size, or some Matrix addressing modes. It may take considerable effort to port to Embedded Matlab.
Realtime-Workshop is at the core of the Code Generation Products. It spits out generic C, or can optimize for a range of embedded Platforms. Most interresting to you is perhaps the xPC-Target, which treats general purpose hardware as embedded target.
I would vote for profiling + then look at what are the bottlenecks.
If the bottleneck is matrix math, you're probably not going to do any better... EXCEPT one big gotcha is array allocation. e.g. if you have a loop:
s = [];
for i = 1:50000
s(i) = 3;
end
This has to keep resizing the array; it's much faster to presize the array (start with zeros or NaN) & fill it from there:
s = zeros(50000,1);
for i = 1:50000
s(i) = 3;
end
If the bottleneck is repeated executions of a lot of function calls, that's a tough one.
If the bottleneck is stuff that MATLAB doesn't do quickly (certain types of parsing, XML, stuff like that) then I would use Java since MATLAB already runs on a JVM and it interfaces really easily to arbitrary JAR files. I looked at interfacing with C/C++ and it's REALLY ugly. Microsoft COM is ok (on Windows only) but after learning Java I don't think I'll ever go back to that.
As others has noted, slow Matlab code is often the result of insufficient vectorization.
However, sometimes even perfectly vectorized code is slow. Then, you have several more options:
See if there are any libraries / toolboxes you can use. These were usually written to be very optimized.
Profile your code, find the tight spots and rewrite those in plain C. Connecting C code (as DLLs for instance) to Matlab is easy and is covered in the documentation.
By Matlab compiler you probably mean the command mcc, which does speed the code a little bit by circumventing Matlab interpreter. What would speed the MAtlab code significantly (by a factor of 50-200) is use of actual C code compiled by the mex command.