Pure C OpenCL vs Python OpenCL performance - c

I am looking for performance measurement between Python wrapper to OpenCL and Pure C OpenCL. Performance measurements can varies with time, memory, etc..
- Are there any benchmarks available?
- What should be the expectation about the time performance differences?
- What kind of tasks (parallel of course...) should make a difference?

It is likely that PyOpenCL is your best choice. I would choose to use C only in very specific situations (a super-critical need for speed/low-latency on the host). For most casual parallel programs, it is fine for the host side to have plenty of slack, because all the real work gets done on the device.
You can consider PyOpenCL and OpenCL to have identical performance on the device.
Maybe use C if you are, like... designing a self-driving car, and every millisecond/amp matters. But even in that situation, it is likely that Python could be used effectively.
The best way to figure out if your specific program is slowed down is to time your code. For PyOpenCL that means:
import time
and
cl.command_queue_properties.PROFILING_ENABLE
Many smart companies and individuals choose to code first in Python, because they can build a flexible, working prototype quickly. If they end up needing more host performance later, it is relatively easy to port Python to C.
Hope that helps!

OpenCL uses precompiled programs, that later sent to device for execution. They are so-called "kernels". These kernels are deployed to be executed on end-device. This means main cost that must be measured is OpenCL implementation API I/O. Therefore, you can't rely on memory/CPU measurements, as real OpenCL part will use same of them.
AFAIK, no benchmarks available, but it is not hard to do one, if you will need it (matrix multiplication is hello world example, overall).
OpenCL is not that kind, that uses I/O on every CPU cycle. Field of use - really big data processing, that uses one big input, a lot of processing operations, and one output (no matter small or big). No one says that OpenCL can't be used with many I/O and minimal calculation variations, but implementation API overhead not worth it.
Expectations must be that I/O is pretty same fast in approximation to overall application performance.

There is a benchmark here: https://github.com/bennylp/saxpy-benchmark, comparing PyOpenCL against OpenCL as well as other frameworks/methods such as CUDA, plain C++, Numpy, R, Octave, and even TensorFlow (disclaimer: I'm the author)
According to the benchmark results, the performance difference between OpenCL and PyOpenCL varies too wildly. The PyOpenCL GPU target is almost 7x slower than OpenCL, but for the CPU target PyOpenCL is actually more than 2x faster than OpenCL!

Related

Do processors have optimizations and architecture preferences targeted firstly or mainly to C/C++ languages?

I have read an article C Is Not a Low-level Language, where is such paragraph:
Unfortunately, simple translation providing fast code is not true for
C. In spite of the heroic efforts that processor architects invest in
trying to design chips that can run C code fast, the levels of
performance expected by C programmers are achieved only as a result of
incredibly complex compiler transforms. The Clang compiler, including
the relevant parts of LLVM, is around 2 million lines of code. Even
just counting the analysis and transform passes required to make C run
quickly adds up to almost 200,000 lines (excluding comments and blank
lines).
What does a bold sentence mean? Does it mean that manufacturers design processors with some optimizations and architecture decisions targeted firstly or even specifically to C (C++) code? Or it just means that they are trying to design processors that executes any code faster, including the code written in C language?
If some preferences to C exists, what are they?
My couple of thoughts:
a branch prediction algorithm tuned in to patterns happening mainly in C code.
instructions which are useful and used in C but aren't useful in other languages. Otherwise other languages (compilers) will use them too.
I knows about language specific processors like Jazelle or Lisp machine for Java and Lisp respectively, but similar technologies can't be applied to C, because there are no bytecode.
Processors don't necessarily have optimizations targeted at C, but they do provide features to make C (and other procedural languages in general) map more cleanly to the platform.
Take cache-coherency in a multi-threaded environment as an example. From a C perspective, a global variable shared by two threads should look the same to both threads. If one thread writes to it the other should be able to see those modifications. But in a multi-core CPU with independent caches, that takes extra effort to support. Core 1 has to be able to detect that core 2 is accessing an address it has modified in cache and flush that out to memory (or somehow share it directly to core 2's cache).
That's essentially the thesis of that entire article. C's abstract machine model doesn't necessarily map cleanly to real modern high-performance processors like it did to the (by comparison extremely simple) PDP-11, and CPUs and compilers have to take great pains to paper over those differences.
The "heroic efforts" of the processor architects is largely referring to the design of cache and memory subsystems on the CPUs.
For a very long time now, the instruction executions circuits inside the CPUs have been far, far quicker than the electronics that looks after fetching/writing data from/to memory, largely because the technologies we have for RAM chips is hasn't really got better. Where the cores have speeded up the memory hasn't, and so the cache and memory subsystem has to get ever more elaborate in order to be able to pre-fetch data and move it towards the execution circuits ahead of time. Needless to say, this doesn't always pan out well.
It's also partly because of the physical distance between the CPU and RAM chips. Though only a few inches (if that) of track on a motherboard, that distance is significant; the speed of a signal down the track is about 1ns every 8 inches. For signals clocked in the GHz range (1 cycle << 1ns), a short track is a long way. This is partly why Apple have gone down the route of putting RAM onto the same package as the CPU in the home-grown M1 silicon.
Back to caches - the likes of Intel (and AMD, ARM) have strived to make CPUs that have good, general purpose performance, so that they run pretty much any code well. Modern compilers help a lot - if they know what the cache in the CPU is likely to do in any particular circumstance, the compilers can arrange code to fit in with what the hardware is likely to do.
A reasonable question then is, is that effective? Well, yes and no. Yes, because compiled code does run quite well, but no for a couple of reasons. The first is that ultimate performance for any given algorightm is rarely reached by the compiler / CPU, and secondly all this complexity makes it nigh on impossible for a good programmer to do their own optimisation.
Some CPUs help out the hero-programmer here. PowerPC (at least some variants) has instructions where the programmer can give the cache system a hint that the programme will shortly need data from such-and-such a location in RAM. The CPU uses that instruction to pre-load the L1 cache with that data, so that when the program actually starts to perform operations on data at that address it's already in cache.
The IBM Cell processor took this to a whole new level. The SPE math cores (there were 8 of them) had no cache, and no way of addressing data in CPU RAM at all. What there was instead was 256K of static RAM per core into which all code and data had to fit, and a way for code to push code and data in and out of that static RAM very quickly (256Gbyte/sec at the time, which was very very quick). The developer was completely on their own; they had to write code to load code and data into a core, get that executed, and then write more code to get the results out to wherever. This was actually pretty liberating; instead of having a cache and memory subsystem trying to automatically deliver data to executions cores, get in the way or (worse) just hide inefficiencies from you, one had the freedom to break down an algorigthm into core-sized lumps knowing that if it fitted, it'd be very quick, or knowing for sure it didn't fit.
Miles Budnek's answer addresses the issues that arise from multi-core CPUs with a cache-coherency and a Symetric Multi Processing (SMP) environment. It's even harder for the cache designer to get it right if there's multuple cores that might very well start tampering with a value. The difficulties involved has lead to vulnerabilities like Meltdown and Spectre.
SMP could be said to be an "optimisation" put into CPUs by designers to aid the C (or other) developer in transitioning code from single to multiple thread. It's an attractive thought - in the way that a single thread programme can see all of it's data merely by addressing it, why not extend the same visibility of data to all threads in the programme?
Turns out that this is what makes it very difficult to design modern CPUs. However the reasons why the industry went this way are plain enough - the smallest possible delta between single and multicore CPUs was going to be the least troublesome for the existing software community to adopt. That's perfectly reasonable.
But it is running out of steam, fast. A better approach (if the goal is the outright pursuit of performance) would be to go back to the old Transputer architectures from Inmos from the 1980s, early 1990s. In such architectures, data held by one core could only be processed by another if the software was written to explicitly transfer the data. Sounds familiar? Yes - Cell process was a bit like that.
Interestingly, languages such as Rust, Go, Erlang have all implemented Communicating Sequential Processes as a parallel processing paradigm. The irony is that, these days, CSP has to be implemented on top of a SMP environment, which is itself an artificial construct brought about by the interconnect between CPUs, cores and memory (e.g. QPI, Hypertransport). Basically, if the whole software world got fully comfortable with CSP then CPU designers wouldn't have to design cache-coherency into their multi-core CPUs. Rust in particular is very well suited, as it already has a strong concept of data ownership in its syntax (which could be leveraged to shovel data around between cores automatically).
The article referred to by the OP seems to me to have it in for C for some reason. There were so many points in it I felt triggered by, but I don't want to go addressing each one point by point. Maybe there is some bias or special interest that has not been declared. As a C programmer, with a particular interest in writing high performance programs, I thought I'd give my two cents on some of the issues raised. Hopefully, this might be of interest to others in the industry with or without a programming background.
From my point of view, the strengths of C are mainly as follows....
C allows you to do things you just can't do in 'higher level' languages.
A well written C (see weakness no.1) program is hard to beat on performance on the same hardware, written in another language.
C is comfortable handling binary data allowing for memory conservation.
C is well established with lots of libraries and programmers.
Objects in memory can be made easy to process from anywhere in the program by using pointers so the data itself doesn't need to be passed around.
Multi-threaded and multi-process programs are quite easy to implement.
It has Read-Write shared memory between threads (and processes with some fancy low-level stuff?)
Assembly can be inlined where needed (though it's not C then I know!).
... and main weaknesses...
Utilising SIMD capabilities is not possible in standard C, and difficult to implement in a portable way with intrinsics.
It takes a lot of code to do simple things for which there are no library functions.
Buffer overflow potential is easily missed, even for experienced programmers.
C pointers can be confusing.
The C programming language has a special place in the evolution of programming languages and I for one, would welcome a replacement that is a better fit to what is possible with modern hardware if it doesn't tie the hands of the programmer and offers better security and performance. From the article,...
'A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model. Running C code on such a system would be problematic, so, given the large amount of legacy C code in the world, it would not likely be a commercial success.'
Such things exist already, GPUs! Modern CPUs are much more like GPUs than they used to be now core counts can be 100+. I have used OpenCL C to write programs with amazing computational performance but they can't do everything well. Some applications can not be efficiently parallelised, if at all. OpenCL C program performance can become terrible when there is even a small amount of branching. Also, it is so much easier to exhause your memory bandwidth and fast cache when running many threads that it might be judged not worth the added complexity over a good single threaded implementation.
In OpenCL C, the programmer has somewhat more control of where data is stored in memory which can definately aid performance. Maybe it's a costly mistake to try to make programming languages too hardware independent. Might it be better to review some (LLVM like) intermediate standard, like in OpenCL C, where one can define 'private', 'local' and 'constant' memory objects to get performance improvements over 'global' memory objects. Such a standard wouldn't need to be tied to an instruction set. As a programmer, I welcome fast CPU instructions but it would be nice if they could be much more easily utilised in portable code AND compilable to portable binaries. Maybe this is something compiler writers could look into along with using SIMD vector registers rather than memory for pushing and popping. As I see it, there are four levels of portability.
Hardware independent source code to run on any hardware conforming to the intermediate standard. The burden is on the compiler to create binaries that will run correctly and efficiently on any hardware conforming to the intermediate standard.
Hardware independent source code to run on any hardware conforming to the intermediate standard. The burden is on the host compiler to create binaries that will run on the host's hardware configuration conforming to the intermediate standard, but may not run on other hardware conforming to the same.
Hardware dependent source code where the logical execution path through the source depends on the architecture of the hardware on which it is run. Programs need to 'query' the hardware configuration.
Hardware specific source code.
In a fantasy world where one can just imagine new standards, hardware, and programming languages, one could choose which level of portablity to aim for. I think that C was supposed to be hardware independent, but it isn't really if you want to get the best performance out of your hardware. OpenCL C tries also, but doesn't quite make it, though with run-time kernel compilation it does a pretty good job. The host program has the same issues though as any other. I don't think there are any 'Level 1' portable languages currently.
Sorry my response is a bit rambling. It's unfortunate that it's difficult to have an objective constructive discussion about the pros and cons of different ideas about future changes in software and hardware. Personally, I think FPGAs have huge potential but are still a long way from where they would need to be to go mainstream. Any new computing language will probably become out of date when hardware changes occur and software trends change. It's remarkable that C still occupies such a prominent space. In another 10 or 20 years time, C will probably still be going strong. How many other modern languages will still be commonplace then?

Can the announced Tegra K1 be a contender against x86 and x64 chips in supercomputing applications?

To clarify, can this RISC base processor (the Tegra K1) be used without significant changes to today's supercomputer programs, and perhaps be a game changer because if it's power, size, cost, and energy usage? I know it's going up against some x64 or x86 processors. Can the code used for current supercomputers be easily converted to code that will run well on these Mobile chips? Thanks.
Can the code used for current supercomputers be easily converted to code that will run well on these Mobile chips?
It depends what you call "supercomputers code". Usually supercomputers run high-level functional code (usually fully compiled code like C++, sometimes VMs-dependent code like Java) on top of other low-lewel code and technologies such as OpenCL or CUDA for accelerators or MPICH for communication between nodes.
All these technologies have ARM implementations so the real thing is to make the functional code is ARM-compatible. This is usually straightforward as code written in high level language is mostly hardware-independent. So the short answer is: yes.
However, what may be more complicated is to scale this code to these new processors.
Tegra K1 is nothing like the GPUs embedded in supercomputers. It has far less memory, runs slightly slower and has only 192 cores.
Its price and power consumption make it possible, however, to build supercomputers with hundreds of them inside.
So code which have been written for traditionnal supercomputers (a few high-performance GPUs enbedded) will not reach the peak performance of 'new' supercomputers (built with a lot of cheap and weak GPUs). There will be a price to pay to existing code on these new architectures.
For modern supercomputing needs, you'd need to answer if a processor can perform well for the energy it consumes. Current architecture of Intel along with GPUs fulfill those needs and Tegra architecture do not perform as well in terms of power-performance to Intel processors.
The question is should it? Intel keeps proving that ARM is inferior and the only factor speaking for using RISC base processors is their price, which I highly doubt is a concern when building super computer.

MPI IO very slow. What could be the cause?

I have just converted a program to make use of MPI calls for use on multiple nodes but I am having a problem getting IO to work well with MPI calls.
I am using standard MPI2 IO Methods like MPI_File_open and MPI_File_write to write my final results to a file. On my laptop, I experience a slight speedup (0.2s -> 0.1s) but on the University's super computer my file writing speed becomes abysmal - (0.2s -> 90s!).
I cant understand why performance would be so bad on the supercomputer but improved on my desktop. Is there something I am overlooking which would heavily contribute to the slow speed?
Some Notes:
The file system on my laptop is ext4 and the one used by the University is nfs
I am using OpenMP 1.4.4 on the super computer and OpenMP 1.4.5 on my laptop.
I have the change the processes view multiple times using MPI_File_set_view due to a requirement set in the guidelines which I dont think I can get past.
I have tried using the asynchronous version of write -MPI_File_iwrite, but this actually gives worse results.

pyCUDA vs C performance differences?

I'm new to CUDA programming and I was wondering how the performance of pyCUDA is compared to programs implemented in plain C.
Will the performance be roughly the same? Are there any bottle necks that I should be aware of?
EDIT:
I obviously tried to google this issue first, and was surprised to not find any information. i.e. I would have excepted that the pyCUDA people have this question answered in their FAQ.
If you're using CUDA -- whether directly through C or with pyCUDA -- all the heavy numerical work you're doing is done in kernels that execute on the gpu and are written in CUDA C (directly by you, or indirectly with elementwise kernels). So there should be no real difference in performance in those parts of your code.
Now, the initialization of arrays, and any post-work analysis, will be done in python (probably with numpy) if you use pyCUDA, and that generally will be significantly slower than doing it directly in a compiled language (though if you've built your numpy/scipy in such a way that it links directly to high-performance libraries, then those calls at least would perform the same in either language). But hopefully, your initialization and finalization are small fractions of the total amount of work you have to do, so that even if there is significant overhead there, it still hopefully won't have a huge impact on overall runtime.
And in fact if it turns out that the python parts of the computation does hurt your application's performance, starting out doing your development in pyCUDA may still be an excellent way to get started, as the development is significantly easier, and you can always re-implement those parts of the code that are too slow in Python in straight C, and call those from python, gaining some of the best of both worlds.
If you're wondering about performance differences by using pyCUDA in different ways, see SimpleSpeedTest.py included in the pyCUDA Wiki examples. It benchmarks the same task completed by a CUDA C kernel encapsulated in pyCUDA, and by several abstractions created by pyCUDA's designer. There's a performance difference.
I've been using pyCUDA for a little while an I like prototyping with it because it speeds up the process of turning an idea into working code.
With pyCUDA you will be writing the CUDA kernels using C++, and it's CUDA, so there shouldn't be a difference in performance of running that code. But there will be a difference in the performance of the code you write in Python to setup or use the results of the pyCUDA kernel vs the one you write in C.
I was looking for an answer for the original question in this post and I see the problem Is deeper as I thought.
I my experience, I compared Cuda kernels and CUFFT's written in C with that written in PyCuda. Surprisingly, I found that, on my computer, the performance of suming, multiplying or making FFT's vary from each implentatiom. For example, I got almost the same performance in cuFFT for vector sizes until 2^23 elements. However, suming and multiplying complex vectors show some troubles. The speed up obtained in C/Cuda was ~6X for N=2^17, whilst in PyCuda only ~3X. It also depends on the way that the sumation was performed. By using SourceModule and wrapping the Raw Cuda code, I found the problem that my kernel, for complex128 vectors, was limitated for a lower N (<=2^16) than that used for gpuarray (<=2^24).
As a conclusion, is a good job testing and comparing the two sides of the problem and evaluate if it is convenient spend time in writing a Cuda script or gain readbility and pay the cost of a lower performance.
Make sure you're using -O3 optimizations there and use nvprof/nvvp to profile your kernels if you're using PyCUDA and you want to get high performance. If you want to use Cuda from Python, PyCUDA is probably THE choice. Because interfacing C++/Cuda code via Python is just hell otherwise. You have to write a hell lot of ugly wrappers. And for numpy integration even more hardcore wrap-up code would be necessary.

Timing Kernel Executions on CUDA

I've used code from CUDA C Best Practices to implement an execution timer. However their is something strange and I don't know if it's an anomaly or if that's normal. I get different read outs each time I run my CUDA app.
Could these readings by related to design or is that something I should expect.
I'm not running any graphic intensive applications on my machine, other than Windows 7.
Well it depends how big the differences are. One thing you can see anomalies caused by is the kernel scheduler. It may just happen that the scheduler is giving some extra timeslices to kernel functions (because graphics API calls have error checking involved) which shows more execution time. If the differences are very large I would say check your code but if it's very low in orders of milliseconds I wouldn't worry about it +- 10msecs is the usual for the timeslicing quantum in most OS's (windows probably included).
Also Aero is kind of intensive so that may be adding to the discrepancies you are seeing.
I've used code from CUDA C Best Practices to implement an execution timer.
Yeah, well, that's not a "best practice" in my experience.
I suggest using the nvprof profiler instead for your device-side code and CUDA Runtime API calls (it also works relatively well, I think, for your own host-side code). It'll take you a bit of hassle to set up and figure out which options you want to use, but it's worth it.

Resources