OpenCL 1.2 compiling kernel binary using LLVM - c

Say I have the OpenCL kernel,
/* Header to make Clang compatible with OpenCL */
/* Test kernel */
__kernel void test(long K, const global float *A, global float *b)
{
for (long i=0; i<K; i++)
for (long j=0; j<K; j++)
b[i] = 1.5f * A[K * i + j];
}
I'm trying to figure out how to compile this to a binary which can be loaded into OpenCL using the clCreateProgramWithBinary command.
I'm on a Mac (Intel GPU), and thus I'm limited to OpenCL 1.2. I've tried a number of different variations on the command,
clang -cc1 -triple spir test.cl -O3 -emit-llvm-bc -o test.bc -cl-std=cl1.2
but the binary always fails when I try to build the program. I'm at my wits' end with this, it's all so confusing and poorly documented.
The performance of the above test function can, in regular C, be significantly improved by applying the standard LLVM compiler optimization flag -O3. My understanding is that this optimization flag some how takes advantage of the contiguous memory access pattern of the inner loop to improve performance. I'd be more than happy to listen to anyone who wants to fill in the details on this.
I'm also wondering how I can first convert to SPIR code, and then convert that to a buildable binary. Eventually I would like to find a way to apply the -O3 compiler optimizations to my kernel, even if I have to manually modify the SPIR (as diffiult as that will be).
I've also gotten the SPIRV-LLVM-Translator tool working (as far as I can tell), and ran,
./llvm-spirv test.bc -o test.spv
and this binary fails to load at the clCreateProgramWithBinary step, I can't even get to the build step.
Possibly SPIRV doesn't work with OpenCL 1.2, and I have to use clCreateProgramWithIL, which unfortunately doesn't exist for OpenCL 1.2. It's difficult to say for sure why it doesn't work.
Please see my previous question here for some more context on this problem.

I don't believe there's any standardised bitcode file format that's available across implementations, at least at the OpenCL 1.x level.
As you're talking specifically about macOS, have you investigated Apple's openclc compiler? This is also what Xcode invokes when you compile a .cl file as part of a target. The compiler is located in /System/Library/Frameworks/OpenCL.framework/Libraries/openclc; it does have comprehensive --help output but that's not a great source for examples on how to use it.
Instead, I recommend you try the OpenCL-in-Xcode tutorial, and inspect the build commands it ends up running:
https://developer.apple.com/library/archive/documentation/Performance/Conceptual/OpenCL_MacProgGuide/XCodeHelloWorld/XCodeHelloWorld.html
You'll find it produces bitcode files (.bc) for 4 "architectures": i386, x86_64, "gpu_64", and "gpu_32". It also auto-generates some C code which loads this code by calling gclBuildProgramBinaryAPPLE().
I don't know if you can untangle it further than that but you certainly can ship bitcode which is GPU-independent using this compiler.
I should point out that OpenCL is deprecated on macOS, so if that's the only platform you're targeting, you really should go for Metal Compute instead. It has much better tooling and will be actively supported for longer. For cross-platform projects it might still make sense to use OpenCL even on macOS, although for shipping kernel binaries instead of source, it's likely you'll have to use platform-specific code for loading those anyway.

Related

Clang or GCC equivalent of _PGOPTI_Prof_Dump_All() from ICC

Intel C(++) Compiler has very useful functions to help with profile guided optimisation.
_PGOPTI_Prof_Reset_All();
/* code */
_PGOPTI_Prof_Dump_All();
https://software.intel.com/en-us/node/512800
This is particularly useful for profiling shared libraries which one would use with ctypes in Python.
I've been trying to figure out if either Clang or GCC have similar functionality – apparently not.
Profile guided optimization works differently in gcc and it is enabled with compiler switches. See this question for PGO with gcc.
PGO just recently arrived in clang and is only available starting at version 3.5. The clang user manual gives an overview of how to use it.
It turns out that both have an internal and not properly documented function named __gcov_flush which does this. It is only explained in the source.
/* Called before fork or exec - write out profile information
gathered so far and reset it to zero. This avoids duplication or
loss of the profile information gathered so far. */
It's not quite as convenient as the Intel equivalent though and requires some gymnastics to make it work.

LLVM IR limitations

I am looking to generate LLVM-IR code from C code and was wondering how well is the IR generation for functions in:
stdio.h, string.h, stdlib.h and generally the standard memory based functions such as malloc, calloc, since I have not been able to find most of the common functions in:
http://llvm.org/docs/LangRef.html and was wondering about the limitations of this representation and whether I might be required to add my own intrinsics just to deal with standard/most popular c functions.
I am looking to change the code at runtime, so was wondering which kind of approach will give me the most flexibility eg: Manipulate the code at AST level instead.
Thanks
Emitting LLVM IR from C is exactly what the industrial-strength compiler Clang does. I suggest running Clang on small snippets of C code with -emit-llvm (details in this document: http://clang.llvm.org/get_started.html) and observing the resulting IR.
You can even do this in your browser: http://ellcc.org/demo/index.cgi
That will allow you to see how builtins like memcpy are handled and any other similar doubts.
Note that neither LLVM nor Clang carry a full C library with them, but they can be used to compile an existing one. newlib is a popular portable C library designed specifically for being built on various new platforms. PNaCl, for example, uses it to build C/C++ code into portable executables - it compiles newlib with the user's code together into a single LLVM IR module.

Atomic Operations in C on Linux

I am trying to port some code I wrote from Mac OS X to Linux and am struggling to find a suitable replacement for the OSX only OSAtomic.h. I found the gcc __sync* family, but I am not sure it will be compatible with the older compiler/kernel I have. I need the code to run on GCC v4.1.2 and kernel 2.6.18.
The particular operations I need are:
Increment
Decrement
Compare and Swap
What is weird is that running locate stdatomic.h on the linux machine finds the header file (in a c++ directory), whereas running the same command on my OSX machine (gcc v4.6.3) returns nothing. What do I have to install to get the stdatomic library, and will it work with gcc v 4.1.2?
As a side note, I can't use any third party libraries.
Well, nothing is there to stop you from using OSAtomic operations on other platforms. The sources for OSAtomic operations for ARM, x86 and PPC are a part of Apple's libc which is opensource. Just make sure you are not using OSSpinLock as that is specific to Mac OS X, but this can be easily replaced by Linux futexes.
See these:
http://opensource.apple.com/source/Libc/Libc-594.1.4/i386/sys/OSAtomic.s
http://opensource.apple.com/source/Libc/Libc-594.1.4/ppc/sys/OSAtomic.s
http://opensource.apple.com/source/Libc/Libc-594.1.4/arm/sys/OSAtomic.s
Alternatively, you can use the sync_* family, which I believe should work on most platforms, which I believe are described here: http://gcc.gnu.org/wiki/Atomic
The OpenPA project provides a portable library of atomic operations under an MIT-style license. This is one I have used before and it is pretty straightforward. The code for your operations would look like
#include "opa_primitives.h"
OPA_int_t my_atomic_int = OPA_INT_T_INITIALIZER(0);
/* increment */
OPA_incr_int(&my_atomic_int);
/* decrement */
OPA_decr_int(&my_atomic_int);
/* compare and swap */
old = OPA_cas_int(&my_atomic_int, expected, new);
It also contains fine-grained memory barriers (i.e. read, write, and read/write) instead of just a full memory fence.
The main header file has a comment showing the operations that are available in the library.
GCC atomic intrinsics have been available since GCC 4.0.1.
There is nothing stopping you building GCC 4.7 or Clang with GCC 4.1.2 and then getting all the newer features such as C11 atomics.
There are many locations you can find BSD licensed assembler implementations of atomics as a last resort.

Do I really need libgcc?

I've been using GCC 4.6.2 on Mac OS X 10.6. I use the -static-libgcc option when I compile, otherwise my binaries look for libgcc on the system and I'm not sure anything over GCC 4.2 is supported on OS X. This works fine, but why do I even need libgcc? I read up on it and the GNU docs say it contains "arithmetic operations that the target processor cannot perform directly." How do I know what these operations are? And why are they so complex that I need to include this library? Why can't GCC just optimize the code directly instead of having to resort to these library functions? I'm a little confused. Any insight into this would be appreciated!
Yes, you do need it .... probably. If you don't need it then statically linking it is harmless. You can tell if you need it by using the -t link trace option (I think).
There are various things that you cant do in one instruction (typically things like 64-bit operations on 32-bit architectures). These things can be done, but if they use a non-trivial number of instructions then it's more space-efficient to have them all in one place.
When you disable optimization using -O0 (that's actually the default anyway) then GCC pretty much always uses the libgcc routines.
When you enable speed optimization then GCC may choose to insert the instruction sequence directly into the code (if it knows how). You may find that it ends up using none of the libgcc versions - it will certainly use fewer libgcc calls.
When you enable size optimizations then GCC may prefer the function call, or may not - it depends on what the GCC developers think is the best speed/size trade-off in each case. Note that even when you tell it to optimize for speed, the compiler may judge that some functions are unlikely to be used, and optimize those for size - even more so if you use PGO.
Basically, you can think of it in the same way as memcpy or the math-library functions: the compiler will inline functions it judges to be beneficial, and call library functions otherwise. The compiler can "inline" standard functions and libgcc function without looking at the library definition, of course - it just "knows" what they do.
Whether to use static or dynamic libgcc is an interesting trade-off. On the one hand, a dynamic (shared) library will use less memory across your whole system, and is more likely to be cached, etc. On the other hand, a static libgcc has a lower call overhead.
The most important thing though is compatibility. Obviously the libgcc library has to be present for your program to run, but it also has to be a compatible version. You're ok on a Linux distro with a stable GCC version, but otherwise static linking is safer.
I hope that answers your questions.

Performance of compiled code by compiled compiler

If I want to achieve better performance from, let's say for example, MySQLdb, I can compile it myself and I will get better performance because it's not compiled on i386, i486 or what ever, just on my CPU. Further I can choose the compile options and so on...
Now, I was wondering if this is true also for non-regular Software, such as compiler.
Here come the 1st part:
Will compiling a compiler like GCC result in better performance?
and the 2nd part:
Will the code compiled by my own compiled compiler perform better?
(Yes, I know, I can compile my compiler and benchmark it... but maybe ... someone already knows the answer, and will share it with us =)
In answer to your first question, almost certainly yes. Binary versions of gcc will be the "lowest common denominator" and, if you compile them with special flags more appropriate to your system, it will most likely be faster.
As to your second question, no.
The output of the compiler will be the same regardless of how you've optimised it (unless it's buggy, of course).
In other words, even if you totally stuffed up your compiler flags when compiling gcc, to the point where your particular compiled version of gcc takes a week and a half to compile "Hello World", the actual "Hello World" executable should be identical to the one produced by the "lowest common denominator" gcc (if you use the same flags).
(1) It is possible. If you introduce a new optimization to your compiler, and re-compile it with this optimization included - it is possible that the re-compiled code will perform better.
(2) No!!!! A compiler cannot change the logic of the code! In your case, the logic of the code is the native code produced at the end. So, if compiler A_1 is compiled using compiler A_2 or B, has no affect on the native code produced by A_1 [in here A_1, A_2 are the same compilers, the index is just for clarity].
a.Well, you can compile the compiler to your system, and maybe it will run faster. like any program. (I think that usualy it's not worth it, but do whatever you want).
b. No. Even if you compile the compiler in your computer, it's behavior should not change, and so the code that it generates also doesn't change.
Will compiling a compiler like GCC result in better performance?
A program compiled specifically to the target platform it is used on will usually perform better than a program compiled for a generic platform. Why is this? Knowledge about the harware can help the compiler align data to be cache friendly and choose an instruction ordering that plays well with a CPUs pipelining.
The most benefit is usally achieved by leveraging specific instruction sets such as SSE (in its various versions).
On the other hand, you should ask yourself if a programm like GCC is really CPU bound (much more likely it will be IO bound) and tuning its CPU performance provides any measurable benefit.
Will the code compiled by my own compiled compiler perform better
Hopefully not! Allowing a compiler to optimize a program should never change its behavior. No matter how you compiled your GCC, it should compile code to the same binaries as a generic binary distribution of GCC would.
If code compiled to the specific platform is faster than code compil for a generic platform, why dont we all ship code instead of binaries? Guess what, some linux distros actually follow this phillosophy, such as Gentoo. And while you're at it, make sure to built statically linked binaries, disk space is so cheap nowadays and it gives you at least another 0.001% of performance.
Alright, that was a bit sarcastic. The reason people distribute generic binaries is pretty obvious: It's geneirc, the lowest common denominator and it will work everywhere. Thats a big bonus in terms of flexibility and user friendlyness. I remember once compiling Gnome for my Gentoo box, it took a day or two! (But it must have been so much faster ;-) )
On the other hand, there are occassions where you want to get the best performance possible and it makes sense to build and optimize for specific architctures.
GCC uses a three step bootstraping when building from source. Basically it compiles the source three times to ensure build tools and compiler is build successfully. This bootstraping is used for validation purpose. However it is possible to use the stage 1 as a benchmark for optimizing later stages. You should build GCC with make profiledbootstrap to use this profile based optimization.
This profile based build process increases the performance of "GCC", but not the software compiled with it, as other answers point out.

Resources