What are the standard benchmarks for comparing C the optimizer of various C compilers?
I'm particularly interested in benchmarks for ARM (or those that can be ported to ARM).
https://en.wikipedia.org/wiki/SPECint is mostly written in C, and is the industry standard benchmark for real hardware, computer-architecture theoretical research (e.g. a larger ROB or some cache difference in a simulated CPU), and for compiler developers to test proposed patches that change code-gen.
The C parts of SPECfp (https://en.wikipedia.org/wiki/SPECfp) are also good choices. Or for a compiler back-end optimizer, the choice of front-end language isn't very significant. The Fortran programs are fine too.
Related: Tricks of a Spec master is a paper that covers the different benchmarks. Maybe originally from a conference.
In this lightning round talk, I will
cover at a high level the performance characteristics of
these benchmarks in terms of optimizations that GCC
does. For example, some benchmarks are classic floating point applications and benefit from SIMD (single instruction multiple data) instructions, while other benchmarks don’t.
Wikipedia is out of date. SPECint/fp 2017 was a long time coming, but it was released in 2017 and is a significant improvement over 2006. e.g. some benchmarks trivialized by clever compiler optimizations like loop inversion. (Some compilers over the years have added basically pattern-recognition to optimize the loop in libquantum, but they can't always do that in general for other loops even when it would be safe. Apparently it can also be easily auto-parallelized.)
For testing a compiler, you might actually want code that aggressive optimization can find major simplifications in, so SPECcpu 2006 is a good choice. Just be aware of the issues with libquantum.
https://www.anandtech.com/show/10353/investigating-cavium-thunderx-48-arm-cores/12 describes gcc as a compiler that "does not try to "break" benchmarks (libquantum...)". But compilers like ICC and SunCC that CPU vendors use / used for SPEC submissions for their own hardware (Intel x86 and Sun UltraSPARC and later x86) are as aggressive as possible on SPEC benchmarks.
SPEC result submissions are required to include compiler version and options used (and OS tuning options), so you can hopefully replicate them.
Related
I have read an article C Is Not a Low-level Language, where is such paragraph:
Unfortunately, simple translation providing fast code is not true for
C. In spite of the heroic efforts that processor architects invest in
trying to design chips that can run C code fast, the levels of
performance expected by C programmers are achieved only as a result of
incredibly complex compiler transforms. The Clang compiler, including
the relevant parts of LLVM, is around 2 million lines of code. Even
just counting the analysis and transform passes required to make C run
quickly adds up to almost 200,000 lines (excluding comments and blank
lines).
What does a bold sentence mean? Does it mean that manufacturers design processors with some optimizations and architecture decisions targeted firstly or even specifically to C (C++) code? Or it just means that they are trying to design processors that executes any code faster, including the code written in C language?
If some preferences to C exists, what are they?
My couple of thoughts:
a branch prediction algorithm tuned in to patterns happening mainly in C code.
instructions which are useful and used in C but aren't useful in other languages. Otherwise other languages (compilers) will use them too.
I knows about language specific processors like Jazelle or Lisp machine for Java and Lisp respectively, but similar technologies can't be applied to C, because there are no bytecode.
Processors don't necessarily have optimizations targeted at C, but they do provide features to make C (and other procedural languages in general) map more cleanly to the platform.
Take cache-coherency in a multi-threaded environment as an example. From a C perspective, a global variable shared by two threads should look the same to both threads. If one thread writes to it the other should be able to see those modifications. But in a multi-core CPU with independent caches, that takes extra effort to support. Core 1 has to be able to detect that core 2 is accessing an address it has modified in cache and flush that out to memory (or somehow share it directly to core 2's cache).
That's essentially the thesis of that entire article. C's abstract machine model doesn't necessarily map cleanly to real modern high-performance processors like it did to the (by comparison extremely simple) PDP-11, and CPUs and compilers have to take great pains to paper over those differences.
The "heroic efforts" of the processor architects is largely referring to the design of cache and memory subsystems on the CPUs.
For a very long time now, the instruction executions circuits inside the CPUs have been far, far quicker than the electronics that looks after fetching/writing data from/to memory, largely because the technologies we have for RAM chips is hasn't really got better. Where the cores have speeded up the memory hasn't, and so the cache and memory subsystem has to get ever more elaborate in order to be able to pre-fetch data and move it towards the execution circuits ahead of time. Needless to say, this doesn't always pan out well.
It's also partly because of the physical distance between the CPU and RAM chips. Though only a few inches (if that) of track on a motherboard, that distance is significant; the speed of a signal down the track is about 1ns every 8 inches. For signals clocked in the GHz range (1 cycle << 1ns), a short track is a long way. This is partly why Apple have gone down the route of putting RAM onto the same package as the CPU in the home-grown M1 silicon.
Back to caches - the likes of Intel (and AMD, ARM) have strived to make CPUs that have good, general purpose performance, so that they run pretty much any code well. Modern compilers help a lot - if they know what the cache in the CPU is likely to do in any particular circumstance, the compilers can arrange code to fit in with what the hardware is likely to do.
A reasonable question then is, is that effective? Well, yes and no. Yes, because compiled code does run quite well, but no for a couple of reasons. The first is that ultimate performance for any given algorightm is rarely reached by the compiler / CPU, and secondly all this complexity makes it nigh on impossible for a good programmer to do their own optimisation.
Some CPUs help out the hero-programmer here. PowerPC (at least some variants) has instructions where the programmer can give the cache system a hint that the programme will shortly need data from such-and-such a location in RAM. The CPU uses that instruction to pre-load the L1 cache with that data, so that when the program actually starts to perform operations on data at that address it's already in cache.
The IBM Cell processor took this to a whole new level. The SPE math cores (there were 8 of them) had no cache, and no way of addressing data in CPU RAM at all. What there was instead was 256K of static RAM per core into which all code and data had to fit, and a way for code to push code and data in and out of that static RAM very quickly (256Gbyte/sec at the time, which was very very quick). The developer was completely on their own; they had to write code to load code and data into a core, get that executed, and then write more code to get the results out to wherever. This was actually pretty liberating; instead of having a cache and memory subsystem trying to automatically deliver data to executions cores, get in the way or (worse) just hide inefficiencies from you, one had the freedom to break down an algorigthm into core-sized lumps knowing that if it fitted, it'd be very quick, or knowing for sure it didn't fit.
Miles Budnek's answer addresses the issues that arise from multi-core CPUs with a cache-coherency and a Symetric Multi Processing (SMP) environment. It's even harder for the cache designer to get it right if there's multuple cores that might very well start tampering with a value. The difficulties involved has lead to vulnerabilities like Meltdown and Spectre.
SMP could be said to be an "optimisation" put into CPUs by designers to aid the C (or other) developer in transitioning code from single to multiple thread. It's an attractive thought - in the way that a single thread programme can see all of it's data merely by addressing it, why not extend the same visibility of data to all threads in the programme?
Turns out that this is what makes it very difficult to design modern CPUs. However the reasons why the industry went this way are plain enough - the smallest possible delta between single and multicore CPUs was going to be the least troublesome for the existing software community to adopt. That's perfectly reasonable.
But it is running out of steam, fast. A better approach (if the goal is the outright pursuit of performance) would be to go back to the old Transputer architectures from Inmos from the 1980s, early 1990s. In such architectures, data held by one core could only be processed by another if the software was written to explicitly transfer the data. Sounds familiar? Yes - Cell process was a bit like that.
Interestingly, languages such as Rust, Go, Erlang have all implemented Communicating Sequential Processes as a parallel processing paradigm. The irony is that, these days, CSP has to be implemented on top of a SMP environment, which is itself an artificial construct brought about by the interconnect between CPUs, cores and memory (e.g. QPI, Hypertransport). Basically, if the whole software world got fully comfortable with CSP then CPU designers wouldn't have to design cache-coherency into their multi-core CPUs. Rust in particular is very well suited, as it already has a strong concept of data ownership in its syntax (which could be leveraged to shovel data around between cores automatically).
The article referred to by the OP seems to me to have it in for C for some reason. There were so many points in it I felt triggered by, but I don't want to go addressing each one point by point. Maybe there is some bias or special interest that has not been declared. As a C programmer, with a particular interest in writing high performance programs, I thought I'd give my two cents on some of the issues raised. Hopefully, this might be of interest to others in the industry with or without a programming background.
From my point of view, the strengths of C are mainly as follows....
C allows you to do things you just can't do in 'higher level' languages.
A well written C (see weakness no.1) program is hard to beat on performance on the same hardware, written in another language.
C is comfortable handling binary data allowing for memory conservation.
C is well established with lots of libraries and programmers.
Objects in memory can be made easy to process from anywhere in the program by using pointers so the data itself doesn't need to be passed around.
Multi-threaded and multi-process programs are quite easy to implement.
It has Read-Write shared memory between threads (and processes with some fancy low-level stuff?)
Assembly can be inlined where needed (though it's not C then I know!).
... and main weaknesses...
Utilising SIMD capabilities is not possible in standard C, and difficult to implement in a portable way with intrinsics.
It takes a lot of code to do simple things for which there are no library functions.
Buffer overflow potential is easily missed, even for experienced programmers.
C pointers can be confusing.
The C programming language has a special place in the evolution of programming languages and I for one, would welcome a replacement that is a better fit to what is possible with modern hardware if it doesn't tie the hands of the programmer and offers better security and performance. From the article,...
'A processor designed purely for speed, not for a compromise between speed and C support, would likely support large numbers of threads, have wide vector units, and have a much simpler memory model. Running C code on such a system would be problematic, so, given the large amount of legacy C code in the world, it would not likely be a commercial success.'
Such things exist already, GPUs! Modern CPUs are much more like GPUs than they used to be now core counts can be 100+. I have used OpenCL C to write programs with amazing computational performance but they can't do everything well. Some applications can not be efficiently parallelised, if at all. OpenCL C program performance can become terrible when there is even a small amount of branching. Also, it is so much easier to exhause your memory bandwidth and fast cache when running many threads that it might be judged not worth the added complexity over a good single threaded implementation.
In OpenCL C, the programmer has somewhat more control of where data is stored in memory which can definately aid performance. Maybe it's a costly mistake to try to make programming languages too hardware independent. Might it be better to review some (LLVM like) intermediate standard, like in OpenCL C, where one can define 'private', 'local' and 'constant' memory objects to get performance improvements over 'global' memory objects. Such a standard wouldn't need to be tied to an instruction set. As a programmer, I welcome fast CPU instructions but it would be nice if they could be much more easily utilised in portable code AND compilable to portable binaries. Maybe this is something compiler writers could look into along with using SIMD vector registers rather than memory for pushing and popping. As I see it, there are four levels of portability.
Hardware independent source code to run on any hardware conforming to the intermediate standard. The burden is on the compiler to create binaries that will run correctly and efficiently on any hardware conforming to the intermediate standard.
Hardware independent source code to run on any hardware conforming to the intermediate standard. The burden is on the host compiler to create binaries that will run on the host's hardware configuration conforming to the intermediate standard, but may not run on other hardware conforming to the same.
Hardware dependent source code where the logical execution path through the source depends on the architecture of the hardware on which it is run. Programs need to 'query' the hardware configuration.
Hardware specific source code.
In a fantasy world where one can just imagine new standards, hardware, and programming languages, one could choose which level of portablity to aim for. I think that C was supposed to be hardware independent, but it isn't really if you want to get the best performance out of your hardware. OpenCL C tries also, but doesn't quite make it, though with run-time kernel compilation it does a pretty good job. The host program has the same issues though as any other. I don't think there are any 'Level 1' portable languages currently.
Sorry my response is a bit rambling. It's unfortunate that it's difficult to have an objective constructive discussion about the pros and cons of different ideas about future changes in software and hardware. Personally, I think FPGAs have huge potential but are still a long way from where they would need to be to go mainstream. Any new computing language will probably become out of date when hardware changes occur and software trends change. It's remarkable that C still occupies such a prominent space. In another 10 or 20 years time, C will probably still be going strong. How many other modern languages will still be commonplace then?
There is an abundance of IDEs and toolchains for the Arm Cortex architecture for C/C++.
Lately, faced with a hard speed optimization issue on my STM Cortex-M3, I started wondering if there are indeed performance differences (in terms of output code execution speed), between different vendors, as some of them claim.
More specifically, between the GNU-C copiler and commercial ones.
Did someone do a comparison between different compilers in this sense?
Practical Speaking Binaries Generated from Commercial IDEs Are more Optimized and Smaller in code size than Ones Generated by GCC , The difference is there but not so big and May even get close to nothing with a little bit of effort With some optimization, I Personally don't think that you will find Any clear benchmark for Commercial Toolchains vs GCC based Ones, Speed and Size Really do depend on so many factors.
https://web.archive.org/web/20170227190422/http://hilbert-space.de/?p=22
On this site which is quite dated it shows that hand written asm would give a much greater improvement then the intrinsics. I am wondering if this is the current truth even now in 2012.
So has the compilation optimization improved for intrinsics using gnu cross compiler?
My experience is that the intrinsics haven't really been worth the trouble. It's too easy for the compiler to inject extra register unload/load steps between your intrinsics. The effort to get it to stop doing that is more complicated than just writing the stuff in raw NEON. I've seen this kind of stuff in pretty recent compilers (including clang 3.1).
At this level, I find you really need to control exactly what's happening. You can have all kinds of stalls if you do things in just barely the wrong order. Doing it in intrinsics feels like surgery with welder's gloves on. If the code is so performance critical that I need intrinsics at all, then intrinsics aren't good enough. Maybe others have difference experiences here.
I've had to use NEON intrinsics in several projects for portability. The truth is that GCC doesn't generate good code from NEON intrinsics. This is not a weakness of using intrinsics, but of the GCC tools. The ARM compiler from Microsoft produces great code from NEON intrinsics and there is no need to use assembly language in that case. Portability and practicality will dictate which you should use. If you can handle writing assembly language then write asm. For my personal projects I prefer to write time-critical code in ASM so that I don't have to worry about a buggy/inferior compiler messing up my code.
Update: The Apple LLVM compiler falls in between GCC (worst) and Microsoft (best). It doesn't do great with instruction interleaving nor optimal register usage, but at least it generates reasonable code (unlike GCC in some situations).
Update2: The Apple LLVM compiler for ARMv8 has been improved dramatically. It now does a great job generating ARMv8 code from C and intrinsics.
So this question is four years old, now, and still shows up in search results...
In 2016 things are much better.
A lot of simple code that I've transcribed from assembly to intrinsics is now optimised better by the compilers than by me because I'm too lazy to do the pipeline work (for how many different pipelines now?), while the compilers just needs me to pass the right --mtune=.
For complex code where register allocation can get tight, GCC and Clang both can still produce slower than handwritten code by a factor of two... or three(ish). It's mostly on register spills, so you should know from the structure of your code whether that's a risk.
But they both sometimes have disappointing accidents. I'd say that right now that's worth the risk (although I'm paid to take risk), and if you do get hit by something then file a bug. That way things will keep on getting better.
By now you even get auto-vectorization for the plain C code and the intrinsics are handled properly:
https://godbolt.org/z/AGHupq
Can you please give me some comparison between C compilers especially with respect to optimization?
Actually there aren't many free compilers around. gcc is "the" free compiler and probably one of the best when it comes to optimisation, even when compared to proprietary compilers.
Some independent benchmarks are linked from here:
http://gcc.gnu.org/benchmarks/
I believe Intel allows you to use its ICC compilers under Linux for non-commercial development for free. ICC beats gcc and Visual Studio hands down when it comes to code generation for x86 and x86-64 (i.e. it typically generates faster code, and can do a decent job of auto-vectorization (SIMD) in some cases).
This is a hard question to answer since you did not tell us what platform you are using, neither hardware or os....
But joemoe is right, gcc tend to excel in this field.
(As a side note: On some platforms there are commercial compilers that are better, but since you gain so much more that just the compiler gcc is hard to beat...)
the Windows SDK is a free download. it includes current versions of the Visual C++ compilers. These compilers do a very good job of optimisation.
First Question
From a C programmer's point of view, what are the differences between Intel Core processors and their AMD equivalents ?
Related Second Question
I think that there are some instructions that differentiate between the Intel Core from the other processors and vis-versa. How important are those instructions ? Are they being taken into account by compilers ? Would performances be better if there was some special Intel compiler only for the Core family ?
If you are programming user-level code and most driver code, there aren't many differences (one exception is the availability of certain instruction sets - which may differ for different processors, see below). If you are writing kernel code dealing with CPU-specific features (profiling using internal counters, memory management, power management, virtualization), the architectures differ in implementation, sometimes greatly.
Most compilers do not automatically take advantage of SSE instructions. However, most do provide SSE-based intrinsics, which will allow you to write SSE-aware code. The subset of all SSE levels available differs for each processor architecture and maker.
See this page for instruction listings. Follow the links to see which architectures the specific instructions are supported on. Also, read the Intel and AMD architecture development manuals for exact details about support and implementation of any and all instruction sets.
First Question From a C programmer's point of view, what are the differences between
Intel Core processors and their AMD equivalents ?
The most significant differences are likely to show up only in highly specialized code that makes use of new generation instructions, such as vector maths, parallelization, SSE.
Would performances be better if there was some special Intel compiler only for the Core family ?
Not sure if you are aware of it, but there's a compiler specifically for Intel cores: icc. It's generally considered to be the best compiler from an optimization point of view.
You might want to check out its wikipedia article.
According to the Intel Core Wikipedia article, there were notable
improvements to SSE, SSE2, and SSE3 instructions. These instructions are SIMD (same instruction, multiple data), meaning that they are designed for applying a single arithmetic operation to a vector of values. They are certainly important, and have been made used by compilers such as GCC for quite awhile.
Of course, recent AMD processors have adopted the newest Intel instructions, and vice-versa. This is an ongoing trend.