ARM Cortex toolchain speed optimization - c

There is an abundance of IDEs and toolchains for the Arm Cortex architecture for C/C++.
Lately, faced with a hard speed optimization issue on my STM Cortex-M3, I started wondering if there are indeed performance differences (in terms of output code execution speed), between different vendors, as some of them claim.
More specifically, between the GNU-C copiler and commercial ones.
Did someone do a comparison between different compilers in this sense?

Practical Speaking Binaries Generated from Commercial IDEs Are more Optimized and Smaller in code size than Ones Generated by GCC , The difference is there but not so big and May even get close to nothing with a little bit of effort With some optimization, I Personally don't think that you will find Any clear benchmark for Commercial Toolchains vs GCC based Ones, Speed and Size Really do depend on so many factors.

Related

Benchmarks for C compiler optimization

What are the standard benchmarks for comparing C the optimizer of various C compilers?
I'm particularly interested in benchmarks for ARM (or those that can be ported to ARM).
https://en.wikipedia.org/wiki/SPECint is mostly written in C, and is the industry standard benchmark for real hardware, computer-architecture theoretical research (e.g. a larger ROB or some cache difference in a simulated CPU), and for compiler developers to test proposed patches that change code-gen.
The C parts of SPECfp (https://en.wikipedia.org/wiki/SPECfp) are also good choices. Or for a compiler back-end optimizer, the choice of front-end language isn't very significant. The Fortran programs are fine too.
Related: Tricks of a Spec master is a paper that covers the different benchmarks. Maybe originally from a conference.
In this lightning round talk, I will
cover at a high level the performance characteristics of
these benchmarks in terms of optimizations that GCC
does. For example, some benchmarks are classic floating point applications and benefit from SIMD (single instruction multiple data) instructions, while other benchmarks don’t.
Wikipedia is out of date. SPECint/fp 2017 was a long time coming, but it was released in 2017 and is a significant improvement over 2006. e.g. some benchmarks trivialized by clever compiler optimizations like loop inversion. (Some compilers over the years have added basically pattern-recognition to optimize the loop in libquantum, but they can't always do that in general for other loops even when it would be safe. Apparently it can also be easily auto-parallelized.)
For testing a compiler, you might actually want code that aggressive optimization can find major simplifications in, so SPECcpu 2006 is a good choice. Just be aware of the issues with libquantum.
https://www.anandtech.com/show/10353/investigating-cavium-thunderx-48-arm-cores/12 describes gcc as a compiler that "does not try to "break" benchmarks (libquantum...)". But compilers like ICC and SunCC that CPU vendors use / used for SPEC submissions for their own hardware (Intel x86 and Sun UltraSPARC and later x86) are as aggressive as possible on SPEC benchmarks.
SPEC result submissions are required to include compiler version and options used (and OS tuning options), so you can hopefully replicate them.

Is code easily portable between Cortex A5 and Cortex A9 made by two different companies?

Can code written for an Cortex A5 built by one company be ported to a Cortex A9 made by another company without too much difficulty?
I want to write some bare metal C code that runs on Atmel's SAMA5D4 (Cortex A5) that takes video from a CMOS camera with a parallel interface and encodes it to H.264. That chip can hardware encode at 720p.
Later, I may want to build a similar setup that can encode at 1080p, so I would want to upgrade to a more expensive chip, NXP i.MX 6Solo (Cortex A9).
So I want to know if I would encounter major headaches or if it would be rather easy to port later. My gut tells me it should be easy but I thought I'd better ask the experts first. If it's a huge headache though I may start with the more expensive chip first.
I'm new to this and not at all experienced with ARM chips or even much C but am willing to learn :-)
As captured in the comments, this task can be made easier if the code is initially written to attempt to clearly abstract the platform specific detail from the application code. This is not as simple as simply replacing the boot.s and isn't something that you can really claim to have done until you've tested the porting.
Much of the architectural behaviour between the two processors will be unchanged, and the C-compiler ought to be able to take advantage of micro-architectural optimisations. This optimisation may not be the best that you could achieve with some manual effort.
Where you are likely to see hard problems is any points in your code that are sensitive to memory ordering or potentially interactions between code and exceptions. The Cortex-A9 is significantly more out-of-order than the Cortex-A5, and the migration may expose bugs in your code. Libraries ought to be stable now, but there is still a risk to be aware of. Anticipating this sort of problem is quite hard and if you are writing the majority of the code yourself you probably need to build in some contingency for the porting task. Once the code is stable on A9, issues of this sort are less likely to show up on either A5 (to give a lower cost production option), or more recent high performance cores.
If I cut and paste a chapter of my math textbook into a my biology text book will that make sense? They both are written using the english language.
No that makes no sense. Assuming you are sticking to common ARM instructions for the code (english), the code isnt going to work from one chip (math book) to another (biology). The majority of the difference is between the vendors logic which is outside the ARM core, no reason whatsoever to assume that two vendors have the same peripherals at the same addresses that work exactly the same bit for bit, gate for gate.
So in general baremetal will NOT work and does NOT work like this. A very high level printf this or that C program, sure because you have many layers of abstraction including the target, doesnt even have to be arm to arm. Now saying that it is certainly possible for you to make or maybe if very lucky find a hardware abstraction layer that hides the differences between the chips, at that layer then you can ideally write that portion of the project and port it. As far as the arm vs arm the differences should be handled by the compiler and again dont even have to be arm to arm could be arm to mips. Any assembly language you may have or any core specific accesses/instructions would need to be checked against the two technical reference manuals to insure they are compatible. Probably not at the cortex-a level but for cortex-ms there are some address space core specific items that can affect high level language code, but for something like this to work you would have to hide that in the abstraction layer.
Generally NO, ARM is the underlying core, the chip differences have nothing to do with ARM so its like cutting and pasting a chapter from a mystery novel you are writing in english into a biography you are also writing in english and hoping that chapter makes sense in the latter book.

Can the announced Tegra K1 be a contender against x86 and x64 chips in supercomputing applications?

To clarify, can this RISC base processor (the Tegra K1) be used without significant changes to today's supercomputer programs, and perhaps be a game changer because if it's power, size, cost, and energy usage? I know it's going up against some x64 or x86 processors. Can the code used for current supercomputers be easily converted to code that will run well on these Mobile chips? Thanks.
Can the code used for current supercomputers be easily converted to code that will run well on these Mobile chips?
It depends what you call "supercomputers code". Usually supercomputers run high-level functional code (usually fully compiled code like C++, sometimes VMs-dependent code like Java) on top of other low-lewel code and technologies such as OpenCL or CUDA for accelerators or MPICH for communication between nodes.
All these technologies have ARM implementations so the real thing is to make the functional code is ARM-compatible. This is usually straightforward as code written in high level language is mostly hardware-independent. So the short answer is: yes.
However, what may be more complicated is to scale this code to these new processors.
Tegra K1 is nothing like the GPUs embedded in supercomputers. It has far less memory, runs slightly slower and has only 192 cores.
Its price and power consumption make it possible, however, to build supercomputers with hundreds of them inside.
So code which have been written for traditionnal supercomputers (a few high-performance GPUs enbedded) will not reach the peak performance of 'new' supercomputers (built with a lot of cheap and weak GPUs). There will be a price to pay to existing code on these new architectures.
For modern supercomputing needs, you'd need to answer if a processor can perform well for the energy it consumes. Current architecture of Intel along with GPUs fulfill those needs and Tegra architecture do not perform as well in terms of power-performance to Intel processors.
The question is should it? Intel keeps proving that ARM is inferior and the only factor speaking for using RISC base processors is their price, which I highly doubt is a concern when building super computer.

Arm Neon Intrinsics vs hand assembly

https://web.archive.org/web/20170227190422/http://hilbert-space.de/?p=22
On this site which is quite dated it shows that hand written asm would give a much greater improvement then the intrinsics. I am wondering if this is the current truth even now in 2012.
So has the compilation optimization improved for intrinsics using gnu cross compiler?
My experience is that the intrinsics haven't really been worth the trouble. It's too easy for the compiler to inject extra register unload/load steps between your intrinsics. The effort to get it to stop doing that is more complicated than just writing the stuff in raw NEON. I've seen this kind of stuff in pretty recent compilers (including clang 3.1).
At this level, I find you really need to control exactly what's happening. You can have all kinds of stalls if you do things in just barely the wrong order. Doing it in intrinsics feels like surgery with welder's gloves on. If the code is so performance critical that I need intrinsics at all, then intrinsics aren't good enough. Maybe others have difference experiences here.
I've had to use NEON intrinsics in several projects for portability. The truth is that GCC doesn't generate good code from NEON intrinsics. This is not a weakness of using intrinsics, but of the GCC tools. The ARM compiler from Microsoft produces great code from NEON intrinsics and there is no need to use assembly language in that case. Portability and practicality will dictate which you should use. If you can handle writing assembly language then write asm. For my personal projects I prefer to write time-critical code in ASM so that I don't have to worry about a buggy/inferior compiler messing up my code.
Update: The Apple LLVM compiler falls in between GCC (worst) and Microsoft (best). It doesn't do great with instruction interleaving nor optimal register usage, but at least it generates reasonable code (unlike GCC in some situations).
Update2: The Apple LLVM compiler for ARMv8 has been improved dramatically. It now does a great job generating ARMv8 code from C and intrinsics.
So this question is four years old, now, and still shows up in search results...
In 2016 things are much better.
A lot of simple code that I've transcribed from assembly to intrinsics is now optimised better by the compilers than by me because I'm too lazy to do the pipeline work (for how many different pipelines now?), while the compilers just needs me to pass the right --mtune=.
For complex code where register allocation can get tight, GCC and Clang both can still produce slower than handwritten code by a factor of two... or three(ish). It's mostly on register spills, so you should know from the structure of your code whether that's a risk.
But they both sometimes have disappointing accidents. I'd say that right now that's worth the risk (although I'm paid to take risk), and if you do get hit by something then file a bug. That way things will keep on getting better.
By now you even get auto-vectorization for the plain C code and the intrinsics are handled properly:
https://godbolt.org/z/AGHupq

Which free C compiler gives options for greater optimizations?

Can you please give me some comparison between C compilers especially with respect to optimization?
Actually there aren't many free compilers around. gcc is "the" free compiler and probably one of the best when it comes to optimisation, even when compared to proprietary compilers.
Some independent benchmarks are linked from here:
http://gcc.gnu.org/benchmarks/
I believe Intel allows you to use its ICC compilers under Linux for non-commercial development for free. ICC beats gcc and Visual Studio hands down when it comes to code generation for x86 and x86-64 (i.e. it typically generates faster code, and can do a decent job of auto-vectorization (SIMD) in some cases).
This is a hard question to answer since you did not tell us what platform you are using, neither hardware or os....
But joemoe is right, gcc tend to excel in this field.
(As a side note: On some platforms there are commercial compilers that are better, but since you gain so much more that just the compiler gcc is hard to beat...)
the Windows SDK is a free download. it includes current versions of the Visual C++ compilers. These compilers do a very good job of optimisation.

Resources