SSE optimized code performs similar to plain version - c

I wanted to take my first steps with Intel's SSE so I followed the guide published here, with the difference that instead of developing for Windows and C++ I make it for Linux and C (therefore I don't use any _aligned_malloc but posix_memalign).
I also implemented one computing intensive method without making use of the SSE extensions. Surprisingly, when I run the program both pieces of code (that one with SSE and that one without) take similar amounts of time to run, usually being the time of the one using the SSE slightly higher than the other.
Is that normal? Could it be possible that GCC does already optimize with SSE (also using -O0 option)? I also tried the -mfpmath=387 option, but no way, still the same.

For floating point operations you may not see a huge benefit with SSE. Most modern x86 CPUs have two FPUs so double precision may only be about the same speed for SIMD vs scalar, and single precision might give you 2x for SIMD over scalar on a good day. For integer operations though, e.g. image or audio processing at 8 or 16 bits, you can still get substantial benefits with SSE.

GCC has a very good inbuilt code vectorizer, (which iirc kicks in at -O0 and above), so this means it will use SIMD in any place that it can in order to speed up scalar code (it will also optimize SIMD code a bit too, if its possible).
its pretty easy to confirm this is indeed whats happening here, just disassemble the output (or have gcc emit commented asm files).

Related

Is there a version of the standard math library which uses VEX instructions?

I have this large library with a mix of regular C++, a lot of SSE intrinsics and a few insignificant pieces of assembly. I have reached the point where I would like to target the AVX instruction set.
To do that, I would like to build the whole thing with gcc's -mavx or MSVC's /arch:AVX so I can add AVX intrinsics wherever they are needed and not have to worry about AVX state transitions internally.
The only problem I've found with that is the standard C math functions: sin(), exp(), etc. Their implementation on linux systems uses SSE instructions without the VEX prefix. I have not checked but I expect a similar problem on windows.
The code uses a fair amount of calls to math functions. Some quick benchmarking reveals that a simple call so sin() gets either slightly (~10%) slower or much (3x) slower depending on the exact CPU and how it handles the AVX transitions (Skylake vs older).
Adding a VZEROUPPER before the call helps pre Skylake CPUs a lot but actually makes the code a little slower on Skylake. It seems like the proper solution would be a VEX encoded version of the math functions.
So my question is this: Is there a reasonably efficient math library which can be compiled to use VEX encoded instructions? How do others deal with this problem?

How to enable the DIV instruction in ASM output of C compiler

I am using vbcc compiler to translate my C code into Motorola 68000 ASM.
For whatever reason, every time I use the division (just integer, not floats) in code, the compiler only inserts the following stub into the ASM output (that I get generated upon every recompile):
public __ldivs
jsr __ldivs
I explicitly searched for all variations of DIVS/DIVU, but every single time, there is just that stub above. The code itself works (I debugged it on target device), so the final code does have the DIV instruction, just not the intermediate output.
Since this is the most expensive instruction and it's in an inner loop, I really gotta experiment with tweaking the code to get the max performance of it.
However, I can't do it if I don't see the resulting ASM code. Any ideas how to enable it ? The compiler manual does not specify anything like that, so there must clearly must be some other - probably common - higher principle in play ?
From the vbcc compiler system manual by Volker Barthelmann:
4.1 Additional options
This backend provides the following additional options:
-cpu=n Generate code for cpu n (e.g. -cpu=68020), default: 68000.
...
4.5 CPUs
The values of -cpu=n have those effects:
...
n>=68020
32bit multiplication/division/modulo is done with the mul?.l, div?.l and
div?l.l instructions.
The original 68000 CPU didn't have support for 32-bit divides, only 16-bit division, so by default vbcc doesn't generate 32-bit divide instructions.
Basically your question doesn't even belong here. You're asking about the workings of your compiler not the 68K cpu family.
Since this is the most expensive instruction and it's in an inner loop, I really gotta experiment with tweaking the code to get the max performance of it.
Then you are already fighting windmills. Chosing an obscure C compiler while at the same time desiring top performance are conflicting goals.
If you really need MC68000 code compatibility, the choice of C is questionable. Since the 68000 has zero cache, store/load orgies that simple C compilers tend to produce en masse, have a huge performance impact. It lessens considerably for the higher members and may become invisible on the superscalar pipelined ones (erm, one; the 68060).
Switch to 68020 code model if target platform permits, and switch compiler if you're not satisfied with your current one.

Compiler -march flag benchmark?

does -march flag in compilers (for example: gcc) really matters?
would it be faster if i compile all my programs and kernel using -march=my_architecture instead of -march=i686
Yes it does, though the differences are only sometimes relevant. They can be quite big however if your code can be vectorized to use SSE or other extended instruction sets which are available on one architecture but not on the other. And of course the difference between 32 and 64 bit can (but need not always) be noticeable (that's -m64 if you consider it a type of -march parameter).
As anegdotic evidence, a few years back I run into a funny bug in gcc where a particular piece of code which was run on a Pentium 4 would be about 2 times slower when compiled with -march=pentium4 than when compiled with -march=pentium2.
So: often there is no difference, and sometimes there is, sometimes it's the other way around than you expect. As always: measure before you decide to use any optimizations that go beyond the "safe" range (e.g. using your actual exact CPU model instead of a more generic one).
There is no guarantee that any code you compile with march will be faster/slower w.r.t. the other version. It really depends on the 'kind' of code and the actual result may be obtained only by measurement. e.g., if your code has lot of potential for vectorization then results might be different with and without 'march'. On the other hand, sometimes compiler do a poor job during vectorization and that might result in slower code when compiled for a specific architecture.

Will it be fine if we use msse and msse2 option of gcc in RTOS devices

As per my knowledge msse and msse2 option of gcc will improve the performance by performing arithmetic operation faster. And also I read some where like it will use more resources like registers, cache memory.
What about the performance if we use the executable generated with these options on RTOS devices(like vxworks board) ?
The OS must support SSE(2) instructions for your application to work correctly. It would seem, from googling, that VcWorks supports this (and it's not really that hard, all it takes is that the OS has a 512 byte save-area per task that uses SSE/SSE2 - given the right circumstances, it can be allocated on demand, but it's often easier to just allocate it to all tasks]. Saving/restoring SSE registers is done "on demand", that is, only when a task different from the previous one to use SSE is using SSE instructions, is it necessary to save the registers. The OS will use a special interrupt(trap) to indicate that "a new task is trying to use SSE instructions.
So, as long as the processor supports it, you should be fine.
I may not be able to directly answer your question, but here are a couple things I do know that may be of use:
SSE, SSE2, etc. must be supported/implemented by the processor for them to have any affect in the first place.
There are specific functions you can call that use these extended instructions for mathematical operations. These functions operate on wider data types or perform an operation on a set efficiently.
Enabling the options in GCC may use the previous APIs/builtins automatically. This is the part I am unsure about.

Practical use of automatic vectorization?

Has anyone taken advantage of the automatic vectorization that gcc can do? In the real world (as opposed to example code)? Does it take restructuring of existing code to take advantage? Are there a significant number of cases in any production code that can be vectorized this way?
I have yet to see either GCC or Intel C++ automatically vectorize anything but very simple loops, even when given the code of algorithms that can (and were, after I manually rewrote them using SSE intrinsics) be vectorized.
Part of this is being conservative - especially when faced with possible pointer aliasing, it can be very difficult for a C/C++ compiler to 'prove' to itself that a vectorization would be safe, even if you as the programmer know that it is. Most compilers (sensibly) prefer to not optimize code rather than risking miscompiling it. This is one area where higher level languages have a real advantage over C, at least in theory (I say in theory since I'm not actually aware of any automatically vectorizing ML or Haskell compilers).
Another part of it is simply analytical limitations - most research in vectorization, I understand, is related to optimizing classical numerical problems (fluid dynamics, say) which was the bread and butter of most vector machines before a few years ago (when, between CUDA/OpenCL, Altivec/SSE, and the STI Cell, vector programming in various forms became widely available in commercial systems).
It's fairly unlikely that code written for a scalar processor in mind will be easy for a compiler to vectorize. Happily, many things you can do to make it easier for a compiler to understand how to vectorize it, like loop tiling and partial loop unrolling, also (tend to) help performance on modern processors even if the compiler doesn't figure out how to vectorize it.
It is hard to use in any business logic, but gives speed ups when you are processing volumes of data in the same way.
Good example is sound/video processing where you apply the same operation to every sample/pixel.
I have used VisualDSP for this, and you had to check the results after compiling - if it is really used where it should.
Vectorized instructions are not limited to Cell processors - most modern workstations-like CPU have them (PPC, x86 since pentium 3, Sparc, etc...). When used well for floating points operations, it can help quite a lot for very computing intensive tasks (filters, etc...). In my experience, automatic vectorization does not work so well.
You may have noticed that pretty much no-one actually knows how to make good use of GCC's Automatic Vectorization. If you search around the web to see people's comments, it always come to the idea that GCC allows you to enable automatic vectorization, but it extremely rarely makes actual use of it, and so if you want to use SIMD acceleration (eg: MMX, SSE, AVX, NEON, AltiVec), then you basically haveto figure out how to write it using compiler intrinsics or Assembly language code.
But the problem with intrinsics is that you effectively need to understand the Assembly language side of it and then also learn the Intrinsics method of describing what you want, which is likely to result in much less efficient code than if you wrote it in Assembly code (such as by a factor of 10x), because the compiler is still going to have trouble making good use of your intrinsic instructions!
For example, you might be using SIMD Intrinsics so that many operations can be performed in parallel at the same time, but your compiler will probably generate Assembly code that transfers the data between the SIMD registers and the normal CPU registers and back, effectively making your SIMD code run at a similar speed (or even slower) than normal code!
So basically:
If you want upto 100% speedups (2x
speed), then either buy the
official Intel/ARM compilers or convert some of your code to use SIMD C/C++ Intrinsics.
If you
want 1000% speedups (10x speed), then
write it in Assembly code using SIMD instructions by hand. Or if available on your hardware, use GPU acceleration instead such as OpenCL or Nvidia's CUDA SDK, since they can provide similar speedups in the GPU as SIMD does in the CPU.

Resources