SIMD SSE2 instructions in assembly - c

I'm currently rewriting a program that used 64 bit words to use 128 bit words. I am trying to use SIMD SSE2 intrinsics from Intel. My new program, that uses the SIMD intrinsics, is about 60% percent slower than the original when I had expected to to be around twice as fast. When I looked at the assembly code for each of them, they were very similar and about the same length. However, the object code (compiled file) was 60% longer.
I also ran callgrind on the two programs, which told me how many instruction reads there were per line. I found that the SIMD version of my program often had fewer instructions reads for the same action than in the original version. Ideally, that should happen, but it doesn't make sense because the SIMD version takes longer to run.
My question:
Do the SSE2 intrinsics convert into more assembly instructions? Do the SSE2 instructions take longer to run? Or is there some other reason that my new program is so slow?
Additional notes: I am programming in C, on Linux Mint, and compiling with gcc -O3 -march=native.

Related

Why doesn’t Clang use vcnt for __builtin_popcountll on AArch32?

The simple test,
unsigned f(unsigned long long x) {
return __builtin_popcountll(x);
}
when compiled with clang --target=arm-none-linux-eabi -mfpu=neon -mfloat-abi=softfp -mcpu=cortex-a15 -Os,⁎ results in the compiler emitting the numerous instructions required to implement the classic popcount for the low and high words in x in parallel, then add the results.
It seems to me from skimming the architecture manuals that NEON code similar to that generated for
#include <arm_neon.h>
unsigned f(unsigned long long x) {
uint8x8_t v = vcnt_u8(vcreate_u8(x));
return vget_lane_u64(vpaddl_u32(vpaddl_u16(vpaddl_u8(v))), 0);
}
should have been beneficial in terms of size at least, even if not necessarily a performance improvement.
Why doesn’t Clang† do that? Am I just giving it the wrong options? Are the ARM-to-NEON-to-ARM transitions so spectacularly slow, even on the A15, that it wouldn’t be worth it? (This is what a comment on a related question seems to suggest, but very briefly.) Is Clang codegen for AArch32 lacking for care and attention, seeing as almost every modern mobile device uses AArch64? (That seems farfetched, but GCC, for example, is known to occasionally have bad codegen on non-prominent architectures such as PowerPC or MIPS.)
⁎ Clang options could be wrong or redundant, adjust as necessary.
† GCC doesn’t seem to do that in my experiments, either, just emitting a call to __popcountdi2, but that suggests I might simply be calling it wrong.
Are the ARM-to-NEON-to-ARM transitions so spectacularly slow, even on
the A15, that it wouldn’t be worth it?
Well you asked very right question.
Shortly, yes, it's. It's slow and in most cases moving data between NEON and ARM CPU and vise-versa is a big performance penalty that over performance gain from using 'fast' NEON instructions.
In details, NEON is a optional co-processor in ARMv7 based chips.
ARM CPU and NEON work in parallel and I might say 'independently' from each other.
Interaction between CPU and NEON co-processor is organised via FIFO. CPU places neon instruction in FIFO and NEON co-processor fetch and execute it.
Delay comes at the point when CPU and NEON needs sync between each other. Sync is accessing same memory region or transfering data between registers.
So whole process of using vcnt would be something like:
ARM CPU placing vcnt into NEON FIFO
Moving data from CPU register into NEON register
NEON fetching vcnt from FIFO
NEON executing vcnt
Moving data from NEON register to CPU register
And all that time CPU is simply waiting while NEON is doing it's work.
Due to NEON pipelining, delay might be up to 20 cycles (if I remember this number correctly).
Note: "up to 20 cycles" is arbitrary, since if ARM CPU has other instructions that does not depend on result of NEON computations, CPU could execute them.
Conclusion: as a rule of thumb that's not worthy, unless you are manually optimise code to reduce/eliminate that sync delays.
PS: That's true for ARMv7. ARMv8 has NEON extension as part of a core, so it's not relevant.

(Edited) When should one use inline assembly in c (outside of optimization)? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 3 years ago.
Improve this question
Note: Edited to make the question non-oppion based
Assumptions
We are in user mode (not in the kernel)
The OS being used is either a modern version of Linux or a modern version of windows that uses a x86 CPU.
Other than optimization, is there a specific example where using inline assembly in a C program is needed. (If applicable, provide the inline assembly)
To be clear, injecting assembly language code through the use of the key words __asm__(in the case of GCC) or __asm (in the case of VC++)
(Most of this was written for the original version of the question. It was edited after).
You mean purely for performance reasons, so excluding using special instructions in an OS kernel?
What you really ultimately want is machine code that executes efficiently. And the ability to modify some text files and recompile to get different machine code. You can usually get both of those things without needing inline asm, therefore:
https://gcc.gnu.org/wiki/DontUseInlineAsm
GNU C inline assembly is hard to use correctly, but if you do use it correctly has very low overhead. Still, it blocks many important optimizations like constant-propagation.
See https://stackoverflow.com/tags/inline-assembly/info for guides on how to use it efficiently / safely. (e.g. use constraints instead of stupid mov instructions as the first or last instruction in the asm template.)
Pretty much always inappropriate, unless you know exactly what you're doing and can't hand-hold the compiler to make asm that's quite as good with pure C or intrinsics. Manual vectorization with intrinsics certainly still has its place; compilers are still terrible at some things, like auto-vectorizing complex shuffles. GCC/Clang won't auto-vectorize at all for search loops like a pure C implementation of memchr, or any loop where the trip-count isn't known before the first iteration.
And of course performance on current microarchitectures has to trump maintainability and optimizing differently for future CPUs. If it's ever appropriate, only for small hot loops where your program spends a lot of time, and typically CPU-bound. If memory-bound then there's usually not much to gain.
Over large scales, compilers are excellent (especially with link-time optimization). Humans can't compete on that scale, not while keeping code maintainable. The only place humans can still compete is in the small scale where you can afford the time to think about every single instruction in a loop that will run many iterations over the course of a program.
The more widely-used and performance-sensitive your code is (e.g. a video encoder like x264 or x265), the more reason there is to consider hand-tuned asm for anything. Saving a few cycles over millions of computers running your code every day starts to add up to being worth considering the maintenance / testing / portability downsides.
The one notable exception is ARM SIMD (NEON) where compilers are often still bad. I think especially for 32-bit ARM (where each 128-bit q0..15 register is aliased by 2x 64-bit d0..32 registers, so you can avoid shuffling by accessing the 2 halves as separate registers. Compilers don't model this well, and can easily shoot themselves in the foot when compiling intrinsics that you'd expect to be able to compile efficiently. Compilers are good at producing efficient asm from SIMD intrinsics for x86 (SSE/AVX) and PowerPC (altivec), but for some unknown reason are bad at optimizing ARM NEON intrinsics and often make sub-optimal asm.
Some compilers are not bad, e.g. apparently Apple clang/LLVM for AArch64 does ok more often than it used to. But still, see Arm Neon Intrinsics vs hand assembly - Jake Lee found the intrinsics version of his 4x4 float matmul was 3x slower than his hand-written version using clang, in Dec 2017. Jake is an ARM optimization expert so I'm inclined to believe that's fairly realistic.
or __asm (in the case of VC++)
MSVC-style asm is usually only useful for writing whole loops because having to take inputs via memory operands destroys (some of) the benefit. So amortizing that overhead over a whole loop helps.
For wrapping single instructions, introducing extra store-forwarding latency is just dumb, and there are MSVC intrinsics for almost everything you can't easily express in pure C. See What is the difference between 'asm', '__asm' and '__asm__'? for examples with a single instruction: you get much worse asm from using MSVC inline asm than you would for pure C or an intrinsic if you look at the big picture (including compiler-generated asm outside your asm block).
C++ code for testing the Collatz conjecture faster than hand-written assembly - why? shows a concrete example where hand-written asm is faster on current CPUs than anything I was able to get GCC or clang to emit by tweaking C source. They apparently don't know how to optimize for lower-latency LEA when it's part of a loop-carried dependency chain.
(The original question there was a great example of why you shouldn't write by hand in asm unless you know exactly what you're doing and use optimized compiler output as a starting point. But my answer shows that for a long-running hot tight loop, there are significant gains that compilers are missing with just micro-optimizations, even leaving aside algorithmic improvements.)
If you're considering asm, always benchmark it against the best you can get the compiler to emit. Working on a hand-written asm version may give you ideas that you can apply to your C to hand-hold compilers into making better asm. Then you can get the benefit without actually including any non-portable inline asm in your code.

Is there a version of the standard math library which uses VEX instructions?

I have this large library with a mix of regular C++, a lot of SSE intrinsics and a few insignificant pieces of assembly. I have reached the point where I would like to target the AVX instruction set.
To do that, I would like to build the whole thing with gcc's -mavx or MSVC's /arch:AVX so I can add AVX intrinsics wherever they are needed and not have to worry about AVX state transitions internally.
The only problem I've found with that is the standard C math functions: sin(), exp(), etc. Their implementation on linux systems uses SSE instructions without the VEX prefix. I have not checked but I expect a similar problem on windows.
The code uses a fair amount of calls to math functions. Some quick benchmarking reveals that a simple call so sin() gets either slightly (~10%) slower or much (3x) slower depending on the exact CPU and how it handles the AVX transitions (Skylake vs older).
Adding a VZEROUPPER before the call helps pre Skylake CPUs a lot but actually makes the code a little slower on Skylake. It seems like the proper solution would be a VEX encoded version of the math functions.
So my question is this: Is there a reasonably efficient math library which can be compiled to use VEX encoded instructions? How do others deal with this problem?

Compiler -march flag benchmark?

does -march flag in compilers (for example: gcc) really matters?
would it be faster if i compile all my programs and kernel using -march=my_architecture instead of -march=i686
Yes it does, though the differences are only sometimes relevant. They can be quite big however if your code can be vectorized to use SSE or other extended instruction sets which are available on one architecture but not on the other. And of course the difference between 32 and 64 bit can (but need not always) be noticeable (that's -m64 if you consider it a type of -march parameter).
As anegdotic evidence, a few years back I run into a funny bug in gcc where a particular piece of code which was run on a Pentium 4 would be about 2 times slower when compiled with -march=pentium4 than when compiled with -march=pentium2.
So: often there is no difference, and sometimes there is, sometimes it's the other way around than you expect. As always: measure before you decide to use any optimizations that go beyond the "safe" range (e.g. using your actual exact CPU model instead of a more generic one).
There is no guarantee that any code you compile with march will be faster/slower w.r.t. the other version. It really depends on the 'kind' of code and the actual result may be obtained only by measurement. e.g., if your code has lot of potential for vectorization then results might be different with and without 'march'. On the other hand, sometimes compiler do a poor job during vectorization and that might result in slower code when compiled for a specific architecture.

SSE optimized code performs similar to plain version

I wanted to take my first steps with Intel's SSE so I followed the guide published here, with the difference that instead of developing for Windows and C++ I make it for Linux and C (therefore I don't use any _aligned_malloc but posix_memalign).
I also implemented one computing intensive method without making use of the SSE extensions. Surprisingly, when I run the program both pieces of code (that one with SSE and that one without) take similar amounts of time to run, usually being the time of the one using the SSE slightly higher than the other.
Is that normal? Could it be possible that GCC does already optimize with SSE (also using -O0 option)? I also tried the -mfpmath=387 option, but no way, still the same.
For floating point operations you may not see a huge benefit with SSE. Most modern x86 CPUs have two FPUs so double precision may only be about the same speed for SIMD vs scalar, and single precision might give you 2x for SIMD over scalar on a good day. For integer operations though, e.g. image or audio processing at 8 or 16 bits, you can still get substantial benefits with SSE.
GCC has a very good inbuilt code vectorizer, (which iirc kicks in at -O0 and above), so this means it will use SIMD in any place that it can in order to speed up scalar code (it will also optimize SIMD code a bit too, if its possible).
its pretty easy to confirm this is indeed whats happening here, just disassemble the output (or have gcc emit commented asm files).

Resources