gcc intrinsic vs inline assembly : which is better? - c

If I want to expose a single machine specific instruction to the programmer, there are two ways I can do so :
Define a new builtin / intrinsic
Expose the same as inline assembly asm() [As its a single arithmetic type instruction, I believe there is no need for asm volatile()]
I have read that builtins allow the compiler to take care of the type checking, register allocation and "other optimizations" etc. But the compiler will need to do this even in case of asm (), right ? So what precisely is the performance benefit of using intrinsic over asm () for a single instruction ?
How does the equation change if there are multiple machine instructions involved ?
The "portability" argument in favor of intrinsic is understandable, but I am curious to understand the performance advantage, if any, of one over the other.

I think it depends a lot on what you're doing. Modifying GCC, and requiring a modified GCC to build your program unless/until your GCC patch makes it upstream, is a lot more of a headache than just using inline asm.
If the instruction you want to use has an abstract meaning not tied down to a particular instruction set architecture, adding the builtin/intrinsic so that the same code using it could automatically work on all targets (with fallback to a more complex implementation with multiple instructions on targets that don't have the instruction) is probably the "right" choice, but might not be practical still.
If the instruction is something very ISA-specific, obscure, not-performance-critical, etc. (I'm thinking of loading a special hardware register, cpu mode register, getting model info, etc. but I'm sure you can think of other examples) then just using inline asm is almost certainly the right solution.
Even if you do think a builtin is the "right" solution for your problem, but need to take the inline asm approach for practical reasons, you can still abstract it with a macro or static inline function in such a way that it's easy to replace all uses with an intrinsic later (or with a fallback implementation on targets without the instruction).

Related

How to use x86intrin.h

In one of my applications, I need to efficiently de-interleave bits in a long stream of data. Ideally, I would like to use the BMI2 pext_u32() and/or pext_u64() x86_64 intrinsic instructions when available. I scoured the internet for doc on x86intrin.h (GCC), but couldn't find much on the subject; so, I am asking the gurus on StackOverflow to help me out.
Where can I find documentation about how to work with functions in x86intrin.h?
Does gcc's implementation of pext_*() already have code behind it to fall back on, or do I need to write the fallback code myself (for conditional compile)?
Is it possible to write a binary that automatically falls back to an alternate implementation if a target does not support the intrinsic? If so, how does one do so?
Is there a known programming pattern that will be recognized by GCC and automatically converted to pext_*() when compiling with optimization enabled and with -mbmi2?
Intel publishes the Intrinsics Guide, which also applies to GCC. You will have to write your own fallback code if you use these intrinsics.
You can achieve automatic switching of implementations by using IFUNC resolvers, but for non-library code, using conditionals or function pointers is probably simpler.
Looking at the gcc/config/i386/i386.md and gcc/config/i386/i386.c files, I don't see anything in GCC 8 which would automatically select the pext instruction without intrinsics in the source code.
The design philosophy of Intel's intrinsics is that you can only use them in functions that will run only on CPUs with the required extensions. Checking for support every instruction would add way too much overhead, and then there's have to be a fallback (there isn't).
Intel intrinsics are not like GNU C __builtin_popcountll (which does use a fallback if compiled without -mpopcnt, but not you can enable target options on a per-function basis with attributes.)

What's the purpose of using assembly language inside a C program?

What's the purpose of using assembly language inside a C program? Compilers are able to generate assembly language already. In what cases would it be better to write assembly than C? Is performance a consideration?
In addition to what everyone said: not all CPU features are exposed to C. Sometimes, especially in driver and operating system programming, one needs to explicitly work with special registers and/or commands that are not otherwise available.
Also vector extensions.
That was especially true before the advent of compiler intrinsics. Those alleviate the need for inline assembly somewhat.
One more use case for inline assembly has to do with interfacing C with reflected languages. Specifically, assembly is all but necessary if you need to call a function when its prototype is not known at compile time. In other words, when the quantity and datatypes of that function's arguments are but runtime variables. C variadic functions and the stdarg machinery won't help you in this case - they would help you parse a stack frame, but not build one. In assembly, on the other hand, it's quite doable.
This is not an OS/driver scenario. There are at least two technologies out there - Java's JNI and COM Automation - where this is a must. In case of Automation, I'm talking about the way the COM runtime is marshaling dual interfaces using their type libraries.
I can think of a very crude C alternative to assembly for that, but it'd be ugly as sin. Slightly less ugly in C++ with templates.
Yet another use case: crash/run-time error reporting. For postmortem debugging, you'd want to capture as much of program state at the point of crash as possible (i. e. all the CPU registers), and assembly is a much better vehicle for that than C. Postmortem debugging of crashing native code usually involves staring at the assembly anyway.
Yet another use case - code that is intended for execution in another process without that process' co-operation or knowledge. This is often referred to as "shellcode", but it doesn't have to be shell related. Code like that needs to be very carefully written, and it can't rely on the conveniences of a high level language (like the run time library, or having a data section) that are normally taken for granted. When one is after injecting a significant piece of functionality into a target process, they usually end up loading a dynamic library, but the initial trampoline code that loads the library and passes control to it tends to be in assembly.
I've been only covering cases where assembly is necessary. Hand-optimizing for performance is covered in other answers.
There are a few, although not many, cases where hand-optimized assembly language can be made to run more efficiently than assembly language generated by C compilers from C source code. Also, for developers used to assembly language, some things can just seem easier to write in assembler.
For these cases, many C compilers allow inline assembly.
However, this is becoming increasingly rare as C compilers get better and better and producing efficient code, and most platforms put restrictions on some of the low-level type of software that is often the type of software that benefits most from being written in assembler.
In general, it is performance but performance of a very specific kind. For example, the SIMD parallel instructions of a processor might not generated by the compiler. By utilizing processor specific data formats and then issuing processor specific parallel instructions (e.g. ARM NEON or Intel SSE), very fast performance on graphics or signal processing problems can occur. Even then, some compilers allow these to be expressed in C using intrinsic functions.
While it used to be common to use assembly language inserts to hand-optimize critical functions, those days are largely done. Modern compilers are very good and modern processors have very complicated timing requirements so hand optimized code is often less optimal than expected.
There were various reasons to write inline assemblies in C. We can simply categorize the reasons into necessary and unnecessary.
For the reasons of unnecessary, possibly be:
platform compatibility
performance concerning
code optimization
etc.
I consider above as unnecessary because sometime they can be discard or implemented through pure C. For example of platform compatibility, you can totally implement particular version for each platform, however, use inline assemblies might reduce the effort. Here we are not going to talk too much about the unnecessary reasons.
For necessary reasons, they possibly be:
something with standard libraries was insufficient to do
some instruction set was not supported by compilers
object code generated incorrectly
writing stack-sensitive code
etc.
These reasons considered necessary, because of they are almost not possibly done with pure C language. For example, in old DOSes, software interrupt INT21 was not reentrantable. If you want to write a Virtual Dirve fully use INT21 supported by the compiler, it was impossible to do. In this situation, you would need to hook the original INT21, and make it reentrantable. However, the compiled code wraps your every call with prolog/epilog. Thus, you can never break something restricted, or you just crashed the code. You can try any of trick by using the pure language of C with libraries; but even you can successfully find a trick, that would mean you found a particular order that the compiler generates the machine code; this is implying: you tried to let the compiler compiles your code to exactly machine code. So, why not just write inline assemblies directly?
This example explained all above of necessary reasons except instruction set not supported, but I think that was easy to think about.
In fact, there're more reasons to write inline assemblies, but now you have some ideas of them, and so on.
Just as a curiosity, I'm adding here a concrete example of something not-so-low-level you can only do in assembly. I read this in an assembly book from my university time where it was used to show an inherent limitation of C/C++, and how to overcome it with assembly.
The problem is how do I invoke a function when the exact number of parameters is only known at runtime? In fact, in C/C++ you can easily define a function that takes a variable number of arguments like printf. But when it comes to calling that function, the compiler wants to know exactly how many parameters must be passed. You may pass more paremters than required, that won't do any harm. But what if the number grows unexpectedly to 100 or 1000 parameters, and must be picked out of an array?
The solution of course is using assembly, where you can dynamically create a stack frame of the proper size, copy the parameters on the stack, invoke the function, and finally reset the stack.
In practice, this would hardly ever be a limitation (except if the library you're using is really really bad designed). People who use assembly in C have much better reasons to do so like others have pointed out in their answers. Still, I think may be an interesting fact to know.
I would rather think of that as a way to write a very specific code for a specific platform, optimization, though still common, is used less nowadays. Knowledge and usage of assembly in C is also practiced by all-color hats.

What does #pragma intrinsic mean?

Just want to know what does #pragma intrinsic(_m_prefetchw) mean ?
As far as I am aware, that looks like someone was intending to modify some MSVC++ specific setting. However, that setting is not a valid option for the intrinsic pragma. _m_prefetchw on the other hand is a 3D Now! intrinsic function.
Like all compiler intrinsic functions, it exposes (possibly) faster assembly instructions supported by the underlying hardware to your C or C++ application in a manner
A. more consistent with optimizers, and
B. more consistent with the language, when compared with using inline assembly.
On MSVC on x86_64/x64/amd64 systems, inline assembly is not supported, so one must use such intrinsics to access whizzbang features of the underlying hardware.
Finally, it should be noted that _m_prefetchw is a 3D Now! intrinsic, and 3D Now! is only supported on AMD hardware. It's probably not something you want to use for new code (i.e. you should use SSE instead, which works on both Intel and AMD hardware, and has more features to boot).
The meaning of "#pragma intrinsic" (note spelling), as with all "#pragma" directives, varies from one compiler to another. Generally, it indicates that a particular thing that looks syntactically like a call to an external function should be replaced with some inline code. In some cases, this may greatly improve performance, especially if the compiler can determine constant values for some or all of the arguments (in the latter situation, the compiler may be able to compute the value of the function and replace it with a constant).
Generally, having functions processed as intrinsic won't pose any particular problem. The biggest danger is that if a user defines in one module a function with the same name as one of the compiler's intrinsic function, and attempts to call that function from another module, the compiler might instead replace the function call with its expected instruction sequence. To prevent this, some compilers don't enable intrinsic functions by default (since doing so would cause the above incompatibility with some standard-conforming programs) but provide #pragma directives to do enable them. Compilers may also use command-line option to enable intrinsics (since the standard allows anything there), or may define some functions like __memcpy() as intrinsic, and within string.h, use a #define directive to convert memcpy into __memcpy (since programs that #include string.h are not allowed to use memcpy for any other purpose).
In C, it depends on whether the implementation recognizes (and defines) it.
If the implementation does not recognize the "intrinsic" preprocessing token, the pragma is ignored.
If the implementation recognizes it, whatever is defined will happen (and if another implementation defines it differently, a different thing happens on the other implementation).
So, check the documentation for the implementation you're talking about (edit: and don't use it if you expect to compile your source on different implementations).
I couldn't find any reference to "#pragma intrinsic" in man gcc, on my system.
The intrinsic pragma tells the compiler that a function has known behavior. The compiler may call the function and not replace the function call with inline instructions, if it will result in better performance.
Source: http://msdn.microsoft.com/en-us/library/tzkfha43(VS.80).aspx

inline a function inside another inline function in C

I currently have inline functions calling another inline function (a simple 4 lines big getAbs() function). However, I discovered by looking to the assembler code that the "big" inline functions are well inlined, but the compiler use a bl jump to call the getAbs() function.
Is it not possible to inline a function in another inline function? By the way, this is embedded code, we are not using the standard libraries.
Edit : The compiler is WindRiver, and I already checked that inlining would be beneficial (4 instructions instead of +-40).
Depending on what compiler you are using you may be able to encourage the compiler to be less reluctant to inline, e.g. with gcc you can use __attribute__ ((always_inline)), with Intel ICC you can use icc -inline-level=1 -inline-forceinline, and with Apple's gcc you can use gcc -obey-inline.
The inline keyword is a suggestion to the compiler, nothing more. It's free to take that suggestion on board, totally ignore it or even lie to you and tell that it's doing it while it's really not.
The only way to force code to be inline is to, well, write it inline. But, even, then the compiler may decide it knows better and decide to shift it out to another function. It has a lot of leeway in generating executable code for your particular source, provided it doesn't change the semantics of it.
Modern compilers are more than capable of generating better code than most developers would hand-craft in assembly. I think the inline keyword should go the same path as the register keyword.
If you've seen the output of gcc at its insane optimisation level, you'll understand why. It has produced code that I wouldn't have dreamed possible, and that took me a long time to understand.
As an aside, check this out for what optimisations that gcc actually has, including a great many containing the text "inline" or "inlining".
#gramm: There's quite a few scenarios in which inline isn't necessarily to your benefit. Most compilers use some very advanced heuristics to determine when to inline. When discussing inlining, the simplest idea is, trust your compiler to produce the fastest code.
I have recently had a very similar problem, reading this post has given me a wackky idea. Why not Have a simple pre-compilation (a simple reg ex should do the job ) code parser that parses out the function call to actually put the source code in-line. use a tag such as /inline/ /end_of_inline/ so that you can use normal ide features (if you are or might use an ide.
Include this in your build process, that way you have the readability advantage as well as removing the compilers assumption that you are only as good a developer as most and do not understand when to in-line.
Nonetheless before trying this you should probably go through the compilers command line options.
I would suggest that if your getAbs() function (sounds like absolute value but you really should be showing us code with the question...) is 4 lines long, then you have much bigger optimizations to worry about than whether the code gets inlined or not.

How to get `gcc` to generate `bts` instruction for x86-64 from standard C?

Inspired by a recent question, I'd like to know if anyone knows how to get gcc to generate the x86-64 bts instruction (bit test and set) on the Linux x86-64 platforms, without resorting to inline assembly or to nonstandard compiler intrinsics.
Related questions:
Why doesn't gcc do this for a simple |= operation were the right-hand side has exactly 1 bit set?
How to get bts using compiler intrinsics or the asm directive
Portability is more important to me than bts, so I won't use and asm directive, and if there's another solution, I prefer not to use compiler instrinsics.
EDIT: The C source language does not support atomic operations, so I'm not particularly interested in getting atomic test-and-set (even though that's the original reason for test-and-set to exist in the first place). If I want something atomic I know I have no chance of doing it with standard C source: it has to be an intrinsic, a library function, or inline assembly. (I have implemented atomic operations in compilers that support multiple threads.)
It is in the first answer for the first link - how much does it matter in grand scheme of things. The only part when you test bits are:
Low level drivers. However if you are writing one you probably know ASM, it is sufficient tided to the system and probably most delays are on I/O
Testing for flags. It is usually either on initialisation (one time only at the beginning) or on some shared computation (which takes much more time).
The overall impact on performance of applications and macrobenchmarks is likely to be minimal even if microbenchmarks shows an improvement.
To the Edit part - using bts alone does not guarantee the atomic of the operation. All it guarantee is that it will be atomic on this core (so is or done on memory). On multi-processor units (uncommon) or multi-core units (very common) you still have to synchronize with other processors.
As synchronization is much more expensive I belive that difference between:
asm("lock bts %0, %1" : "+m" (*array) : "r" (bit));
and
asm("lock or %0, %1" : "+m" (*array) : "r" (1 << bit));
is minimal. And the second form:
Can set several flag at once
Have nice __sync_fetch_and_or (array, 1 << bit) form (working on gcc and intel compiler as far as I remember).
I use the gcc atomic builtins such as __sync_lock_test_and_set( http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html ). Changing the -march flag will directly affect what is generated. I'm using it with i686 right now, but http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/i386-and-x86_002d64-Options.html#i386-and-x86_002d64-Options shows all the possibilities.
I realize it's not exactly what you are asking for, but I found those two web pages very useful when I was looking for mechanisms like that.
I believe (but am not certain) that neither the C++ or C standards have any mechanisms for these types of synchronization mechanisms yet. Support for higher level synchronization mechanisms are in various states of standardization, but I don't even think one of those would allow you the access of the type of primitive you're after.
Are you programming lock-free datastructures where locks are insufficient?
You probably want to just go ahead and use gcc's non-standard extensions and/or operating system or library provided synchronization primitives. I would bet there's a library that might provide the type of portability you're looking for if you're concerned about using compiler intrinsics. (Though really, I think most people just bite the bullet and use gcc-specific code when they need it. Not ideal, but the standards haven't really been keeping up.)

Resources