What is the difference between the __sync and __atomic intrinsics of gcc - c

I'm writing a toy operating system (so I cannot use any library, including the standard one), compiled with gcc, and I want to use atomics for some of the synchronization code. After some search, I found that gcc has two sets of builtins for atomic operations, __sync_* and __atomic_*, but there is no information as to the difference between the two.
What is the difference between these two besides the latter has a parameter for memory ordering? Is the __sync_ version equivalent to __atomic_ version with the sequential ordering? Is the __sync_ version deprecated in favor of the __atomic_ one?

Disclaimer: I have not used these primitives before. The following answer is based on my reading of the documentation and previous experience with concurrency.
Is the __sync_ version deprecated in favor of the __atomic_ one?
Yes, you should use __atomic and let the compiler fall back to __sync when necessary.
Is the __sync_ version equivalent to __atomic_ version with the sequential ordering?
No, the exact ordering guarantees are specified in the documentation for __sync. If you use __atomic, and the compiler chooses to fall back to __sync, then it will add code to meet the requested ordering guarantees.
From the documentation for __atomic:
Target architectures are encouraged to provide their own patterns for each of these built-in functions. If no target is provided, the original non-memory model set of ‘__sync’ atomic built-in functions are utilized, along with any required synchronization fences surrounding it in order to achieve the proper behavior. Execution in this case is subject to the same restrictions as those built-in functions.
A final word of caution: not all the __sync or __atomic operations can be implemented inline. The compiler may implement them as a call to an external function that is (presumably) implemented in the standard library. If you don't have access to the standard library, then you'll have to implement the missing functions yourself. Here is the relevant quote from the documentation:
If there is no pattern or mechanism to provide a lock free instruction sequence, a call is made to an external routine with the same parameters to be resolved at run time.
These primitives are a low-level mechanism, and you should understand what the compiler can and cannot do.
For an example of what code the compiler generates inline, see the related question: Atomic operations and code generation for gcc

Related

Using GCC __sync extensions for a portable C library

I am developing a C library on OS X (10.10.x which happens to ship with GCC 4.2.x). This library is intended to be maximally portable and not specific to OS X.
I would like the end users to have the least headaches in building from source. So while the project is coded to std=c11 to get some of the benefits of the most modern C, it seems optional matter such as atomics are not supported by this version of GCC.
I am assuming GNU-Linux and various BSD end users to have either (a) a later version of GCC, or (b) the chops to install the latest and greatest.
Is it a good decision to rely on the __sync extensions of GCC for the required CAS (etc.) semantics?
I think you need to take a step back and first define all your use cases. The merits of __sync vs C11 atomics aside, better to define your needs first (i.e. __sync/atomics are solutions not needs).
The Linux kernel is one of the heaviest, most sophisticated users of locking, atomics, etc. and C11 atomics aren't powerful enough for it. See https://lwn.net/Articles/586838/
For example, you might be far better off wrapping things in pthread_mutex_lock / pthread_mutex_unlock pairs. Declaring a struct as C11 atomic does not guarantee atomic access to the whole struct, only parts of it. So, if you needed the following to be atomic:
glob.x = 5;
glob.y = 7;
glob.z = 9;
You would be better wrapping this in the pthread_mutex_* pairing. For comparison, inside the Linux kernel, this would be spin locks or RCU. In fact, you might use RCU as well. Note that doing:
CAS(glob.x,5)
CAS(glob.y,7)
CAS(glob.z,9)
is not the same as the mutex pairing if you want an all or nothing update.
I'd wrap your implementation in some thin layer. For example, the best way might be __sync on one arch [say BSD] and atomics on another. By abstracting this into a .h file with macros/inlines, you can write "common code" without lots of #ifdef's everywhere.
I wrote a ring queue struct/object. Its updater could use CAS [I wrote my own inline asm for this], pthread_mutex_*, kernel spin locks, etc. Actual choice of which was controlled by one or two #ifdef's inside my_ring_queue.h
Another advantage to abstraction: You can change your mind farther down the road. Suppose you did an early pick of __sync or atomics. You code this up in 200 places in 30 files. Then, comes the "big oops" where you realize this was the wrong choice. Lots of editing ensues. So, never put a naked [say] __sync_val_compare_and_swap in any of your .c files. Put it in once in my_atomics.h as something like #define MY_CAS_VAL(...) __sync_val_compare_and_swap(...) and use MY_CAS_VAL
You might also be able to reduce the number of places that need interthread locking by using thread local storage for certain things like subpool allocs/frees.
You may also want to use a mixture of CAS and lock pairings. Some specific uses fair better with low level CAS and others would be more efficient with mutex pairs. Again, it helps if you can define your needs first.
Also, consider the final disaster scenario: The compiler doesn't support atomics and __sync is not available [or does not work] for the arch you're compiling to. What then?
In that case, note that all __sync operations can be implemented using pthread_mutex pairings. That's your disaster fallback.

How to create atomic section in c [duplicate]

Are there functions for performing atomic operations (like increment / decrement of an integer) etc supported by C Run time library or any other utility libraries?
If yes, what all operations can be made atomic using such functions?
Will it be more beneficial to use such functions than the normal synchronization primitives like mutex etc?
OS : Windows, Linux, Solaris & VxWorks
Prior to C11
The C library doesn't have any.
On Linux, gcc provides some -- look for __sync_fetch_and_add, __sync_fetch_and_sub, and so on.
In the case of Windows, look for InterlockedIncrement, InterlockedDecrement``, InterlockedExchange`, and so on. If you use gcc on Windows, I'd guess it also has the same built-ins as it does on Linux (though I haven't verified that).
On Solaris, it'll depend. Presumably if you use gcc, it'll probably (again) have the same built-ins it does under Linux. Otherwise, there are libraries floating around, but nothing really standardized.
C11
C11 added a (reasonably) complete set of atomic operations and atomic types. The operations include things like atomic_fetch_add and atomic_fetch_sum (and *_explicit versions of same that let you specify the ordering model you need, where the default ones always use memory_order_seq_cst). There are also fence functions, such as atomic_thread_fence and atomic_signal_fence.
The types correspond to each of the normal integer types--for example, atomic_int8_t corresponding to int8_t and atomic_uint_least64_t corrsponding to uint_least64_t. Those are typedef names defined in <stdatomic.h>. To avoid conflicts with any existing names, you can omit the header; the compiler itself uses names in the implementor's namespace (e.g., _Atomic_uint_least32_t instead of atomic_uint_least32_t).
'Beneficial' is situational. Always, performance depends on circumstances. You may expect something wonderful to happen when you switch out a mutex for something like this, but you may get no benefit (if it's not that popular of a case) or make things worse (if you accidently create a 'spin-lock').
Across all supported platforms, you can use use GLib's atomic operations. On platforms which have atomic operations built-in (e.g. assembly instructions), glib will use them. On other platforms, it will fall back to using mutexes.
I think that atomic operations can give you a speed boost, even if mutexes are implemented using them. With the mutex, you will have at least two atomic ops (lock & unlock), plus the actual operation. If the atomic op is available, it's a single operation.
Not sure what you mean by the C runtime library. The language proper, or the standard library does not provide you with any means to do this. You'd need to use a OS specific library/API. Also, don't be fooled by sig_atomic_t -- they are not what it seems at first glance and are useful only in the context of signal handlers.
On Windows, there are InterlockedExchange and the like. For Linux, you can take glibc's atomic macros - they're portable (see i486 atomic.h). I don't know a solution for the other operating systems.
In general, you can use the xchg instruction on x86 for atomic operations (works on Dual Core CPUs, too).
As to your second question, no, I don't think that using atomic operations will be faster than using mutexes. For instance, the pthreads library already implements mutexes with atomic operations, which is very fast.

Is there a way to test whether thread safe functions are available in the C standard library?

In regards to the thread safe functions in newer versions of the C standard library, is there a cross-platform way to tell if these are available via pre-processor definition? I am referring to functions such as localtime_r().
If there is not a standard way, what is the reliable way in GCC? [EDIT] Or posix systems with unistd.h?
There is no standard way to test that, which means there is no way to test it across all platforms. Tools like autoconf will create a tiny C program that calls this function and then try to compile and link it. It this works, looks like the function exists, if not, then it may not exist (or the compiler options are wrong and the appropriate CFLAGS need to be set).
So you have basically 6 options:
Require them to exist. Your code can only work on platforms where they exist; period. If they don't exist, compilation will fail, but that is not your problem, since the platform violates your minimum requirements.
Avoid using them. If you use the non-thread safe ones, maybe protected by a global lock (e.g. a mutex), it doesn't matter if they exist or not. Of course your code will then only work on platforms with POSIX mutexes, however, if a platform has no POSIX mutexes, it won't have POSIX threads either and if it has no POSIX threads (and I guess you are probably using POSIX threads w/o supporting any alternative), why would you have to worry about thread-safety in the first place?
Decide at runtime. Depending on the platform, either do a "weak link", so you can test at runtime if the function was found or not (a pointer to the function will point to NULL if it wasn't) or alternatively resolve the symbol dynamically using something like dlsym() (which is also not really portable, but widely supported in the Linux/UNIX world). However, in that case you need a fallback if the function is not found at runtime.
Use a tool like autoconf, some other tool with similar functionality, or your own configuration script to determine this prior to start of compilation (and maybe set preprocessor macros depending on result). In that case you will also need a fallback solution.
Limit usage to well known platforms. Whether this function is available on a certain platform is usually known (and once it is available, it won't go away in the future). Most platforms expose preprocessor macros to test what kind of platform that is and sometimes even which version. E.g. if you know that GNU/Linux, Android, Free/Open/NetBSD, Solaris, iOS and MacOS X all offer this function, test if you are compiling for one of these platforms and if yes, use it. If the code is compiled for another platform (or if you cannot determine what platform that is), it may or may not offer this function, but since you cannot say for sure, better be safe and use the fallback.
Let the user decide. Either always use the fallback, unless the user has signaled support or do it the other way round (which makes probably more sense), always assume it is there and in case compilation fails, offer a way the user can force your code into "compatibility mode", by somehow specifying that thread-safe-functions are not available (e.g. by setting an environment variable or by using a different make target). Of course this is the least convenient method for the (poor) user.

Cool GCC built-ins

I've heard of a lot of cool GCC extensions and built-in functions over the years, but I always wind up forgetting about them before thinking of using them.
What are some cool GCC extensions and built-ins, and some real-life examples of how to put them to use?
GCC provides many features as compiler extensions, off the top of mind and frequently used by me are:
Statement Expressions
Designated Initializers
There are many more documented on the GCC website here.
Caveat:
However, using any form of compiler extensions renders your code non-portable across other compilers so do use them at that risk.
If you want real-life examples of how useful gcc extensions can be then GCC hacks in the Linux kernel is an interesting choice since if it is being used in the Linux kernel then it is probably a good indication it has some real-world impact. As noted before, using extensions does make your code non-portable but clang does make an effort to support gcc extensions which may mitigate some of the impact.
One extensions that is not covered but is used a lot in the Linux kernel is statement expressions, also see Are compund statements (blocks) surrounded by parens expressions in ANSI C?.
The article covers the following features:
Type discovery using typeof
Range extension which includes both Case Ranges and Designated Initializers
Zero-length arrays are flexible array members but with some additions
Determining call address using __builtin_return_addres
Constant detection using __builtin_constant_p
Function Attributes
Branch prediction hints using __builtin_expect
Pre-fetching using __builtin_prefetch
Variable attributes
I recently stumbled over quite a lot of them that are really helpful to emulate the new C11 standard. Actually many of the new features are already there, but with different syntax.
alignment attributes
thread local variables
noreturn attribute to functions
atomic operations (through their __sync_... builtins)
type generic programming
I've written some of that and how to use that with the C11 interfaces in my blog.
Two features that are not covered in functionality by C11 that are really nice, and that I'd very much like to see in future versions of the standard
statement expressions (already mentioned by Als)
__typeof__

What does #pragma intrinsic mean?

Just want to know what does #pragma intrinsic(_m_prefetchw) mean ?
As far as I am aware, that looks like someone was intending to modify some MSVC++ specific setting. However, that setting is not a valid option for the intrinsic pragma. _m_prefetchw on the other hand is a 3D Now! intrinsic function.
Like all compiler intrinsic functions, it exposes (possibly) faster assembly instructions supported by the underlying hardware to your C or C++ application in a manner
A. more consistent with optimizers, and
B. more consistent with the language, when compared with using inline assembly.
On MSVC on x86_64/x64/amd64 systems, inline assembly is not supported, so one must use such intrinsics to access whizzbang features of the underlying hardware.
Finally, it should be noted that _m_prefetchw is a 3D Now! intrinsic, and 3D Now! is only supported on AMD hardware. It's probably not something you want to use for new code (i.e. you should use SSE instead, which works on both Intel and AMD hardware, and has more features to boot).
The meaning of "#pragma intrinsic" (note spelling), as with all "#pragma" directives, varies from one compiler to another. Generally, it indicates that a particular thing that looks syntactically like a call to an external function should be replaced with some inline code. In some cases, this may greatly improve performance, especially if the compiler can determine constant values for some or all of the arguments (in the latter situation, the compiler may be able to compute the value of the function and replace it with a constant).
Generally, having functions processed as intrinsic won't pose any particular problem. The biggest danger is that if a user defines in one module a function with the same name as one of the compiler's intrinsic function, and attempts to call that function from another module, the compiler might instead replace the function call with its expected instruction sequence. To prevent this, some compilers don't enable intrinsic functions by default (since doing so would cause the above incompatibility with some standard-conforming programs) but provide #pragma directives to do enable them. Compilers may also use command-line option to enable intrinsics (since the standard allows anything there), or may define some functions like __memcpy() as intrinsic, and within string.h, use a #define directive to convert memcpy into __memcpy (since programs that #include string.h are not allowed to use memcpy for any other purpose).
In C, it depends on whether the implementation recognizes (and defines) it.
If the implementation does not recognize the "intrinsic" preprocessing token, the pragma is ignored.
If the implementation recognizes it, whatever is defined will happen (and if another implementation defines it differently, a different thing happens on the other implementation).
So, check the documentation for the implementation you're talking about (edit: and don't use it if you expect to compile your source on different implementations).
I couldn't find any reference to "#pragma intrinsic" in man gcc, on my system.
The intrinsic pragma tells the compiler that a function has known behavior. The compiler may call the function and not replace the function call with inline instructions, if it will result in better performance.
Source: http://msdn.microsoft.com/en-us/library/tzkfha43(VS.80).aspx

Resources