About atomicity guarantee in C - c

On x86 machines, instructions like inc, addl are not atomic and under SMP environment it is not safe to use them without lock prefix. But under UP environment it is safe since inc, addl and other simple instructions won't be interrupted.
My problem is that, given a C-level statement like
x = x + 1;
Is there any guarantees that C compiler will always use UP-safe instructions like
incl %eax
but not those thread-unsafe instructions(like implementing the C statement in several instructions which may be interrupted by a context switch) even in a UP environment?

No.
You can use "volatile", which prevents the compiler from holding x in a temporary register, and for most targets this will actually have the intended effect. But it isn't guaranteed.
To be on the safe side you should either use some inline asm, or if you need to stay portable, encapsulate the increment with mutexes.

There is absolutely no guarantee that "x - x + 1" will compile to interrupt-safe instructions on any platform, including x86. It may well be that it is safe for a specific compiler and specific processor architecture but it's not mandated in the standards at all and the standard is the only guarantee you get.
You can't consider anything to be safe based on what you think it will compile down to. Even if a specific compiler/architecture states that it is, relying on it is very bad since it reduces portability. Other compilers, architectures or even later versions on the same compiler and architecture can break your code quite easily.
It's quite feasible that x = x + 1 could compile to an arbitrary sequence such as:
load r0,[x] ; load memory into reg 0
incr r0 ; increment reg 0
stor [x],r0 ; store reg 0 back to memory
on a CPU that has no memory-increment instructions. Or it may be smart and compile it into:
lock ; disable task switching (interrupts)
load r0,[x] ; load memory into reg 0
incr r0 ; increment reg 0
stor [x],r0 ; store reg 0 back to memory
unlock ; enable task switching (interrupts)
where lock disables and unlock enables interrupts. But, even then, this may not be thread-safe in an architecture that has more than one of these CPUs sharing memory (the lock may only disable interrupts for one CPU), as you've already stated.
The language itself (or libraries for it, if it's not built into the language) will provide thread-safe constructs and you should use those rather than depend on your understanding (or possibly misunderstanding) of what machine code will be generated.
Things like Java synchronized and pthread_mutex_lock() (available to C under some OS') are what you want to look into.

If you use GLib, they have macros for int and pointer atomic operations.
http://library.gnome.org/devel/glib/stable/glib-Atomic-Operations.html

In recent versions of GCC there are __sync_xxx intrinsics to do exactly what you want.
Instead of writing:
x += 1;
write this:
__sync_fetch_and_add(&x, 1);
And gcc will make sure this will be compiled into an atomic opcode. This is supported on most important archs now.
http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html
It stems originally from the recommendations from Intel for C on ia64, but now found it ways to gcc on a lot of other archs, too. So it's even a bit portable.

Is there any guarantees that C compiler will always use UP-safe instructions
Not in the C standard. But your compiler/standard library may provide you with special types or certain guarantees.
This gcc doc may be along the lines of what you need.

I believe you need to resort to SMP targeted libraries or else roll your own inline assembler code.

A C compiler may implement a statement like x = x + 1 in several instructions.
You can use the register keyword to hint the compiler to use a register instead of memory, but the compiler is free to ignore it.
I suggest to use OS lock specific routines like the InterlockedIncrement Function on Windows.

Worrying about just x86 is horribly unportable coding anyway. This is one of those seemingly small coding tasks that turns out to be a project in itself. Find an existing library project that solves this sort of problem for a wide range of platforms, and use it. GLib seems to be one, from what kaizer.se says.

Related

Is it possible to generate ansi C functions with type information for a moving GC implementation?

I am wondering what methods there are to add typing information to generated C methods. I'm transpiling a higher-level programming language to C and I'd like to add a moving garbage collector. However to do that I need the method variables to have typing information, otherwise I could modify a primitive value that looks like a pointer.
An obvious approach would be to encapsulate all (primitive and non-primitive) variables in a struct that has an extra (enum) variable for typing information, however this would cause memory and performance overhead, the transpiled code is namely meant for embedded platforms. If I were to accept the memory overhead the obvious option would be to use a heap handle for all objects and then I'd be able to freely move heap blocks. However I'm wondering if there's a more efficient better approach.
I've come up with a potential solution, namely to predeclare and group variables based whether they're primitives or not (I can do that in the transpiler), and add an offset variable to each method at the end (I need to be able to find it accurately when scanning the stack area), that tells me where the non-primitive variables begin and where they end, so I can only scan those. This means that each method will use an additional 16/32-bit (depending on arch) of memory, however this should still be more memory efficient than the heap handle approach.
Example:
void my_func() {
int i = 5;
int z = 3;
bool b = false;
void* person;
void* person_info = ...;
.... // logic
volatile int offset = 0x034;
}
My aim is for something that works universally across GCC compilers, thus my concerns are:
Can the compiler reorder the variables from how they're declared in
the source code?
Can I force the compiler to put some data in the
method's stack frame (using volatile)?
Can I find the offset accurately when scanning the stack?
I'd like to avoid assembly so this approach can work (by default) across multiple platforms, however I'm open for methods even if they involve assembly (if they're reliable).
Typing information could be somehow encoded in the C function name; this is done by C++ and other implementations and called name mangling.
Actually, you could decide, since all your C code is generated, to adopt a different convention: generate long C identifiers which are practically unique and sort-of random program-wide, such as tiziw_7oa7eIzzcxv03TmmZ and keep their typing information elsewhere (e.g. some database). On Linux, such an approach is friendly to both libbacktrace and dlsym(3) + dladdr(3) (and of course nm(1) or readelf(1) or gdb(1)), so used in both bismon and RefPerSys projects.
Typing information is practically tied to calling conventions and ABIs. For example, the x86-64 ABI for Linux mandates different processor registers for passing floating points or pointers.
Read the Garbage Collection handbook or at least P.Wilson Uniprocessor Garbage Collection Techniques survey. You could decide to use tagged integers instead of boxing them, and you could decide to have a conservative GC (e.g. Boehm's GC) instead of a precise one. In my old GCC MELT project I generated C or C++ code for a generational copying GC. Similar techniques are used both in Bismon and in RefPerSys.
Since you are transpiling to C, consider also alternatives, such as libgccjit or LLVM. Look into libjit and asmjit.
Study also the implementation of other transpilers (compilers to C), including Chicken/Scheme and Bigloo.
Can the GCC compiler reorder the variables from how they're declared in the source code?
Of course yes, depending upon the optimizations you are asking. Some variables won't even exist in the binary (e.g. those staying in registers).
Can I force the compiler to put some data in the method's stack frame (using volatile)?
Better generate a single struct variable containing all your language variables, and leave optimizations to the compiler. You will be surprised (see this draft report).
Can I find the offset accurately when scanning the stack?
This is the most difficult, and depends a lot of compiler optimizations (e.g. if you run gcc with -O1 or -O3 on the generated C code; in some cases a recent GCC -e.g GCC 9 or GCC 10 on x86-64 for Linux- is capable of tail-call optimizations; check by compiling using gcc -O3 -S -fverbose-asm then looking into the produced assembler code). If you accept some small target processor and compiler specific tricks, this is doable. Study the implementation of the Ocaml compiler.
Send me (to basile#starynkevitch.net) an email for discussion. Please mention the URL of your question in it.
If you want to have an efficient generational copying GC with multi-threading, things become extremely tricky. The question is then how many years of development can you afford spending.
If you have exceptions in your language, take also a great care. You could with great caution generate calls to longjmp.
See of course this answer of mine.
With transpiling techniques, the evil is in the details
On Linux (specifically!) see also my manydl.c program. It demonstrates that on a Linux x86-64 laptop you could generate, in practice, hundred of thousands of dlopen(3)-ed plugins. Read then How to write shared libraries
Study also the implementation of SBCL and of GNU Prolog, at least for inspiration.
PS. The dream of a totally architecture-neutral and operating-system independent transpiler is an illusion.

Shall I use register class variables in modern C programs?

In C++, the keyword register was removed in its latest standard ISO/IEC 14882:2017 (C++17).
But also in C, I see a lot, that more and more coders tend to not use or like to declare an object with the register class qualifier because its purposed benefit shall be almost useless, like in #user253751´s answer:
register does not cause the compiler to store a value in a register. register does absolutely nothing. Only extremely old compilers used register to know which variables to store in registers. New compilers do it automatically. Even 20-year-old compilers do it automatically.
Is the use of register class variables and with that the use of the keyword register deprecated?
Shall I use register class variables in my modern programs? Or is this behavior redundant and deprecated?
There is no benefit to using register. Modern compilers substantially ignore it — they can handle register allocation better than you can. The only thing it prevents is taking the address of the variable, which is not a significant benefit.
None of my own code uses register any more. The code I work on loses register when I get to work on a file — but it takes time to get through 17,000+ files (and I only change a file when I have an external reason to change it — but it can be a flimsy reason).
As #JonathanLeffler stated it is ignored in most cases.
Some compilers have a special extension syntax if you want to keep the variable in the particular register.
gcc Global or local variable can be placed in the particular register. This option is not available for all platforms. I know that AVR & ARM ports implement it.
example:
register int x asm ("10");
int foo(int y)
{
x = bar(x);
x = bar1(x);
return x*x;
}
https://godbolt.org/z/qwAZ8x
More information: https://gcc.gnu.org/onlinedocs/gcc-6.1.0/gcc/Explicit-Register-Variables.html#Explicit-Register-Variables
But to be honest I was never using it in my programming life (30y+)
It's effectively deprecated and offers no real benefit.
C is a product of the early 1970s, and the register keyword served as a hint to the compiler that a) this particular object was going to be used a lot, so b) you might want to store it somewhere other than main memory - IOW, a register or some other "fast" memory.
It may have made a difference then - now, it's pretty much ignored. The only measurable effect is that it prevents you from taking the address of that object.
First of all, this feature is NOT deprecated because: "register" in this context (global or local register variables) is a GNU extension which are not deprecated.
In your example, R10 (or the register that GCC internally assigns REGNO(reg) = 10), is a global register. "global" here means, that all code in your application must agree on that usage. This is usually not the case for code from libraries like libc, libm or libgcc because they are not compiled with -ffixed-10. Moreover, global registers might conflict with the ABI. avr-gcc for example might pass values in R10. In avr-gcc, R2...R9 are not used by the ABI and not by code from libgcc (except for 64-bit double).
In some hard real-time app with avr-gcc I used global regs in a (premature) optimization, just to notice that the performance gain was miniscule.
Local register variables, however, are very handy when it comes to integrating non-ABI functions for example assembly functions that don't comply to the GCC ABI, without the need for assembly wrappers.

Using GCC __sync extensions for a portable C library

I am developing a C library on OS X (10.10.x which happens to ship with GCC 4.2.x). This library is intended to be maximally portable and not specific to OS X.
I would like the end users to have the least headaches in building from source. So while the project is coded to std=c11 to get some of the benefits of the most modern C, it seems optional matter such as atomics are not supported by this version of GCC.
I am assuming GNU-Linux and various BSD end users to have either (a) a later version of GCC, or (b) the chops to install the latest and greatest.
Is it a good decision to rely on the __sync extensions of GCC for the required CAS (etc.) semantics?
I think you need to take a step back and first define all your use cases. The merits of __sync vs C11 atomics aside, better to define your needs first (i.e. __sync/atomics are solutions not needs).
The Linux kernel is one of the heaviest, most sophisticated users of locking, atomics, etc. and C11 atomics aren't powerful enough for it. See https://lwn.net/Articles/586838/
For example, you might be far better off wrapping things in pthread_mutex_lock / pthread_mutex_unlock pairs. Declaring a struct as C11 atomic does not guarantee atomic access to the whole struct, only parts of it. So, if you needed the following to be atomic:
glob.x = 5;
glob.y = 7;
glob.z = 9;
You would be better wrapping this in the pthread_mutex_* pairing. For comparison, inside the Linux kernel, this would be spin locks or RCU. In fact, you might use RCU as well. Note that doing:
CAS(glob.x,5)
CAS(glob.y,7)
CAS(glob.z,9)
is not the same as the mutex pairing if you want an all or nothing update.
I'd wrap your implementation in some thin layer. For example, the best way might be __sync on one arch [say BSD] and atomics on another. By abstracting this into a .h file with macros/inlines, you can write "common code" without lots of #ifdef's everywhere.
I wrote a ring queue struct/object. Its updater could use CAS [I wrote my own inline asm for this], pthread_mutex_*, kernel spin locks, etc. Actual choice of which was controlled by one or two #ifdef's inside my_ring_queue.h
Another advantage to abstraction: You can change your mind farther down the road. Suppose you did an early pick of __sync or atomics. You code this up in 200 places in 30 files. Then, comes the "big oops" where you realize this was the wrong choice. Lots of editing ensues. So, never put a naked [say] __sync_val_compare_and_swap in any of your .c files. Put it in once in my_atomics.h as something like #define MY_CAS_VAL(...) __sync_val_compare_and_swap(...) and use MY_CAS_VAL
You might also be able to reduce the number of places that need interthread locking by using thread local storage for certain things like subpool allocs/frees.
You may also want to use a mixture of CAS and lock pairings. Some specific uses fair better with low level CAS and others would be more efficient with mutex pairs. Again, it helps if you can define your needs first.
Also, consider the final disaster scenario: The compiler doesn't support atomics and __sync is not available [or does not work] for the arch you're compiling to. What then?
In that case, note that all __sync operations can be implemented using pthread_mutex pairings. That's your disaster fallback.

How to create atomic section in c [duplicate]

Are there functions for performing atomic operations (like increment / decrement of an integer) etc supported by C Run time library or any other utility libraries?
If yes, what all operations can be made atomic using such functions?
Will it be more beneficial to use such functions than the normal synchronization primitives like mutex etc?
OS : Windows, Linux, Solaris & VxWorks
Prior to C11
The C library doesn't have any.
On Linux, gcc provides some -- look for __sync_fetch_and_add, __sync_fetch_and_sub, and so on.
In the case of Windows, look for InterlockedIncrement, InterlockedDecrement``, InterlockedExchange`, and so on. If you use gcc on Windows, I'd guess it also has the same built-ins as it does on Linux (though I haven't verified that).
On Solaris, it'll depend. Presumably if you use gcc, it'll probably (again) have the same built-ins it does under Linux. Otherwise, there are libraries floating around, but nothing really standardized.
C11
C11 added a (reasonably) complete set of atomic operations and atomic types. The operations include things like atomic_fetch_add and atomic_fetch_sum (and *_explicit versions of same that let you specify the ordering model you need, where the default ones always use memory_order_seq_cst). There are also fence functions, such as atomic_thread_fence and atomic_signal_fence.
The types correspond to each of the normal integer types--for example, atomic_int8_t corresponding to int8_t and atomic_uint_least64_t corrsponding to uint_least64_t. Those are typedef names defined in <stdatomic.h>. To avoid conflicts with any existing names, you can omit the header; the compiler itself uses names in the implementor's namespace (e.g., _Atomic_uint_least32_t instead of atomic_uint_least32_t).
'Beneficial' is situational. Always, performance depends on circumstances. You may expect something wonderful to happen when you switch out a mutex for something like this, but you may get no benefit (if it's not that popular of a case) or make things worse (if you accidently create a 'spin-lock').
Across all supported platforms, you can use use GLib's atomic operations. On platforms which have atomic operations built-in (e.g. assembly instructions), glib will use them. On other platforms, it will fall back to using mutexes.
I think that atomic operations can give you a speed boost, even if mutexes are implemented using them. With the mutex, you will have at least two atomic ops (lock & unlock), plus the actual operation. If the atomic op is available, it's a single operation.
Not sure what you mean by the C runtime library. The language proper, or the standard library does not provide you with any means to do this. You'd need to use a OS specific library/API. Also, don't be fooled by sig_atomic_t -- they are not what it seems at first glance and are useful only in the context of signal handlers.
On Windows, there are InterlockedExchange and the like. For Linux, you can take glibc's atomic macros - they're portable (see i486 atomic.h). I don't know a solution for the other operating systems.
In general, you can use the xchg instruction on x86 for atomic operations (works on Dual Core CPUs, too).
As to your second question, no, I don't think that using atomic operations will be faster than using mutexes. For instance, the pthreads library already implements mutexes with atomic operations, which is very fast.

How to get `gcc` to generate `bts` instruction for x86-64 from standard C?

Inspired by a recent question, I'd like to know if anyone knows how to get gcc to generate the x86-64 bts instruction (bit test and set) on the Linux x86-64 platforms, without resorting to inline assembly or to nonstandard compiler intrinsics.
Related questions:
Why doesn't gcc do this for a simple |= operation were the right-hand side has exactly 1 bit set?
How to get bts using compiler intrinsics or the asm directive
Portability is more important to me than bts, so I won't use and asm directive, and if there's another solution, I prefer not to use compiler instrinsics.
EDIT: The C source language does not support atomic operations, so I'm not particularly interested in getting atomic test-and-set (even though that's the original reason for test-and-set to exist in the first place). If I want something atomic I know I have no chance of doing it with standard C source: it has to be an intrinsic, a library function, or inline assembly. (I have implemented atomic operations in compilers that support multiple threads.)
It is in the first answer for the first link - how much does it matter in grand scheme of things. The only part when you test bits are:
Low level drivers. However if you are writing one you probably know ASM, it is sufficient tided to the system and probably most delays are on I/O
Testing for flags. It is usually either on initialisation (one time only at the beginning) or on some shared computation (which takes much more time).
The overall impact on performance of applications and macrobenchmarks is likely to be minimal even if microbenchmarks shows an improvement.
To the Edit part - using bts alone does not guarantee the atomic of the operation. All it guarantee is that it will be atomic on this core (so is or done on memory). On multi-processor units (uncommon) or multi-core units (very common) you still have to synchronize with other processors.
As synchronization is much more expensive I belive that difference between:
asm("lock bts %0, %1" : "+m" (*array) : "r" (bit));
and
asm("lock or %0, %1" : "+m" (*array) : "r" (1 << bit));
is minimal. And the second form:
Can set several flag at once
Have nice __sync_fetch_and_or (array, 1 << bit) form (working on gcc and intel compiler as far as I remember).
I use the gcc atomic builtins such as __sync_lock_test_and_set( http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/Atomic-Builtins.html ). Changing the -march flag will directly affect what is generated. I'm using it with i686 right now, but http://gcc.gnu.org/onlinedocs/gcc-4.1.2/gcc/i386-and-x86_002d64-Options.html#i386-and-x86_002d64-Options shows all the possibilities.
I realize it's not exactly what you are asking for, but I found those two web pages very useful when I was looking for mechanisms like that.
I believe (but am not certain) that neither the C++ or C standards have any mechanisms for these types of synchronization mechanisms yet. Support for higher level synchronization mechanisms are in various states of standardization, but I don't even think one of those would allow you the access of the type of primitive you're after.
Are you programming lock-free datastructures where locks are insufficient?
You probably want to just go ahead and use gcc's non-standard extensions and/or operating system or library provided synchronization primitives. I would bet there's a library that might provide the type of portability you're looking for if you're concerned about using compiler intrinsics. (Though really, I think most people just bite the bullet and use gcc-specific code when they need it. Not ideal, but the standards haven't really been keeping up.)

Resources