Using __thread in c99

Using __thread in c99 - c

I would like to define a few variables as thread-specific using the __thread storage class. But three questions make me hesitate:
Is it really standard in c99? Or more to the point, how good is the compiler support?
Will the variables be initialised in every thread?
Do non-multi threaded programs treat them as plain-old-globals?

To answer your specific questions:
No, it is not part of C99. You will not find it mentioned anywhere in the n1256.pdf (C99+TC1/2/3) or the original C99 standard.
Yes, __thread variables start out with their initialized value in every new thread.
From a standpoint of program behavior, thread-local storage class variables behave pretty much the same as plain globals in non-multi-threaded programs. However, they do incur a bit more runtime cost (memory and startup time), and there can be issues with limits on the size and number of thread-local variables. All this is rather complicated and varies depending on whether your program is static- or dynamic-linked and whether the variables reside in the main program or a shared library...
Outside of implementing C/POSIX (e.g. errno, etc.), thread-local storage class is actually not very useful, in my opinion. It's pretty much a crutch for avoiding cleanly passing around the necessary state in the form of a context pointer or similar. You might think it could be useful for getting around broken interfaces like qsort that don't take a context pointer, but unfortunately there is no guarantee that qsort will call the comparison function in the same thread that called qsort. It might break the job down and run it in multiple threads. Same goes for most other interfaces where this sort of workaround would be possible.

You probably want to read this:
http://www.akkadia.org/drepper/tls.pdf
1) MSVC doesn't support C99. GCC does and other compilers attempt GCC compatibility.
edit A breakdown of compiler support for __thread is available here:
http://chtekk.longitekk.com/index.php?/archives/2011/02/C8.html
2) Only C++ supports an initializer and it must be constant.
3) Non-multi-threaded applications are single-threaded applications.

Related

Do I need to write explicit memory barrier for multithreaded C code?

I'm writing some code on Linux using pthread multithreading library and I'm currently wondering if following code is safe when compiled with -Ofast -lto -pthread.
// shared global
long shared_event_count = 0;
// ...
pthread_mutex_lock(mutex);
while (shared_event_count <= *last_seen_event_count)
pthread_cond_wait(cond, mutex);
*last_seen_event_count = shared_event_count;
pthread_mutex_unlock(mutex);
Are the calls to pthread_* functions enough or should I also include memory barrier to make sure that the change to global variable shared_event_count is actually updated during the loop? Without memory barrier the compiler would be freely to optimize the variable as register integer only, right? Of course, I could declare the shared integer as volatile which would prevent keeping the contents of that variable in register only during the loop but if I used the variable multiple times within the loop, it could make sense to only check the fresh status for the loop conditition only because that could allow for more compiler optimizations.
From testing the above code as-is, it appears that the generated code actually sees the changes made by another thread. However, is there any spec or documentation that actually guarantees this?
The common solution seems to be "don't optimize multithreaded code too aggressively" but that seems like a poor man's workaround instead of really fixing the issue. I'd rather write correct code and let the compiler optimize as much as possible within the specs (any code that gets broken by optimizations is in reality using e.g. undefined behavior of C standard as assumed stable behavior, except for some rare cases where compiler actually outputs invalid code but that seems to be very very rare these days).
I'd much prefer writing the code that works with any optimizing compiler – as such, it should only use features specified in the C standard and the pthread library documentation.
I found an interesting article at https://www.alibabacloud.com/blog/597460 which contains a trick like this:
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
This was actually used first in Linux kernel and it triggered a compiler bug in old GCC versions: https://lwn.net/Articles/624126/
As such, let's assume that the compiler is actually following the spec and doesn't contain a bug but implements every possible optimization known to man allowed by the specs. Is the above code safe with that assumption?
Also, does pthread_mutex_lock() include memory barrier by the spec or could compiler re-order the statements around it?

The compiler will not reorder memory accesses across pthread_mutex_lock() (this is an oversimplification and not strictly true, see below).
First I’ll justify this by talking about how compilers work, then I’ll justify this by looking at the spec, and then I’ll justify this by talking about convention.
I don’t think I can give you a perfect justification from the spec. In general, I would not expect a spec to give you a perfect justification—it’s turtles all the way down (do you have a spec for how to interpret the spec?), and the spec is designed to be read and understood by actual humans who understand the relevant background concepts.
How This Works
How this works—the compiler, by default, assumes that a function it doesn’t know can access any global variable. So it must emit the store to shared_event_count before the call to pthread_mutex_lock()—as far as the compiler knows, pthread_mutex_lock() reads the value of shared_event_count.
Inside pthread_mutex_lock is a memory fence for the CPU, if necessary.
Justification
From n1548:
In the abstract machine, all expressions are evaluated as specified by the semantics. An actual implementation need not evaluate part of an expression if it can deduce that its value is not used and that no needed side effects are produced (including any caused by calling a function or accessing a volatile object).
Yes, there’s LTO. LTO can do some very surprising things. However, the fact is that writing to shared_event_count does have side effects and those side effects do affect the behavior of pthread_mutex_lock() and pthread_mutex_unlock().
The POSIX spec states that pthread_mutex_lock() provides synchronization. I could not find an explanation in the POSIX spec of what synchronization is, so this may have to suffice.
POSIX 4.12
Applications shall ensure that access to any memory location by more than one thread of control (threads or processes) is restricted such that no thread of control can read or modify a memory location while another thread of control may be modifying it. Such access is restricted using functions that synchronize thread execution and also synchronize memory with respect to other threads. The following functions synchronize memory with respect to other threads:
Yes, in theory the store to shared_event_count could be moved or eliminated—but the compiler would have to somehow prove that this transformation is legal. There are various ways you could imagine this happening. For example, the compiler might be configured to do “whole program optimization”, and it may observe that shared_event_count is never read by your program—at which point, it’s a dead store and can be eliminated by the compiler.
Convention
This is how pthread_mutex_lock() has been used since the dawn of time. If compilers did this optimization, pretty much everyone’s code would break.
Volatile
I would generally not use this macro in ordinary code:
#define ACCESS_ONCE(x) (*(volatile typeof(x) *)&(x))
This is a weird thing to do, useful in weird situations. Ordinary multithreaded code is not a sufficiently weird situation to use tricks like this. Generally, you want to either use locks or atomics to read or write shared values in a multithreaded program. ACCESS_ONCE does not use a lock and it does not use atomics—so, what purpose would you use it for? Why wouldn’t you use atomic_store() or atomic_load()?
In other words, it is unclear what you would be trying to do with volatile. The volatile keyword is easily the most abused keyword in C. It is rarely useful except when writing to memory-mapped IO registers.
Conclusion
The code is fine. Don’t use volatile.

Efficiency issues when using C99 and C11.

The other day I was converting a program written with C99 standard into C11. Basically the motive was to use the code with MSVC but It was written in Linux and was mostly compiled with default GCC behaviour. During the code conversion, I found out that you can not decalre variables of a function after any statement i.e. you must declare them at the top of the function.
But my question is that wouldn't it be against the efficient programming rule that variables should be declared near their use so that it maximizes the cache hits? For example, In a large function of say 200 LOC, I want to use some big static look up array at nearly the end of the function. Wouldn't declaring and initializing it just before the usage cause more cache hits? or am I simple missing some basic point of C11 C language standard?

You seem to have some confusion for which version of the standard you are compiling your program. AFAIK, MSVC doesn't support any of the more recent C standards.
But to come to the core of your question, no this is not an efficiency issue. The compiler is allowed to reorder statements to its liking, as long as the observable behavior of the program doesn't change. Thus a modern compiler will always touch a new variable the latest possible before its first use.

Where the variable declaration appears has no effect on cache behavior. Just having a declaration doesn't touch memory.
You may need to separate out initialization into a separate assignment, however, in order to make sure you don't have an initializer causing a memory access at (near) the beginning of the function.

Declare variable as locally as possible

I'm new to Linux kernel development.
One thing that bothers me is a way a variables are declared and initialized.
I'm under impression that code uses variable declaration placement rules for C89/ANSI C (variables are declared at the beginning of block), while C99 relaxes the rule.
My background is C++ and there many advises from "very clever people" to declare variable as locally as possible - better declare and initialize in the same instruction:
Google C++ Style Guide
C++ Coding Standards: 101 Rules, Guidelines, and Best Practices - item 18
A good discussion about it here.
What is the accepted way to initialize variables in Linux kernel?

I couldn't find a relevant passage in the Linux kernel coding style. So, follow the convention used in existing code -- declare variables at beginning of block -- or run the risk of your code seeming out-of-place.
Reasons why variables at beginning of block is a Good Thing:
the target architecture may not have a C99 compiler
... can't think of more reasons

You should always try to declare variables as locally as possible. If you're using C++ or C99, that would usually be right before the first use.
In older C, doing that doesn't fall under "possible", and there the place to declare those variables would usually be the beginning of the current block.
(I say 'usually' because of some cases with functions and loops where it's better to make them a bit more global...)

In most normal cases, declare them in the beginning of the function where you are using them. There are exceptions, but they are rare.
if your function is short enough, the deceleration is far away from the first use anyway. If your function is longer then that - it's a good sign your function is too long.
The reason many C++ based coding standards recommend declaring close to use is that C++ data types can be much "fatter" (e.g. thing of class with multiple inheritances etc.) and so take up a lot more space. If you define such an instance at the beginning of a function but use it only much later (and maybe not at all) you are wasting a lot of RAM. This tends to be much less of an issue in C with it's native type only data types.

There is an oblique reference in the Coding Style document. It says:
Another measure of the function is the number of local variables. They
shouldn't exceed 5-10, or you're doing something wrong. Re-think the
function, and split it into smaller pieces. A human brain can
generally easily keep track of about 7 different things, anything more
and it gets confused. You know you're brilliant, but maybe you'd like
to understand what you did 2 weeks from now.
So while C99 style in-place initialisers are handy in certain cases the first thing you should probably be asking yourself is why it's hard to have them all at the top of the function. This doesn't prevent you from declaring stuff inside block markers, for example for in-loop calculations.

In older C it is possible to declare them locally by creating a block inside the function. Blocks can be added even without ifs/for/while:
int foo(void)
{
int a;
int b;
....
a = 5 + b;
{
int c;
....
}
}
Although it doesn't look very neat, it still is possible, even in older C.

I can't speak to why they have done things one way in the Linux kernel, but in the systems we develop, we tend to not use C99-specific features in the core code. Individual applications tend to have stuff written for C99, because they will typically be deployed to one known platform, and the gcc C99 implementation is known good.
But the core code has to be deployable on whatever platform the customer demands (within reason). We have supplied systems on AIX, Solaris, Informix, Linux, Tru-64, OpenVMS(!) and the presence of C99 compliant compilers isn't always guaranteed.
The Linux kernel needs to be substantially more portable again - and particularly down to small footprint embedded systems. I guess the feature just isn't important enough to override these sorts of considerations.

Is ARPACK thread-safe?

Is it safe to use the ARPACK eigensolver from different threads at the same time from a program written in C? Or, if ARPACK itself is not thread-safe, is there an API-compatible thread-safe implementation out there? A quick Google search didn't turn up anything useful, but given the fact that ARPACK is used heavily in large scientific calculations, I'd find it highly surprising to be the first one who needs a thread-safe sparse eigensolver.
I'm not too familiar with Fortran, so I translated the ARPACK source code to C using f2c, and it seems that there are quite a few static variables. Basically, all the local variables in the translated routines seem to be static, implying that the library itself is not thread-safe.

Fortran 77 does not support recursion, and hence a standard conforming compiler can allocate all variables in the data section of the program; in principle, neither a stack nor a heap is needed [1].
It might be that this is what f2c is doing, and if so, it might be that it's the f2c step that makes the program non thread-safe, rather than the program itself. Of course, as others have mentioned, check out for COMMON blocks as well. EDIT: Also, check for explicit SAVE directives. SAVE means that the value of the variable should be retained between subsequent invocations of the procedure, similar to static in C. Now, allocating all procedure local data in the data section makes all variables implicitly SAVE, and unfortunately, there is a lot of old code that assumes this even though it's not guaranteed by the Fortran standard. Such code, obviously, is not thread-safe. Wrt. ARPACK specifically, I can't promise anything but ARPACK is generally well regarded and widely used so I'd be surprised if it suffered from these kinds of dusty-deck problems.
Most modern Fortran compilers do use stack allocation. You might have better luck compiling ARPACK with, say, gfortran and the -frecursive option.
EDIT:
[1] Not because it's more efficient, but because Fortran was originally designed before stacks and heaps were invented, and for some reason the standards committee wanted to retain the option to implement Fortran on hardware with neither stack nor heap support all the way up to Fortran 90. Actually, I'd guess that stacks are more efficient on todays heavily cache-dependent hardware rather than accessing procedure local data that is spread all over the data section.

I have converted ARPACK to C using f2c. Whenever you use f2c and you care about thread-safety you must use the -a switch. This makes local variables have automatic storage, i.e. be stack based locals rather than statics which is the default.
Even so, ARPACK itself is decidedly not threadsafe. It uses a lot of common blocks (i.e. global variables) to preserve state between different calls to its functions. If memory serves, it uses a reverse communication interface which tends to lead developers to using global variables. And of course ARPACK probably was written long before multi-threading was common.
I ended up re-working the converted C code to systematically remove all the global variables. I created a handful of C structs and gradually moved the global variables into these structs. Finally I passed pointers to these structs to each function that needed access to those variables. Although I could just have converted each global into a parameter wherever it was needed it was much cleaner to keep them all together, contained in structs.
Essentially the idea is to convert global variables into local variables.

ARPACK uses BLAC right? Then those libraries need to be thread safe too.
I believe your idea to check with f2c might not be a bullet proof way of telling if the Fortran code is thread safe, I would guess it also depends on the Fortran compiler and libraries.

I don't know what strategy f2c uses in translating Fortran. Since ARPACK is written in FORTRAN 77, the first thing to do is check for the presence of COMMON blocks. These are global variables, and if used, the code is most likely not thread safe. The ARPACK webpage, http://www.caam.rice.edu/software/ARPACK/, says that there is a parallel version -- it seems likely that that version is threadsafe.

is local static variable provided by embedded compilers?

im working on a c lib which would be nice to also work on embedded systems
but im not very deep into embedded development so my question
are most embedded compilers able to cope with local static variables - which i would then just assume in further development
OR
is there a #define which i can use for a #ifdef to create a global variable in case of
thx

They should, as local static variables are part of the C standard.
Of course, there is nothing preventing them from creating a C-like language that does not have all the features. But since that would be non-standard, then the way to identify that a feature is lacking would be non-standard as well.

Since static variables are part of the standard, you should be safe.
The problem with support is probably not to be found with your compiler (most of which handle the standard pretty well), but with whatever code you have to set up your runtime environment. Make sure that when you're loading the code that you properly unpack the executable, read-only data, read-write data, and zero-init sections of the executable before jumping into the C code.

Local static variables are part of th C standard, so yes.
\pedantic{
If your code is well organized, with separate files (compilation units) for different subsystems, you might do better to have a static variable with file scope. This will make it easier to factor the code that uses it into separate functions. If the code that uses the variable is complicated, this will permit you to split it into smaller static functions, which are easier to read, understand and debug.
}

Yes. local statics are really not much different than globals once the compiler is done chewing on your source code. I could think up exotic processors where globals would be an issue, but I doubt you will encounter many.
The truly interesting thing about globals on embedded processors is that you often have the option of having the compiler allocate them in ROM, EEPROMs, etc.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight