I have a C extension that is called from my multithreaded Python application. I use a static variable i somewhere in a C function, and I have a few i++ statements later on that can be run from different Python threads (that variable is only used in my C code though, I don't yield it to Python).
For some reason I haven't met any race condition so far, but I wonder if it's just luck...
I don't have any thread-related C code (no Py_BEGIN_ALLOW_THREADS or anything).
I know that the GIL only guarantees single bytecode instructions to be atomic and thread-safe, thus statements as i+=1 in Python are not thread-safe.
But I don't know about a i++ instruction in a C extension. Any help ?
Python will not release the GIL when you are running C code (unless you either tell it to or cause the execution of Python code - see the warning note at the bottom!). It only releases the GIL just before a bytecode instruction (not during) and from the interpreter's point of view running a C function is part of executing the CALL_FUNCTION bytecode.* (Unfortunately I can't find a reference for this paragraph currently, but I'm almost certain it's right)
Therefore, unless you do anything specific your C code will be the only thread running and thus any operation you do in it should be thread safe.
If you specifically want to release the GIL - for example because you're doing a long calculation which doesn't interfere with Python, reading from a file, or sleeping while waiting for something else to happen - then the easiest way is to do Py_BEGIN_ALLOW_THREADS then Py_END_ALLOW_THREADS when you want to get it back. During this block you cannot use most Python API functions and it's your responsibility to ensure thread safety in C. The easiest way to do this is to only use local variables and not read or write any global state.
If you've already got a C thread running without the GIL (thread A) then simply holding the GIL in thread B does not guarantee that thread A won't modify C global variables. To be safe you need to ensure that you never modify global state without some kind of locking mechanism (either the Python GIL or a C mechanism) in all your C functions.
Additional thought
* One place where the GIL can be released in C code is if the C code calls something that causes Python code to be executed. This might be through using PyObject_Call. A less obvious place would be if Py_DECREF caused a destructor to be executed. You'd have the GIL back by the time your C code resumed, but you could no longer guarantee that global objects were unchanged. This obvious doesn't affect simple C like x++.
Belated Edit:
It should be emphasised that it's really, really, really easy to cause the execution of Python code. For this reason you shouldn't use the GIL in place of a mutex or actual locking mechanism. You should only consider it for operations that are really atomic (i.e. a single C API call) or entirely on non-Python C objects. You won't lose the GIL unexpected while executing C Code, but a lot of C API calls may release the GIL, do something else, and then regain the GIL before returning to your C code.
The purpose the GIL is to make sure that the Python internals don't get corrupted. The GIL will continue to serve this purpose within an extension module. However race conditions that involve valid Python objects arranged in ways you don't expect are still available to you. For example:
PySequence_SetItem(some_list, 0, some_item);
PyObject* item = PySequence_GetItem(some_list, 0);
assert(item == some_item); // may not be true
// the destructor of the previous contents of item 0 may have released the GIL
Related
Let's say thread 1(executed in Core 0) updates a global variable, and the updated global value is cached in Core 0's L1 cache(not flushed to the main memory). Then thread 2 starts to execute in Core 3, and it tries to read the global variable, and read it from the main memory(since it doesn't have the cached value), so thread 2 is reading a outdated value.
I know in C you can use volatile to force the compilier do not read the value form CPU registers, which means that volatile varaible will get its value from cache or main memory. In my above scenario, even if I declare the global variable with volatile, the latest value will still be cached in L1 cache, the main memory still has an old value which will be read by thread 2. So how can we fix this issue?
or maybe my understanding is wrong, using volatile will make the variable updated in main memory directly so everytime you try to read/write a volatile variable, you read/write it from/to the main memory directly?
To some extent, people noting that the premise of your question is flawed is a reasonable answer. In general, this happens rarely if at all, and is usually indistinguishable from a race condition.
However yes it can happen. See for example memory barriers which are a great example of how such a condition (albeit due to OOO execution etc.) can occur.
That being said, what you're looking for to make sure the specific occurrence you've noted cannot happen is called a "cache flush". This can be genuinely important on ARM/ARM64 processors where separate data and instruction caches exist, but it's also a good habit to get into for data that is passed between threads this way. You can also check out the __builtin___clear_cache c compiler builtin which performs a similar task. Hopefully one of these will help you get to the bottom of your problem.
However, most likely you're not running into a caching issue, and a race condition is far more likely to be arising. If memory barriers/cache flushes don't fix your issue, audit your code very carefully for raciness.
How to make sure C multithreading program read the latest value from main memory?
You probably want to use some thread library like POSIX threads. Read some Pthread tutorial, see pthreads(7) and use pthread_create(3), pthread_mutex_init, pthread_mutex_lock, pthread condition variables, etc etc
Read also the documentation of GNU libc and of your C compiler (e.g. GCC, to be used as gcc -Wall -Wextra -g) and of your debugger (e.g. GDB).
Be prepared to fight against Heisenbugs.
In general you cannot prove statically that your C program don't have race conditions. See Rice's theorem. You could use tools like Frama-C or the Clang static analyzer, or write your own GCC plugin, or improve or extend Bismon described in this draft report.
You could be interested by CompCert.
you cannot be sure that your program read the "latest" value from memory.
(unless you add some assembly code)
Read about cache coherence protocols.
First of all, this is definitely about C, no C++ solutions are requested.
Target:
Return to the caller function (A) beyond multiple stack frames.
I have some solutions, but none of them feels like the best option.
The easiest one in the sense of implementation is longjmp/setjmp, but I am not sure
if it destroys auto variables, because as wiki refers, no normal stack unwinding
taking part if longjmp is performed.
Here is a short description of the program flow:
the A function calls file processing function, which results in many internal
and recursive invocations. At some point, file reader meets EOF, so the job of
file processing is done and control should be given to A function.
Comparing each read character against EOF or '\0'? No, thanks.
UPD: I can avoid dynamic allocations in the call chain between setjmp and longjmp.
Not being sure about auto variables, I do not know what will happen in sequential calls
to file processing (there is more than 1 file).
So:
1) Whats about 'no stack unwinding' by longjmp? How danger is that if I got all the
data holders available (pointers).
2) Other neat and effective ways to go back to the A frame?
I don't know what you read somewhere, but setjmp/longjmp is exactly the tool foreseen for the task.
longjmp re-establishes the "stack" exactly (well sort of) as it has been at the call to setjmp, all modifications to the "stack" that had been done between the two are lost, including all auto variables that have been defined. This re-establishment of the stack is brute forward, in C there is no concept of destructors, and this is perhaps meant by "no stack unwinding".
I put "stack" in quotes since this is not a term that the C standard applies, it only talks about state and allows that this is organized how it pleases to the implementation.
Now the only information that you are able to keep from the time between setjmp and longjmp are:
the value that you pass to longjmp
the value of modified volatile objects that you defined before setjmp
So in the branch where you come back from longjmp you have to use this (and only this) information to cleanup your mess: close files, free objects that you malloced etc.
I'm looking for a way to call a C function in a different stack, i.e. save the current stack pointer, set the stack pointer to a different location, call the function and restore the old stack pointer when it returns.
The purpose of this is a lightweight threading system for a programming language. Threads will operate on very small stacks, check when more stack is needed and dynamically resize it. This is so that thousands of threads can be allocated without wasting a lot of memory. When calling in to C code it is not safe to use a tiny stack, since the C code does not know about checking and resizing, so I want to use a big pthread stack which is used only for calling C (shared between lightweight threads on the same pthread).
Now I could write assembly code stubs which will work fine, but I wondered if there is a better way to do this, such as a gcc extension or a library which already implements it. If not, then I guess I'll have my head buried in ABI and assembly language manuals ;-) I only ask this out of laziness and not wanting to reinvent the wheel.
Assuming you're using POSIX threads and on a POSIX system, you can achieve this with signals. Setup an alternate signal handling stack (sigaltstack) and designate one special real-time signal to have its handler run on the alternate signal stack. Then raise the signal to switch to the stack, and have the signal handler read the data for what function to call, and what argument to pass it, from thread-local data.
Note that this approach is fairly expensive (multiple system calls to change stacks), but should be 100% portable to POSIX systems. Since it's slow, you might want to make arch-specific call-on-alt-stack functions written in assembly, and only use my general solution as a fallback for archs where you haven't written an assembly version.
In a codebase I reviewed, I found the following idiom.
void notify(struct actor_t act) {
write(act.pipe, "M", 1);
}
// thread A sending data to thread B
void send(byte *data) {
global.data = data;
notify(threadB);
}
// in thread B event loop
read(this.sock, &cmd, 1);
switch (cmd) {
case 'M': use_data(global.data);break;
...
}
"Hold it", I said to the author, a senior member of my team, "there's no memory barrier here! You don't guarantee that global.data will be flushed from the cache to main memory. If thread A and thread B will run in two different processors - this scheme might fail".
The senior programmer grinned, and explained slowly, as if explaining his five years old boy how to tie his shoelaces: "Listen young boy, we've seen here many thread related bugs, in high load testing, and in real clients", he paused to scratch his longish beard, "but we've never had a bug with this idiom".
"But, it says in the book..."
"Quiet!", he hushed me promptly, "Maybe theoretically, it's not guaranteed, but in practice, the fact you used a function call is effectively a memory barrier. The compiler will not reorder the instruction global.data = data, since it can't know if anyone using it in the function call, and the x86 architecture will ensure that the other CPUs will see this piece of global data by the time thread B reads the command from the pipe. Rest assured, we have ample real world problems to worry about. We don't need to invest extra effort in bogus theoretical problems.
"Rest assured my boy, in time you'll understand to separate the real problem from the I-need-to-get-a-PhD non-problems."
Is he correct? Is that really a non-issue in practice (say x86, x64 and ARM)?
It's against everything I learned, but he does have a long beard and a really smart looks!
Extra points if you can show me a piece of code proving him wrong!
Memory barriers aren't just to prevent instruction reordering. Even if instructions aren't reordered it can still cause problems with cache coherence. As for the reordering - it depends on your compiler and settings. ICC is particularly agressive with reordering. MSVC w/ whole program optimization can be, too.
If your shared data variable is declared as volatile, even though it's not in the spec most compilers will generate a memory variable around reads and writes from the variable and prevent reordering. This is not the correct way of using volatile, nor what it was meant for.
(If I had any votes left, I'd +1 your question for the narration.)
In practice, a function call is a compiler barrier, meaning that the compiler will not move global memory accesses past the call. A caveat to this is functions which the compiler knows something about, e.g. builtins, inlined functions (keep in mind IPO!) etc.
So a processor memory barrier (in addition to a compiler barrier) is in theory needed to make this work. However, since you're calling read and write which are syscalls that change the global state, I'm quite sure that the kernel issues memory barriers somewhere in the implementation of those. There is no such guarantee though, so in theory you need the barriers.
The basic rule is: the compiler must make the global state appear to be exactly as you coded it, but if it can prove that a given function doesn't use global variables then it can implement the algorithm any way it chooses.
The upshot is that traditional compilers always treated functions in another compilation unit as a memory barrier because they couldn't see inside those functions. Increasingly, modern compilers are growing "whole program" or "link time" optimization strategies which break down these barriers and will cause poorly written code to fail, even though it's been working fine for years.
If the function in question is in a shared library then it won't be able to see inside it, but if the function is one defined by the C standard then it doesn't need to -- it already knows what the function does -- so you have to be careful of those also. Note that a compiler will not recognise a kernel call for what it is, but the very act of inserting something that the compiler can't recognise (inline assembler, or a function call to an assembler file) will create a memory barrier in itself.
In your case, notify will either be a black box the compiler can't see inside (a library function) or else it will contain a recognisable memory barrier, so you are most likely safe.
In practice, you have to write very bad code to fall over this.
In practice, he's correct and a memory barrier is implied in this specific case.
But the point is that if its presence is "debatable", the code is already too complex and unclear.
Really guys, use a mutex or other proper constructs. It's the only safe way to deal with threads and to write maintainable code.
And maybe you'll see other errors, like that the code is unpredictable if send() is called more than one time.
In C I have a pointer that is declared volatile and initialized null.
void* volatile pvoid;
Thread 1 is occasionally reading the pointer value to check if it is non-null. Thread 1 will not set the value of the pointer.
Thread 2 will set the value of a pointer just once.
I believe I can get away without using a mutex or condition variable.
Is there any reason thread 1 will read a corrupted value or thread 2 will write a corrupted value?
To make it thread safe, you have to make atomic reads/writes to the variable, it being volatile is not safe in all timing situations. Under Win32 there are the Interlocked functions, under Linux you can build it yourself with assembly if you do not want to use the heavy weight mutexes and conditional variables.
If you are not against GPL then http://www.threadingbuildingblocks.org and its atomic<> template seems promising. The lib is cross platform.
In the case where the value fits in a single register, such as a memory aligned pointer, this is safe. In other cases where it might take more than one instruction to read or write the value, the read thread could get corrupted data. If you are not sure wether the read and write will take a single instruction in all usage scenarios, use atomic reads and writes.
Depends on your compiler, architecture and operating system. POSIX (since this question was tagged pthreads Im assuming we're not talking about windows or some other threading model) and C don't give enough constraints to have a portable answer to this question.
The safe assumption is of course to protect the access to the pointer with a mutex. However based on your description of the problem I wonder if pthread_once wouldn't be a better way to go. Granted there's not enough information in the question to say one way or the other.
Unfortunately, you cannot portably make any assumptions about what is atomic in pure C.
GCC, however, does provide some atomic built-in functions that take care of using the proper instructions for many architectures for you. See Chapter 5.47 of the GCC manual for more information.
Well this seems fine.. The only problem will happen in this case
let thread A be your checking thread and B the modifying one..
The thing is that checking for equality is not atomic technically first the values should be copied to registers then checked and then restored. Lets assume that thread A has copied to register, now B decides to change the value , now the value of your variable changes. So when control goes back to A it will say it is not null even though it SHUD be according to when the thread was called. This seems harmless in this program but MIGHT cause problems..
Use a mutex.. simple enuf.. and u can be sure you dont have sync errors!
On most platforms where a pointer value can be read/written in a single instruction, it either set or it isn't set yet. It can't be interrupted in the middle and contain a corrupted value. A mutex isn't needed on that kind of platform.