I need to use an atomic variable in C as this variable is accessed across different threads. Don't want a race condition.
My code is running on CentOS. What are my options?
C11 atomic primitives
http://en.cppreference.com/w/c/language/atomic
_Atomic const int * p1; // p is a pointer to an atomic const int
const atomic_int * p2; // same
const _Atomic(int) * p3; // same
Added in glibc 2.28. Tested in Ubuntu 18.04 (glibc 2.27) by compiling glibc from source: Multiple glibc libraries on a single host Later also tested on Ubuntu 20.04, glibc 2.31.
Example adapted from: https://en.cppreference.com/w/c/language/atomic
main.c
#include <stdio.h>
#include <threads.h>
#include <stdatomic.h>
atomic_int acnt;
int cnt;
int f(void* thr_data)
{
(void)thr_data;
for(int n = 0; n < 1000; ++n) {
++cnt;
++acnt;
// for this example, relaxed memory order is sufficient, e.g.
// atomic_fetch_add_explicit(&acnt, 1, memory_order_relaxed);
}
return 0;
}
int main(void)
{
thrd_t thr[10];
for(int n = 0; n < 10; ++n)
thrd_create(&thr[n], f, NULL);
for(int n = 0; n < 10; ++n)
thrd_join(thr[n], NULL);
printf("The atomic counter is %u\n", acnt);
printf("The non-atomic counter is %u\n", cnt);
}
Compile and run:
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o main.out main.c -pthread
./main.out
Possible output:
The atomic counter is 10000
The non-atomic counter is 8644
The non-atomic counter is very likely to be smaller than the atomic one due to racy access across threads to the non atomic variable.
Disassembly analysis at: How do I start threads in plain C?
If you are using GCC on your CentOS platform, then you can use the __atomic built-in functions.
Of particular interest might be this function:
— Built-in Function: bool __atomic_always_lock_free (size_t size, void *ptr)
This built-in function returns true if objects of size bytes always generate lock free atomic instructions for the target architecture. size must resolve to a compile-time constant and the result also resolves to a compile-time constant.
ptr is an optional pointer to the object that may be used to determine alignment. A value of 0 indicates typical alignment should be used. The compiler may also ignore this parameter.
if (_atomic_always_lock_free (sizeof (long long), 0))
I am going to toss in my two cents in case someone benefits. Atomic operations are a major problem in Linux. I used gatomic.h at one time only to find it gone. I see all kinds of different atomic options of either questionable reliability or availability -- and I see things changing all the time. They can be complex with tests needed by O/S level, processor, whatever. You can use a mutex -- not only complex by dreadfully slow.
Although perhaps not ideal in threads, this works great for atomic operations on shared memory variables. It is simple and it works on every O/S and processor and configuration known to man (or woman), dead reliable, easy to code, and will always work.
Any code can me made atomic with a simple primitive -- a semaphore. It is something that is true/false, 1/0, yes/no, locked/unlocked -- binary.
Once you establish the semaphore:
set semaphore //must be atomic
do all the code you like which will be atomic as the semaphore will block for you
release semaphore //must be atomic
Relatively straight forward except the "must be atomic" lines.
It turns out that you easily assign your semaphores a number (I use a define so they have a name like "#define OPEN_SEM 1" and "#define "CLASS_SEM 2" and so forth.
Find out your largest number and when your program initializes open a file in some directory (I use one just for this purpose). If not there create it:
if (ablockfd < 0) { //ablockfd is static in case you want to
//call it over and over
char *get_sy_path();
char lockname[100];
strcpy(lockname, get_sy_path());
strcat(lockname, "/metlock");
ablockfd = open(lockname, O_RDWR);
//error code if ablockfd bad
}
Now to gain a semaphore:
Now use your semaphore number to "lock" a "record" in your file of length one byte. Note -- the file will never actually occupy disk space and no disk operation occurs.
//sem_id is passed in and is set from OPEN_SEM or CLASS_SEM or whatever you call your semaphores.
lseek(ablockfd, sem_id, SEEK_SET); //seeks to the bytes in file of
//your semaphore number
result = lockf(ablockfd, F_LOCK, 1);
if (result != -1) {
//got the semaphore
} else {
//failed
}
To test if the semaphore is held:
result = lockf(ablockfd, F_TEST, 1); //after same lseek
To release the semaphore:
result = lockf(ablockfd, F_ULOCK, 1); //after same lseek
And all the other things you can do with lockf -- blocking/non-blocking, etc.
Note -- this is WAY faster than a mutex, it goes away if the process dies (a good thing), simple to code, and I know of no operating system with any processor with any number of them or number of cores that cannot atomically lock a record ... so simple code that just works. The file never really exists (no bytes but in directory), seems to be no practical limit to how many you may have. I have used this for years on machines with no easy atomic solutions.
Related
I'm trying to write a function that copies a function (and ends up modify its assembly) and returns it. This works fine for one level of indirection, but at two I get a segfault.
Here is a minimum (not)working example:
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#define BODY_SIZE 100
int f(void) { return 42; }
int (*G(void))(void) { return f; }
int (*(*H(void))(void))(void) { return G; }
int (*g(void))(void) {
void *r = mmap(0, BODY_SIZE, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
memcpy(r, f, BODY_SIZE);
return r;
}
int (*(*h(void))(void))(void) {
void *r = mmap(0, BODY_SIZE, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
memcpy(r, g, BODY_SIZE);
return r;
}
int main() {
printf("%d\n", f());
printf("%d\n", G()());
printf("%d\n", g()());
printf("%d\n", H()()());
printf("%d\n", h()()()); // This one fails - why?
return 0;
}
I can memcpy into an mmap'ed area once to create a valid function that can be called (g()()). But if I try to apply it again (h()()()) it segfaults. I have confirmed that it correctly creates the copied version of g, but when I execute that version I get a segfault.
Is there some reason why I can't execute code in one mmap'ed area from another mmap'ed area? From exploratory gdb-ing with x/i checks it seems like I can call down successfully, but when I return the function I came from has been erased and replaced with 0s.
How can I get this behaviour to work? Is it even possible?
BIG EDIT:
Many have asked for my rationale as I am obviously doing an XY problem here. That is true and intentional. You see, a little under a month ago this question was posted on the code golf stack exchange. It also got itself a nice bounty for a C/Assembly solution. I gave some idle thought to the problem and realized that by copying a functions body while stubbing out an address with some unique value I could search its memory for that value and replace it with a valid address, thus allowing me to effectively create lambda functions that take a single pointer as an argument. Using this I could get single currying working, but I need the more general currying. Thus my current partial solution is linked here. This is the full code that exhibits the segfault I am trying to avoid. While this is pretty much the definition of a bad idea, I find it entertaining and would like to know if my approach is viable or not. The only thing I'm missing is ability to run a function created from a function, but I can't get that to work.
The code is using relative calls to invoke mmap and memcpy so the copied code ends up calling an invalid location.
You can invoke them through a pointer, e.g.:
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#define BODY_SIZE 100
void* (*mmap_ptr)(void *addr, size_t length, int prot, int flags,
int fd, off_t offset) = mmap;
void* (*memcpy_ptr)(void *dest, const void *src, size_t n) = memcpy;
int f(void) { return 42; }
int (*G(void))(void) { return f; }
int (*(*H(void))(void))(void) { return G; }
int (*g(void))(void) {
void *r = mmap_ptr(0, BODY_SIZE, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
memcpy_ptr(r, f, BODY_SIZE);
return r;
}
int (*(*h(void))(void))(void) {
void *r = mmap_ptr(0, BODY_SIZE, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
memcpy_ptr(r, g, BODY_SIZE);
return r;
}
int main() {
printf("%d\n", f());
printf("%d\n", G()());
printf("%d\n", g()());
printf("%d\n", H()()());
printf("%d\n", h()()()); // This one fails - why?
return 0;
}
I'm trying to write a function that copies a function
I think that is pragmatically not the right approach, unless you know very well machine code for your platform (and then you would not ask the question). Be aware of position independent code (useful because in general mmap(2) would use ASLR and give some "randomness" in the addresses). BTW, genuine self-modifying machine code (i.e. changing some bytes of some existing valid machine code) is today cache and branch-predictor unfriendly and should be avoided in practice.
I suggest two related approaches (choose one of them).
Generate some temporary C file (see also this), e.g. in /tmp/generated.c, then fork a compilation using gcc -Wall -g -O -fPIC /tmp/generated.c -shared -o /tmp/generated.so of it into a plugin, then dlopen(3) (for dynamic loading) that /tmp/generated.so shared object plugin (and probably use dlsym(3) to find function pointers in it...). For more about shared objects, read Drepper's How To Write Shared Libraries paper. Today, you can dlopen many hundreds of thousands of such shared libraries (see my manydl.c example) and C compilers (like recent GCC) are fast enough to compile a few thousand lines of code in a time compatible with interaction (e.g. less than a tenth of second). Generating C code is a widely used practice. In practice you would represent some AST in memory of the generated C code before emitting it.
Use some JIT compilation library, such as GCCJIT, or LLVM, or libjit, or asmjit, etc.... which would generate a function in memory, do the required relocations, and give you some pointer to it.
BTW, instead of coding in C, you might consider using some homoiconic language implementation (such as SBCL for Common Lisp, which compiles to machine code at every REPL interaction, or any dynamically contructed S-expr program representation).
The notions of closures and of callbacks are worthwhile to know. Read SICP and perhaps Lisp In Small Pieces (and of course the Dragon Book, for general compiler culture).
this question was posted on code golf.SE
I updated the 8086 16-bit code-golf answer on the sum-of-args currying question to include commented disassembly.
You might be able to use the same idea in 32-bit code with a stack-args calling convention to make a modified copy of a machine code function that tacks on a push imm32. It wouldn't be fixed-size anymore, though, so you'd need to update the function size in the copied machine code.
In normal calling conventions, the first arg is pushed last, so you can't just append another push imm32 before a fixed-size call target / leave / ret trailer. If writing a pure asm answer, you could use an alternate calling convention where args are pushed in the other order. Or you could have a fixed-size intro, then an ever-growing sequence of push imm32 + call / leave / ret.
The currying function itself could use a register-arg calling convention, even if you want the target function to use i386 System V for example (stack args).
You'd definitely want to simplify by not supporting args wider than 32 bit, so no structs by value, and no double. (Of course you could chain multiple calls to the currying function to build up a larger arg.)
Given the way the new code-golf challenge is written, I guess you'd compare the total number of curried args against the number of args the target "input" function takes.
I don't think there's any chance you can make this work in pure C with just memcpy; you have to modify the machine code.
I am trying to understand the MPI-Function `MPI_Fetch_and_op() through a small example and ran into a strange behaviour I would like to understand.
In the example the process with rank 0 is waiting till the processes 1..4 have each incremented the value of result by one before carrying on.
With the default value 0 for assert used in the function MPI_Win_lock_all() I sometimes (1 out of 10) get an infinite loop, that is updating the value of result[0] in the MASTER to the value of 3. The terminal output looks like the following code snippet:
result: 3
result: 3
result: 3
...
According to the documentation the function MPI_Fetch_and_op is atomic.
This operations is atomic with respect to other "accumulate"
operations.
First Question:
Why is it not updating the value of result[0] to 4?
If I change the value of assert to MPI_MODE_NOCHECK it seems to work
Second Question:
Why is it working with MPI_MODE_NOCHECK
According to the documentation I thought this means the mutual exclusion has to be organized in a different way. Can someone explain the passage from the documentation of MPI_Win_lock_all()?
MPI_MODE_NOCHECK
No other process holds, or will attempt to acquire a conflicting lock, while the caller holds the window lock. This is useful when
mutual exclusion is achieved by other means, but the coherence
operations that may be attached to the lock and unlock calls are still
required.
Thanks in advance!
Example program:
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#define MASTER 0
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
MPI_Comm comm = MPI_COMM_WORLD;
int r, p;
MPI_Comm_rank(comm, &r);
MPI_Comm_size(comm, &p);
printf("Hello from %d\n", r);
int result[1] = {0};
//int assert = MPI_MODE_NOCHECK;
int assert = 0;
int one = 1;
MPI_Win win_res;
MPI_Win_allocate(1 * sizeof(MPI_INT), sizeof(MPI_INT), MPI_INFO_NULL, comm, &result[0], &win_res);
MPI_Win_lock_all(assert, win_res);
if (r == MASTER) {
result[0] = 0;
do{
MPI_Fetch_and_op(&result, &result , MPI_INT, r, 0, MPI_NO_OP, win_res);
printf("result: %d\n", result[0]);
} while(result[0] != 4);
printf("Master is done!\n");
} else {
MPI_Fetch_and_op(&one, &result, MPI_INT, 0, 0, MPI_SUM, win_res);
}
MPI_Win_unlock_all(win_res);
MPI_Win_free(&win_res);
MPI_Finalize();
return 0;
}
Compiled with the following Makefile:
MPICC = mpicc
CFLAGS = -g -std=c99 -Wall -Wpedantic -Wextra
all: fetch_and
fetch_and: main.c
$(MPICC) $(CFLAGS) -o $# main.c
clean:
rm fetch_and
run: all
mpirun -np 5 ./fetch_and
Your code works for me, unchanged. But that may be coincidence. There are many problems with your code. Let me point out what I see:
You hard-coded the number of processes in the test result[0] != 4
You hard-coded the master value into MPI_Fetch_and_op(&one, &result, MPI_INT, 0
Passing the same address as update and result seems dangerous to me: MPI_Fetch_and_op(&result, &result
And my compiler complains about the first parameter since it is in effect an int** (actually int (*)[1])
I'm not sure why you don't get the same complaint on the second parameter,
....but I'm not happy about that second parameter anyway, since the fetch operation writes in memory that you designated to be the window buffer. I guess the lack of coherence here saves you.
You initialize the window with result[0] = 0; but I don't think that is coherent with the window so again, you may just be lucky.
I would think that MPI_Win_allocate(1 * sizeof(MPI_INT), sizeof(MPI_INT), MPI_INFO_NULL, comm, &result[0] would also be some sort of memory corruption since result is an output here, but it is a statically allocated array.
Similarly, Win_free tries to deallocate the memory buffer, but that was, as already remarked, a static buffer, so again: memory corruption.
Your use of Win_lock_all is not appropriate: it means that one process locks the window on all targets. Without any competing locks!! You are locking the window on only one process, but from all possible origins. I'd use an ordinary lock.
Finally, RMA calls are non-blocking. Normally, consistency is made by a Win_fence or Win_unlock. But because you are using a long-lived lock, you need to follow the Fetch_and_op by a MPI_Win_flush_local.
Ok, so that's a dozen cases of, eh, less than ideal programming. Still, in my set up it works. (Sometimes. Sometimes it also hangs.) So you may want to clean up your code a little. Your logic is correct, but your actual implementation not.
Is there is a simple but sure way to measure the relative differences in performance between two algorithm implementations in C programs. More specifically, I want to compare the performance of implementation A vs. B? I'm thinking of a scheme like this:
In a unit test program:
start timer
call function
stop timer
get difference between start stop time
Run the scheme above for a pair of functions A and B, then get a percentage difference in execution time to determine which is faster.
Upon doing some research I came across this question about using a Monotonic clock on OSX in C, which apparently can give me at least nanosecond precision. To be clear, I understand that precise, controlled measurements are hard to perform, like what's discussed in "With O(N) known and system clock known, can we calculate the execution time of the code?, which I assume should be irrelevant in this case because I only want a relative measurement.
Everything considered, is this a sufficient and valid approach towards the kind of analysis I want to perform? Are there any details or considerations I might be missing?
The main modification I make to the timing scheme you outline is to ensure that the same timing code is used for both functions — assuming they do have an identical interface, by passing a function pointer to skeletal code.
As an example, I have some code that times some functions that validate whether a given number is prime. The control function is:
static void test_primality_tester(const char *tag, int seed, int (*prime)(unsigned), int count)
{
srand(seed);
Clock clk;
int nprimes = 0;
clk_init(&clk);
clk_start(&clk);
for (int i = 0; i < count; i++)
{
if (prime(rand()))
nprimes++;
}
clk_stop(&clk);
char buffer[32];
printf("%9s: %d primes found (out of %d) in %s s\n", tag, nprimes,
count, clk_elapsed_us(&clk, buffer, sizeof(buffer)));
}
I'm well aware of srand() — why call it once?, but the point of using srand() once each time this function is called is to ensure that the tests process the same sequence of random numbers. On macOS, RAND_MAX is 0x7FFFFFFF.
The type Clock contain analogues to two struct timespec structures, for the start and stop time. The clk_init() function initializes the structure; clk_start() records the start time in the structure; clk_stop() records the stop time in the structure; and clk_elapsed_us() calculates the elapsed time between the start and stop times in microseconds. The package is written to provide me with cross-platform portability (at the cost of some headaches in determining which is the best sub-second timing routine available at compile time).
You can find my code for timers on Github in the repository https://github.com/jleffler/soq, in the src/libsoq directory — files timer.h and timer.c. The code has not yet caught up with macOS Sierra having clock_gettime(), though it could be compiled to use it with -DHAVE_CLOCK_GETTIME as a command-line compiler option.
This code was called from a function one_test():
static void one_test(int seed)
{
printf("Seed; %d\n", seed);
enum { COUNT = 10000000 };
test_primality_tester("IsPrime1", seed, IsPrime1, COUNT);
test_primality_tester("IsPrime2", seed, IsPrime2, COUNT);
test_primality_tester("IsPrime3", seed, IsPrime3, COUNT);
test_primality_tester("isprime1", seed, isprime1, COUNT);
test_primality_tester("isprime2", seed, isprime2, COUNT);
test_primality_tester("isprime3", seed, isprime3, COUNT);
}
And the main program can take one or a series of seeds, or uses the current time as a seed:
int main(int argc, char **argv)
{
if (argc > 1)
{
for (int i = 1; i < argc; i++)
one_test(atoi(argv[i]));
}
else
one_test(time(0));
return(0);
}
Consider the following snippet of C code:
int flag = 0;
/* Assume that the functions lock_helper, unlock_helper implement enter/leave in
* a global mutex and thread_start_helper simply runs the function in separate
* operating-system threads */
void worker1()
{
/* Long-running job here */
lock_helper();
if (!flag)
flag = 1;
unlock_helper();
}
void worker2()
{
/* Another long-running job here */
lock_helper();
if (!flag)
flag = 2;
unlock_helper();
}
int main(int argc, char **argv)
{
thread_start_helper(&worker1);
thread_start_helper(&worker2);
do
{
/* doing something */
} while (!flag);
/* do something with 'flag' */
}
Questions:
Is it it possible that 'flag' will always be 0 for the main thread(and it
becomes stuck in the do/while loop) due to some compiler optimization?
Will the 'volatile' modifier make any difference?
If the answer is 'depends on a feature provided by the compiler', is there any
way I can check for this 'feature' with a configuration script at
compile-time?
The code is likely to work as is, but is somewhat fragile. For one thing, it depends on the reads and writes to flag being atomic on the processor being used (and that flag's alignment is sufficient).
I would recommend either using a read lock to read the value of flag or use functionality of whatever threading library you are using to make flag properly atomic.
Since you can assume that the loading of an aligned int is an atomic operation, the only danger with your code is the optimizer: your compiler is allowed to optimize away all but the first reads of flag within main(), i. e. to convert your code into
int main(int argc, char **argv)
{
thread_start_helper(&worker1);
thread_start_helper(&worker2);
/* doing something */
if(!flag) {
while(1) /* doing something */
}
//This point is unreachable and the following can be optimized away entirely.
/* do something with 'flag' */
}
There are two ways you can make sure that this does not happen: 1. make flag volatile, which is a bad idea because it includes quite a bit of unwanted overhead, and 2. introduce the necessary memory barriers. Due to the atomicity of reading an int and the fact that you only want to interprete the value of flag after it has changed, you should be able to get away with just a compiler barrier before the loop condition like this:
int main(int argc, char **argv)
{
thread_start_helper(&worker1);
thread_start_helper(&worker2);
do
{
/* doing something */
barrier();
} while(!flag)
/* do something with 'flag' */
}
The barrier() used here is very lightweight, it is the cheapest of all barriers available.
This is not enough if you want to analyze any other data that is written before flag is raised, because you might still load stale data from memory (because the CPU decided to prefetch the value). For a comprehensive discussion of memory fences, their necessity, and their use, see https://www.kernel.org/doc/Documentation/memory-barriers.txt
Finally, you should be aware, that the other writer thread may modify flag at any time after the do{}while() loop exits. So, you should immediately copy its value to a shadow variable like this:
int myFlagCopy;
do
{
/* doing something */
barrier();
} while(!(myFlagCopy = flag))
/* do something with 'myFlagCopy' */
It is possible that the while is executed before the threads... you have to wait the execution of thread before, using pthread_join()
nftw wants a parameter for number of file handles to use, and doesn't seem to have a way to say 'as many as possible'. Specifying 255 seems to work on Linux, but fails on BSD. Apparently OPEN_MAX is the recommended solution on BSD, but I can't use this as it doesn't work on Linux.
Is there a portable equivalent of OPEN_MAX that will work on both Linux and BSD?
Alternatively, is there a portable number, some number large enough to not slow things down, that is portable for practical purposes (ideally specified in POSIX, or at least that will work on every Unix-like system with significant market share)?
Advanced Programming in the Unix Environment, 2nd Ed gives us the following code which should work everywhere; though it is pretty clever, I think it is a little unfortunate it doesn't also check the rlimits of the process, since the rlimits can further constrain how many open files a process may use. That aside, here's the code from The Master:
#ifdef OPEN_MAX
static long openmax = OPEN_MAX;
#else
static long openmax = 0;
#endif
/*
* If OPEN_MAX is indeterminate, we're not
* guaranteed that this is adequate.
*/
#define OPEN_MAX_GUESS 256
long
open_max(void)
{
if (openmax == 0) { /* first time through */
errno = 0;
if ((openmax = sysconf(_SC_OPEN_MAX)) < 0) {
if (errno == 0)
openmax = OPEN_MAX_GUESS; /* it's indeterminate */
else
err_sys("sysconf error for _SC_OPEN_MAX");
}
}
return(openmax);
}
(err_sys() is provided in the apue.h header with the sources -- should be easy to code a replacement for your routine.)
See getdtablesize. It has a conformance note:
SVr4, 4.4BSD (the getdtablesize() function first appeared in 4.2BSD). It is not specified in POSIX.1-2001; portable applications should employ sysconf(_SC_OPEN_MAX) instead of this call.