Consistency of MPI_Fetch_and_op - c

I am trying to understand the MPI-Function `MPI_Fetch_and_op() through a small example and ran into a strange behaviour I would like to understand.
In the example the process with rank 0 is waiting till the processes 1..4 have each incremented the value of result by one before carrying on.
With the default value 0 for assert used in the function MPI_Win_lock_all() I sometimes (1 out of 10) get an infinite loop, that is updating the value of result[0] in the MASTER to the value of 3. The terminal output looks like the following code snippet:
result: 3
result: 3
result: 3
...
According to the documentation the function MPI_Fetch_and_op is atomic.
This operations is atomic with respect to other "accumulate"
operations.
First Question:
Why is it not updating the value of result[0] to 4?
If I change the value of assert to MPI_MODE_NOCHECK it seems to work
Second Question:
Why is it working with MPI_MODE_NOCHECK
According to the documentation I thought this means the mutual exclusion has to be organized in a different way. Can someone explain the passage from the documentation of MPI_Win_lock_all()?
MPI_MODE_NOCHECK
No other process holds, or will attempt to acquire a conflicting lock, while the caller holds the window lock. This is useful when
mutual exclusion is achieved by other means, but the coherence
operations that may be attached to the lock and unlock calls are still
required.
Thanks in advance!
Example program:
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#define MASTER 0
int main(int argc, char *argv[])
{
MPI_Init(&argc, &argv);
MPI_Comm comm = MPI_COMM_WORLD;
int r, p;
MPI_Comm_rank(comm, &r);
MPI_Comm_size(comm, &p);
printf("Hello from %d\n", r);
int result[1] = {0};
//int assert = MPI_MODE_NOCHECK;
int assert = 0;
int one = 1;
MPI_Win win_res;
MPI_Win_allocate(1 * sizeof(MPI_INT), sizeof(MPI_INT), MPI_INFO_NULL, comm, &result[0], &win_res);
MPI_Win_lock_all(assert, win_res);
if (r == MASTER) {
result[0] = 0;
do{
MPI_Fetch_and_op(&result, &result , MPI_INT, r, 0, MPI_NO_OP, win_res);
printf("result: %d\n", result[0]);
} while(result[0] != 4);
printf("Master is done!\n");
} else {
MPI_Fetch_and_op(&one, &result, MPI_INT, 0, 0, MPI_SUM, win_res);
}
MPI_Win_unlock_all(win_res);
MPI_Win_free(&win_res);
MPI_Finalize();
return 0;
}
Compiled with the following Makefile:
MPICC = mpicc
CFLAGS = -g -std=c99 -Wall -Wpedantic -Wextra
all: fetch_and
fetch_and: main.c
$(MPICC) $(CFLAGS) -o $# main.c
clean:
rm fetch_and
run: all
mpirun -np 5 ./fetch_and

Your code works for me, unchanged. But that may be coincidence. There are many problems with your code. Let me point out what I see:
You hard-coded the number of processes in the test result[0] != 4
You hard-coded the master value into MPI_Fetch_and_op(&one, &result, MPI_INT, 0
Passing the same address as update and result seems dangerous to me: MPI_Fetch_and_op(&result, &result
And my compiler complains about the first parameter since it is in effect an int** (actually int (*)[1])
I'm not sure why you don't get the same complaint on the second parameter,
....but I'm not happy about that second parameter anyway, since the fetch operation writes in memory that you designated to be the window buffer. I guess the lack of coherence here saves you.
You initialize the window with result[0] = 0; but I don't think that is coherent with the window so again, you may just be lucky.
I would think that MPI_Win_allocate(1 * sizeof(MPI_INT), sizeof(MPI_INT), MPI_INFO_NULL, comm, &result[0] would also be some sort of memory corruption since result is an output here, but it is a statically allocated array.
Similarly, Win_free tries to deallocate the memory buffer, but that was, as already remarked, a static buffer, so again: memory corruption.
Your use of Win_lock_all is not appropriate: it means that one process locks the window on all targets. Without any competing locks!! You are locking the window on only one process, but from all possible origins. I'd use an ordinary lock.
Finally, RMA calls are non-blocking. Normally, consistency is made by a Win_fence or Win_unlock. But because you are using a long-lived lock, you need to follow the Fetch_and_op by a MPI_Win_flush_local.
Ok, so that's a dozen cases of, eh, less than ideal programming. Still, in my set up it works. (Sometimes. Sometimes it also hangs.) So you may want to clean up your code a little. Your logic is correct, but your actual implementation not.

Related

Application exit with code -8 before even tries to divide by 0?

I have code like this:
#include <stdio.h>
#include <stdint.h>
int main() {
uint8_t result = 0;
uint8_t a = 0;
printf("Trying to divide by %d...\n", a);
result = (1/a != 0);
printf("%d\n", result);
return 0;
}
I have tried this in online compiler here: https://www.mycompiler.io/new/c
And this is the result:
[Execution complete with exit code -8]
I expected "Trying to divide by 0..." to be printed out, and then application should exit.
But the application exits before it prints anything.
Is it normal C behavior or some glitch of online compiler?
I can see two possible explanations here
First, it's important to know that undefined behavior can time travel back in time. There is no guarantee that the program will run fine until the problematic line. This may occur because the compiler might think that it would be better to reorder things like this:
uint8_t a = 0;
uint8_t result = (1/a != 0);
printf("Trying to divide by %d...\n", a);
The compiler is allowed to do this, because it's allowed to assume that division by zero will never happen.
Second, it could be the case that the output needs to be flushed. Whenever you want to make sure that a printout gets visible on screen before doing something else, use fflush(stdout). It might make a difference if you do this:
printf("Trying to divide by %d...\n", a);
fflush(stdout);
result = (1/a != 0);

How to understand the type of storage of a pointer

I have as an homework this task:
Given a void** ptr_addr write a function that return 0 if the type of storage of *ptr_addr is static or automatic and return 1 if the type of storage of *ptr_addr is dynamic.
The language of the code must be C.
The problem is that theoretically I know what the task is about but I don't know how to check the
previous condition with a code.
Thanks for the help!
Normally I don't do homework, but in cases like this I may make an exception.
Bear in mind that what I'm about to present is horrible code. Also it doesn't meet your requirements as stated — you'll have to adapt it for that. Also it may not meet your instructor's expectations: for an instructor demented enough to be assigning this task, I can't begin to guess his (her? its?) expectations. You may get dinged for using the technique I've presented, or for presenting someone else's work. Also I'm going to get dinged for presenting this code here on Stack Overflow, because no, it's nothing like portable or guaranteed to do anything, let alone to work. I have no idea whether it'll work on your system.
Nevertheless, and may God help me, I tested it, and it does "work" on a modern Debian Linux system.
#include <unistd.h>
extern etext, edata, end;
char *
mcat(void *p)
{
int dummy;
if(p < &etext)
return "text";
else if(p < &edata)
return "data";
else if(p < &end)
return "bss";
else if(p < sbrk(0))
return "heap";
else if(p > &dummy)
return "stack";
else return "?";
}
You'll get a good number of warnings if you compile this, which could theoretically be silenced using some explicit casts, but I think the warnings are actually pretty appropriate, given the nefariousness of this code.
How it works: on at least some Unix-like systems, etext, edata, and end are magic symbols corresponding to the ends of the program's text, initialized data, and uninitialized data segments, respectively. sbrk(0) gives you a pointer to the top of the heap that a traditional implementation of malloc is using. And &dummy is a good approximation of the bottom of the stack.
Test program:
#include <stdio.h>
#include <stdlib.h>
int g = 2;
int g2;
int main()
{
int l;
static int s = 3;
static int s2;
int *p = malloc(sizeof(int));
printf("g: %s\n", mcat(&g));
printf("g2: %s\n", mcat(&g2));
printf("main: %s\n", mcat(main));
printf("l: %s\n", mcat(&l));
printf("s: %s\n", mcat(&s));
printf("s2: %s\n", mcat(&s2));
printf("p: %s\n", mcat(p));
}
On my test system this prints
g: data
g2: bss
main: text
l: stack
s: data
s2: bss
p: heap
I'd like to post a different approach to solve the problem:
// this function returns 1 if ptr has been allocated by malloc/calloc/realloc, otherwise 0
int is_pointer_heap(void* ptr) {
pid_t p = fork();
if (p == 0) {
(void) realloc(ptr, 1);
exit(0);
}
int status;
(void) waitpid(p, &status, 0);
return (status == 0) ? 1 : 0;
}
I wrote this (bad) code very quickly (and there's lot of room for improvements), but I tested it and it seems to work.
EXPLANATION: realloc() will crash your process if the argument passed to it is not a malloc/calloc/realloc-allocated pointer. Here we create a new child process, we let the child process call realloc(); if the child process crashes, we return 0, otherwise we return 1.

Executing code in mmap to produce executable code segfaults

I'm trying to write a function that copies a function (and ends up modify its assembly) and returns it. This works fine for one level of indirection, but at two I get a segfault.
Here is a minimum (not)working example:
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#define BODY_SIZE 100
int f(void) { return 42; }
int (*G(void))(void) { return f; }
int (*(*H(void))(void))(void) { return G; }
int (*g(void))(void) {
void *r = mmap(0, BODY_SIZE, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
memcpy(r, f, BODY_SIZE);
return r;
}
int (*(*h(void))(void))(void) {
void *r = mmap(0, BODY_SIZE, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
memcpy(r, g, BODY_SIZE);
return r;
}
int main() {
printf("%d\n", f());
printf("%d\n", G()());
printf("%d\n", g()());
printf("%d\n", H()()());
printf("%d\n", h()()()); // This one fails - why?
return 0;
}
I can memcpy into an mmap'ed area once to create a valid function that can be called (g()()). But if I try to apply it again (h()()()) it segfaults. I have confirmed that it correctly creates the copied version of g, but when I execute that version I get a segfault.
Is there some reason why I can't execute code in one mmap'ed area from another mmap'ed area? From exploratory gdb-ing with x/i checks it seems like I can call down successfully, but when I return the function I came from has been erased and replaced with 0s.
How can I get this behaviour to work? Is it even possible?
BIG EDIT:
Many have asked for my rationale as I am obviously doing an XY problem here. That is true and intentional. You see, a little under a month ago this question was posted on the code golf stack exchange. It also got itself a nice bounty for a C/Assembly solution. I gave some idle thought to the problem and realized that by copying a functions body while stubbing out an address with some unique value I could search its memory for that value and replace it with a valid address, thus allowing me to effectively create lambda functions that take a single pointer as an argument. Using this I could get single currying working, but I need the more general currying. Thus my current partial solution is linked here. This is the full code that exhibits the segfault I am trying to avoid. While this is pretty much the definition of a bad idea, I find it entertaining and would like to know if my approach is viable or not. The only thing I'm missing is ability to run a function created from a function, but I can't get that to work.
The code is using relative calls to invoke mmap and memcpy so the copied code ends up calling an invalid location.
You can invoke them through a pointer, e.g.:
#include <stdio.h>
#include <string.h>
#include <sys/mman.h>
#define BODY_SIZE 100
void* (*mmap_ptr)(void *addr, size_t length, int prot, int flags,
int fd, off_t offset) = mmap;
void* (*memcpy_ptr)(void *dest, const void *src, size_t n) = memcpy;
int f(void) { return 42; }
int (*G(void))(void) { return f; }
int (*(*H(void))(void))(void) { return G; }
int (*g(void))(void) {
void *r = mmap_ptr(0, BODY_SIZE, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
memcpy_ptr(r, f, BODY_SIZE);
return r;
}
int (*(*h(void))(void))(void) {
void *r = mmap_ptr(0, BODY_SIZE, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
memcpy_ptr(r, g, BODY_SIZE);
return r;
}
int main() {
printf("%d\n", f());
printf("%d\n", G()());
printf("%d\n", g()());
printf("%d\n", H()()());
printf("%d\n", h()()()); // This one fails - why?
return 0;
}
I'm trying to write a function that copies a function
I think that is pragmatically not the right approach, unless you know very well machine code for your platform (and then you would not ask the question). Be aware of position independent code (useful because in general mmap(2) would use ASLR and give some "randomness" in the addresses). BTW, genuine self-modifying machine code (i.e. changing some bytes of some existing valid machine code) is today cache and branch-predictor unfriendly and should be avoided in practice.
I suggest two related approaches (choose one of them).
Generate some temporary C file (see also this), e.g. in /tmp/generated.c, then fork a compilation using gcc -Wall -g -O -fPIC /tmp/generated.c -shared -o /tmp/generated.so of it into a plugin, then dlopen(3) (for dynamic loading) that /tmp/generated.so shared object plugin (and probably use dlsym(3) to find function pointers in it...). For more about shared objects, read Drepper's How To Write Shared Libraries paper. Today, you can dlopen many hundreds of thousands of such shared libraries (see my manydl.c example) and C compilers (like recent GCC) are fast enough to compile a few thousand lines of code in a time compatible with interaction (e.g. less than a tenth of second). Generating C code is a widely used practice. In practice you would represent some AST in memory of the generated C code before emitting it.
Use some JIT compilation library, such as GCCJIT, or LLVM, or libjit, or asmjit, etc.... which would generate a function in memory, do the required relocations, and give you some pointer to it.
BTW, instead of coding in C, you might consider using some homoiconic language implementation (such as SBCL for Common Lisp, which compiles to machine code at every REPL interaction, or any dynamically contructed S-expr program representation).
The notions of closures and of callbacks are worthwhile to know. Read SICP and perhaps Lisp In Small Pieces (and of course the Dragon Book, for general compiler culture).
this question was posted on code golf.SE
I updated the 8086 16-bit code-golf answer on the sum-of-args currying question to include commented disassembly.
You might be able to use the same idea in 32-bit code with a stack-args calling convention to make a modified copy of a machine code function that tacks on a push imm32. It wouldn't be fixed-size anymore, though, so you'd need to update the function size in the copied machine code.
In normal calling conventions, the first arg is pushed last, so you can't just append another push imm32 before a fixed-size call target / leave / ret trailer. If writing a pure asm answer, you could use an alternate calling convention where args are pushed in the other order. Or you could have a fixed-size intro, then an ever-growing sequence of push imm32 + call / leave / ret.
The currying function itself could use a register-arg calling convention, even if you want the target function to use i386 System V for example (stack args).
You'd definitely want to simplify by not supporting args wider than 32 bit, so no structs by value, and no double. (Of course you could chain multiple calls to the currying function to build up a larger arg.)
Given the way the new code-golf challenge is written, I guess you'd compare the total number of curried args against the number of args the target "input" function takes.
I don't think there's any chance you can make this work in pure C with just memcpy; you have to modify the machine code.

How to use atomic variables in C?

I need to use an atomic variable in C as this variable is accessed across different threads. Don't want a race condition.
My code is running on CentOS. What are my options?
C11 atomic primitives
http://en.cppreference.com/w/c/language/atomic
_Atomic const int * p1; // p is a pointer to an atomic const int
const atomic_int * p2; // same
const _Atomic(int) * p3; // same
Added in glibc 2.28. Tested in Ubuntu 18.04 (glibc 2.27) by compiling glibc from source: Multiple glibc libraries on a single host Later also tested on Ubuntu 20.04, glibc 2.31.
Example adapted from: https://en.cppreference.com/w/c/language/atomic
main.c
#include <stdio.h>
#include <threads.h>
#include <stdatomic.h>
atomic_int acnt;
int cnt;
int f(void* thr_data)
{
(void)thr_data;
for(int n = 0; n < 1000; ++n) {
++cnt;
++acnt;
// for this example, relaxed memory order is sufficient, e.g.
// atomic_fetch_add_explicit(&acnt, 1, memory_order_relaxed);
}
return 0;
}
int main(void)
{
thrd_t thr[10];
for(int n = 0; n < 10; ++n)
thrd_create(&thr[n], f, NULL);
for(int n = 0; n < 10; ++n)
thrd_join(thr[n], NULL);
printf("The atomic counter is %u\n", acnt);
printf("The non-atomic counter is %u\n", cnt);
}
Compile and run:
gcc -ggdb3 -O0 -std=c99 -Wall -Wextra -pedantic -o main.out main.c -pthread
./main.out
Possible output:
The atomic counter is 10000
The non-atomic counter is 8644
The non-atomic counter is very likely to be smaller than the atomic one due to racy access across threads to the non atomic variable.
Disassembly analysis at: How do I start threads in plain C?
If you are using GCC on your CentOS platform, then you can use the __atomic built-in functions.
Of particular interest might be this function:
— Built-in Function: bool __atomic_always_lock_free (size_t size, void *ptr)
This built-in function returns true if objects of size bytes always generate lock free atomic instructions for the target architecture. size must resolve to a compile-time constant and the result also resolves to a compile-time constant.
ptr is an optional pointer to the object that may be used to determine alignment. A value of 0 indicates typical alignment should be used. The compiler may also ignore this parameter.
if (_atomic_always_lock_free (sizeof (long long), 0))
I am going to toss in my two cents in case someone benefits. Atomic operations are a major problem in Linux. I used gatomic.h at one time only to find it gone. I see all kinds of different atomic options of either questionable reliability or availability -- and I see things changing all the time. They can be complex with tests needed by O/S level, processor, whatever. You can use a mutex -- not only complex by dreadfully slow.
Although perhaps not ideal in threads, this works great for atomic operations on shared memory variables. It is simple and it works on every O/S and processor and configuration known to man (or woman), dead reliable, easy to code, and will always work.
Any code can me made atomic with a simple primitive -- a semaphore. It is something that is true/false, 1/0, yes/no, locked/unlocked -- binary.
Once you establish the semaphore:
set semaphore //must be atomic
do all the code you like which will be atomic as the semaphore will block for you
release semaphore //must be atomic
Relatively straight forward except the "must be atomic" lines.
It turns out that you easily assign your semaphores a number (I use a define so they have a name like "#define OPEN_SEM 1" and "#define "CLASS_SEM 2" and so forth.
Find out your largest number and when your program initializes open a file in some directory (I use one just for this purpose). If not there create it:
if (ablockfd < 0) { //ablockfd is static in case you want to
//call it over and over
char *get_sy_path();
char lockname[100];
strcpy(lockname, get_sy_path());
strcat(lockname, "/metlock");
ablockfd = open(lockname, O_RDWR);
//error code if ablockfd bad
}
Now to gain a semaphore:
Now use your semaphore number to "lock" a "record" in your file of length one byte. Note -- the file will never actually occupy disk space and no disk operation occurs.
//sem_id is passed in and is set from OPEN_SEM or CLASS_SEM or whatever you call your semaphores.
lseek(ablockfd, sem_id, SEEK_SET); //seeks to the bytes in file of
//your semaphore number
result = lockf(ablockfd, F_LOCK, 1);
if (result != -1) {
//got the semaphore
} else {
//failed
}
To test if the semaphore is held:
result = lockf(ablockfd, F_TEST, 1); //after same lseek
To release the semaphore:
result = lockf(ablockfd, F_ULOCK, 1); //after same lseek
And all the other things you can do with lockf -- blocking/non-blocking, etc.
Note -- this is WAY faster than a mutex, it goes away if the process dies (a good thing), simple to code, and I know of no operating system with any processor with any number of them or number of cores that cannot atomically lock a record ... so simple code that just works. The file never really exists (no bytes but in directory), seems to be no practical limit to how many you may have. I have used this for years on machines with no easy atomic solutions.

MPI simple data transfer program

I'm supposed to send some integer number from one processor to another, and it is to be done on shell server from my university...
First I created my solution code, which is going to look like (at least I think so...)
#include <stdio.h>
#include <mpi.h>
int main(int argc, char **argv)
{
int currentRank = -1;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &currentRank);
if(currentRank == 0) {
int numberToSend = 1;
MPI_Send(&numberToSend , 1, MPI_INT, 1, 0, MPI_COMM_WORLD);
}
else if(currentRank == 1) {
int recivedNumber;
MPI_Recv(&recivedNumber, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE);
printf("Recived number = %d\n", recivedNumber);
}
MPI_Finalize();
return 0;
}
Than I should create some name.pbs file... and run it. And I can't understand how to specify this number of processors... I tried like follow:
#PBS -l nodes=2:ppn=2
#PBS -N cnt
#PBS -j oe
mpiexec ~/mpi1
But later on still have no idea what to do with this on putty. qstat comamnd seems to do nothing... only when qstat -Q or q it shows me some 'statistics' but there are 0 values everywhere... it's my first program in mpi and I really don't understand it at all...
And when I try to run my program I get:
164900#halite:~$ ./transfer1
Fatal error in MPI_Send: Invalid rank, error stack:
MPI_Send(174): MPI_Send(buf=0x7fffd28ec640, count=1, MPI_INT, dest=1, tag=0, MPI_COMM_WORLD) failed
MPI_Send(99).: Invalid rank has value 1 but must be nonnegative and less than 1
Can anyone explain me how to run this on the server ?
the example code works fine here, tested with OpenMPI and GCC,
the problem is that when you run the code you need to specific the number of cores via your mpirun instance, you may havecorrectly allocated them using torque or what ever scheduler you are using but you are running the compiled code as if it were serial you need to run it with mpi heres and example with the associated output
mpirun -np 2 ./example
Recived number = 1
with different scheduler hydra,PBS, or different MPI version you need to follow the same pattern as above and specific to your mpi run command the number of cores

Resources