Why is my function static variable never different despite being incremented?

Why is my function static variable never different despite being incremented? - c

I am writing a callback function in C. It is intended to initialise an I2C sensor, and it called at the conclusion of each (split-phase) configuration step; after the 9th call, the device is almost ready to use.
The basic idea of the function is this:
void callback(void)
{
static uint8_t calls = 0;
if (++calls == 9) {
// Finalise device setup (literally a single line of code)
}
}
My problem is that the above if statement is never being entered, despite the function being called 9 times.
The (dis)assembly code for my function seems sane (with the exception of the subi . 0xFF trick for an increment, despite the inclusion of an inc instruction):
00000a4c <callback>:
a4c: 80 91 9e 02 lds r24, 0x029E
a50: 8f 5f subi r24, 0xFF ; 255
a52: 80 93 9e 02 sts 0x029E, r24
a56: 89 30 cpi r24, 0x09 ; 9
a58: 09 f0 breq .+2 ; 0xa5c <callback+0x10>
a5a: 08 95 ret
a5c: 2e e1 ldi r18, 0x1E ; 30
a5e: 35 e0 ldi r19, 0x05 ; 5
a60: 41 e0 ldi r20, 0x01 ; 1
a62: 60 e0 ldi r22, 0x00 ; 0
a64: 84 e7 ldi r24, 0x74 ; 116
a66: 0c 94 c7 02 jmp 0x58e ; 0x58e <twi_set_register>
I am writing the code for an Atmel AVR chip, and thus compiling with avr-gcc. I have no meaningful code debugging capabilities (I don't have access to a JTAG programmer, and the function is asynchronous/split-phase in any case; USART printing is too slow).
However, I have access to a logic analyser, and have been able to determine a number of things by placing while (1) ; statements inside the code:
the function is called - if I place an infinite loop at the start of the function, the microcontroller hangs
the function should be called 9 times - the trigger for the function is an I2C communication, and in the previous step it hangs immediately after the first communication; I can observe 9 complete and valid I2C communications
calls is incremented within the function - if I add if (calls == 0) { while (1) ; } after the increment, it does not hang
calls is never non-zero at the start of the function - if I add if (calls) { while(1) ; } before the increment, it does not hang
I'm completely at a loss for ideas.
Does anyone have any suggestions as to what could cause this, or even for new debugging steps I could take?

I ended up finding the cause of the error; another subsystem was breaking as a side-effect of the first call to the callback function, meaning that no other calls succeeded.
This explains the behaviours I saw:
it hung the first time because it was actually being called
it didn't hang the second time (or any future time) because it was only being called once
the I2C transactions I was observing were occurring, but their callback mechanism was not operating correctly, due to the other subsystem (tasks) breaking
I was able to work this out by using a few GPIO pins as debugging toggles, and thus tracking how the call was progressing through the TWI interface.
Thanks for the help guys. This isn't really an answer to the original question as posed, but it is solved, so that's something :)

For what you say I can only think of 3 possibilities: 1) your assumption that the function is being called on every I2C communication is incorrect, 2) your program has a bug (maybe a memory leak) in some unrelated function which causes the variable calls to become corrupted. or 3) two or more threads are calling your function simultaneously and calls is being incremented in a different way than you expect, use > instead of ==, if this solves the problem, then you are running in a milti-threaded environment and you didn't konw.
You need an accurate method to know the exact value of calls, if you don't have a debugger and don't have the means to output text either, the only thing you have left to play is time. I don't know you compiler, but I am sure it contains some useful timing functions, so what I would do would be to loop before increment for 10+calls seconds, and after increment again 10+calls seconds, for example:
sleep(1000*(10+calls));
++calls;
sleep(1000*(10+calls));
if(calls>8){
// your actual code
}
I would (chronometer in hand) expect a delay of (10+0 plus 10+1) = 21 seconds on the first call, 23 seconds on the second call, 25 in the third and so on. That way I could be sure the value of calls started with 0 and then it was progressively increased until 9.
Also, you must test for what you expect not for what you don't expect, so instead of doing this:
++calls;
if (calls==0) while (1);
do this:
++calls;
if (calls==1) while (1);
that way if your program hangs you can be sure the value of calls is exactly 1, and not whatever different from zero. If you count one valid I2C communication and your program hangs then the transition from 0 to 1 was done correctly, so change the hang statement accordingly:
++calls;
if (calls==2) while (1);
Again, if you count 2 valid I2C communications before your program hangs that means that the transition from 1 to 2 was correct, and so on.
Another suggestion, try this:
uint8_t Times(void){
static uint8_t calls = 0;
return ++calls;
}
void callback(void){
if (Times()>8) {
// Your actual code
}
}
And this:
void callback(void){
static uint8_t calls = 0;
if (calls++>7) {
// some code.
}
}
Hope this helps.

Related

single-reader single-writer fix-sized ringbuf, without lock and atomic variable, is it always safe for any CPU arch?

Was asked to wrap a blocking call to non-blocking during an interview today.
So we (the interviewer and me) decided to achieve that by adding a background thread inside the nonblocking API.
Here is the code I wrote:
30 #define ARRAY_SIZE(x) (sizeof(x)/sizeof(x[0]))
31
32 struct SensorReading records[60*10] = {{0}};
33 size_t head = 0;
34
35
36 void * worker_thread(void *arg) {
37 while (1) {
38 size_t idx = (head + 1) % ARRAY_SIZE(records);
39 records[idx] = read_next_sample();
40 head = idx;
41 }
42 }
43
44 float get_most_recent_lux() {
45 static pthread_t worker = -1;
46 if (-1 == worker) {
47 struct SensorReading r = read_next_sample(); // This is the blocking call
48 records[0] = r;
49 if (-1 == pthread_create(&worker, NULL, worker_thread, NULL)) {
50 // error handling
51 }
52 return r.lux;
53 }
54 return records[head].lux;
55 }
Let me explain a little bit here:
read_next_sample() is the blocking call provided;
line 44 get_most_recent_lux() is the wrapped non-blocking API I need to provide.
Internally, it starts a thread which executes the worker_thread() function defined in line 36.
worker_thread() keeps calling the blocking call and writes data to a ringbuf.
So the reader can read most recent record data from the ringbuf.
Also please be noted:
This programming language used here is C, not C++.
This is a single reader single writer case.
This is different than producer-consumer problem, since the wrapped API get_most_recent_lux() should always return most recent data.
Since this is a single reader single writer case, I believe:
No lock required here.
No atomic values required here.
(so the head in line 33 is not declared as atomic values,
and I use a normal evaluation operation (head = idx) in line 40).
Question: Is my above statement correct?
The interviewer keeps telling me that my statement is not correct for all CPU arch, so he believes mutex or atomic variable is required here.
But I don't think so.
I believe, indeed, the single line evaluation C code (head = idx) can be translated to multiple assembly instructions, but only the last assembly instruction is for storing the updated value to the memory. So,
before the last assembly instruction executed, the updated value has not been updated to the memory yet, so the reader will always read the old head value.
after the last assembly instruction executed, the reader will always read the updated head value.
in both case, it is safe, no corruption will happen.
There is no other possibilities. Within a specified time period where only 1 write can happen (let's say change from 1 to 2), the reader can only read either 1 or 2, the reader will never read any value other than 1 or 2, like 0, 3, or 1.5.
Agree?
I really can not believe there is any CPU arch that the code won't work.
If there is, please educate me.
Thanks very much.

You don't need any atomic RMWs or seq_cst, but you do need _Atomic to do release-store and acquire-load to/from head.
That only happens for free on x86 (and SPARC), not other ISAs, and is still unsafe vs. compile-time reordering even when targeting x86. head = idx; could become visible to the other core before the updates to records[idx], letting it read the stale values.
(Well, the records[head].lux load part will actually work on most ISAs like mo_consume, since ISAs other than DEC Alpha guarantee dependency ordering for loads.)
I think there are some other similar Q&As on SO about trying to use non-atomic variables for inter-thread communication. There's little point, just use atomic_store_explicit with memory_order_release - it will compile the same on x86 as a non-atomic store, but with compile time ordering guarantees. So you're not gaining efficiency by avoiding stdatomic.h, if you only use acquire and release. (Except for loads - if you want actual dependency-ordering without a barrier on weakly-ordered ISAs, you have to use relaxed under controlled conditions on weakly-ordered ISAs, because consume is currently semi-deprecated and promotes to acquire in current compilers.) See When to use volatile with multi threading? for more about why hand-rolled atomics work and why they have no advantage.
Also, note that you have no protection against the queue becoming full and overwriting values that haven't been read yet. SPSC queues like this typically have the consumer side update a read index that the writer can check.

Dynamic C code execution: memory references

tl;dr : I'm trying to execute dynamically some code from another snippet. But I am stuck with handling memory reference (e.g. mov 40200b, %rdi): can I patch my code or the snippet running code so that 0x40200b is resolved correctly (as the offset 200b from the code)?
To generate the code to be executed dynamically I start from a (kernel) object and I resolve the references using ld.
#!/usr/bin/python
import os, subprocess
if os.geteuid() != 0:
print('Run this as root')
exit(-1)
with open("/proc/kallsyms","r") as f:
out=f.read()
sym= subprocess.Popen( ['nm', 'ebbchar.ko', '-u' ,'--demangle', '-fposix'],stdout=subprocess.PIPE)
v=''
for sym in sym.stdout:
s = " "+ sym.split()[0]+ "\n"
off = out.find(s)
v += "--defsym "+s.strip() + "=0x" +out[off-18:off -2]+" "
print(v)
os.system("ld ebbchar.ko "+ v +"-o ebbchar.bin");
I then transmit the code to be executed with through a mmaped file
int fd = open(argv[1], O_RDWR | O_SYNC);
address1 = mmap(NULL, page_size, PROT_WRITE|PROT_READ , MAP_SHARED, fd, 0);
int in=open(argv[2],O_RDONLY);
sz= read(in, buf+8,BUFFER_SIZE-8);
uint64_t entrypoint=atol(argv[3]);
*((uint64_t*)buf)=entrypoint;
write(fd, buf, min(sz+8, (size_t) BUFFER_SIZE));
I execute code dynamycally with this code
struct mmap_info *info;
copy_from_user((void*)(&info->offset),buf,8);
copy_from_user(info->data, buf+8, sz-8);
unsigned long (*func)(void) func= (void*) (info->data + info->offset);
int ret= func();
This approch work for code that don't access memory such as "\x55\x48\x89\xe5\xc7\x45\xf8\x02\x00\x00\x00\xc7\x45\xfc\x03\x00\x00\x00\x8b\x55\xf8\x8b\x45\xfc\x01\xd0\x5d\xc3" but I have problems when memory is involved.
See example below.
Let's assume i wan't execute dynamically the function vm_close. Objdump -d -S returns:
0000000000401017 <vm_close>:
{
401017: e8 e4 07 40 81 callq ffffffff81801800 <__fentry__>
printk(KERN_INFO "vm_close");
40101c: 48 c7 c7 0b 20 40 00 mov $0x40200b,%rdi
401023: e9 b6 63 ce 80 jmpq ffffffff810e73de <printk>
At execution, my function pointer points to the right code:
(gdb) x/12x $rip
0xffffc90000c0601c: 0x48 0xc7 0xc7 0x0b 0x20 0x40 0x00 0xe9
0xffffc90000c06024: 0xb6 0x63 0xce 0x80
(gdb) x/2i $rip
=> 0xffffc90000c0601c: mov $0x40200b,%rdi
0xffffc90000c06023: jmpq 0xffffc8ff818ec3de
BUT, this code will fail since:
1) In my context $0x40200b points at the physical address $0x40200b, and not offset 200b from the beginning of the code.
2) I don't understand why but the address displayed there is actually different from the correct one (0xffffc8ff818ec3de != ffffffff810e73de) so it won't point on my symbol and will crash.
Is there a way to solve my 2 issues?
Also, I had trouble to find good documentation related to my issue (low-level memory resolution), if you could give me some, that would really help me.
Edit: Since I run the code in the kernel I cannot simply compile the code with -fPIC or -fpie which is not allowed by gcc (cc1: error: code model kernel does not support PIC mode)
Edit 24/09:
According to #Peter Cordes comment, I recompiled it adding mcmodel=small -fpie -mno-red-zone -mnosse to the Makefile (/lib/modules/$(uname -r)fixed/build/Makefile)
This is better than in the original version since the generated code before linking is now:
0000000000000018 <vm_close>:
{
18: ff 15 00 00 00 00 callq *0x0(%rip) # 1e <vm_close+0x6>
printk(KERN_INFO "vm_close");
1e: 48 8d 3d 00 00 00 00 lea 0x0(%rip),%rdi # 25 <vm_close+0xd>
25: e8 00 00 00 00 callq 2a <vm_close+0x12>
}
2a: c3 retq So thanks to rip-relative addressing
Thus I’m now able to access the other variables on my script…
Thus, after linking I can successfully access my variable embedded within the buffer.
40101e: 48 8d 3d e6 0f 00 00 lea 0xfe6(%rip),%rdi # 40200b
Still, one problem remains:
The symbol I want to access (printk) and my executable buffer are in different address spaces, for exemple:
printk=0xffffffff810e73de:
Executable_buffer=0xffffc9000099d000
But in my callq to printk, I have only 32 bits to write the address to call as an offset from $rip since there is no .got section in the kernel. This means that printk has to be located within [$rip-2GO, $rip+2GO]. But this is not the case there.
Do I have a way to access the printk address although they are located more than 2GO away from my buffer (I tried to used mcmodel=medium but I haven't seen any difference in the generated code), for instance by modifying gcc options so that the binary actually have a .got section?
Or is there a reliable way to force my executable and potentially-too large-for-kmalloc buffer to be allocated in the [0xffffffff00000000 ; 0xffffffffffffffff] range? (I currently use __vmalloc(BUFFER_SIZE, GFP_KERNEL, PAGE_KERNEL_EXEC); )
Edit 27/09:
I succedded in allocationg my buffer in the [0xffffffff00000000 ; 0xffffffffffffffff] range using the non exported __vmalloc_node_range function as a (dirty) hack.
IMPORTED(__vmalloc_node_range)(BUFFER_SIZE, MODULE_ALIGN,
MODULES_VADDR + get_module_load_offset(),
MODULES_END, GFP_KERNEL,
PAGE_KERNEL_EXEC, 0, NUMA_NO_NODE,
__builtin_return_address(0));
Then, when I know the address of my executable buffer and the address of the kernel symbols (by parsing /proc/kallsyms), I can patch my binary using ld’s option --defsym symbol=relative_address where relative_address = symbol_address - buffer_offset .
Despite being extremely dirt, this approach actually works.
But I need to relink my binary each time I execute it since the buffer may (and will) be allocated at a different address. To solve this issue, I think the best way would be to build my executable as a real position independent executable so that I can just patch the global offset table and not fully relink the module.
But with the options provided there I got a rip-relative address but no got/plt. So I'd like to find a way to build my module as a proper PIE.
This post is getting huge, messy and we are deviating from the original question. Thus, I opened a new simplified post there. If I get interesting answers, I'll edit this post to explain them.
Note: For the sake of simplicity, safety tests are not displayed there
Note 2: I am perfectly aware that my PoC is very unusual and can be a bad practice but I'd like to do it anyway.

What is "Dummy for Loop" in C?

One of my friend asked me this question while we were travelling back to home. And he got to know about this from his teacher. First he told me its something like the infinite loop with no body. Hence I posted this question here asking whether infinite loop is called as dummy for loop.
I haven't seen anything like Dummy For in any of the books I read and not from the internet as well. Later he told me that those for loops which contains blank body are known as dummy loop. One like this :
for (int i=0; i<10; ++i);
Such loops are quite helpful in some cases. I was more eager to know if there is something really exists in software development field or it's just a name given by an individual.
Sorry for the delay in editing the details.

It is an infinite loop with no body. The purpose would be to make sure the program doesn't end, yet does nothing. It is also known as an idle loop.
That doesn't mean nothing will ever happen. There could still be interrupts occurring, which make the cpu jump out of the loop, and return to the loop after the interrupt service routine has finished.
Often this technique is used in embedded programming, where inputs and timers trigger interrupts, in which all of the activity occurs. When no interrupt is being serviced, we want to ensure the program doesn't just run off the end of the code.

Infinite loop. This keeps running the for loop forever.
Just to mention the use of it.There might be a need where the user keeps inputting some data and when some particular match is found you need to break out of this.
Similar to
while(1)
{
}
Example:
while(1)
{
printf("Enter a number\n");
scanf("%d",&a);
flag = 0;
switch(a)
{
case 1:
printf("hi\n");
break;
case 2:
flag = 1;
break;
}
if(flag)
break;
}
printf("Out of while loop\n");
}

The loop you described for( ; ; ); is an infinite loop as it has nothing defined in it.
for(initialization;condition;increment/decrements); is called dummy for loop.
It works as:
';' have a unique identification number 0 (NULL).
When we add semicolon at the end of for loop, it acts as the body of the for loop. So, for loop will do nothing(semicolon).
Now if there are arguments passed in for loop, say, (i=0;i<=10,i++); then it will run 10 times and will do nothing.
If there are no arguments passed, for( ; ; ); then it is an infinite loop and will do nothing.
It is called "dummy" because of semicolon as it does nothing.

Given dummyloop.c
int main(int args, char ** pargs)
{
for(;;);
}
gcc -g -c dummyloop.c
objdump -d -M intel -S dummyloop.o
dummyloop.o: format de fichier elf64-x86-64
Déassemblage de la section .text:
0000000000000000 <main>:
int main(int args, char ** pargs)
{
0: 55 push rbp
1: 48 89 e5 mov rbp,rsp
4: 89 7d fc mov DWORD PTR [rbp-0x4],edi
7: 48 89 75 f0 mov QWORD PTR [rbp-0x10],rsi
for(;;);
b: eb fe jmp b <main+0xb>
b: jmp b is infinite loop
An active infinite loop is something we usualy would like to avoid.

Process Coredumped but does not look like illegal reference in a multithreaded program

Backtrace of the coredump:
#0 0x0000000000416228 in add_to_epoll (struct_fd=0x18d32760, lno=7901) at lbi.c:7092
#1 0x0000000000418b54 in connect_fc (struct_fd=0x18d32760, type=2) at lbi.c:7901
#2 0x0000000000418660 in poll_fc (arg=0x0) at lbi.c:7686
#3 0x00000030926064a7 in start_thread () from /lib64/libpthread.so.0
#4 0x0000003091ed3c2d in clone () from /lib64/libc.so.6
Code Snippet:
#define unExp(x) __builtin_expect((x),0)
...
7087 int add_to_epoll( struct fdStruct * struct_fd, int lno)
7088 {
7089 struct epoll_event ev;
7090 ev.events = EPOLLIN | EPOLLET | EPOLLPRI | EPOLLERR ;
7091 ev.data.fd = fd_st->fd;
7092 if (unExp(epoll_ctl(struct_fd->Hdr->info->epollfd, EPOLL_CTL_ADD, struct_fd->fd,&ev) == -1))
7093 {
7094 perror("client FD ADD to epoll error:");
7095 return -1;
7096 }
7097 else
7098 {
...
7109 }
7110 return 1;
7111 }
Disassembly of the offending line. I am not good at interpreting assembly code but have tried my best:
if (unExp(epoll_ctl(struct_fd->Hdr->info->epollfd, EPOLL_CTL_ADD, stuct_fd->fd,&ev) == -1))
416210: 48 8b 45 d8 mov 0xffffffffffffffd8(%rbp),%rax // Storing struct_fd->fd
416214: 8b 10 mov (%rax),%edx // to EDX
416216: 48 8b 45 d8 mov 0xffffffffffffffd8(%rbp),%rax // Storing struct_fd->Hdr->info->epollfd
41621a: 48 8b 80 e8 01 00 00 mov 0x1e8(%rax),%rax // to EDI which failed
416221: 48 8b 80 58 01 00 00 mov 0x158(%rax),%rax // while trying to offset members of the structure
416228: 8b 78 5c mov 0x5c(%rax),%edi // <--- failed here since Reg AX is 0x0
41622b: 48 8d 4d e0 lea 0xffffffffffffffe0(%rbp),%rcx
41622f: be 01 00 00 00 mov $0x1,%esi
416234: e8 b7 e1 fe ff callq 4043f0 <epoll_ctl#plt>
416239: 83 f8 ff cmp $0xffffffffffffffff,%eax
41623c: 0f 94 c0 sete %al
41623f: 0f b6 c0 movzbl %al,%eax
416242: 48 85 c0 test %rax,%rax
416245: 74 5e je 4162a5 <add_to_epoll+0xc9>
Printing out Registers and struct member values:
(gdb) i r $rax
rax 0x0 0
(gdb) p struct_fd
$3 = (struct fdStruct *) 0x18d32760
(gdb) p struct_fd->Hdr
$4 = (StHdr *) 0x3b990f30
(gdb) p struct_fd->Hdr->info
$5 = (struct Info *) 0x3b95b410 // Strangely, this is NOT NULL. Inconsistent with assembly dump.
(gdb) p ev
$6 = {events = 2147483659, data = {ptr = 0x573dc648000003d6, fd = 982, u32 = 982, u64= 6286398667419026390}}
Please let me know if my dis-assembly interpretation is OK. And if yes, would like to understand why gdb not showing NULL when it is printing out the structure members.
OR if the analysis is not perfect would like to know the actual reason of coredump. Please let me know if you need more info.
Thanks
---- The following part has been added Later ----
The proxy is a multithreaded program. Doing more digging came to know that when the problem occurs the following two thread were running in parallel. And when I avoid the two functions to run parallely the problem never occurs. But, the thing is I cannot explain how this behavior results into the original problematic scene:
Thread 1:
------------------------------------------------------------
int new_connection() {
...
struct_fd->Hdr->info=NULL; /* (line 1) */
...
<some code>
...
struct_fd->Hdr->info=Golbal_InFo_Ptr; /* (line 2) */ // This is a malloced memory, once allocated never freed
...
...
}
------------------------------------------------------------
Thread 2 executing add_to_epoll():
------------------------------------------------------------
int add_to_epoll( struct fdStruct * struct_fd, int lno)
{
...
if (unExp(epoll_ctl(struct_fd->Hdr->info->epollfd,...) /* (line 3) */
...
}
------------------------------------------------------------
In the above snippets if execution is done in the order,
LIne 1,
Line 3,
Line 2,
the scene can occur. What I expect is whenever an illegal reference is encountered it should dump immediately without trying to execute LINE 3 which makes it NON NULL.
It is a definite behavior because till now I have got around 12 coredumps of the same problem, all showing the exact same thing.

It is clear that struct_fd->Hdr->info is NULL, as Per Johansson already answered.
However, GDB thinks that it is not. How could that be?
One common way this happens, is when
you change the layout of struct fdStruct, struct StHdr (or both),
and
you neglect to rebuild all objects that use these definitions
The disassembly shows that offsetof(struct fdStruct, Hdr) == 0x1e8 and offsetof(struct StHdr, info) == 0x158. See what GDB prints for the following:
(gdb) print/x (char*)&struct_fd->Hdr - (char*)struct_fd
(gdb) print/x (char*)&struct_fd->Hdr->info - (char*)struct_fd->Hdr
I bet it would print something other than 0x1e8 and 0x158.
If that's the case, make clean && make may fix the problem.
Update:
(gdb) print/x (char*)&struct_fd->Hdr - (char*)struct_fd
$1 = 0x1e8
(gdb) print/x (char*)&struct_fd->Hdr->info - (char*)struct_fd->Hdr
$3 = 0x158
This proves that GDB's idea of how objects are laid out in memory matches compiled code.
We still don't know whether GDB's idea of the value of struct_fd matches reality. What do these commands print?
(gdb) print struct_fd
(gdb) x/gx $rbp-40
They should produce the same value (0x18d32760). Assuming they do, the only other explanation I can think of is that you have multiple threads accessing struct_fd, and the other thread overwrites the value that used to be NULL with the new value.
I just noticed your update to the question ;-)
What I expect is whenever an illegal reference is encountered it should dump immediately without trying to execute LINE 3 which makes it NON NULL.
Your expectation is incorrect: on any modern CPU, you have multiple cores, and your threads are executing simultaneously. That is, you have this code (time goes down along Y axis):
char *p; // global
Time CPU0 CPU1
0 p = NULL
1 if (*p) p = malloc(1)
2 *p = 'a';
...
At T1, CPU0 traps into the OS, but CPU1 continues. Eventually, the OS processes hardware trap, and dumps memory state at that time. On CPU1, hundreds of instructions may have executed after T1. The clocks between CPU0 and CPU1 aren't even synchronized, they don't necessarily go in lock-step.
Moral of the story: don't access global variables from multiple threads without proper locking.

The C line part of the disassembly does not match the one in the original code. But clearly
struct_fd->Hdr->info
is NULL. gdb shouldn't have a problem printing that, but it does sometimes get confused when the code is compiles with -O2 or higher.

Why does __sync_add_and_fetch work for a 64 bit variable on a 32 bit system?

Consider the following condensed code:
/* Compile: gcc -pthread -m32 -ansi x.c */
#include <stdio.h>
#include <inttypes.h>
#include <pthread.h>
static volatile uint64_t v = 0;
void *func (void *x) {
__sync_add_and_fetch (&v, 1);
return x;
}
int main (void) {
pthread_t t;
pthread_create (&t, NULL, func, NULL);
pthread_join (t, NULL);
printf ("v = %"PRIu64"\n", v);
return 0;
}
I have a uint64_t variable that I want to increment atomically, because the variable is a counter in a multi-threaded program.
To achieve the atomicity I use GCC's atomic builtins.
If I compile for an amd64 system (-m64) the produced assembler code is easy to understand.
By using a lock addq, the processor guarantees the increment to be atomic.
400660: f0 48 83 05 d7 09 20 lock addq $0x1,0x2009d7(%rip)
But the same C code produces a very complicated ASM code on an ia32 system (-m32):
804855a: a1 28 a0 04 08 mov 0x804a028,%eax
804855f: 8b 15 2c a0 04 08 mov 0x804a02c,%edx
8048565: 89 c1 mov %eax,%ecx
8048567: 89 d3 mov %edx,%ebx
8048569: 83 c1 01 add $0x1,%ecx
804856c: 83 d3 00 adc $0x0,%ebx
804856f: 89 ce mov %ecx,%esi
8048571: 89 d9 mov %ebx,%ecx
8048573: 89 f3 mov %esi,%ebx
8048575: f0 0f c7 0d 28 a0 04 lock cmpxchg8b 0x804a028
804857c: 08
804857d: 75 e6 jne 8048565 <func+0x15>
Here is what I don't understand:
lock cmpxchg8b does guarantee that the changed variable is only written if the expected value still resides in the target address. The compare-and-swap is guaranteed to happen atomically.
But what guarantees that the reading of the variable in 0x804855a and 0x804855f to be atomic?
Probably it does not matter if there was a "dirty read", but could someone please outline a short proof that there is no problem?
Further: Why does the generated code jump back to 0x8048565 and not 0x804855a? I am positive that this is only correct if other writers, too, only increment the variable. Is this an implicated requirement for the __sync_add_and_fetch function?

The initial read with 2 separate mov instructions is not atomic, but it's not in the loop. #interjay's answer explains why this is fine.
Fun fact: the read done by cmpxchg8b would be atomic even without a lock prefix. (But this code does use a lock prefix to make the entire RMW operation atomic, rather than separate atomic load and atomic store.)
It's guaranteed to be atomic due to it being aligned correctly (and it fits on one cache line) and because Intel made the spec this way, see the Intel Architecture manual Vol 1, 4.4.1:
A word or doubleword operand that crosses a 4-byte boundary or a
quadword operand that crosses an 8-byte boundary is considered
unaligned and requires two separate memory bus cycles for access.
Vol 3A 8.1.1:
The Pentium processor (and newer processors since) guarantees that the
following additional memory operations will always be carried out
atomically:
• Reading or writing a quadword aligned on a 64-bit
boundary
• 16-bit accesses to uncached memory locations that fit
within a 32-bit data bus
The P6 family processors (and newer
processors since) guarantee that the following additional memory
operation will always be carried out atomically:
• Unaligned 16-, 32-,
and 64-bit accesses to cached memory that fit within a cache line
Thus by being aligned, it can be read in 1 cycle, and it fits into one cache line making cmpxchg8b's read atomic.
If the data had been misaligned, the lock prefix would still make it atomic, but the performance cost would be very high because a simple cache-lock (delaying response to MESI Invalidate requests for that one cache line) would no longer be sufficient.
The code jumps back to 0x8048565 (after the mov loads, including the copy and add-1) because v has already been loaded; there is no need to load it again as CMPXCHG8B will set EAX:EDX to the value in the destination if it fails:
CMPXCHG8B Description for the Intel ISA manual Vol. 2A:
Compare EDX:EAX with m64. If equal, set ZF and load ECX:EBX into m64.
Else, clear ZF and load m64 into EDX:EAX.
Thus the code needs only to increment the newly returned value and try again.
If we look at this in C code it becomes easier:
value = dest; // non-atomic but usually won't tear
while(!CAS8B(&dest,value,value + 1))
{
value = dest; // atomic; part of lock cmpxchg8b
}
The value = dest is actually from the same read that cmpxchg8b used for the compare part. There isn't a separate reload inside the loop.
In fact, C11 atomic_compare_exchange_weak / _strong has this behaviour built-in: it updates the "expected" operand.
So does gcc's modern builtin __atomic_compare_exchange_n (type *ptr, type *expected, type desired, bool weak, int success_memorder, int failure_memorder) - it takes the expected value by reference.
With GCC's older obsolete __sync builtins, __sync_val_compare_and_swap returns the old val (instead of a boolean swapped / didn't-swap result for __sync_bool_compare_and_swap)

The reading of the variable in 0x804855a and 0x804855f does not need to be atomic. Using the compare-and-swap instruction to increment looks like this in pseudocode:
oldValue = *dest; // non-atomic: tearing between the halves is unlikely but possible
do {
newValue = oldValue+1;
} while (!compare_and_swap(dest, &oldValue, newValue));
Since the compare-and-swap checks that *dest == oldValue before swapping, it will act as a safeguard - so that if the value in oldValue is incorrect, the loop will be tried again, so there's no problem if the non-atomic read resulted in an incorrect value.
The 64-bit access to *dest done by lock cmpxchg8b is atomic (as part of an atomic RMW of *dest). Any tearing in loading the 2 halves separately will be caught here. Or if a write from another core happened after the initial read, before lock cmpxchg8b: this is possible even with single-register-width cmpxchg-retry loops. (e.g. to implement atomic fetch_mul or an atomic float, or other RMW operations that x86's lock prefix doesn't let us do directly.)
Your second question was why the line oldValue = *dest is not inside the loop. This is because the compare_and_swap function will always replace the value of oldValue with the actual value of *dest. So it will essentially perform the line oldValue = *dest for you, and there's no point in doing it again. In the case of the cmpxchg8b instruction, it will put the contents of the memory operand in edx:eax when the comparison fails.
The pseudocode for compare_and_swap is:
bool compare_and_swap (int *dest, int *oldVal, int newVal)
{
do atomically {
if ( *oldVal == *dest ) {
*dest = newVal;
return true;
} else {
*oldVal = *dest;
return false;
}
}
}
By the way, in your code you need to ensure that v is aligned to 64 bits - otherwise it could be split between two cache lines and the cmpxchg8b instruction will not be performed atomically. You can use GCC's __attribute__((aligned(8))) for this.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight