Can this be atomically executed? - c

I would like to know whether it is possible to ensure line is atomically executed, given that it could be executed by both the ISR and Main context. I'm working on an ARM9 (LPC313x) and using RealView 4 (armcc).
foo() {
..
stack_var = ++volatile_var; // line
..
}
I'm looking for any routine like _atomic_ for C166, direct assembly code, etc. I would prefer not to have to disable the interrupts.
Thank you very much.

No, I don't think that you ever can expect ++volatile_var to be atomic, even if you don't have the assignment. Use a proper atomic primitive for that. If your compiler doesn't provide such an extension you easily find short inline assembler for that on the web. The assembler instructions are call ldrex and strex for atomic exchange on arm, I think.
Edit: it seems that the specific processor type that is asked for in the question does not implement these instructions.
Edit: The following should work with gcc, for another compiler one probably has to adapt the __asm__ parts.
inline
size_t arm_ldrex(size_t volatile*ptr) {
size_t ret;
__asm__ volatile ("ldrex %0,[%1]\t# load exclusive\n"
: "=&r" (ret)
: "r" (ptr)
: "cc", "memory"
);
return ret;
}
inline
_Bool arm_strex(size_t volatile*ptr, size_t val) {
size_t error;
__asm__ volatile ("strex %0,%1,[%2]\t# store exclusive\n"
: "=&r" (error)
: "r" (val), "r" (ptr)
: "cc", "memory"
);
return !error;
}
inline
size_t atomic_add_fetch(size_t volatile *object, size_t operand) {
for (;;) {
size_t oldval = arm_ldrex(object);
size_t newval = oldval + operand;
if (arm_strex(object, newval)) return newval;
}
}

From a quick look, the C166 _atomic_ macro seems to utilize an instruction that effectively masks interrupts for the duration of a specified number of instructions.
There is nothing directly corresponding to that in the ARM architecture.
You could of course use the swp instruction (or __swp intrinsic in the RealView toolchain) to implement a lock around the critical section. ldrex/strex mentioned in another answer do not exist in ARM architecture version 5, which includes the ARM9 processors.
http://infocenter.arm.com/help/topic/com.arm.doc.dui0491c/CJAHDCHB.html and http://infocenter.arm.com/help/topic/com.arm.doc.dui0489c/Chdbbbai.html respectively.
A simplistic lock implementation around this (using the RealView toolchain) would be:
{
/* Loop until lock acquired */
while (__swp(LOCKED, &lockvar) == LOCKED);
..
/* Critical section */
..
lockvar = UNLOCKED;
}
However, this will lead to deadlock in the ISR context when the Main thread is holding the lock.
I think masking interrupts around the operation is likely to be the least hairy solution, although if your Main context is executing in User mode it will require a system call to implement.

Related

Making data reads/writes atomic in C11 GCC using <stdatomic.h>?

I have learned from SO threads here and here, among others, that it is not safe to assume that reads/writes of data in multithreaded applications are atomic at the OS/hardware level, and corruption of data may result. I would like to know the simplest way of making reads and writes of int variables atomic, using the <stdatomic.h> C11 library with the GCC compiler on Linux.
If I currently have an int assignment in a thread: messageBox[i] = 2, how do I make this assignment atomic? Equally for a reading test, like if (messageBox[i] == 2).
For C11 atomics you don't even have to use functions. If your implementation (= compiler) supports atomics you can just add an atomic specifier to a variable declaration and then subsequently all operations on that are atomic:
_Atomic(int) toto = 65;
...
toto += 2; // is an atomic read-modify-write operation
...
if (toto == 67) // is an atomic read of toto
Atomics have their price (they need much more computing resources) but as long as you use them scarcely they are the perfect tool to synchronize threads.
If I currently have an int assignment in a thread: messageBox[i] = 2, how do I make this assignment atomic? Equally for a reading test, like if (messageBox[i] == 2).
You almost never have to do anything. In almost every case, the data which your threads share (or communicate with) are protected from concurrent access via such things as mutexes, semaphores and the like. The implementation of the base operations ensure the synchronization of memory.
The reason for these atomics is to help you construct safer race conditions in your code. There are a number of hazards with them; including:
ai += 7;
would use an atomic protocol if ai were suitably defined. Trying to decipher race conditions is not aided by obscuring the implementation.
There is also a highly machine dependent portion to them. The line above, for example, could fail [1] on some platforms, but how is that failure communicated back to the program? It is not [2].
Only one operation has the option of dealing with failure; atomic_compare_exchange_(weak|strong). Weak just tries once, and lets the program choose how and whether to retry. Strong retries endlessly. It isn't enough to just try once -- spurious failures due to interrupts can occur -- but endless retries on a non-spurious failure is no good either.
Arguably, for robust programs or widely applicable libraries, the only bit of you should use is atomic_compare_exchange_weak().
[1] Load-linked, store-conditional (ll-sc) is a common means for making atomic transactions on asynchronous bus architectures. The load-linked sets a little flag on a cache line, which will be cleared if any other bus agent attempts to modify that cache line. Store-conditional stores a value iff the little flag is set in the cache, and clears the flag; iff the flag is cleared, Store-conditional signals an error, so an appropriate retry operation can be attempted. From these two operations, you can construct any atomic operation you like on a completely asynchronous bus architecture.
ll-sc can have subtle dependencies on the caching attributes of the location. Permissible cache attributes are platform dependent, as is which operations may be performed between the ll and sc.
If you put an ll-sc operation on a poorly cached access, and blindly retry, your program will lock up. This isn't just speculation; I had to debug one of these on an ARMv7-based "safe" system.
[2]:
#include <stdatomic.h>
int f(atomic_int *x) {
return (*x)++;
}
f:
dmb ish
.L2:
ldrex r3, [r0]
adds r2, r3, #1
strex r1, r2, [r0]
cmp r1, #0
bne .L2 /* note the retry loop */
dmb ish
mov r0, r3
bx lr
The most portable way is to use one of the C11 atomic variables. You can also use a spinlock atomic operation to guard non-atomic variables. Here is a simple pthread produce/consumer example to play with, modify as desired. Notice that the cnt_non and cnt_vol can be corrupted.
atomic_uint cnt_atomic;
int cnt_non;
volatile int cnt_vol;
typedef atomic_uint lock_t;
lock_t lockholder = 0;
#define LOCK_C 0x01
#define LOCK_P 0x02
int cnt_lock; /* not atomic on purpose to test spinlock */
atomic_int lock_held_c, lock_held_p;
void
lock(lock_t *bitarrayp, uint32_t desired)
{
uint32_t expected = 0; /* lock is not held */
/* the value in expected is updated if it does not match
* the value in bitarrayp. If the comparison fails then compare
* the returned value with the lock bits and update the appropriate
* counter.
*/
do {
if (expected & LOCK_P) lock_held_p++;
if (expected & LOCK_C) lock_held_c++;
expected = 0;
} while(!atomic_compare_exchange_weak(bitarrayp, &expected, desired));
}
void
unlock(lock_t *bitarrayp)
{
*bitarrayp = 0;
}
void*
fn_c(void *thr_data)
{
(void)thr_data;
for (int i=0; i<40000; i++) {
cnt_atomic++;
cnt_non++;
cnt_vol++;
/* lock, increment, unlock */
lock(&lockholder, LOCK_C);
cnt_lock++;
unlock(&lockholder);
}
return NULL;
}
void*
fn_p(void *thr_data)
{
(void)thr_data;
for (int i=0; i<30000; i++) {
cnt_atomic++;
cnt_non++;
cnt_vol++;
/* lock, increment, unlock */
lock(&lockholder, LOCK_P);
cnt_lock++;
unlock(&lockholder);
}
return NULL;
}
void
drv_pc(void)
{
pthread_t thr[2];
pthread_create(&thr[0], NULL, fn_c, NULL);
pthread_create(&thr[1], NULL, fn_p, NULL);
for(int n = 0; n < 2; ++n)
pthread_join(thr[n], NULL);
printf("cnt_a=%d, cnt_non=%d cnt_vol=%d\n", cnt_atomic, cnt_non, cnt_vol);
printf("lock %d held_c=%d held_p=%d\n", cnt_lock, lock_held_c, lock_held_p);
}
that it is not safe to assume that reads/writes of data in
multithreaded applications are atomic at the OS/hardware level, and
corruption of data may result
Actually non composite operations on types like int are atomic on all reasonable architecture. What you read is simply a hoax.
(An increment is a composite operation: it has a read, a calculation, and a write component. Each component is atomic but the whole composite operation is not.)
But atomicity at the hardware level isn't the issue. The high level language you use simply doesn't support that kind of manipulations on regular types. You need to use atomic types to even have the right to manipulate objects in such a way that the question of atomicity is relevant: when you are potentially modifying an object in use in another thread.
(Or volatile types. But don't use volatile. Use atomics.)

reading a 64 bit volatile variable on cortex-m3

I have a 64 bit integer variable on a 32 bit Cortex-M3 ARM controller (STM32L1), which can be modified asynchronously by an interrupt handler.
volatile uint64_t v;
void some_interrupt_handler() {
v = v + something;
}
Obviously, I need a way to access it in a way that prevents getting inconsistent, halfway updated values.
Here is the first attempt
static inline uint64_t read_volatile_uint64(volatile uint64_t *x) {
uint64_t y;
__disable_irq();
y = *x;
__enable_irq();
return y;
}
The CMSIS inline functions __disable_irq() and __enable_irq() have an unfortunate side effect, forcing a memory barrier on the compiler, so I've tried to come up with something more fine-grained
static inline uint64_t read_volatile_uint64(volatile uint64_t *x) {
uint64_t y;
asm ( "cpsid i\n"
"ldrd %[value], %[addr]\n"
"cpsie i\n"
: [value]"=r"(y) : [addr]"m"(*x));
return y;
}
It still disables interrupts, which is not desirable, so I'm wondering if there's a way doing it without resorting to cpsid. The Definitive Guide to
ARM Cortex-M3 and Cortex-M4 Processors, Third Edition by Joseph Yiu says
If an interrupt request arrives when the processor is executing a
multiple cycle instruction, such as an integer divide, the instruction
could be abandoned and restarted after the interrupt handler
completes. This behavior also applies to load double-word (LDRD) and
store double-word (STRD) instructions.
Does it mean that I'll be fine by simply writing this?
static inline uint64_t read_volatile_uint64(volatile uint64_t *x) {
uint64_t y;
asm ( "ldrd %[value], %[addr]\n"
: [value]"=&r"(y) : [addr]"m"(*x));
return y;
}
(Using "=&r" to work around ARM errata 602117)
Is there some library or builtin function that does the same portably? I've tried atomic_load() in stdatomic.h, but it fails with undefined reference to '__atomic_load_8'.
Yes, using a simple ldrd is safe in this application since it will be restarted (not resumed) if interrupted, hence it will appear atomic from the interrupt handler's point of view.
This holds more generally for all load instructions except those that are exception-continuable, which are a very restricted subset:
only ldm, pop, vldm, and vpop can be continuable
an instruction inside an it-block is never continuable
an ldm/pop whose first loaded register is also the base register (e.g. ldm r0, { r0, r1 }) is never continuable
This gives plenty of options for atomically reading a multi-word variable that's modified by an interrupt handler on the same core. If the data you wish to read is not a contiguous array of words then you can do something like:
1: ldrex %[val0], [%[ptr]] // can also be byte/halfword
... more loads here ...
strex %[retry], %[val0], [%[ptr]]
cbz %[retry], 2f
b 1b
2:
It doesn't really matter which word (or byte/halfword) you use for the ldrex/strex since an exception will perform an implicit clrex.
The other direction, writing a variable that's read by an interrupt handler is a lot harder. I'm not 100% sure but I think the only stores that are guaranteed to appear atomic to an interrupt handler are those that are "single-copy atomic", i.e. single byte, aligned halfword, and aligned word. Anything bigger would require disabling interrupts or using some clever lock-free structure.
Atomicity is not guaranteed on LDRD according to the ARMv7m reference manual. (A3.5.1)
The only ARMv7-M explicit accesses made by the ARM processor which exhibit single-copy atomicity are:
• All byte transactions
• All halfword transactions to 16-bit aligned locations
• All word transactions to 32-bit aligned locations
LDM, LDC, LDRD, STM, STC, STRD, PUSH and POP operations are seen to be a sequence of 32-bit
transactions aligned to 32 bits. Each of these 32-bit transactions are guaranteed to exhibit single-copy
atomicity. Sub-sequences of two or more 32-bit transactions from the sequence also do not exhibit
single-copy atomicity
What you can do is use a byte to indicate to the ISR you're reading it.
non_isr(){
do{
flag = 1
foo = doubleword
while(flag > 1)
flag = 0
}
isr(){
if(flag == 1)
flag++;
doubleword = foo
}
Source (login required):
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0403e.b/index.html
Login not required:
http://www.telecom.uff.br/~marcos/uP/ARMv7_Ref.pdf
I was also trying to use a 64-bit (2 x 32-bit) system_tick, but on an STM32L4xx (ARM cortex M3). I found that when I tried to use just "volatile uint64_t system_tick", compiler injected assembly instruction LDRD, which may have been enough, since getting interrupted after reading the first word is supposed to cause both words to be read again.
I asked the tech at IAR software support and he responded that I should use C11 atomics;
#include "stdatomic.h"
#ifdef __STDC_NO_ATOMICS__
static_assert(__STDC_NO_ATOMICS__ != 1);
#endif
volatile atomic_uint_fast64_t system_tick;
/**
* \brief Increment system_timer
* \retval none
*/
void HAL_IncTick(void)
{
system_tick++;
}
/**
* \brief Read 64-bit system_tick
* \retval system_tick
*/
uint64_t HAL_GetSystemTick(void)
{
return system_tick;
}
/**
* \brief Read 32 least significant bits of system_tick
* \retval (uint64_t) system_tick
*/
uint32_t HAL_GetTick(void)
{
return (uint32_t)system_tick;
}
But what I found was a colossal amount of code was added to make the read "atomic".
Way back in the day of 8-bit micro-controllers, the trick was to read the high byte, read the low byte, then read the high byte until the high byte was the same twice - proving that there was no rollover created by the ISR. So if you are against disabling IRQ, reading system_tick, then enabling IRQ, try this trick:
/**
* \brief Read 64-bit system_tick
* \retval system_tick
*/
uint64_t HAL_GetSystemTick(void)
{
uint64_t tick;
do {
tick = system_tick;
} while ((uint32_t)(system_tick >> 32) != (uint32_t)(tick >> 32));
return tick;
}
The idea is that if the most significant word does not roll over, then then whole 64-bit system_timer must be valid. If HAL_IncTick() did anything more than a simple increment, this assertion would not be possible.

Mixing C and assembly and its impact on registers

Consider the following C and (ARM) assembly snippet, which is to be compiled with GCC:
__asm__ __volatile__ (
"vldmia.64 %[data_addr]!, {d0-d1}\n\t"
"vmov.f32 q12, #0.0\n\t"
: [data_addr] "+r" (data_addr)
: : "q0", "q12");
for(int n=0; n<10; ++n){
__asm__ __volatile__ (
"vadd.f32 q12, q12, q0\n\t"
"vldmia.64 %[data_addr]!, {d0-d1}\n\t"
: [data_addr] "+r" (data_addr),
:: "q0", "q12");
}
In this example, I am initialising some SIMD registers outside the loop and then having C handle the loop logic, with those initialised registers being used inside the loop.
This works in some test code, but I'm concerned of the risk of the compiler clobbering the registers between snippets. Is there any way of ensuring this doesn't happen? Can I infer any assurances about the type of registers that are going to be used in a snippet (in this case, that no SIMD registers will be clobbered)?
In general, there's not a way to do this in gcc; clobbers only guarantee that registers will be preserved around the asm call. If you need to ensure that the registers are saved between two asm sections, you will need to store them to memory in the first, and reload in the second.
Edit: After much fiddling around I've come to the conclusion this is much harder to solve in general using the strategy described below than I initially thought.
The problem is that, particularly when all the registers are used, there is nothing to stop the first register stash from overwriting another. Whether there is some trick to play with using direct memory writes that can be optimised away I don't know, but initial tests would suggest the compiler might still choose to clobber not-yet-stashed registers
For the time being and until I have more information, I'm unmarking this answer as correct and this answer should treated as probably wrong in the general case. My conclusion is this that such local protection of registers needs better support in the compiler to be useful
This absolutely is possible to do reliably. Drawing on the comments by #PeterCordes as well as the docs and a couple of useful bug reports (gcc 41538 and 37188) I came up with the following solution.
The point that makes it valid is the use of temporary variables to make sure the registers are maintained (logically, if the loop clobbers them, then they will be reloaded). In practice, the temporary variables are optimised away which is clear from inspection of the resultant asm.
// d0 and d1 map to the first and second values of q0, so we use
// q0 to reduce the number of tmp variables we pass around (instead
// of using one for each of d0 and d1).
register float32x4_t data __asm__ ("q0");
register float32x4_t output __asm__ ("q12");
float32x4_t tmp_data;
float32x4_t tmp_output;
__asm__ __volatile__ (
"vldmia.64 %[data_addr]!, {d0-d1}\n\t"
"vmov.f32 %q[output], #0.0\n\t"
: [data_addr] "+r" (data_addr),
[output] "=&w" (output),
"=&w" (data) // we still need to constrain data (q0) as written to.
::);
// Stash the register values
tmp_data = data;
tmp_output = output;
for(int n=0; n<10; ++n){
// Make sure the registers are loaded correctly
output = tmp_output;
data = tmp_data;
__asm__ __volatile__ (
"vadd.f32 %[output], %[output], q0\n\t"
"vldmia.64 %[data_addr]!, {d0-d1}\n\t"
: [data_addr] "+r" (data_addr),
[output] "+w" (output),
"+w" (data) // again, data (q0) was written to in the vldmia op.
::);
// Remember to stash the registers again before continuing
tmp_data = data;
tmp_output = output;
}
It's necessary to instruct the compiler that q0 is written to in the last line of each asm output constraint block, so it doesn't think it can reorder the stashing and reloading of the data register resulting in the asm block getting invalid values.

gcc asm too many memory references

I am trying to read the time from the CMOS using asm but i get this error:
/tmp/ccyx8l5L.s:1236: Error: too many memory references for 'mov'
/tmp/ccyx8l5L.s:1240: Error: too many memory references for 'out'
/tmp/ccyx8l5L.s:1244: Error: too many memory references for 'in'
/tmp/ccyx8l5L.s:1252: Error: too many memory references for 'mov'
and here is the code:
for (index = 0; index < 128; index++) {
asm("cli");
asm("mov al, index"); /* Move index address*/
asm("out 0x70,al"); /* Copy address to CMOS register*/
/* some kind of real delay here is probably best */
asm("in al,0x71"); /* Fetch 1 byte to al*/
asm("sti"); /* Enable interrupts*/
asm("mov tvalue,al");
array[index] = tvalue;
}
I am using gcc to compile
gcc uses AT&T syntax.
Compile with -masm=intel
As Janycz points out, your code uses intel syntax for the assembler, while gcc (by default) expects at&t. If you are using gcc, how about something like this:
int main()
{
unsigned char array[128];
for (unsigned char index = 0; index < 128; index++) {
asm("cli\n\t"
"out %%al,$0x70\n\t"
/* some kind of real delay here is probably best */
"in $0x71, %%al\n\t" /* Fetch 1 byte to al*/
"sti" /* Enable interrupts*/
: "=a" (array[index]) : "a" (index) );
}
}
This minimizes the amount of asm you have to write (4 lines vs your 6) and allows the compiler to perform more optimizations. See the docs for gcc's inline asm to understand how all this works.
I haven't tried running this, since it won't run on protected mode operating systems (like Windows). In and Out are protected instructions and can't be used by user-mode applications. I assume you are running this on DOS or something?

Calculating CPU frequency in C with RDTSC always returns 0

The following piece of code was given to us from our instructor so we could measure some algorithms performance:
#include <stdio.h>
#include <unistd.h>
static unsigned cyc_hi = 0, cyc_lo = 0;
static void access_counter(unsigned *hi, unsigned *lo) {
asm("rdtsc; movl %%edx,%0; movl %%eax,%1"
: "=r" (*hi), "=r" (*lo)
: /* No input */
: "%edx", "%eax");
}
void start_counter() {
access_counter(&cyc_hi, &cyc_lo);
}
double get_counter() {
unsigned ncyc_hi, ncyc_lo, hi, lo, borrow;
double result;
access_counter(&ncyc_hi, &ncyc_lo);
lo = ncyc_lo - cyc_lo;
borrow = lo > ncyc_lo;
hi = ncyc_hi - cyc_hi - borrow;
result = (double) hi * (1 << 30) * 4 + lo;
return result;
}
However, I need this code to be portable to machines with different CPU frequencies. For that, I'm trying to calculate the CPU frequency of the machine where the code is being run like this:
int main(void)
{
double c1, c2;
start_counter();
c1 = get_counter();
sleep(1);
c2 = get_counter();
printf("CPU Frequency: %.1f MHz\n", (c2-c1)/1E6);
printf("CPU Frequency: %.1f GHz\n", (c2-c1)/1E9);
return 0;
}
The problem is that the result is always 0 and I can't understand why. I'm running Linux (Arch) as guest on VMware.
On a friend's machine (MacBook) it is working to some extent; I mean, the result is bigger than 0 but it's variable because the CPU frequency is not fixed (we tried to fix it but for some reason we are not able to do it). He has a different machine which is running Linux (Ubuntu) as host and it also reports 0. This rules out the problem being on the virtual machine, which I thought it was the issue at first.
Any ideas why this is happening and how can I fix it?
Okay, since the other answer wasn't helpful, I'll try to explain on more detail. The problem is that a modern CPU can execute instructions out of order. Your code starts out as something like:
rdtsc
push 1
call sleep
rdtsc
Modern CPUs do not necessarily execute instructions in their original order though. Despite your original order, the CPU is (mostly) free to execute that just like:
rdtsc
rdtsc
push 1
call sleep
In this case, it's clear why the difference between the two rdtscs would be (at least very close to) 0. To prevent that, you need to execute an instruction that the CPU will never rearrange to execute out of order. The most common instruction to use for that is CPUID. The other answer I linked should (if memory serves) start roughly from there, about the steps necessary to use CPUID correctly/effectively for this task.
Of course, it's possible that Tim Post was right, and you're also seeing problems because of a virtual machine. Nonetheless, as it stands right now, there's no guarantee that your code will work correctly even on real hardware.
Edit: as to why the code would work: well, first of all, the fact that instructions can be executed out of order doesn't guarantee that they will be. Second, it's possible that (at least some implementations of) sleep contain serializing instructions that prevent rdtsc from being rearranged around it, while others don't (or may contain them, but only execute them under specific (but unspecified) circumstances).
What you're left with is behavior that could change with almost any re-compilation, or even just between one run and the next. It could produce extremely accurate results dozens of times in a row, then fail for some (almost) completely unexplainable reason (e.g., something that happened in some other process entirely).
I can't say for certain what exactly is wrong with your code, but you're doing quite a bit of unnecessary work for such a simple instruction. I recommend you simplify your rdtsc code substantially. You don't need to do 64-bit math carries your self, and you don't need to store the result of that operation as a double. You don't need to use separate outputs in your inline asm, you can tell GCC to use eax and edx.
Here is a greatly simplified version of this code:
#include <stdint.h>
uint64_t rdtsc() {
uint64_t ret;
# if __WORDSIZE == 64
asm ("rdtsc; shl $32, %%rdx; or %%rdx, %%rax;"
: "=A"(ret)
: /* no input */
: "%edx"
);
#else
asm ("rdtsc"
: "=A"(ret)
);
#endif
return ret;
}
Also you should consider printing out the values you're getting out of this so you can see if you're getting out 0s, or something else.
As for VMWare, take a look at the time keeping spec (PDF Link), as well as this thread. TSC instructions are (depending on the guest OS):
Passed directly to the real hardware (PV guest)
Count cycles while the VM is executing on the host processor (Windows / etc)
Note, in #2 the while the VM is executing on the host processor. The same phenomenon would go for Xen, as well, if I recall correctly. In essence, you can expect that the code should work as expected on a paravirtualized guest. If emulated, its entirely unreasonable to expect hardware like consistency.
You forgot to use volatile in your asm statement, so you're telling the compiler that the asm statement produces the same output every time, like a pure function. (volatile is only implicit for asm statements with no outputs.)
This explains why you're getting exactly zero: the compiler optimized end-start to 0 at compile time, through CSE (common-subexpression elimination).
See my answer on Get CPU cycle count? for the __rdtsc() intrinsic, and #Mysticial's answer there has working GNU C inline asm, which I'll quote here:
// prefer using the __rdtsc() intrinsic instead of inline asm at all.
uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
This works correctly and efficiently for 32 and 64-bit code.
hmmm I'm not positive but I suspect the problem may be inside this line:
result = (double) hi * (1 << 30) * 4 + lo;
I'm suspicious if you can safely carry out such huge multiplications in an "unsigned"... isn't that often a 32-bit number? ...just the fact that you couldn't safely multiply by 2^32 and had to append it as an extra "* 4" added to the 2^30 at the end already hints at this possibility... you might need to convert each sub-component hi and lo to a double (instead of a single one at the very end) and do the multiplication using the two doubles

Resources