reading a 64 bit volatile variable on cortex-m3 - arm

I have a 64 bit integer variable on a 32 bit Cortex-M3 ARM controller (STM32L1), which can be modified asynchronously by an interrupt handler.
volatile uint64_t v;
void some_interrupt_handler() {
v = v + something;
}
Obviously, I need a way to access it in a way that prevents getting inconsistent, halfway updated values.
Here is the first attempt
static inline uint64_t read_volatile_uint64(volatile uint64_t *x) {
uint64_t y;
__disable_irq();
y = *x;
__enable_irq();
return y;
}
The CMSIS inline functions __disable_irq() and __enable_irq() have an unfortunate side effect, forcing a memory barrier on the compiler, so I've tried to come up with something more fine-grained
static inline uint64_t read_volatile_uint64(volatile uint64_t *x) {
uint64_t y;
asm ( "cpsid i\n"
"ldrd %[value], %[addr]\n"
"cpsie i\n"
: [value]"=r"(y) : [addr]"m"(*x));
return y;
}
It still disables interrupts, which is not desirable, so I'm wondering if there's a way doing it without resorting to cpsid. The Definitive Guide to
ARM Cortex-M3 and Cortex-M4 Processors, Third Edition by Joseph Yiu says
If an interrupt request arrives when the processor is executing a
multiple cycle instruction, such as an integer divide, the instruction
could be abandoned and restarted after the interrupt handler
completes. This behavior also applies to load double-word (LDRD) and
store double-word (STRD) instructions.
Does it mean that I'll be fine by simply writing this?
static inline uint64_t read_volatile_uint64(volatile uint64_t *x) {
uint64_t y;
asm ( "ldrd %[value], %[addr]\n"
: [value]"=&r"(y) : [addr]"m"(*x));
return y;
}
(Using "=&r" to work around ARM errata 602117)
Is there some library or builtin function that does the same portably? I've tried atomic_load() in stdatomic.h, but it fails with undefined reference to '__atomic_load_8'.

Yes, using a simple ldrd is safe in this application since it will be restarted (not resumed) if interrupted, hence it will appear atomic from the interrupt handler's point of view.
This holds more generally for all load instructions except those that are exception-continuable, which are a very restricted subset:
only ldm, pop, vldm, and vpop can be continuable
an instruction inside an it-block is never continuable
an ldm/pop whose first loaded register is also the base register (e.g. ldm r0, { r0, r1 }) is never continuable
This gives plenty of options for atomically reading a multi-word variable that's modified by an interrupt handler on the same core. If the data you wish to read is not a contiguous array of words then you can do something like:
1: ldrex %[val0], [%[ptr]] // can also be byte/halfword
... more loads here ...
strex %[retry], %[val0], [%[ptr]]
cbz %[retry], 2f
b 1b
2:
It doesn't really matter which word (or byte/halfword) you use for the ldrex/strex since an exception will perform an implicit clrex.
The other direction, writing a variable that's read by an interrupt handler is a lot harder. I'm not 100% sure but I think the only stores that are guaranteed to appear atomic to an interrupt handler are those that are "single-copy atomic", i.e. single byte, aligned halfword, and aligned word. Anything bigger would require disabling interrupts or using some clever lock-free structure.

Atomicity is not guaranteed on LDRD according to the ARMv7m reference manual. (A3.5.1)
The only ARMv7-M explicit accesses made by the ARM processor which exhibit single-copy atomicity are:
• All byte transactions
• All halfword transactions to 16-bit aligned locations
• All word transactions to 32-bit aligned locations
LDM, LDC, LDRD, STM, STC, STRD, PUSH and POP operations are seen to be a sequence of 32-bit
transactions aligned to 32 bits. Each of these 32-bit transactions are guaranteed to exhibit single-copy
atomicity. Sub-sequences of two or more 32-bit transactions from the sequence also do not exhibit
single-copy atomicity
What you can do is use a byte to indicate to the ISR you're reading it.
non_isr(){
do{
flag = 1
foo = doubleword
while(flag > 1)
flag = 0
}
isr(){
if(flag == 1)
flag++;
doubleword = foo
}
Source (login required):
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0403e.b/index.html
Login not required:
http://www.telecom.uff.br/~marcos/uP/ARMv7_Ref.pdf

I was also trying to use a 64-bit (2 x 32-bit) system_tick, but on an STM32L4xx (ARM cortex M3). I found that when I tried to use just "volatile uint64_t system_tick", compiler injected assembly instruction LDRD, which may have been enough, since getting interrupted after reading the first word is supposed to cause both words to be read again.
I asked the tech at IAR software support and he responded that I should use C11 atomics;
#include "stdatomic.h"
#ifdef __STDC_NO_ATOMICS__
static_assert(__STDC_NO_ATOMICS__ != 1);
#endif
volatile atomic_uint_fast64_t system_tick;
/**
* \brief Increment system_timer
* \retval none
*/
void HAL_IncTick(void)
{
system_tick++;
}
/**
* \brief Read 64-bit system_tick
* \retval system_tick
*/
uint64_t HAL_GetSystemTick(void)
{
return system_tick;
}
/**
* \brief Read 32 least significant bits of system_tick
* \retval (uint64_t) system_tick
*/
uint32_t HAL_GetTick(void)
{
return (uint32_t)system_tick;
}
But what I found was a colossal amount of code was added to make the read "atomic".
Way back in the day of 8-bit micro-controllers, the trick was to read the high byte, read the low byte, then read the high byte until the high byte was the same twice - proving that there was no rollover created by the ISR. So if you are against disabling IRQ, reading system_tick, then enabling IRQ, try this trick:
/**
* \brief Read 64-bit system_tick
* \retval system_tick
*/
uint64_t HAL_GetSystemTick(void)
{
uint64_t tick;
do {
tick = system_tick;
} while ((uint32_t)(system_tick >> 32) != (uint32_t)(tick >> 32));
return tick;
}
The idea is that if the most significant word does not roll over, then then whole 64-bit system_timer must be valid. If HAL_IncTick() did anything more than a simple increment, this assertion would not be possible.

Related

Making data reads/writes atomic in C11 GCC using <stdatomic.h>?

I have learned from SO threads here and here, among others, that it is not safe to assume that reads/writes of data in multithreaded applications are atomic at the OS/hardware level, and corruption of data may result. I would like to know the simplest way of making reads and writes of int variables atomic, using the <stdatomic.h> C11 library with the GCC compiler on Linux.
If I currently have an int assignment in a thread: messageBox[i] = 2, how do I make this assignment atomic? Equally for a reading test, like if (messageBox[i] == 2).
For C11 atomics you don't even have to use functions. If your implementation (= compiler) supports atomics you can just add an atomic specifier to a variable declaration and then subsequently all operations on that are atomic:
_Atomic(int) toto = 65;
...
toto += 2; // is an atomic read-modify-write operation
...
if (toto == 67) // is an atomic read of toto
Atomics have their price (they need much more computing resources) but as long as you use them scarcely they are the perfect tool to synchronize threads.
If I currently have an int assignment in a thread: messageBox[i] = 2, how do I make this assignment atomic? Equally for a reading test, like if (messageBox[i] == 2).
You almost never have to do anything. In almost every case, the data which your threads share (or communicate with) are protected from concurrent access via such things as mutexes, semaphores and the like. The implementation of the base operations ensure the synchronization of memory.
The reason for these atomics is to help you construct safer race conditions in your code. There are a number of hazards with them; including:
ai += 7;
would use an atomic protocol if ai were suitably defined. Trying to decipher race conditions is not aided by obscuring the implementation.
There is also a highly machine dependent portion to them. The line above, for example, could fail [1] on some platforms, but how is that failure communicated back to the program? It is not [2].
Only one operation has the option of dealing with failure; atomic_compare_exchange_(weak|strong). Weak just tries once, and lets the program choose how and whether to retry. Strong retries endlessly. It isn't enough to just try once -- spurious failures due to interrupts can occur -- but endless retries on a non-spurious failure is no good either.
Arguably, for robust programs or widely applicable libraries, the only bit of you should use is atomic_compare_exchange_weak().
[1] Load-linked, store-conditional (ll-sc) is a common means for making atomic transactions on asynchronous bus architectures. The load-linked sets a little flag on a cache line, which will be cleared if any other bus agent attempts to modify that cache line. Store-conditional stores a value iff the little flag is set in the cache, and clears the flag; iff the flag is cleared, Store-conditional signals an error, so an appropriate retry operation can be attempted. From these two operations, you can construct any atomic operation you like on a completely asynchronous bus architecture.
ll-sc can have subtle dependencies on the caching attributes of the location. Permissible cache attributes are platform dependent, as is which operations may be performed between the ll and sc.
If you put an ll-sc operation on a poorly cached access, and blindly retry, your program will lock up. This isn't just speculation; I had to debug one of these on an ARMv7-based "safe" system.
[2]:
#include <stdatomic.h>
int f(atomic_int *x) {
return (*x)++;
}
f:
dmb ish
.L2:
ldrex r3, [r0]
adds r2, r3, #1
strex r1, r2, [r0]
cmp r1, #0
bne .L2 /* note the retry loop */
dmb ish
mov r0, r3
bx lr
The most portable way is to use one of the C11 atomic variables. You can also use a spinlock atomic operation to guard non-atomic variables. Here is a simple pthread produce/consumer example to play with, modify as desired. Notice that the cnt_non and cnt_vol can be corrupted.
atomic_uint cnt_atomic;
int cnt_non;
volatile int cnt_vol;
typedef atomic_uint lock_t;
lock_t lockholder = 0;
#define LOCK_C 0x01
#define LOCK_P 0x02
int cnt_lock; /* not atomic on purpose to test spinlock */
atomic_int lock_held_c, lock_held_p;
void
lock(lock_t *bitarrayp, uint32_t desired)
{
uint32_t expected = 0; /* lock is not held */
/* the value in expected is updated if it does not match
* the value in bitarrayp. If the comparison fails then compare
* the returned value with the lock bits and update the appropriate
* counter.
*/
do {
if (expected & LOCK_P) lock_held_p++;
if (expected & LOCK_C) lock_held_c++;
expected = 0;
} while(!atomic_compare_exchange_weak(bitarrayp, &expected, desired));
}
void
unlock(lock_t *bitarrayp)
{
*bitarrayp = 0;
}
void*
fn_c(void *thr_data)
{
(void)thr_data;
for (int i=0; i<40000; i++) {
cnt_atomic++;
cnt_non++;
cnt_vol++;
/* lock, increment, unlock */
lock(&lockholder, LOCK_C);
cnt_lock++;
unlock(&lockholder);
}
return NULL;
}
void*
fn_p(void *thr_data)
{
(void)thr_data;
for (int i=0; i<30000; i++) {
cnt_atomic++;
cnt_non++;
cnt_vol++;
/* lock, increment, unlock */
lock(&lockholder, LOCK_P);
cnt_lock++;
unlock(&lockholder);
}
return NULL;
}
void
drv_pc(void)
{
pthread_t thr[2];
pthread_create(&thr[0], NULL, fn_c, NULL);
pthread_create(&thr[1], NULL, fn_p, NULL);
for(int n = 0; n < 2; ++n)
pthread_join(thr[n], NULL);
printf("cnt_a=%d, cnt_non=%d cnt_vol=%d\n", cnt_atomic, cnt_non, cnt_vol);
printf("lock %d held_c=%d held_p=%d\n", cnt_lock, lock_held_c, lock_held_p);
}
that it is not safe to assume that reads/writes of data in
multithreaded applications are atomic at the OS/hardware level, and
corruption of data may result
Actually non composite operations on types like int are atomic on all reasonable architecture. What you read is simply a hoax.
(An increment is a composite operation: it has a read, a calculation, and a write component. Each component is atomic but the whole composite operation is not.)
But atomicity at the hardware level isn't the issue. The high level language you use simply doesn't support that kind of manipulations on regular types. You need to use atomic types to even have the right to manipulate objects in such a way that the question of atomicity is relevant: when you are potentially modifying an object in use in another thread.
(Or volatile types. But don't use volatile. Use atomics.)

How to do computations with addresses at compile/linking time?

I wrote some code for initializing the IDT, which stores 32-bit addresses in two non-adjacent 16-bit halves. The IDT can be stored anywhere, and you tell the CPU where by running the LIDT instruction.
This is the code for initializing the table:
void idt_init(void) {
/* Unfortunately, we can't write this as loops. The first option,
* initializing the IDT with the addresses, here looping over it, and
* reinitializing the descriptors didn't work because assigning a
* a uintptr_t (from (uintptr_t) handler_func) to a descr (a.k.a.
* uint64_t), according to the compiler, "isn't computable at load
* time."
* The second option, storing the addresses as a local array, simply is
* inefficient (took 0.020ms more when profiling with the "time" command
* line program!).
* The third option, storing the addresses as a static local array,
* consumes too much space (the array will probably never be used again
* during the whole kernel runtime).
* But IF my argument against the third option will be invalidated in
* the future, THEN it's the best option I think. */
/* Initialize descriptors of exception handlers. */
idt[EX_DE_VEC] = idt_trap(ex_de);
idt[EX_DB_VEC] = idt_trap(ex_db);
idt[EX_NMI_VEC] = idt_trap(ex_nmi);
idt[EX_BP_VEC] = idt_trap(ex_bp);
idt[EX_OF_VEC] = idt_trap(ex_of);
idt[EX_BR_VEC] = idt_trap(ex_br);
idt[EX_UD_VEC] = idt_trap(ex_ud);
idt[EX_NM_VEC] = idt_trap(ex_nm);
idt[EX_DF_VEC] = idt_trap(ex_df);
idt[9] = idt_trap(ex_res); /* unused Coprocessor Segment Overrun */
idt[EX_TS_VEC] = idt_trap(ex_ts);
idt[EX_NP_VEC] = idt_trap(ex_np);
idt[EX_SS_VEC] = idt_trap(ex_ss);
idt[EX_GP_VEC] = idt_trap(ex_gp);
idt[EX_PF_VEC] = idt_trap(ex_pf);
idt[15] = idt_trap(ex_res);
idt[EX_MF_VEC] = idt_trap(ex_mf);
idt[EX_AC_VEC] = idt_trap(ex_ac);
idt[EX_MC_VEC] = idt_trap(ex_mc);
idt[EX_XM_VEC] = idt_trap(ex_xm);
idt[EX_VE_VEC] = idt_trap(ex_ve);
/* Initialize descriptors of reserved exceptions.
* Thankfully we compile with -std=c11, so declarations within
* for-loops are possible! */
for (size_t i = 21; i < 32; ++i)
idt[i] = idt_trap(ex_res);
/* Initialize descriptors of hardware interrupt handlers (ISRs). */
idt[INT_8253_VEC] = idt_int(int_8253);
idt[INT_8042_VEC] = idt_int(int_8042);
idt[INT_CASC_VEC] = idt_int(int_casc);
idt[INT_SERIAL2_VEC] = idt_int(int_serial2);
idt[INT_SERIAL1_VEC] = idt_int(int_serial1);
idt[INT_PARALL2_VEC] = idt_int(int_parall2);
idt[INT_FLOPPY_VEC] = idt_int(int_floppy);
idt[INT_PARALL1_VEC] = idt_int(int_parall1);
idt[INT_RTC_VEC] = idt_int(int_rtc);
idt[INT_ACPI_VEC] = idt_int(int_acpi);
idt[INT_OPEN2_VEC] = idt_int(int_open2);
idt[INT_OPEN1_VEC] = idt_int(int_open1);
idt[INT_MOUSE_VEC] = idt_int(int_mouse);
idt[INT_FPU_VEC] = idt_int(int_fpu);
idt[INT_PRIM_ATA_VEC] = idt_int(int_prim_ata);
idt[INT_SEC_ATA_VEC] = idt_int(int_sec_ata);
for (size_t i = 0x30; i < IDT_SIZE; ++i)
idt[i] = idt_trap(ex_res);
}
The macros idt_trap and idt_int, and are defined as follows:
#define idt_entry(off, type, priv) \
((descr) (uintptr_t) (off) & 0xffff) | ((descr) (KERN_CODE & 0xff) << \
0x10) | ((descr) ((type) & 0x0f) << 0x28) | ((descr) ((priv) & \
0x03) << 0x2d) | (descr) 0x800000000000 | \
((descr) ((uintptr_t) (off) & 0xffff0000) << 0x30)
#define idt_int(off) idt_entry(off, 0x0e, 0x00)
#define idt_trap(off) idt_entry(off, 0x0f, 0x00)
idt is an array of uint64_t, so these macros are implicitly cast to that type. uintptr_t is the type guaranteed to be capable of holding pointer values as integers and on 32-bit systems usually 32 bits wide. (A 64-bit IDT has 16-byte entries; this code is for 32-bit).
I get the warning that the initializer element is not constant due to the address modification in play.
It is absolutely sure that the address is known at linking time.
Is there anything I can do to make this work? Making the idt array automatic would work but this would require the whole kernel to run in the context of one function and this would be some bad hassle, I think.
I could make this work by some additional work at runtime (as Linux 0.01 also does) but it just annoys me that something technically feasible at linking time is actually infeasible.
Related: Solution needed for building a static IDT and GDT at assemble/compile/link time - a linker script for ld can shift and mask to break up link-time-constant addresses. No earlier step has the final addresses, and relocation entries are limited in what they can represent in a .o.
The main problem is that function addresses are link-time constants, not strictly compile time constants. The compiler can't just get 32b binary integers and stick that into the data segment in two separate pieces. Instead, it has to use the object file format to indicate to the linker where it should fill in the final value (+ offset) of which symbol when linking is done. The common cases are as an immediate operand to an instruction, a displacement in an effective address, or a value in the data section. (But in all those cases it's still just filling in 32-bit absolute address so all 3 use the same ELF relocation type. There's a different relocation for relative displacements for jump / call offsets.)
It would have been possible for ELF to have been designed to store a symbol reference to be substituted at link time with a complex function of an address (or at least high / low halves like on MIPS for lui $t0, %hi(symbol) / ori $t0, $t0, %lo(symbol) to build address constants from two 16-bit immediates). But in fact the only function allowed is addition/subtraction, for use in things like mov eax, [ext_symbol + 16].
It is of course possible for your OS kernel binary to have a static IDT with fully resolved addresses at build time, so all you need to do at runtime is execute a single lidt instruction. However, the standard
build toolchain is an obstacle. You probably can't achieve this without post-processing your executable.
e.g. you could write it this way, to produce a table with the full padding in the final binary, so the data can be shuffled in-place:
#include <stdint.h>
#define PACKED __attribute__((packed))
// Note, this is the 32-bit format. 64-bit is larger
typedef union idt_entry {
// we will postprocess the linker output to have this format
// (or convert at runtime)
struct PACKED runtime { // from OSdev wiki
uint16_t offset_1; // offset bits 0..15
uint16_t selector; // a code segment selector in GDT or LDT
uint8_t zero; // unused, set to 0
uint8_t type_attr; // type and attributes, see below
uint16_t offset_2; // offset bits 16..31
} rt;
// linker output will be in this format
struct PACKED compiletime {
void *ptr; // offset bits 0..31
uint8_t zero;
uint8_t type_attr;
uint16_t selector; // to be swapped with the high16 of ptr
} ct;
} idt_entry;
// #define idt_ct_entry(off, type, priv) { .ptr = off, .type_attr = type, .selector = priv }
#define idt_ct_trap(off) { .ct = { .ptr = off, .type_attr = 0x0f, .selector = 0x00 } }
// generate an entry in compile-time format
extern void ex_de(); // these are the raw interrupt handlers, written in ASM
extern void ex_db(); // they have to save/restore *all* registers, and end with iret, rather than the usual C ABI.
// it might be easier to use asm macros to create this static data,
// just so it can be in the same file and you don't need cross-file prototypes / declarations
// (but all the same limitations about link-time constants apply)
static idt_entry idt[] = {
idt_ct_trap(ex_de),
idt_ct_trap(ex_db),
// ...
};
// having this static probably takes less space than instructions to write it on the fly
// but not much more. It would be easy to make a lidt function that took a struct pointer.
static const struct PACKED idt_ptr {
uint16_t len; // encoded as bytes - 1, so 0xffff means 65536
void *ptr;
} idt_ptr = { sizeof(idt) - 1, idt };
/****** functions *********/
// inline
void load_static_idt(void) {
asm volatile ("lidt %0"
: // no outputs
: "m" (idt_ptr));
// memory operand, instead of writing the addressing mode ourself, allows a RIP-relative addressing mode in 64bit mode
// also allows it to work with -masm=intel or not.
}
// Do this once at at run-time
// **OR** run this to pre-process the binary, after link time, as part of your build
void idt_convert_to_runtime(void) {
#ifdef DEBUG
static char already_done = 0; // make sure this only runs once
if (already_done)
error;
already_done = 1;
#endif
const int count = sizeof idt / sizeof idt[0];
for (int i=0 ; i<count ; i++) {
uint16_t tmp1 = idt[i].rt.selector;
uint16_t tmp2 = idt[i].rt.offset_2;
idt[i].rt.offset_2 = tmp1;
idt[i].rt.selector = tmp2;
// or do this swap in fewer insns with SSE or MMX pshufw, but using vector instructions before setting up the IDT may be insane.
}
}
This does compile. See a diff of the -m32 and -m64 asm output on the Godbolt compiler explorer. Look at the layout in the data section (note that .value is a synonym for .short, and is 16 bits.) (But note that the IDT table format is different for 64-bit mode.)
I think I have the size calculation correct (bytes - 1), as documented in http://wiki.osdev.org/Interrupt_Descriptor_Table. Minimum value 100h bytes long (encoded as 0x99). See also https://en.wikibooks.org/wiki/X86_Assembly/Global_Descriptor_Table. (lgdt size/pointer works the same way, although the table itself has a different format.)
The other option, instead of having the IDT static in the data section, is to have it in the bss section, with the data stored as immediate constants in a function that will initialize it (or in an array read by that function).
Either way, that function (and its data) can be in a .init section whose memory you re-use after it's done. (Linux does this to reclaim memory from code and data that's only needed once, at startup.) This would give you the optimal tradeoff of small binary size (since 32b addresses are smaller than 64b IDT entries), and no runtime memory wasted on code to set up the IDT. A small loop that runs once at startup is negligible CPU time. (The version on Godbolt fully unrolls because I only have 2 entries, and it embeds the address into each instruction as a 32-bit immediate, even with -Os. With a large enough table (just copy/paste to duplicate a line) you get a compact loop even at -O3. The threshold is lower for -Os.)
Without memory-reuse haxx, probably a tight loop to rewrite 64b entries in place is the way to go. Doing it at build time would be even better, but then you'd need a custom tool to run the tranformation on the kernel binary.
Having the data stored in immediates sounds good in theory, but the code for each entry would probably total more than 64b, because it couldn't loop. The code to split an address into two would have to be fully unrolled (or placed in a function and called). Even if you had a loop to store all the same-for-multiple-entries stuff, each pointer would need a mov r32, imm32 to get the address in a register, then mov word [idt+i + 0], ax / shr eax, 16 / mov word [idt+i + 6], ax. That's a lot of machine-code bytes.
One way would be to use an intermediate jump table that is located at a fixed address. You could initialize the idt with addresses of the locations in this table (which will be compile time constant). The locations in the jump table will contain jump instructions to the actual isr routines.
The dispatch to an isr will be indirect as follows:
trap -> jump to intermediate address in the idt -> jump to isr
One way to create a jump table at a fixed address is as follows.
Step 1: Put jump table in a section
// this is a jump table at a fixed address
void jump(void) __attribute__((section(".si.idt")));
void jump(void) {
asm("jmp isr0"); // can also be asm("call ...") depending on need
asm("jmp isr1");
asm("jmp isr2");
}
Step 2: Instruct the linker to locate the section at a fixed address
SECTIONS
{
.so.idt 0x600000 :
{
*(.si.idt)
}
}
Put this in the linker script right after the .text section. This will make sure that the executable code in the table will go into a executable memory region.
You can instruct the linker to use your script as follows using the --script option in the Makefile.
LDFLAGS += -Wl,--script=my_script.lds
The following macro gives you the address of the location which contains the jump (or call) instruction to the corresponding isr.
// initialize the idt at compile time with const values
// you can find a cleaner way to generate offsets
#define JUMP_ADDR(off) ((char*)0x600000 + 4 + (off * 5))
You will then initialize the idt as follows using modified macros.
// your real idt will be initialized as follows
#define idt_entry(addr, type, priv) \
( \
((descr) (uintptr_t) (addr) & 0xffff) | \
((descr) (KERN_CODE & 0xff) << 0x10) | \
((descr) ((type) & 0x0f) << 0x28) | \
((descr) ((priv) & 0x03) << 0x2d) | \
((descr) 0x1 << 0x2F) | \
((descr) ((uintptr_t) (addr) & 0xffff0000) << 0x30) \
)
#define idt_int(off) idt_entry(JUMP_ADDR(off), 0x0e, 0x00)
#define idt_trap(off) idt_entry(JUMP_ADDR(off), 0x0f, 0x00)
descr idt[] =
{
...
idt_trap(ex_de),
...
idt_int(int_casc),
...
};
A demo working example is below, which shows dispatch to a isr with a non-fixed address from a instruction at a fixed address.
#include <stdio.h>
// dummy isrs for demo
void isr0(void) {
printf("==== isr0\n");
}
void isr1(void) {
printf("==== isr1\n");
}
void isr2(void) {
printf("==== isr2\n");
}
// this is a jump table at a fixed address
void jump(void) __attribute__((section(".si.idt")));
void jump(void) {
asm("jmp isr0"); // can be asm("call ...")
asm("jmp isr1");
asm("jmp isr2");
}
// initialize the idt at compile time with const values
// you can find a cleaner way to generate offsets
#define JUMP_ADDR(off) ((char*)0x600000 + 4 + (off * 5))
// dummy idt for demo
// see below for the real idt
char* idt[] =
{
JUMP_ADDR(0),
JUMP_ADDR(1),
JUMP_ADDR(2),
};
int main(int argc, char* argv[]) {
int trap;
char* addr = idt[trap = argc - 1];
printf("==== idt[%d]=%p\n", trap, addr);
asm("jmp *%0\n" : :"m"(addr));
}

does read / write atomicity hold on multi-computer memory?

I'm working in a multi-computer environment where 2 computers will access the same memory over a 32bit PCI bus.
The first computer will only write to the 32bit int.
*int_pointer = number;
The second computer will only read from the 32bit int.
number = *int_pointer;
Both OS's / CPU's are 32bit architecture.
computer with PCI is Intel Based.
computer on the PCI card is power PC.
The case I am worried about is if the write only computer is changing the variable at the same time the read computer is reading it, causing invalid data for the read computer.
I'd like to know if the atomicity of reading / writing to the same position in memory is preserved over multiple computers.
If so, would the following prevent the race condition:
number = *int_pointer;
while(*int_pointer != number) {
number = *int_pointer;
}
I can guarantee that writes will occur every 16ms*, and reads will occur randomly.
*times will drift since both computers have different timers.
There isn't enough information to answer this definitively.
If either of the CPUs decide to cache the memory at *int_pointer, this will always fail and your extra code will not fix any race condition.
Assuming both CPUs know that memory at *int_pointer is uncacheable, AND that location *int_pointer is aligned on a 4-byte/32-bit boundary AND the CPU on the PCI card has a 32 bit memory interface, AND the pointer is declared as a pointer to volatile AND both your C compilers implement volatile correctly, THEN it is likely that the updates will be atomic.
If any of the above conditions are not met, the result will be unpredictable and your "race detection" code is not likely to work.
Edited to explain why volatile is needed:
Here is your race condition code compiled to MIPS assembler with -O4 and no volatile qualifier. (I used MIPS because the generated code is easier to read than x86 code):
int nonvol(int *ip) {
int number = *ip;
while (*ip != number) {
number = *ip;
}
return number;
}
Disassembly output:
00000000 <nonvol>:
0: 8c820000 lw v0,0(a0)
4: 03e00008 jr ra
The compiler has optimized the while loop away since it knows that *ip cannot change.
Here's what happens with volatile and the same compiler options:
int vol(volatile int *ip) {
int number = *ip;
while (*ip != number) {
number = *ip;
}
return number;
}
Disassembly output:
00000008 <vol>:
8: 8c820000 lw v0,0(a0)
c: 8c830000 lw v1,0(a0)
10: 1443fffd bne v0,v1,8 <vol>
14: 00000000 nop
18: 03e00008 jr ra
Now the while loop is not optimized away because using volatile has told the compiler that *ip can change at any time.
To answer my own question:
There is no race condition. The memory controller on the external PCI device will handle all memory read/write operations. Since all data carriers are at least 32bits wide there will not be a race condition within those 32bits.
Since the transfers use a DMA controller the interactions between memory and processors is well behaved.
See Direct Memory Access -PCI

Can this be atomically executed?

I would like to know whether it is possible to ensure line is atomically executed, given that it could be executed by both the ISR and Main context. I'm working on an ARM9 (LPC313x) and using RealView 4 (armcc).
foo() {
..
stack_var = ++volatile_var; // line
..
}
I'm looking for any routine like _atomic_ for C166, direct assembly code, etc. I would prefer not to have to disable the interrupts.
Thank you very much.
No, I don't think that you ever can expect ++volatile_var to be atomic, even if you don't have the assignment. Use a proper atomic primitive for that. If your compiler doesn't provide such an extension you easily find short inline assembler for that on the web. The assembler instructions are call ldrex and strex for atomic exchange on arm, I think.
Edit: it seems that the specific processor type that is asked for in the question does not implement these instructions.
Edit: The following should work with gcc, for another compiler one probably has to adapt the __asm__ parts.
inline
size_t arm_ldrex(size_t volatile*ptr) {
size_t ret;
__asm__ volatile ("ldrex %0,[%1]\t# load exclusive\n"
: "=&r" (ret)
: "r" (ptr)
: "cc", "memory"
);
return ret;
}
inline
_Bool arm_strex(size_t volatile*ptr, size_t val) {
size_t error;
__asm__ volatile ("strex %0,%1,[%2]\t# store exclusive\n"
: "=&r" (error)
: "r" (val), "r" (ptr)
: "cc", "memory"
);
return !error;
}
inline
size_t atomic_add_fetch(size_t volatile *object, size_t operand) {
for (;;) {
size_t oldval = arm_ldrex(object);
size_t newval = oldval + operand;
if (arm_strex(object, newval)) return newval;
}
}
From a quick look, the C166 _atomic_ macro seems to utilize an instruction that effectively masks interrupts for the duration of a specified number of instructions.
There is nothing directly corresponding to that in the ARM architecture.
You could of course use the swp instruction (or __swp intrinsic in the RealView toolchain) to implement a lock around the critical section. ldrex/strex mentioned in another answer do not exist in ARM architecture version 5, which includes the ARM9 processors.
http://infocenter.arm.com/help/topic/com.arm.doc.dui0491c/CJAHDCHB.html and http://infocenter.arm.com/help/topic/com.arm.doc.dui0489c/Chdbbbai.html respectively.
A simplistic lock implementation around this (using the RealView toolchain) would be:
{
/* Loop until lock acquired */
while (__swp(LOCKED, &lockvar) == LOCKED);
..
/* Critical section */
..
lockvar = UNLOCKED;
}
However, this will lead to deadlock in the ISR context when the Main thread is holding the lock.
I think masking interrupts around the operation is likely to be the least hairy solution, although if your Main context is executing in User mode it will require a system call to implement.

Calculating CPU frequency in C with RDTSC always returns 0

The following piece of code was given to us from our instructor so we could measure some algorithms performance:
#include <stdio.h>
#include <unistd.h>
static unsigned cyc_hi = 0, cyc_lo = 0;
static void access_counter(unsigned *hi, unsigned *lo) {
asm("rdtsc; movl %%edx,%0; movl %%eax,%1"
: "=r" (*hi), "=r" (*lo)
: /* No input */
: "%edx", "%eax");
}
void start_counter() {
access_counter(&cyc_hi, &cyc_lo);
}
double get_counter() {
unsigned ncyc_hi, ncyc_lo, hi, lo, borrow;
double result;
access_counter(&ncyc_hi, &ncyc_lo);
lo = ncyc_lo - cyc_lo;
borrow = lo > ncyc_lo;
hi = ncyc_hi - cyc_hi - borrow;
result = (double) hi * (1 << 30) * 4 + lo;
return result;
}
However, I need this code to be portable to machines with different CPU frequencies. For that, I'm trying to calculate the CPU frequency of the machine where the code is being run like this:
int main(void)
{
double c1, c2;
start_counter();
c1 = get_counter();
sleep(1);
c2 = get_counter();
printf("CPU Frequency: %.1f MHz\n", (c2-c1)/1E6);
printf("CPU Frequency: %.1f GHz\n", (c2-c1)/1E9);
return 0;
}
The problem is that the result is always 0 and I can't understand why. I'm running Linux (Arch) as guest on VMware.
On a friend's machine (MacBook) it is working to some extent; I mean, the result is bigger than 0 but it's variable because the CPU frequency is not fixed (we tried to fix it but for some reason we are not able to do it). He has a different machine which is running Linux (Ubuntu) as host and it also reports 0. This rules out the problem being on the virtual machine, which I thought it was the issue at first.
Any ideas why this is happening and how can I fix it?
Okay, since the other answer wasn't helpful, I'll try to explain on more detail. The problem is that a modern CPU can execute instructions out of order. Your code starts out as something like:
rdtsc
push 1
call sleep
rdtsc
Modern CPUs do not necessarily execute instructions in their original order though. Despite your original order, the CPU is (mostly) free to execute that just like:
rdtsc
rdtsc
push 1
call sleep
In this case, it's clear why the difference between the two rdtscs would be (at least very close to) 0. To prevent that, you need to execute an instruction that the CPU will never rearrange to execute out of order. The most common instruction to use for that is CPUID. The other answer I linked should (if memory serves) start roughly from there, about the steps necessary to use CPUID correctly/effectively for this task.
Of course, it's possible that Tim Post was right, and you're also seeing problems because of a virtual machine. Nonetheless, as it stands right now, there's no guarantee that your code will work correctly even on real hardware.
Edit: as to why the code would work: well, first of all, the fact that instructions can be executed out of order doesn't guarantee that they will be. Second, it's possible that (at least some implementations of) sleep contain serializing instructions that prevent rdtsc from being rearranged around it, while others don't (or may contain them, but only execute them under specific (but unspecified) circumstances).
What you're left with is behavior that could change with almost any re-compilation, or even just between one run and the next. It could produce extremely accurate results dozens of times in a row, then fail for some (almost) completely unexplainable reason (e.g., something that happened in some other process entirely).
I can't say for certain what exactly is wrong with your code, but you're doing quite a bit of unnecessary work for such a simple instruction. I recommend you simplify your rdtsc code substantially. You don't need to do 64-bit math carries your self, and you don't need to store the result of that operation as a double. You don't need to use separate outputs in your inline asm, you can tell GCC to use eax and edx.
Here is a greatly simplified version of this code:
#include <stdint.h>
uint64_t rdtsc() {
uint64_t ret;
# if __WORDSIZE == 64
asm ("rdtsc; shl $32, %%rdx; or %%rdx, %%rax;"
: "=A"(ret)
: /* no input */
: "%edx"
);
#else
asm ("rdtsc"
: "=A"(ret)
);
#endif
return ret;
}
Also you should consider printing out the values you're getting out of this so you can see if you're getting out 0s, or something else.
As for VMWare, take a look at the time keeping spec (PDF Link), as well as this thread. TSC instructions are (depending on the guest OS):
Passed directly to the real hardware (PV guest)
Count cycles while the VM is executing on the host processor (Windows / etc)
Note, in #2 the while the VM is executing on the host processor. The same phenomenon would go for Xen, as well, if I recall correctly. In essence, you can expect that the code should work as expected on a paravirtualized guest. If emulated, its entirely unreasonable to expect hardware like consistency.
You forgot to use volatile in your asm statement, so you're telling the compiler that the asm statement produces the same output every time, like a pure function. (volatile is only implicit for asm statements with no outputs.)
This explains why you're getting exactly zero: the compiler optimized end-start to 0 at compile time, through CSE (common-subexpression elimination).
See my answer on Get CPU cycle count? for the __rdtsc() intrinsic, and #Mysticial's answer there has working GNU C inline asm, which I'll quote here:
// prefer using the __rdtsc() intrinsic instead of inline asm at all.
uint64_t rdtsc(){
unsigned int lo,hi;
__asm__ __volatile__ ("rdtsc" : "=a" (lo), "=d" (hi));
return ((uint64_t)hi << 32) | lo;
}
This works correctly and efficiently for 32 and 64-bit code.
hmmm I'm not positive but I suspect the problem may be inside this line:
result = (double) hi * (1 << 30) * 4 + lo;
I'm suspicious if you can safely carry out such huge multiplications in an "unsigned"... isn't that often a 32-bit number? ...just the fact that you couldn't safely multiply by 2^32 and had to append it as an extra "* 4" added to the 2^30 at the end already hints at this possibility... you might need to convert each sub-component hi and lo to a double (instead of a single one at the very end) and do the multiplication using the two doubles

Resources