does read / write atomicity hold on multi-computer memory? - c

I'm working in a multi-computer environment where 2 computers will access the same memory over a 32bit PCI bus.
The first computer will only write to the 32bit int.
*int_pointer = number;
The second computer will only read from the 32bit int.
number = *int_pointer;
Both OS's / CPU's are 32bit architecture.
computer with PCI is Intel Based.
computer on the PCI card is power PC.
The case I am worried about is if the write only computer is changing the variable at the same time the read computer is reading it, causing invalid data for the read computer.
I'd like to know if the atomicity of reading / writing to the same position in memory is preserved over multiple computers.
If so, would the following prevent the race condition:
number = *int_pointer;
while(*int_pointer != number) {
number = *int_pointer;
}
I can guarantee that writes will occur every 16ms*, and reads will occur randomly.
*times will drift since both computers have different timers.

There isn't enough information to answer this definitively.
If either of the CPUs decide to cache the memory at *int_pointer, this will always fail and your extra code will not fix any race condition.
Assuming both CPUs know that memory at *int_pointer is uncacheable, AND that location *int_pointer is aligned on a 4-byte/32-bit boundary AND the CPU on the PCI card has a 32 bit memory interface, AND the pointer is declared as a pointer to volatile AND both your C compilers implement volatile correctly, THEN it is likely that the updates will be atomic.
If any of the above conditions are not met, the result will be unpredictable and your "race detection" code is not likely to work.
Edited to explain why volatile is needed:
Here is your race condition code compiled to MIPS assembler with -O4 and no volatile qualifier. (I used MIPS because the generated code is easier to read than x86 code):
int nonvol(int *ip) {
int number = *ip;
while (*ip != number) {
number = *ip;
}
return number;
}
Disassembly output:
00000000 <nonvol>:
0: 8c820000 lw v0,0(a0)
4: 03e00008 jr ra
The compiler has optimized the while loop away since it knows that *ip cannot change.
Here's what happens with volatile and the same compiler options:
int vol(volatile int *ip) {
int number = *ip;
while (*ip != number) {
number = *ip;
}
return number;
}
Disassembly output:
00000008 <vol>:
8: 8c820000 lw v0,0(a0)
c: 8c830000 lw v1,0(a0)
10: 1443fffd bne v0,v1,8 <vol>
14: 00000000 nop
18: 03e00008 jr ra
Now the while loop is not optimized away because using volatile has told the compiler that *ip can change at any time.

To answer my own question:
There is no race condition. The memory controller on the external PCI device will handle all memory read/write operations. Since all data carriers are at least 32bits wide there will not be a race condition within those 32bits.
Since the transfers use a DMA controller the interactions between memory and processors is well behaved.
See Direct Memory Access -PCI

Related

static const vs const declaration performance difference on uC

Lets say i have a lookuptable, an array of 256 elements defined and declared in a header named lut.h. The array will be accessed multiple times in lifetime of the program.
From my understanding if its defined & declared as static, it will remain in memory until the program is done, i.e. if it is a task running on a uC, the array is in memory the entire time.
Where as without static, it will be loaded into memory when accessed.
In lut.h
static const float array[256] = {1.342, 14.21, 42.312, ...}
vs.
const float array[256] = {1.342, 14.21, 42.312, ...}
Considering the uC has limited spiflash and psram, what would be the most performance oriented approach?
You have some misconceptions here, since a MCU is not a PC. Everything in memory in a MCU will persist for as long as the MCU has power. Programs do not end or return resources to a hosting OS.
"Tasks" on a MCU means you have a RTOS. They use their own stack and that's a topic of it's own, quite unrelated to your question. It is normal that all tasks on a RTOS execute forever, rather than getting allocated/deallocated in run-time like processes in a PC.
static versus automatic on local scope does mean different RAM memory use, but not necessarily more/less memory use. Local variables get pushed/popped on the stack as the program executes. static ones sit on their designated address.
Where as without static, it will be loaded into memory when accessed.
Only if the array you are loading into is declared locally. That is:
void func (void)
{
int my_local_array[] = {1,2,3};
...
}
Here my_local_array will load the values from flash to RAM during execution of that function only. This means two things:
The actual copy down from flash to RAM is slow. First of all, copying something is always slow, regardless of the situation. But in the specific case of copying from RAM to flash, it might be extra slow, depending on MCU.
It will be extra slow on high end MCUs with flash wait states that fail to utilize data cache for the copy. It will be extra slow on weird Harvard architecture MCUs that can't address data directly. Etc.
So naturally if you do this copy down each time a function is called, instead of just once, your program will turn much slower.
Large local objects lead to a need for higher stack size. The stack must be large enough to deal with the worst-case scenario. If you have large local objects, the stack size will need to be set much higher to prevent stack overflows. Meaning this can actually lead to less effective memory use.
So it isn't trivial to tell if you save or lose memory by making an object local.
General good practice design in embedded systems programming is to not allocate large objects on the stack, as they make stack handling much more detailed and the potential for stack overflow increases. Such objects should be declared as static, at file scope. Particularly if speed is important.
static const float array vs const float array
Another misconception here. Making something const in MCU system, while at the same time placing it at file scope ("global"), most likely means that the variable will end up in flash ROM, not in RAM. Regardless of static.
This is most of the time preferred, since in general RAM is a more valuable resource than flash. The role static plays here is merely good program design, as it limits access to the variable to the local translation unit, rather than cluttering up the global namespace.
In lut.h
You should never define variables in header files.
It is bad from a program design point-of-view, as you expose the variable all over the place ("spaghetti programming") and it is bad from a linker point of view, if multiple source files include the same header file - which is extremely likely.
Correctly designed programs places the variable in the .c file and limits access by declaring it static. Access from the outside, if needed, is done through setters/getters.
he uC has limited spiflash
What is "spiflash"? An external serial flash memory accessed through SPI? Then none of this makes sense, since such flash memory isn't memory-mapped and typically the compiler can't utilize it. Access to such memories has to be carried out by your application, manually.
If your arrays are defined on a file level (you mentioned lut.h), and both have const qualifiers, they will not be loaded into RAM¹. The static keyword only limits the scope of the array, it doesn't change its lifetime in any way. If you check the assembly for your code, you will see that both arrays look exactly the same when compiled:
static const int static_array[] = { 1, 2, 3 };
const int extern_array[] = { 1, 2, 3};
extern void do_something(const int * a);
int main(void)
{
do_something(static_array);
do_something(extern_array);
return 0;
}
Resulting assembly:
main:
sub rsp, 8
mov edi, OFFSET FLAT:static_array
call do_something
mov edi, OFFSET FLAT:extern_array
call do_something
xor eax, eax
add rsp, 8
ret
extern_array:
.long 1
.long 2
.long 3
static_array:
.long 1
.long 2
.long 3
On the other hand, if if you declare the arrays inside a function, then the array will be copied to temporary storage (stack) for the duration of the function, unless you add the static qualifier:
extern void do_something(const int * a);
int main(void)
{
static const int static_local_array[] = { 1, 2, 3 };
const int local_array[] = { 1, 2, 3 };
do_something(static_local_array);
do_something(local_array);
return 0;
}
Resulting assembly:
main:
sub rsp, 24
mov edi, OFFSET FLAT:static_local_array
movabs rax, 8589934593
mov QWORD PTR [rsp+4], rax
mov DWORD PTR [rsp+12], 3
call do_something
lea rdi, [rsp+4]
call do_something
xor eax, eax
add rsp, 24
ret
static_local_array:
.long 1
.long 2
.long 3
¹ More precisely, it depends on the compiler. Some compilers will need additional custom attributes to define exactly where you want to store the data. Some compilers will try to place the array into RAM when there is enough spare space, to allow faster reading.

Code execution exploit Cortex M4

For testing the MPU and playing around with exploits, I want to execute code from a local buffer running on my STM32F4 dev board.
int main(void)
{
uint16_t func[] = { 0x0301f103, 0x0301f103, 0x0301f103 };
MPU->CTRL = 0;
unsigned int address = (void*)&func+1;
asm volatile(
"mov r4,%0\n"
"ldr pc, [r4]\n"
:
: "r"(address)
);
while(1);
}
In main, I first turn of the MPU. In func my instructions are stored. In the ASM part I load the address (0x2001ffe8 +1 for thumb) into the program counter register. When stepping through the code with GDB, in R4 the correct value is stored and then transfered to PC register. But then I will end up in the HardFault Handler.
Edit:
The stack looks like this:
0x2001ffe8: 0x0301f103 0x0301f103 0x0301f103 0x2001ffe9
The instructions are correct in the memory. Definitive Guide to Cortex says region 0x20000000–0x3FFFFFFF is the SRAM and "this region is executable,
so you can copy program code here and execute it".
You are assigning 32 bit values to a 16 bit array.
Your instructions dont terminate, they continue on to run into whatever is found in ram, so that will crash.
You are not loading the address to the array into the program counter you are loading the first item in the array into the program counter, this will crash, you created a level of indirection.
Look at the BX instruction for this rather than ldr pc
You did not declare the array as static, so the array can be optimized out as dead and unused, so this can cause it to crash.
The compiler should also complain that you are assigning a void* to an unsigned variable, so a typecast is wanted there.
As a habit I recommend address|=1 rather than +=1, in this case either will function.

Program is not returning expected pc registry address for buffer overflow

Using the following program, I am trying to achieve a buffer overflow:
#include <stdio.h>
#include <string.h>
void vuln(){
char buff[16];
scanf("%s",buff);
printf("You entered: %s\n",buff);
}
void secret(){
printf("You shouldn't be here.");
}
int main(){
vuln();
return 0;
}
When entering AAAABBBBCCCCDDDDEEEEFFFFGGGGHHHHIIIIJJJJKKKKLLLLMMMM as the input, the crash log reports that R15 (PC) register contains the value 0x46464644
My question is why isn't the address 0x46464646 ? How is it going to three F's followed by one D? The expected result should be 0x46464646 because that is where the data should be overwritten.
ARM instructions are aligned, either on a 2-byte or 4-byte boundary depending on whether they are THUMB or ARM instructions. This means that the LSb of legal branch targets is always zero. So ARM took advantage of this, and used the LSb of a branch address (for some branch instructions) to denote whether to switch between ARM and THUMB.
It is most likely that you overwrote the stored LR with 0x46464645 or 0x46464646, and then the processor discarded your LSb (or LS2bs) (potentially using it to select the mode to take when executing code at the destination).
Try changing the last of your 'E' characters or first of your 'F's to something with a recognizable upper-6-bits to test this hypothesis. (presuming your ARM is running little-endian.) It's most likely the first 'F' considering ARM's alignment requirements.
Edit: why are people down-voting this question? It is both interesting and well-asked.
Here's a reference to the behavior on branching that mentions the mode switch: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489e/Cihfddaf.html

reading a 64 bit volatile variable on cortex-m3

I have a 64 bit integer variable on a 32 bit Cortex-M3 ARM controller (STM32L1), which can be modified asynchronously by an interrupt handler.
volatile uint64_t v;
void some_interrupt_handler() {
v = v + something;
}
Obviously, I need a way to access it in a way that prevents getting inconsistent, halfway updated values.
Here is the first attempt
static inline uint64_t read_volatile_uint64(volatile uint64_t *x) {
uint64_t y;
__disable_irq();
y = *x;
__enable_irq();
return y;
}
The CMSIS inline functions __disable_irq() and __enable_irq() have an unfortunate side effect, forcing a memory barrier on the compiler, so I've tried to come up with something more fine-grained
static inline uint64_t read_volatile_uint64(volatile uint64_t *x) {
uint64_t y;
asm ( "cpsid i\n"
"ldrd %[value], %[addr]\n"
"cpsie i\n"
: [value]"=r"(y) : [addr]"m"(*x));
return y;
}
It still disables interrupts, which is not desirable, so I'm wondering if there's a way doing it without resorting to cpsid. The Definitive Guide to
ARM Cortex-M3 and Cortex-M4 Processors, Third Edition by Joseph Yiu says
If an interrupt request arrives when the processor is executing a
multiple cycle instruction, such as an integer divide, the instruction
could be abandoned and restarted after the interrupt handler
completes. This behavior also applies to load double-word (LDRD) and
store double-word (STRD) instructions.
Does it mean that I'll be fine by simply writing this?
static inline uint64_t read_volatile_uint64(volatile uint64_t *x) {
uint64_t y;
asm ( "ldrd %[value], %[addr]\n"
: [value]"=&r"(y) : [addr]"m"(*x));
return y;
}
(Using "=&r" to work around ARM errata 602117)
Is there some library or builtin function that does the same portably? I've tried atomic_load() in stdatomic.h, but it fails with undefined reference to '__atomic_load_8'.
Yes, using a simple ldrd is safe in this application since it will be restarted (not resumed) if interrupted, hence it will appear atomic from the interrupt handler's point of view.
This holds more generally for all load instructions except those that are exception-continuable, which are a very restricted subset:
only ldm, pop, vldm, and vpop can be continuable
an instruction inside an it-block is never continuable
an ldm/pop whose first loaded register is also the base register (e.g. ldm r0, { r0, r1 }) is never continuable
This gives plenty of options for atomically reading a multi-word variable that's modified by an interrupt handler on the same core. If the data you wish to read is not a contiguous array of words then you can do something like:
1: ldrex %[val0], [%[ptr]] // can also be byte/halfword
... more loads here ...
strex %[retry], %[val0], [%[ptr]]
cbz %[retry], 2f
b 1b
2:
It doesn't really matter which word (or byte/halfword) you use for the ldrex/strex since an exception will perform an implicit clrex.
The other direction, writing a variable that's read by an interrupt handler is a lot harder. I'm not 100% sure but I think the only stores that are guaranteed to appear atomic to an interrupt handler are those that are "single-copy atomic", i.e. single byte, aligned halfword, and aligned word. Anything bigger would require disabling interrupts or using some clever lock-free structure.
Atomicity is not guaranteed on LDRD according to the ARMv7m reference manual. (A3.5.1)
The only ARMv7-M explicit accesses made by the ARM processor which exhibit single-copy atomicity are:
• All byte transactions
• All halfword transactions to 16-bit aligned locations
• All word transactions to 32-bit aligned locations
LDM, LDC, LDRD, STM, STC, STRD, PUSH and POP operations are seen to be a sequence of 32-bit
transactions aligned to 32 bits. Each of these 32-bit transactions are guaranteed to exhibit single-copy
atomicity. Sub-sequences of two or more 32-bit transactions from the sequence also do not exhibit
single-copy atomicity
What you can do is use a byte to indicate to the ISR you're reading it.
non_isr(){
do{
flag = 1
foo = doubleword
while(flag > 1)
flag = 0
}
isr(){
if(flag == 1)
flag++;
doubleword = foo
}
Source (login required):
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0403e.b/index.html
Login not required:
http://www.telecom.uff.br/~marcos/uP/ARMv7_Ref.pdf
I was also trying to use a 64-bit (2 x 32-bit) system_tick, but on an STM32L4xx (ARM cortex M3). I found that when I tried to use just "volatile uint64_t system_tick", compiler injected assembly instruction LDRD, which may have been enough, since getting interrupted after reading the first word is supposed to cause both words to be read again.
I asked the tech at IAR software support and he responded that I should use C11 atomics;
#include "stdatomic.h"
#ifdef __STDC_NO_ATOMICS__
static_assert(__STDC_NO_ATOMICS__ != 1);
#endif
volatile atomic_uint_fast64_t system_tick;
/**
* \brief Increment system_timer
* \retval none
*/
void HAL_IncTick(void)
{
system_tick++;
}
/**
* \brief Read 64-bit system_tick
* \retval system_tick
*/
uint64_t HAL_GetSystemTick(void)
{
return system_tick;
}
/**
* \brief Read 32 least significant bits of system_tick
* \retval (uint64_t) system_tick
*/
uint32_t HAL_GetTick(void)
{
return (uint32_t)system_tick;
}
But what I found was a colossal amount of code was added to make the read "atomic".
Way back in the day of 8-bit micro-controllers, the trick was to read the high byte, read the low byte, then read the high byte until the high byte was the same twice - proving that there was no rollover created by the ISR. So if you are against disabling IRQ, reading system_tick, then enabling IRQ, try this trick:
/**
* \brief Read 64-bit system_tick
* \retval system_tick
*/
uint64_t HAL_GetSystemTick(void)
{
uint64_t tick;
do {
tick = system_tick;
} while ((uint32_t)(system_tick >> 32) != (uint32_t)(tick >> 32));
return tick;
}
The idea is that if the most significant word does not roll over, then then whole 64-bit system_timer must be valid. If HAL_IncTick() did anything more than a simple increment, this assertion would not be possible.

How can I deal with given situtaion related to Hardware change

I am maintaining a Production code related to FPGA device .Earlier resisters on FPGA are of 32 bits and read/write to these registers are working fine.But Hardware is changed and so did the FPGA device and with latest version of FPGA device we have trouble in read and write to FPGA register .After some R&D we came to know FPGA registers are no longer 32 bit ,it is now 31 bit registers and same has been claimed by FPGA device vendor.
So there is need to change small code as well.Earlier we were checking that address of registers are 4 byte aligned or not(because registers are of 32 bits)now with current scenario we have to check address are 31 bit aligned.So for the same we are going to check
if the most significant bit of the address is set (which means it is not a valid 31 bit).
I guess we are ok here.
Now second scenario is bit tricky for me.
if read/write for multiple registers that is going to go over the 0x7fff-fffc (which is the maximum address in 31 bit scheme) boundary, then have to handle request carefully.
Reading and Writing for multiple register takes length as an argument which is nothing but number of register to be read or write.
For example, if the read starts with 0x7fff-fff8, and length for the read is 5. Then actually, we can only read 2 registers (which is 0x7fff-fff8, and 0x7fff-fffc).
Now could somebody suggest me some kind of pseudo code to handle this scenario
Some think like below
while(lenght>1)
{
if(!(address<<(lenght*31) <= 0x7fff-fffc))
{
length--;
}
}
I know it is not good enough but something in same line which I can use.
EDIT
I have come up with a piece of code which may fulfill my requirement
int count;
Index_addr=addr;
while(Index_add <= 7ffffffc)
{
/*Wanted to move register address to next register address,each register is 31 bit wide and are at consecutive location. like 0x0,0x4 and 0x8 etc.*/
Index_add=addr<<1; // Guess I am doing wrong here ,would anyone correct it.
count++;
}
length=count;
The root problem seems to be that the program is not properly treating the FPGA registers.
Data encapsulation would help, and, instead of treating the 31-bit FPGA registers as memory locations, they should be abstracted.
The FPGA should be treated as a vector (a one-dimensional array) of registers.
The vector of N FPGA registers should be addressable by an register index in the range of 0x0000 through N-1.
The FPGA registers are memory mapped at base addr.
So the memory address = 4 * FPGA register index + base addr.
Access to the FPGA registers should be encapsulated by read and write procedures:
int read_fpga_reg(int reg_index, uint32_t *reg_valp)
{
if (reg_index < 0 || reg_index >= MAX_REG_INDEX)
return -1; /* error return */
*reg_valp = *(uint32_t *)(reg_index << 2 + fpga_base_addr);
return 0;
}
As long as MAX_REG_INDEX and fpga_base_addr are properly defined, then this code will never generate an invalid memory access.
I'm not absolutely sure I'm interpreting the given scenario correctly. But here's a shot at it:
// Assuming "address" starts 4-byte aligned and is just defined as an integer
unsigned uint32_t address; // (Assuming 32-bit unsigned longs)
while ( length > 0 ) // length is in bytes
{
// READ 4-byte value at "address"
// Mask the read value with 0x7FFFFFFF since there are 31 valid bits
// 32 bits (4 bytes) have been read
if ( (--length > 0) && (address < 0x7ffffffc) )
address += 4;
}

Resources