I'm using linaro g++ for ARM arch64 to compile a simple cpp file:
int main()
{
char *helloMain = "main module (crm.c)";
long faculty, num = 12;
int stop,mainLoop = 1;
char word[80] = "";
}
After objdump the generated elf file, I got its asm code:
0000000000001270 <main>:
int main()
{
1270: d101c3ff sub sp, sp, #0x70
char *helloMain = "main module (crm.c)";
1274: 90000020 adrp x0, 5000 <_malloc_trim_r+0x160>
1278: 9111c000 add x0, x0, #0x470
127c: f90003e0 str x0, [sp]
long faculty, num = 12;
1280: d2800180 movz x0, #0xc
1284: f90007e0 str x0, [sp,#8]
int stop,mainLoop = 1;
1288: 52800020 movz w0, #0x1
128c: b90013e0 str w0, [sp,#16]
char word[80] = "";
1290: 910063e0 add x0, sp, #0x18
1294: 90000021 adrp x1, 5000 <_malloc_trim_r+0x160>
1298: 91122021 add x1, x1, #0x488
129c: 39400021 ldrb w1, [x1]
12a0: 39000001 strb w1, [x0]
12a4: 91000400 add x0, x0, #0x1
12a8: a9007c1f stp xzr, xzr, [x0]
12ac: a9017c1f stp xzr, xzr, [x0,#16]
12b0: a9027c1f stp xzr, xzr, [x0,#32]
12b4: a9037c1f stp xzr, xzr, [x0,#48]
12b8: f900201f str xzr, [x0,#64]
12bc: b900481f str wzr, [x0,#72]
12c0: 7900981f strh wzr, [x0,#76]
12c4: 3901381f strb wzr, [x0,#78]
}
12c8: 52800000 movz w0, #0x0
12cc: 9101c3ff add sp, sp, #0x70
12d0: d65f03c0 ret
Before executing this code on an ARMV8 board, sp is initialized to an address aligned to 0x1000.
The execution of such code raised an alignment fault exception on
12a8: a9007c1f stp xzr, xzr, [x0]
I noticed x0 was added by 0x1 so it was aligned to 0x1 when stp instruction was executed.
Why g++ didn't make it align to 0x10 to avoid such alignment fault exception?
The g++ version is:
gcc 4.8.1 20130506 (prerelease) (crosstool-NG linaro-1.13.1-4.8-2013.05 - Linaro GCC 2013.05)
From the manual:
-munaligned-access
-mno-unaligned-access
Enables (or disables) reading and writing of 16- and 32- bit values from addresses that are not 16- or 32- bit aligned.
By default unaligned access is disabled for all pre-ARMv6 and all
ARMv6-M architectures, and enabled for all other architectures. If
unaligned access is not enabled then words in packed data structures
will be accessed a byte at a time.
The ARM attribute Tag_CPU_unaligned_access will be set in the
generated object file to either true or false, depending upon the
setting of this option. If unaligned access is enabled then the
preprocessor symbol __ARM_FEATURE_UNALIGNED will also be defined.
AArch64/ARMv8 supports unaligned access out of box, so GCC assumes it's available. If this is not the case, you may have to disable it explicitly with the above switch. It's also possible that the "prerelease" version you're using is not quite finished yet and various bugs/issues are present.
EDIT
As mentioned in the comments, the corresponding AArch64 options are:
-mstrict-align
-mno-strict-align
Avoid or allow generating memory accesses that may not be aligned on a natural object boundary as described in the architecture specification.
By the way, the code behaves like this because GCC interpreted the assignment literally:
Copy the string "" (so just a single zero byte) to the start of the buffer.
Fill the rest of the buffer with zeroes.
I suspect that if you enable optimizations, the unaligned access will be gone.
Or, if you use char word[80] = {0}, it should do the zeroing in one go.
After some study on ARMV8 architecture, I got deeper understandings about the data abort exception that I met.
Why did this align fault exception occur?
As #IgorSkochinsky has mentioned, AArch64/ARMv8 supports unaligned access. But as I'm working on a simple bare-metal environment, MMU wasn't enbaled, so in this case, memory is regarded as a device, and device doesn't support unaligned access. If MMU is enabled, this exception is gone.
How to force GCC to compile an unaligned-access free elf file?
From the manual, -mno-unaligned-access should be enough, but for my GCC version:
gcc 4.8.1 20130506 (prerelease) (crosstool-NG
linaro-1.13.1-4.8-2013.05 - Linaro GCC 2013.05)
it says there's no such option. In my case, another option -mstrict-align solved this problem.
Related
I would like to implement libraries at different addresses, link them dynamically and not statically to my main program. In my opinion, it can improve my FOTA update, decreasing the amount of bytes to be sent.
I thought about change it on my linker file, but I don't know how to "jump"/call the library in a different address.
It looks complicated but it is actually quite simple.
Rename the .text segments in your library linker script (the names have to be distinct from your main app ones)
Change the load addresses of those segments to the ones you need
Link statically with your app
Save your binary file from the generated .elf using objcopy and selecting the segments only from your app.
Enjoy
Of course the functions have to be at that addresses before you start the new app (for example be the part of the bootloader, or be placed there before the app is run).
But remember if you recompile the library and place it in the FLASH you need to relink the app.
Another way is to have the table of the function pointers at the known address. and then just call the pointer.
typedef struct
{
int (*_myprintf)(const char *fmt, ...);
FILE *(*_myfopen)(const char *pathname, const char *mode);
/* etc etc*/
}myfuncs;
#define fptr_start 0x8005000 // symbol defined in the linker script.
myfuncs *fptrs = (myfuncs *)fptr_start;
#define myprintf(fmt, ...) fptrs -> _myprintf(fmt, __VA_ARGS__)
void foo(int x)
{
myprintf("%d\n", x);
}
Look up bx or blx in the arm architectural reference manual that relates to your core. (either armv6-m or armv7-m, since the instructions didn't change either will do and armv8-m uses instructions from either 6-m or 7-m (or a combination).
My guess is you are overcomplicating things and when you look at this at a system level it may not be worth the risk/effort to take this path when other less costly solutions will work, but you will want to treat this like any other loadable module, like the ones used on your computer today, be it the thing being downloaded needs functions in it called or the thing being downloaded needs to patch in functions from a library on flash, either way it works the same way you need a table of addresses in some form as complicated or as simple as you desire depending on your design.
If there is .bss or .data involved then you may need to do even more work, ideally you don't need to be position independent if you design for well known addresses, but if you don't do that then you will need to be position independent and patch up addresses accordingly making the toolchain do most of the work for you (in isolating the addresses to be changed to a set of tables that can be found).
simple example
vectors.s
.globl _start
_start:
.word one
.word two
.word three
so.c
unsigned int one ( unsigned int x )
{
return(x+1);
}
unsigned int two ( unsigned int x, unsigned int y )
{
return(x+y+2);
}
unsigned int three ( unsigned int x, unsigned int y, unsigned int z )
{
return(z+y+z+3);
}
vectors.ld
MEMORY
{
ram : ORIGIN = 0x20002000, LENGTH = 0x2000
}
SECTIONS
{
.text : { *(.text*) } > ram
}
build
arm-none-eabi-as --warn --fatal-warnings -mcpu=cortex-m0 vectors.s -o vectors.o
arm-none-eabi-gcc -Wall -O2 -ffreestanding -mcpu=cortex-m0 -mthumb -c so.c -o so.o
arm-none-eabi-ld -nostdlib -nostartfiles -T vectors.ld vectors.o so.o -o so.elf
arm-none-eabi-objdump -D so.elf > so.list
arm-none-eabi-objcopy -O binary so.elf so.bin
examine
Disassembly of section .text:
20002000 <_start>:
20002000: 2000200d andcs r2, r0, sp
20002004: 20002011 andcs r2, r0, r1, lsl r0
20002008: 20002019 andcs r2, r0, r9, lsl r0
2000200c <one>:
2000200c: 3001 adds r0, #1
2000200e: 4770 bx lr
20002010 <two>:
20002010: 3102 adds r1, #2
20002012: 1808 adds r0, r1, r0
20002014: 4770 bx lr
20002016: 46c0 nop ; (mov r8, r8)
20002018 <three>:
20002018: 3103 adds r1, #3
2000201a: 0052 lsls r2, r2, #1
2000201c: 1888 adds r0, r1, r2
2000201e: 4770 bx lr
then you can use this either doing the function pointer thing or a helper function
using blx
.globl bounce
bounce:
push {r3,lr}
ldr r0,[r0]
blx r0
pop {r3,pc}
but that is unecessary can just use bx
.globl bounce
bounce:
ldr r0,[r0]
bx r0
...
unsigned int bounce ( unsigned int, unsigned int, unsigned int, unsigned int );
...
ret=bounce(0x20002000,17,0,0); // one(17);
ret=bounce(0x20002004,11,13,0); // two(11,13);
ret=bounce(0x20002008,1,2,3); // three(1,2,3);
...
Yes, this example was designed to work you also have to know the calling convention and if you use more than r0-r3 for parameters more work has to be done. or you can have a bounce function per library function or a set of them or whatever.
You have to be real careful with C function pointer approach because they can fail with the cortex-m and you cant insure from C that they will work it depends on the rest of the function and some other factors how the code gets implemented (other SO questions and answers on this topic). It has to do with having the lsbit set for thumb. Which you can see how I did it here I didn't need to mess with that I could have orred in a 1 in the bounce function for paranoia reasons, I would not use add because as you see here if I ORed that would still work if I used ADD then it would crash if the linker did the work for me.
Short answer just look up bx and blx those are the key to branching/calling with respect to your question. You should have a copy of the arm technical reference manual for the core you are using and a copy of the arm architectural reference manual for the architecture used by that core when doing this kind of work, as well as the datasheet and reference manual for the part in question. (The ARM ARM contains the instruction set information for that architecture).
In the executable I'm writing I have 2 implementations of the same function, one for armhf (fast) and one for armel (slow). At runtime I'd like to detect the CPU type, and call the armhf implementation if armhf was detected. How do I detect the CPU? I need something like this in C code:
int is_cpu_armhf(void) {
...
}
The code may contain inline assembly, but preferably it shouldn't contain a call to a library function or a system call, because it should work with multiple libraries and multiple operating systems.
I've found https://github.com/pytorch/cpuinfo/tree/master/src/arm , but it doesn't seem to be using any inline assembly, but it relies on the operating system to get the CPU information.
... I have two implementations of the same function, one for armhf (fast) and one for armel (slow). At runtime I'd like to detect the CPU type, and call the armhf implementation if armhf was detected. How do I detect the CPU? I need something like this in C code ...
As #Ruslan noted, the cpu features are mostly privileged on ARM. If you are root then you can read a MRS register for the feature mask. The latest kernels fake a cpuid for ARM, but it is only available on most recent kernels.
At runtime you may be able to parse /proc/cpuinfo on Linux for cpu arch and features. You may also be able to call getauxval and read the bits from the auxiliary vector.
What I have found that works best is:
Try to read getauxval for arch and feature
Use a SIGILL probe if getauxval fails
The SIGILL probe is expensive. You setup a SIGILL handler and try the ARMv5 or ARMv7 instruction. If you catch a SIGILL you know the instruction is not available.
SIGILL probes are used by Crypto++ and OpenSSL. For example, movw and movt were added at ARMv7. Here is the code to probe for ARMv7 using the movw and movt instructions in Crypto++. OpenSSL performs similar in crypto/armcap.c.
bool CPU_ProbeARMv7()
{
volatile bool result = true;
volatile SigHandler oldHandler = signal(SIGILL, SigIllHandler);
if (oldHandler == SIG_ERR)
return false;
volatile sigset_t oldMask;
if (sigprocmask(0, NULLPTR, (sigset_t*)&oldMask))
return false;
if (setjmp(s_jmpSIGILL))
result = false;
else
{
unsigned int a;
asm volatile (
#if defined(__thumb__)
".inst.n 0xf241, 0x2034 \n\t" // movw r0, 0x1234
".inst.n 0xf2c1, 0x2034 \n\t" // movt r0, 0x1234
"mov %0, r0 \n\t" // mov [a], r0
#else
".inst 0xe3010234 \n\t" // movw r0, 0x1234
".inst 0xe3410234 \n\t" // movt r0, 0x1234
"mov %0, r0 \n\t" // mov [a], r0
#endif
: "=r" (a) : : "r0");
result = (a == 0x12341234);
}
sigprocmask(SIG_SETMASK, (sigset_t*)&oldMask, NULLPTR);
signal(SIGILL, oldHandler);
return result;
}
The volatiles are required in the probes. Also see What sense do these clobbered variable warnings make?
On Android you should use android_getCpuFamily() and android_getCpuFeatures() instead of getauxval.
The ARM folks say you should NOT parse /proc/cpuinfo. Also see ARM Blog and Runtime Detection of CPU Features on an armv8-a CPU. (Non-paywall version here).
DO NOT perform SIGILL based feature probes on iOS devices. Apple devices trash memory. For Apple devices use something like How to get device make and model on iOS?.
You also need to enable code paths based on compiler options. That is a whole 'nother can of worms. For that problem see Detect ARM NEON availability in the preprocessor?
For some additional source code to examine, see cpu.cpp in Crypto++. It is the place where Crypto++ does things like call getauxval, android_getCpuFamily() and android_getCpuFeatures().
The Crypto++ SIGILL probes occur in specific source files since a source file usually needs a compiler option to enable an arch, like -march=armv7-a and -fpu=neon for ARM. That's why ARMv7 and NEON are detected in neon_simd.cpp. (There are other similar files for i686 and x86_64, Altivec, PowerPC, and Aarch64).
Here is what a getauxval and android_getCpuFamily() looks like in Crypto++. CPU_QueryARMv7 is used first. If CPU_QueryARMv7 fails, then a SIGILL feature probe is used.
inline bool CPU_QueryARMv7()
{
#if defined(__ANDROID__) && defined(__arm__)
if (((android_getCpuFamily() & ANDROID_CPU_FAMILY_ARM) != 0) &&
((android_getCpuFeatures() & ANDROID_CPU_ARM_FEATURE_ARMv7) != 0))
return true;
#elif defined(__linux__) && defined(__arm__)
if ((getauxval(AT_HWCAP) & HWCAP_ARMv7) != 0 ||
(getauxval(AT_HWCAP) & HWCAP_NEON) != 0)
return true;
#elif defined(__APPLE__) && defined(__arm__)
// Apple hardware is ARMv7 or above.
return true;
#endif
return false;
}
The ARM instructions for movw and movt were disassembled from the following source code:
int a;
asm volatile("movw %0,%1 \n"
"movt %0,%1 \n"
: "=r"(a) : "i"(0x1234));
00000010 <_Z5test2v>: // ARM
10: e3010234 movw r0, #4660 ; 0x1234
14: e3410234 movt r0, #4660 ; 0x1234
18: e12fff1e bx lr
0000001c <_Z5test3v>: // Thumb
1c: f241 2034 movw r0, #4660 ; 0x1234
20: f2c1 2034 movt r0, #4660 ; 0x1234
24: e12fff1e bx lr
Here is what reading a MRS looks like. It is very similar to getting cpuid bitmask on x86. The code below can be used to get Crypto features for Aarch64, but it requires root privileges.
The code requires Exception Level 1 (EL1) and above, but user space runs at EL0. Attempting to run the code from userland results in a SIGILL and termination.
#if defined(__arm64__) || defined(__aarch64__)
uint64_t caps = 0; // Read ID_AA64ISAR0_EL1
__asm __volatile("mrs %0, " "id_aa64isar0_el1" : "=r" (caps));
#elif defined(__arm__) || defined(__aarch32__)
uint32_t caps = 0; // Read ID_ISAR5_EL1
__asm __volatile("mrs %0, " "id_isar5_el1" : "=r" (caps));
#endif
The benefit of issuing instructions yourself is, it does not need arch options when compiling the source file:
unsigned int a;
asm volatile (
#if defined(__thumb__)
".inst.n 0xf241, 0x2034 \n\t" // movw r0, 0x1234
".inst.n 0xf2c1, 0x2034 \n\t" // movt r0, 0x1234
"mov %0, r0 \n\t" // mov [a], r0
#else
".inst 0xe3010234 \n\t" // movw r0, 0x1234
".inst 0xe3410234 \n\t" // movt r0, 0x1234
"mov %0, r0 \n\t" // mov [a], r0
#endif
: "=r" (a) : : "r0");
You can compile the above code without arch options:
gcc cpu-test.c -o cpu-test.o
If you were to use movw and movt:
int a;
asm volatile("movw %0,%1 \n"
"movt %0,%1 \n"
: "=r"(a) : "i"(0x1234));
then your compiler would need to support ARMv7, and you would need to use the arch option:
gcc -march=armv7 cpu-test.c -o cpu-test.o
And GCC could use ARMv7 throughout the source file, which could cause a SIGILL outside your protected code.
I've experienced Clang using the wrong instruction set on x86. See Crypto++ Issue 751. GCC will surely follow. In the Clang case, I needed to compile with -march=avx on a source file so I could use AVX intrinsics. Clang generated AVX code outside my protected block and it crashed on a old Core2 Duo machine. (The Clang generated unsafe code was initialization of a std::string).
In the case of ARM the problem is, you need -march=armv7 to enable the ISA with movw and movt and the compiler thinks it can use the ISA, too. It is a design bug in the compiler where the user's arch and the compiler's arch'es are conflated. In reality, because of the compiler design, you need a user arch and a separate compiler arch.
I have a project on armv5te platform, and I have to rewrite some functions and use assembly code to use enhancement DSP instructions.
I use a lot of int64_t type for accumulators, but I do not have an idea how to pass it for arm instruction SMULL (http://www.keil.com/support/man/docs/armasm/armasm_dom1361289902800.htm).
How can I pass lower or higher 32-bits of 64 variables to 32-bit register? (I know, that I can use intermediate variable int32_t, but it does not look good).
I know, that compiler would do it for me, but I just write the small function for an example.
int64_t testFunc(int64_t acc, int32_t x, int32_t y)
{
int64_t tmp_acc;
asm("SMULL %0, %1, %2, %3"
: "=r"(tmp_acc), "=r"(tmp_acc) // no idea how to pass tmp_acc;
: "r"(x), "r"(y)
);
return tmp_acc + acc;
}
You don't need and shouldn't use inline asm for this. The compiler can do even better than smull, and use smlal to multiply-accumulate with one instruction:
int64_t accum(int64_t acc, int32_t x, int32_t y) {
return acc + x * (int64_t)y;
}
which compiles (with gcc8.2 -O3 -mcpu=arm10e on the Godbolt compiler explorer) to this asm: (ARM10E is an ARMv5 microarchitecture I picked from Wikipedia's list)
accum:
smlal r0, r1, r3, r2 #, y, x
bx lr #
As a bonus, this pure C also compiles efficiently for AArch64.
https://gcc.gnu.org/wiki/DontUseInlineAsm
If you insist on shooting yourself in the foot and using inline asm:
Or in the general case with other instructions, there might be a case where you'd want this.
First, beware that smull output registers aren't allowed to overlap the first input register, so you have to tell the compiler about this. An early-clobber constraint on the output operand(s) will do the trick of telling the compiler it can't have inputs in those registers. I don't see a clean way to tell the compiler that the 2nd input can be in the same register as an output.
This restriction is lifted in ARMv6 and later (see this Keil documentation) "Rn must be different from RdLo and RdHi in architectures before ARMv6", but for ARMv5 compatibility you need to make sure the compiler doesn't violate this when filling in your inline-asm template.
Optimizing compilers can optimize away a shift/OR that combines 32-bit C variables into a 64-bit C variable, when targeting a 32-bit platform. They already store 64-bit variables as a pair of registers, and in normal cases can figure out there's no actual work to be done in the asm.
So you can specify a 64-bit input or output as a pair of 32-bit variables.
#include <stdint.h>
int64_t testFunc(int64_t acc, int32_t x, int32_t y)
{
uint32_t prod_lo, prod_hi;
asm("SMULL %0, %1, %2, %3"
: "=&r" (prod_lo), "=&r"(prod_hi) // early clobber for pre-ARMv6
: "r"(x), "r"(y)
);
int64_t prod = ((int64_t)prod_hi) << 32;
prod |= prod_lo; // + here won't optimize away, but | does, with gcc
return acc + prod;
}
Unfortunately the early-clobber means we need 6 total registers, but the ARM calling convention only has 6 call-clobbered registers (r0..r3, lr, and ip (aka r12)). And one of them is LR, which has the return address so we can't lose its value. Probably not a big deal when inlined into a regular function that already saves/restores several registers.
Again from Godbolt:
# gcc -O3 output with early-clobber, valid even before ARMv6
testFunc:
str lr, [sp, #-4]! #, Save return address (link register)
SMULL ip, lr, r2, r3 # prod_lo, prod_hi, x, y
adds r0, ip, r0 #, prod, acc
adc r1, lr, r1 #, prod, acc
ldr pc, [sp], #4 # return by popping the return address into PC
# gcc -O3 output without early-clobber (&) on output constraints:
# valid only for ARMv6 and later
testFunc:
SMULL r3, r2, r2, r3 # prod_lo, prod_hi, x, y
adds r0, r3, r0 #, prod, acc
adc r1, r2, r1 #, prod, acc
bx lr #
Or you can use a "=r"(prod64) constraint and use modifiers to select which half of %0 you get. Unfortunately, gcc and clang emit less efficient asm for some reason, saving more registers (and maintaining 8-byte stack alignment). 2 instead of 1 for gcc, 4 instead of 2 for clang.
// using an int64_t directly with inline asm, using %Q0 and %R0 constraints
// Q is the low half, R is the high half.
int64_t testFunc2(int64_t acc, int32_t x, int32_t y)
{
int64_t prod; // gcc and clang seem to want more free registers this way
asm("SMULL %Q0, %R0, %1, %2"
: "=&r" (prod) // early clobber for pre-ARMv6
: "r"(x), "r"(y)
);
return acc + prod;
}
again compiled with gcc -O3 -mcpu=arm10e. (clang saves/restores 4 registers)
# gcc -O3 with the early-clobber so it's safe on ARMv5
testFunc2:
push {r4, r5} #
SMULL r4, r5, r2, r3 # prod, x, y
adds r0, r4, r0 #, prod, acc
adc r1, r5, r1 #, prod, acc
pop {r4, r5} #
bx lr #
So for some reason it seems to be more efficient to manually handle the halves of a 64-bit integer with current gcc and clang. This is obviously a missed optimization bug.
I am wondering whenever I would need to use a atomic type or volatile (or nothing special) for a interrupt counter:
uint32_t uptime = 0;
// interrupt each 1 ms
ISR()
{
// this is the only location which writes to uptime
++uptime;
}
void some_func()
{
uint32_t now = uptime;
}
I myself would think that volatile should be enough and guarantee error-free operation and consistency (incremental value until overflow).
But it has come to my mind that maybe a mov instruction could be interrupted mid-operation when moving/setting individual bits, is that possible on x86_64 and/or armv7-m?
for example the mov instruction would begin to execute, set 16 bits, then would be pre-empted, the ISR would run increasing uptime by one (and maybe changing all bits) and then the mov instruction would be continued. I cannot find any material that could assure me of the working order.
Would this also be the same on armv7-m?
Would using sig_atomic_t be the correct solution to always have an error-free and consistent result or would it be "overkill"?
For example the ARM7-M architecture specifies:
In ARMv7-M, the single-copy atomic processor accesses are:
• All byte accesses.
• All halfword accesses to halfword-aligned locations.
• All word accesses to word-aligned locations.
would a assert with &uptime % 8 == 0 be sufficient to guarantee this?
Use volatile. You compiler does not know about interrupts. It may assume, that ISR() function is never called (do you have in your code anywhere a call to ISR?). That means that uptime will never increment, that means that uptime will always be zero, that means that uint32_t now = uptime; may be safely optimized to uint32_t now = 0;. Use volatile uint32_t uptime. That way the optimizer will not optimize uptime away.
Word size. uint32_t variable has 4bytes. So on 32-bit processor it will take 1 instruction to fetch it's value, but on 8-bit processor it will take at least 4 instructions (in general). So on 32-bit processor you don't need to disable interrupt before loading the value of uptime, because interrupt routine will start executing before or after the current instruction is executed on the processor. Processor can't branch to interrupt routing mid-instruction, that's not possible. On 8-bit processor we need to disable interrupts before reading from uptime, like:
DisableInterrupts();
uint32_t now = uptime;
EnableInterrupts();
C11 atomic types. I have never seen a real embedded code which uses them, still waiting, I see volatile everywhere. This is dependent on your compiler, because the compiler implements atomic types and atomic_* functions. This is compiler dependent. Are 100% sure that when reading from atomic_t variable your compiler will disable ISR() interrupt? Inspect the assembly output generated from atomic_* calls, you will know for sure. This was a good read. I would expect atomic* C11 types to work for concurrency between multiple threads, which can switch execution context anytime. Using it between interrupt and normal context may block your cpu, because once you are in IRQ you get back to normal execution only after servicing that IRQ, ie. some_func sets mutex up to read uptime, then IRQ fires up and IRQ will check in a loop if mutex is down, this will result in endless loop.
See for example HAL_GetTick() implementation, from here, removed __weak macro and substituted __IO macro by volatile, those macros are defined in cmsis file:
static volatile uint32_t uwTick;
void HAL_IncTick(void)
{
uwTick++;
}
uint32_t HAL_GetTick(void)
{
return uwTick;
}
Typically HAL_IncTick() is called from systick interrupt each 1ms.
You have to read the documentation for each separate core and/or chip. x86 is a completely separate thing from ARM, and within both families each instance may vary from any other instance, can be and should expect to be completely new designs each time. Might not be but from time to time are.
Things to watch out for as noted in the comments.
typedef unsigned int uint32_t;
uint32_t uptime = 0;
void ISR ( void )
{
++uptime;
}
void some_func ( void )
{
uint32_t now = uptime;
}
On my machine with the tool I am using today:
Disassembly of section .text:
00000000 <ISR>:
0: e59f200c ldr r2, [pc, #12] ; 14 <ISR+0x14>
4: e5923000 ldr r3, [r2]
8: e2833001 add r3, r3, #1
c: e5823000 str r3, [r2]
10: e12fff1e bx lr
14: 00000000 andeq r0, r0, r0
00000018 <some_func>:
18: e12fff1e bx lr
Disassembly of section .bss:
00000000 <uptime>:
0: 00000000 andeq r0, r0, r0
this could vary, but if you find a tool on one machine one day that builds a problem then you can assume it is a problem. So far we are actually okay. because some_func is dead code the read is optimized out.
typedef unsigned int uint32_t;
uint32_t uptime = 0;
void ISR ( void )
{
++uptime;
}
uint32_t some_func ( void )
{
uint32_t now = uptime;
return(now);
}
fixed
00000000 <ISR>:
0: e59f200c ldr r2, [pc, #12] ; 14 <ISR+0x14>
4: e5923000 ldr r3, [r2]
8: e2833001 add r3, r3, #1
c: e5823000 str r3, [r2]
10: e12fff1e bx lr
14: 00000000 andeq r0, r0, r0
00000018 <some_func>:
18: e59f3004 ldr r3, [pc, #4] ; 24 <some_func+0xc>
1c: e5930000 ldr r0, [r3]
20: e12fff1e bx lr
24: 00000000 andeq r0, r0, r0
Because of cores like mips and arm tending to have data aborts by default for unaligned accesses we might assume the tool will not generate an unaligned address for such a clean definition. But if we were to talk about packed structs, that is another story you told the compiler to generate an unaligned access and it will...If you want to feel safe remember a "word" in ARM is 32 bits so you can assert address of variable AND 3.
x86 one would also assume a clean definition like that would result in an aligned variable, but x86 doesnt have the data fault issue by default and as a result compilers are a bit more free...focusing on arm as I think that is your question.
Now if I do this:
typedef unsigned int uint32_t;
uint32_t uptime = 0;
void ISR ( void )
{
if(uptime)
{
uptime=uptime+1;
}
else
{
uptime=uptime+5;
}
}
uint32_t some_func ( void )
{
uint32_t now = uptime;
return(now);
}
00000000 <ISR>:
0: e59f2014 ldr r2, [pc, #20] ; 1c <ISR+0x1c>
4: e5923000 ldr r3, [r2]
8: e3530000 cmp r3, #0
c: 03a03005 moveq r3, #5
10: 12833001 addne r3, r3, #1
14: e5823000 str r3, [r2]
18: e12fff1e bx lr
1c: 00000000 andeq r0, r0, r0
and adding volatile
00000000 <ISR>:
0: e59f3018 ldr r3, [pc, #24] ; 20 <ISR+0x20>
4: e5932000 ldr r2, [r3]
8: e3520000 cmp r2, #0
c: e5932000 ldr r2, [r3]
10: 12822001 addne r2, r2, #1
14: 02822005 addeq r2, r2, #5
18: e5832000 str r2, [r3]
1c: e12fff1e bx lr
20: 00000000 andeq r0, r0, r0
the two reads results in two reads. now there is a problem here if the read-modify-write can get interrupted, but we assume since this is an ISR it cant? If you were to read a 7, add a 1 then write an 8 if you were interrupted after the read by something that is also modifying uptime, that modification has limited life, its modification happens, say a 5 is written, then this ISR writes an 8 on top if it.
if a read-modify-write were in the interruptable code then the isr could get in there and it probably wouldnt work the way you wanted. This is two readers two writers you want one responsible for writing a shared resource and the others read-only. Otherwise you need a lot more work not built into the language.
Note on an arm machine:
typedef int __sig_atomic_t;
...
typedef __sig_atomic_t sig_atomic_t;
so
typedef unsigned int uint32_t;
typedef int sig_atomic_t;
volatile sig_atomic_t uptime = 0;
void ISR ( void )
{
if(uptime)
{
uptime=uptime+1;
}
else
{
uptime=uptime+5;
}
}
uint32_t some_func ( void )
{
uint32_t now = uptime;
return(now);
}
Isnt going to change the result. At least not on that system with that define, need to examine other C libraries and/or sandbox headers to see what they define, or if you are not careful (happens often) the wrong headers are used, the x6_64 headers are used to build arm programs with the cross compiler. seen gcc and llvm make host vs target mistakes.
going back to a concern though which based on your comments you appear to already understand
typedef unsigned int uint32_t;
uint32_t uptime = 0;
void ISR ( void )
{
if(uptime)
{
uptime=uptime+1;
}
else
{
uptime=uptime+5;
}
}
void some_func ( void )
{
while(uptime&1) continue;
}
This was pointed out in the comments even though you have one writer and one reader
00000020 <some_func>:
20: e59f3018 ldr r3, [pc, #24] ; 40 <some_func+0x20>
24: e5933000 ldr r3, [r3]
28: e2033001 and r3, r3, #1
2c: e3530000 cmp r3, #0
30: 012fff1e bxeq lr
34: e3530000 cmp r3, #0
38: 012fff1e bxeq lr
3c: eafffffa b 2c <some_func+0xc>
40: 00000000 andeq r0, r0, r0
It never goes back to read the variable from memory, and unless someone corrupts the register in an event handler, this can be an infinite loop.
make uptime volatile:
00000024 <some_func>:
24: e59f200c ldr r2, [pc, #12] ; 38 <some_func+0x14>
28: e5923000 ldr r3, [r2]
2c: e3130001 tst r3, #1
30: 012fff1e bxeq lr
34: eafffffb b 28 <some_func+0x4>
38: 00000000 andeq r0, r0, r0
now the reader does a read every time.
same issue here, not in a loop, no volatile.
00000020 <some_func>:
20: e59f302c ldr r3, [pc, #44] ; 54 <some_func+0x34>
24: e5930000 ldr r0, [r3]
28: e3500005 cmp r0, #5
2c: 0a000004 beq 44 <some_func+0x24>
30: e3500004 cmp r0, #4
34: 0a000004 beq 4c <some_func+0x2c>
38: e3500001 cmp r0, #1
3c: 03a00006 moveq r0, #6
40: e12fff1e bx lr
44: e3a00003 mov r0, #3
48: e12fff1e bx lr
4c: e3a00007 mov r0, #7
50: e12fff1e bx lr
54: 00000000 andeq r0, r0, r0
uptime can have changed between tests. volatile fixes this.
so volatile is not the universal solution, having the variable be used for one way communication is ideal, need to communicate the other way use a separate variable, one writer one or more readers per.
you have done the right thing and consulted the documentation for your chip/core
So if aligned (in this case a 32 bit word) AND the compiler chooses the right instruction then the interrupt wont interrupt the transaction. If it is an LDM/STM though you should read the documentation (push and pop are also LDM/STM pseudo instructions) in some cores/architectures those can be interrupted and restarted as a result we are warned about those situations in arm documentation.
short answer, add volatile, and make it so there is only one writer per variable. and keep the variable aligned. (and read the docs each time you change chips/cores, and periodically disassemble to check the compiler is doing what you asked it to do). doesnt matter if it is the same core type (another cortex-m3) from the same vendor or different vendors or if it is some completely different core/chip (avr, msp430, pic, x86, mips, etc), start from zero, get the docs and read them, check the compiler output.
TL:DR: Use volatile if an aligned uint32_t is naturally atomic (it is on x86 and ARM). Why is integer assignment on a naturally aligned variable atomic on x86?. Your code will technically have C11 undefined behaviour, but real implementations will do what you want with volatile.
Or use C11 stdatomic.h with memory_order_relaxed if you want to tell the compiler exactly what you mean. It will compile to the same asm as volatile on x86 and ARM if you use it correctly.
(But if you actually need it to run efficiently on single-core CPUs where load/store of an aligned uint32_t isn't atomic "for free", e.g. with only 8-bit registers, you might rather disable interrupts instead of having stdatomic fall back to using a lock to serialize reads and writes of your counter.)
Whole instructions are always atomic with respect to interrupts on the same core, on all CPU architectures. Partially-completed instructions are either completed or discarded (without committing their stores) before servicing an interrupt.
For a single core, CPUs always preserve the illusion of running instructions one at a time, in program order. This includes interrupts only happening on the boundaries between instructions. See #supercat's single-core answer on Can num++ be atomic for 'int num'?. If the machine has 32-bit registers, you can safely assume that a volatile uint32_t will be loaded or stored with a single instruction. As #old_timer points out, beware of unaligned packed-struct members on ARM, but unless you manually do that with __attribute__((packed)) or something, the normal ABIs on x86 and ARM ensure natural alignment.
Multiple bus transactions from a single instruction for unaligned operands or narrow busses only matters for concurrent read+write, either from another core or a non-CPU hardware device. (e.g. if you're storing to device memory).
Some long-running x86 instructions like rep movs or vpgatherdd have well-defined ways to partially complete on exceptions or interrupts: update registers so re-running the instruction does the right thing. But other than that, an instruction has either run or it hasn't, even a "complex" instruction like a memory-destination add that does a read/modify/write.) IDK if anyone's ever proposed a CPU that could suspend/result multi-step instructions across interrupts instead of cancelling them, but x86 and ARM are definitely not like that. There are lots of weird ideas in computer-architecture research papers. But it seems unlikely that it would be worth the keeping all the necessary microarchitectural state to resume in the middle of a partially-executed instruction instead of just re-decoding it after returning from an interrupt.
This is why AVX2 / AVX512 gathers always need a gather mask even when you want to gather all the elements, and why they destroy the mask (so you have to reset it to all-ones again before the next gather).
In your case, you only need the store (and load outside the ISR) to be atomic. You don't need the whole ++uptime to be atomic. You can express this with C11 stdatomic like this:
#include <stdint.h>
#include <stdatomic.h>
_Atomic uint32_t uptime = 0;
// interrupt each 1 ms
void ISR()
{
// this is the only location which writes to uptime
uint32_t tmp = atomic_load_explicit(&uptime, memory_order_relaxed);
// the load doesn't even need to be atomic, but relaxed atomic is as cheap as volatile on machines with wide-enough loads
atomic_store_explicit(&uptime, tmp+1, memory_order_relaxed);
// some x86 compilers may fail to optimize to add dword [uptime],1
// but uptime+=1 would compile to LOCK ADD (an atomic increment), which you don't want.
}
// MODIFIED: return the load result
uint32_t some_func()
{
// this does need to be an atomic load
// you typically get that by default with volatile, too
uint32_t now = atomic_load_explicit(&uptime, memory_order_relaxed);
return now;
}
volatile uint32_t compiles to the exact same asm on x86 and ARM. I put the code on the Godbolt compiler explorer. This is what clang6.0 -O3 does for x86-64. (With -mtune=bdver2, it uses inc instead of add, but it knows that memory-destination inc is one of the few cases where inc is still worse than add on Intel :)
ISR: # #ISR
add dword ptr [rip + uptime], 1
ret
some_func: # #some_func
mov eax, dword ptr [rip + uptime]
ret
inc_volatile: // void func(){ volatile_var++; }
add dword ptr [rip + volatile_var], 1
ret
gcc uses separate load/store instructions for both volatile and _Atomic, unfortunately.
# gcc8.1 -O3
mov eax, DWORD PTR uptime[rip]
add eax, 1
mov DWORD PTR uptime[rip], eax
At least that means there's no downside to using _Atomic or volatile _Atomic on either gcc or clang.
Plain uint32_t without either qualifier is not a real option, at least not for the read side. You probably don't want the compiler to hoist get_time() out of a loop and use the same time for every iteration. In cases where you do want that, you could copy it to a local. That could result in extra work for no benefit if the compiler doesn't keep it in a register, though (e.g. across function calls it's easiest for the compiler to just reload from static storage). On ARM, though, copying to a local may actually help because then it can reference it relative to the stack pointer instead of needing to keep a static address in another register, or regenerate the address. (x86 can load from static addresses with a single large instruction, thanks to its variable-length instruction set.)
If you want any stronger memory-ordering, you can use atomic_signal_fence(memory_order_release); or whatever (signal_fence not thread_fence) to tell the compiler you only care about ordering wrt. code running asynchronously on the same CPU ("in the same thread" like a signal handler), so it will only have to block compile-time reordering, not emit any memory-barrier instructions like ARM dmb.
e.g. in the ISR:
uint32_t tmp = atomic_load_explicit(&idx, memory_order_relaxed);
tmp++;
shared_buf[tmp] = 2; // non-atomic
// Then do a release-store of the index
atomic_signal_fence(memory_order_release);
atomic_load_explicit(&idx, tmp, memory_order_relaxed);
Then it's safe for a reader to load idx, run atomic_signal_fence(memory_order_acquire);, and read from shared_buf[tmp] even if shared_buf is not _Atomic. (Assuming you took care of wraparound issues and so on.)
volatile is only sugestion for compiler, where value should be stored. typically with this flat this is stored in any CPU register. But if compiler will not take this space because it is busy for other operation, it will be ignored and traditionally stored in memory. this is the main rule.
then let's look at the architecture. all native CPU instruction with all native types are atomic. But many operation can be splited into two steps, when value should be copied from memory to memory. in that situation can be done some cpu interrupt. but don't worry, it is normal. when value will not be stored into prepared variable, you can understand this as not fully commited operation.
problem is when you use words longer than implemented in CPU, for example u32bit in 16 or 8 bit processor. In that situation reading and writting value will be splited into many steps. then it will be sure, then some part of value will be stored, other not, and you will get wrong damaged value.
in this scenario it is not allways good aproach for disabling interrupts, because this can take big time. of course you can use locking, but this can do the same.
but you can make some structure, with first field as data, and second field as counter that suit in architecture. then when you reading that value, you can at first get counter as first value, then get value, and at last get counter second time. when counter differs, you should repeat this process.
of course it doesn't guarantee all will be proper, but typically it saves a lot of cpu cycles. for example you will use 16bit additional counter for verification, it is 65536 values. then when you read this second counter first time, you main process must be frozen for very long cycles, in this example it should be 65536 missed interrupts, for making bug for main counter or any other stored value.
of course if you using 32bit value in 32bit architecture, it is not a problem, you don't need specially secure that operation, independed or architecture. of course except if architecture do all its operation as atomic :)
example code:
struct
{
ucint32_t value; //us important value
int watchdog; //for value secure, long platform depended, usually at least 32bits
} SecuredCounter;
ISR()
{
// this is the only location which writes to uptime
++SecuredCounter.value;
++SecuredCounter.watchdog;
}
void some_func()
{
uint32_t now = Read_uptime;
}
ucint32_t Read_uptime;
{
int secure1; //length platform dependee
ucint32_t value;
int secure2;
while (1) {
longint secure1=SecuredCounter.watchdog; //read first
ucint32_t value=SecuredCounter.value; //read value
longint secure2=SecuredCounter.watchdog; //read second, should be as first
if (secure1==secure2) return value; //this is copied and should be proper
};
};
Different approach is to make two identical counters, you should increase it both in single function. In read function you copy both values to local variables, and compare it is identical. If is, then value is proper and return single one. If differs, repeat reading. Don't worry, if values differs, then you reading function has been interrupted. It is very fiew chance, after repeated reading it will happen again. But if it will happen, it is no chance it will be stalled loop.
Here is my assembly code for A9,
ldr x1, = 0x400020 // Const value may be address also
ldr w0, = 0x200018 // Const value may be address also
str w0, [x1]
The below one is expected output ?
*((u32 *)0x400020) = 0x200018;
When i cross checked with it by compiler it given differnet result as mov and movs insted of ldr. How to create ldr in c?
When i cross checked with it by compiler it given differnet result as mov and movs
It sounds to me like you compiled the C code with a compiler targetting AArch32, but the assembly code you've shown looks like it was written for AArch64.
Here's what I get when I compile with ARM64 GCC 5.4 and optimization level O3 (comments added by me):
mov x0, 32 # x0 = 0x20
mov w1, 24 # w1 = 0x18
movk x0, 0x40, lsl 16 # x0[31:16] = 0x40
movk w1, 0x20, lsl 16 # w1[31:16] = 0x20
str w1, [x0]
How to create ldr in c?
I can't see any good reason why you'd want the compiler to generate an LDR in this case.
LDR reg,=value is a pseudo-instruction that allows you to load immediates that cannot be encoded directly in the instruction word. The assembler achieves this by placing the value (e.g. 0x200018) in a literal pool, and then replacing ldr w0, =0x200018 with a PC-relative load from that literal pool (i.e. something like ldr w0,[pc,#offset_to_value]). Accessing memory is slow, so the compiler generated another sequence of instructions for you that achieves the same thing in a more efficient manner.
Pseudo-instructions are mainly a convenience for humans writing assembly code, making the code easier for them or their colleagues to read/write/maintain. Unlike a human being, a compiler doesn't get fatigued by repeating the same task over and over, and therefore doesn't have as much need for conveniences like that.
TL;DR: The compiler will generate what it thinks is the best (according to the current optimization level) instruction sequence. Also, that particular form of LDR is a pseudo-instruction, so you might not be able to get a compiler to generate it even if you disable all optimizations.