For an STM32F7, which includes instructions for double floating points, I want to convert an uint64_t to double.
In order to test that, I used the following code:
volatile static uint64_t m_testU64 = 45uLL * 0xFFFFFFFFuLL;
volatile static double m_testD;
#ifndef DO_NOT_USE_UL2D
m_testD = (double)m_testU64;
#else
double t = (double)(uint32_t)(m_testU64 >> 32u);
t *= 4294967296.0;
t += (double)(uint32_t)(m_testU64 & 0xFFFFFFFFu);
m_testD = t;
#endif
By default (if DO_NOT_USE_UL2D is not defined) the compiler (gcc or clang) is calling the function: __aeabi_ul2d() which is kind of complex in number of executed instruction. See the assembly code here : https://github.com/gcc-mirror/gcc/blob/master/libgcc/config/arm/ieee754-df.S#L537
For my particular example, it takes 20 instructions without entering in most of the branches
And if DO_NOT_USE_UL2D is defined, the compiler generate the following assembly code:
movw r0, #1728 ; 0x6c0
vldr d2, [pc, #112] ; 0x303fa0
movt r0, #8192 ; 0x2000
vldr s0, [r0, #4]
ldr r1, [r0, #0]
vcvt.f64.u32 d0, s0
vldr s2, [r0]
vcvt.f64.u32 d1, s2
ldr r1, [r0, #4]
vfma.f64 d1, d0, d2
vstr d1, [r0, #8]
The code is simpler, and it is only 10 instructions.
So here the the questions (if DO_NOT_USE_UL2D is defined):
Is my code (in C) correct?
Is my code slower than the __aeabi_ul2d() function (not really important, but a bit curious)?
I have to do that, since I am not allowed to use function from libgcc (There are very good reasons for that...)
Be aware that the main purpure of this question is not about performance, I am really curious about the implementation in libgcc, and I really want to know if there is something wrong in my code.
Related
I'm trying to implement some "OSEK-Services" on an arm7tdmi-s using gcc arm. Unfortunately turning up the optimization level results in "wrong" code generation. The main thing I dont understand is that the compiler seems to ignore the procedure call standard, e.g. passing parameters to a function by moving them into registers r0-r3. I understand that function calls can be inlined but still the parameters need to be in the registers to perform the system call.
Consider the following code to demonstrate my problem:
unsigned SysCall(unsigned param)
{
volatile unsigned ret_val;
__asm __volatile
(
"swi 0 \n\t" /* perform SystemCall */
"mov %[v], r0 \n\t" /* move the result into ret_val */
: [v]"=r"(ret_val)
:: "r0"
);
return ret_val; /* return the result */
}
int main()
{
unsigned retCode;
retCode = SysCall(5); // expect retCode to be 6 when returning back to usermode
}
I wrote the Top-Level software interrupt handler in assembly as follows:
.type SWIHandler, %function
.global SWIHandler
SWIHandler:
stmfd sp! , {r0-r2, lr} #save regs
ldr r0 , [lr, #-4] #load sysCall instruction and extract sysCall number
bic r0 , #0xff000000
ldr r3 , =DispatchTable #load dispatchTable
ldr r3 , [r3, r0, LSL #2] #load sysCall address into r3
ldmia sp, {r0-r2} #load parameters into r0-r2
mov lr, pc
bx r3
stmia sp ,{r0-r2} #store the result back on the stack
ldr lr, [sp, #12] #restore return address
ldmfd sp! , {r0-r2, lr} #load result into register
movs pc , lr #back to next instruction after swi 0
The dispatch table looks like this:
DispatchTable:
.word activateTaskService
.word getTaskStateService
The SystemCall function looks like this:
unsigned activateTaskService(unsigned tID)
{
return tID + 1; /* only for demonstration */
}
running without optimization everything works fine and the parameters are in the registers as to be expected:
See following code with -O0 optimization:
00000424 <main>:
424: e92d4800 push {fp, lr}
428: e28db004 add fp, sp, #4
42c: e24dd008 sub sp, sp, #8
430: e3a00005 mov r0, #5 #move param into r0
434: ebffffe1 bl 3c0 <SysCall>
000003c0 <SysCall>:
3c0: e52db004 push {fp} ; (str fp, [sp, #-4]!)
3c4: e28db000 add fp, sp, #0
3c8: e24dd014 sub sp, sp, #20
3cc: e50b0010 str r0, [fp, #-16]
3d0: ef000000 svc 0x00000000
3d4: e1a02000 mov r2, r0
3d8: e50b2008 str r2, [fp, #-8]
3dc: e51b3008 ldr r3, [fp, #-8]
3e0: e1a00003 mov r0, r3
3e4: e24bd000 sub sp, fp, #0
3e8: e49db004 pop {fp} ; (ldr fp, [sp], #4)
3ec: e12fff1e bx lr
Compiling the same code with -O3 results in the following assembly code:
00000778 <main>:
778: e24dd008 sub sp, sp, #8
77c: ef000000 svc 0x00000000 #Inline SystemCall without passing params into r0
780: e1a02000 mov r2, r0
784: e3a00000 mov r0, #0
788: e58d2004 str r2, [sp, #4]
78c: e59d3004 ldr r3, [sp, #4]
790: e28dd008 add sp, sp, #8
794: e12fff1e bx lr
Notice how the systemCall gets inlined without assigning the value 5 t0 r0.
My first approach is to move those values manually into the registers by adapting the function SysCall from above as follows:
unsigned SysCall(volatile unsigned p1)
{
volatile unsigned ret_val;
__asm __volatile
(
"mov r0, %[p1] \n\t"
"swi 0 \n\t"
"mov %[v], r0 \n\t"
: [v]"=r"(ret_val)
: [p1]"r"(p1)
: "r0"
);
return ret_val;
}
It seems to work in this minimal example but Im not very sure whether this is the best possible practice. Why does the compiler think he can omit the parameters when inlining the function? Has somebody any suggestions whether this approach is okay or what should be done differently?
Thank you in advance
A function call in C source code does not instruct the compiler to call the function according to the ABI. It instructs the compiler to call the function according to the model in the C standard, which means the compiler must pass the arguments to the function in a way of its choosing and execute the function in a way that has the same observable effects as defined in the C standard.
Those observable effects do not include setting any processor registers. When a C compiler inlines a function, it is not required to set any particular processor registers. If it calls a function using an ABI for external calls, then it would have to set registers. Inline calls do not need to obey the ABI.
So merely putting your system request inside a function built of C source code does not guarantee that any registers will be set.
For ARM, what you should do is define register variables assigned to the required register(s) and use those as input and output to the assembly instructions:
unsigned SysCall(unsigned param)
{
register unsigned Parameter __asm__("r0") = param;
register unsigned Result __asm__("r0");
__asm__ volatile
(
"swi 0"
: "=r" (Result)
: "r" (Parameter)
: // "memory" // if any inputs are pointers
);
return Result;
}
(This is a major kludge by GCC; it is ugly, and the documentation is poor. But see also https://stackoverflow.com/tags/inline-assembly/info for some links. GCC for some ISAs has convenient specific-register constraints you can use instead of r, but not for ARM.) The register variables do not need to be volatile; the compiler knows they will be used as input and output for the assembly instructions.
The asm statement itself should be volatile if it has side effects other than producing a return value. (e.g. getpid() doesn't need to be volatile.)
A non-volatile asm statement with outputs can be optimized away if the output is unused, or hoisted out of loops if its used with the same input (like a pure function call). This is almost never what you want for a system call.
You also need a "memory" clobber if any of the inputs are pointers to memory that the kernel will read or modify. See How can I indicate that the memory *pointed* to by an inline ASM argument may be used? for more details (and a way to use a dummy memory input or output to avoid a "memory" clobber.)
A "memory" clobber on mmap/munmap or other system calls that affect what memory means would also be wise; you don't want the compiler to decide to do a store after munmap instead of before.
I am compiling this code for a Cortex M7 using GCC:
// copy manually
void write_test_plain(uint8_t * ptr, uint32_t value)
{
*ptr++ = (u8)(value);
*ptr++ = (u8)(value >> 8);
*ptr++ = (u8)(value >> 16);
*ptr++ = (u8)(value >> 24);
}
// copy using memcpy
void write_test_memcpy(uint8_t * ptr, uint32_t value)
{
void *px = (void*)&value;
memcpy(ptr, px, 4);
}
int main(void)
{
extern uint8_t data[];
extern uint32_t value;
// i added some offsets to data to
// make sure the compiler cannot
// assume it's aligned in memory
write_test_plain(data + 2, value);
__asm volatile("": : :"memory"); // just to split inlined calls
write_test_memcpy(data + 5, value);
... do something with data ...
}
And I get the following Thumb2 assembly with -O2:
// write_test_plain(data + 2, value);
800031c: 2478 movs r4, #120 ; 0x78
800031e: 2056 movs r0, #86 ; 0x56
8000320: 2134 movs r1, #52 ; 0x34
8000322: 2212 movs r2, #18 ; 0x12
8000324: 759c strb r4, [r3, #22]
8000326: 75d8 strb r0, [r3, #23]
8000328: 7619 strb r1, [r3, #24]
800032a: 765a strb r2, [r3, #25]
// write_test_memcpy(data + 5, value);
800032c: 4ac4 ldr r2, [pc, #784] ; (8000640 <main+0x3a0>)
800032e: 923b str r2, [sp, #236] ; 0xec
8000330: 983b ldr r0, [sp, #236] ; 0xec
8000332: f8c3 0019 str.w r0, [r3, #25]
Can someone explain how the memcpy version works? This looks like inlined 32-bit store to the destination address, but isn't this a problem since data + 5 is most certainly not aligned to a 4-byte boundary?
Is this perhaps some optimization which happens due to some undefined behavior in my source?
For Cortex-M processors unaligned loads and stores of bytes, half-words, and words are usually allowed and most compilers use this when generating code unless they are instructed not to. If you want to prevent gcc from assuming the unaligned accesses are OK, you can use the -mno-unaligned-access compiler flag.
If you specify this flag gcc will no longer inline the call to memcpy and write_test_memcpy looks like
write_test_memcpy(unsigned char*, unsigned long):
push {lr}
sub sp, sp, #12
movs r2, #4
add r3, sp, #8
str r1, [r3, #-4]!
mov r1, r3
bl memcpy
add sp, sp, #12
ldr pc, [sp], #4
Cortex-M 7 , M4, M3 M33, M23 does support unaligned access
M0, M+ doesn't support unaligned access
however you can disable the support of unaligned access in cortexm7 by setting bit UNALIGN_TRP in configuration and control register and any unaligned access will generate usage fault.
From compiler perspective, default setting is that generated assembly code does unaligned access unless you disable this by using the compile flag -mno-unaligned-access
here is a c source code example:
register int a asm("r8");
register int b asm("r9");
int main() {
int c;
a=2;
b=3;
c=a+b;
return c;
}
And this is the assembled code generated using a arm gcc cross compiler:
$ arm-linux-gnueabi-gcc -c global_reg_var_test.c -Wa,-a,-ad
...
mov r8, #2
mov r9, #3
mov r2, r8
mov r3, r9
add r3, r2, r3
...
When using -frename-registers, the behaviour was the same. (updated. Before I had said with -O3.)
So the question is: why gcc add the 3rd and 4th MOV's instead of 'ADD R3, R8, R9'?
Context: I need to optimize a code in a simulated inorder cpu (gem5 arm minorcpu) that doesn't rename registers.
I took real example (posted in comments) and put it on the godbolt compiler explorer. The main inefficiency in calc() is that src1 and src2 are globals it has to load from memory, instead of args passed in registers.
I didn't look at main, just calc.
register int sum asm ("r4");
register int r asm ("r5");
register int c asm ("r6");
register int k asm ("r7");
register int temp1 asm ("r8"); // really? you're using two global register vars for scratch temporaries? Just let the compiler do its job.
register int temp2 asm ("r9");
register long n asm ("r10");
int *src1, *src2, *dst;
void calc() {
temp1 = r*n;
temp2 = k*n;
temp1 = temp1+k;
temp2 = temp2+c;
// you get bad code for this because src1 and src2 are globals, not args passed in regs
sum = sum + src1[temp1] * src2[temp2];
}
# gcc 4.8.2 -O3 -Wall -Wextra -Wa,-a,-ad -fverbose-asm
mla r0, r10, r7, r6 # temp2.9, n, k, c ## tmp = k*n + c
movw r3, #:lower16:.LANCHOR0 # tmp136,
mla r8, r10, r5, r7 # temp1, n, r, k ## temp1 = r*n + k
movt r3, #:upper16:.LANCHOR0 # tmp136,
ldmia r3, {r1, r2} # tmp136,, ## load both pointers, since they're stored adjacently in memory
mov r9, r0 # temp2, temp2.9 ## This insn is wasted: the first MLA should have had this as the dest
ldr r3, [r1, r8, lsl #2] # *_22, *_22
ldr r2, [r2, r9, lsl #2] # *_28, *_28
mla r4, r2, r3, r4 # sum, *_28, *_22, sum
bx lr #
For some reason, one of the integer multiply-accumulate (mla) instructions uses r8 (temp1) as the destination, but the other one writes to r0 (a scratch reg), and only later moves the result to r9 (temp2).
The sum += src1[temp1] * src2[temp2] is done with an mla that reads and writes r4 (sum).
Why do you need temp1 and temp2 to be globals? That's just going to stop the optimizer from doing aggressive optimizations that don't calculate exactly the same temporaries that the C source does. Fortunately the C memory model is weak enough that it should be able to reorder assignments to them, although this might actually be why it didn't MLA into temp2 directly, since it decided to do that calculation first. (Hmm, does the memory model even apply? Other threads can't see our registers at all, so those globals are all effectively thread-local. It should allow relaxed ordering for assignments to globals. Signal handlers can see these globals, and could run at any point. gcc isn't following strict source order, since in the source both multiplies happen before either add.)
Godbolt doesn't have a newer ARM gcc version, so I can't easily test a newer gcc. A newer gcc might do a better job with this.
BTW, I tried a version of the function using local variables for temporaries, and didn't actually get better results. Probably because there are still so many register globals that gcc couldn't pick convenient regs for the temporaries.
// same register globals, except for temp1 and temp2.
void calc_local_tmp() {
int t1 = r*n + k;
sum += src1[t1] * src2[k*n + c];
}
push {lr} # gcc decides to push to get a tmp reg
movw r3, #:lower16:.LANCHOR0 # tmp131,
mla lr, r10, r5, r7 # tmp133, n.1, r, k.2
movt r3, #:upper16:.LANCHOR0 # tmp131,
mla ip, r7, r10, r6 # tmp137, k.2, n.1, c
ldr r2, [r3] # src1, src1
ldr r0, [r3, #4] # src2, src2
ldr r1, [r2, lr, lsl #2] # *_10, *_10
ldr r3, [r0, ip, lsl #2] # *_20, *_20
mla r4, r3, r1, r4 # sum, *_20, *_10, sum
ldr pc, [sp], #4 #
Compiling with -fcall-used-r8 -fcall-used-r9 didn't help; gcc makes the same code that pushes lr to get an extra temporary. It fails to use ldmia (load-multiple) because it makes a sub-optimal choice of which temporary to put in which reg. (&src1 in r0 would let it load src1 and src2 into r2 and r3.)
I need to reduce the code bloat for the Cortex-M0 microprocessor.
At startup the ROM data has to be copied to the RAM data once. Therefore I have this piece of code:
void __startup( void ){
extern unsigned int __data_init_start;
extern unsigned int __data_start;
extern unsigned int __data_end;
// copy .data section from flash to ram
s = & __data_init_start;
d = & __data_start;
e = & __data_end;
while( d != e ){
*d++ = *s++;
}
}
The assembly code that is generated by the compiler looks like this:
ldr r1, .L10+8
ldr r2, .L10+12
sub r0, r1, r2
lsr r3, r0, #2
add r3, r3, #1
lsl r1, r3, #2
mov r3, #0
.L4:
add r3, r3, #4
cmp r3, r1
beq .L9
.L5:
ldr r4, .L10+16
add r0, r2, r3
add r4, r3, r4
sub r4, r4, #4
ldr r4, [r4]
sub r0, r0, #4
str r4, [r0]
b .L4
How can I optimize this code so the code size is at minimum?
The compiler (or you!) does not realize that the range to copy is end - start. There seems to be some unnecessarily shuffling of data going on -- the 2 add and the sub in the loop. Also, it seems to me the compiler makes sure that the number of copies to make is a multiple of 4. An obvious optimization, then, is to make sure it is in advance! Below I assume it is (if not, the bne will fail and happily keep on copying and trample all over your memory).
Using my decade-old ARM assembler knowlegde (yes, that is a major disclaimer), and post-incrementing, I think the following short snippet is what it can be condensed to. From 18 instructions down to 8, not too bad. If it works.
ldr r1, __data_init_start
ldr r2, __data_start
ldr r3, __data_end
sub r4, r3, r2
.L1:
ldr r3, [r1], #4 ; safe to re-use r3 here
str r3, [r2], #4
subs r4, r4, #4
bne L1
May be that platform guarantees that writing to an unsigned int * you may change an unsigned int * value (i.e. it doesn't take advantage of type mismatch aliasing rules).
Then the code is inefficient because e is a global variable and the generated code logic must take in account that writing to *d may change the value of e.
Making at least e a local should solve this problem (most compilers know that aliasing a local that never had its address taken is not possible from a C point of view).
What basically __asm__ __volatile__ () does and what is significance of "memory" for ARM architecture?
asm volatile("" ::: "memory");
creates a compiler level memory barrier forcing optimizer to not re-order memory accesses across the barrier.
For example, if you need to access some address in a specific order (probably because that memory area is actually backed by a different device rather than a memory) you need to be able tell this to the compiler otherwise it may just optimize your steps for the sake of efficiency.
Assume in this scenario you must increment a value in address, read something and increment another value in an adjacent address.
int c(int *d, int *e) {
int r;
d[0] += 1;
r = e[0];
d[1] += 1;
return r;
}
Problem is compiler (gcc in this case) can rearrange your memory access to get better performance if you ask for it (-O). Probably leading to a sequence of instructions like below:
00000000 <c>:
0: 4603 mov r3, r0
2: c805 ldmia r0, {r0, r2}
4: 3001 adds r0, #1
6: 3201 adds r2, #1
8: 6018 str r0, [r3, #0]
a: 6808 ldr r0, [r1, #0]
c: 605a str r2, [r3, #4]
e: 4770 bx lr
Above values for d[0] and d[1] are loaded at the same time. Lets assume this is something you want to avoid then you need to tell compiler not to reorder memory accesses and that is to use asm volatile("" ::: "memory").
int c(int *d, int *e) {
int r;
d[0] += 1;
r = e[0];
asm volatile("" ::: "memory");
d[1] += 1;
return r;
}
So you'll get your instruction sequence as you want it to be:
00000000 <c>:
0: 6802 ldr r2, [r0, #0]
2: 4603 mov r3, r0
4: 3201 adds r2, #1
6: 6002 str r2, [r0, #0]
8: 6808 ldr r0, [r1, #0]
a: 685a ldr r2, [r3, #4]
c: 3201 adds r2, #1
e: 605a str r2, [r3, #4]
10: 4770 bx lr
12: bf00 nop
It should be noted that this is only compile time memory barrier to avoid compiler to reorder memory accesses, as it puts no extra hardware level instructions to flush memories or wait for load or stores to be completed. CPUs can still reorder memory accesses if they have the architectural capabilities and memory addresses are on normal type instead of strongly ordered or device (ref).
This sequence is a compiler memory access scheduling barrier, as noted in the article referenced by Udo. This one is GCC specific - other compilers have other ways of describing them, some of them with more explicit (and less esoteric) statements.
__asm__ is a gcc extension of permitting assembly language statements to be entered nested within your C code - used here for its property of being able to specify side effects that prevent the compiler from performing certain types of optimisations (which in this case might end up generating incorrect code).
__volatile__ is required to ensure that the asm statement itself is not reordered with any other volatile accesses any (a guarantee in the C language).
memory is an instruction to GCC that (sort of) says that the inline asm sequence has side effects on global memory, and hence not just effects on local variables need to be taken into account.
The meaning is explained here:
http://en.wikipedia.org/wiki/Memory_ordering
Basically it implies that the assembly code will be executed where you expect it. It tells the compiler to not reorder instructions around it. That is what is coded before this piece of code will be executed before and what is coded after will be executed after.
static inline unsigned long arch_local_irq_save(void)
{
unsigned long flags;
asm volatile(
" mrs %0, cpsr # arch_local_irq_save\n"
" cpsid i" //disabled irq
: "=r" (flags) : : "memory", "cc");
return flags;
}