Is vldr/vstr atomic?

Is vldr/vstr atomic? - arm

Is arm's vldr/vstr atomic on smp?like one thread is doing
vldr d0, mem0
vstr d0, mem1
the other doing
vmov d0, r0,r1
vstr d0, mem0
so would thread one sees the consistent memory state with both r0 and r1 visible or not?

Related

gcc arm optimizes away parameters before System Call

I'm trying to implement some "OSEK-Services" on an arm7tdmi-s using gcc arm. Unfortunately turning up the optimization level results in "wrong" code generation. The main thing I dont understand is that the compiler seems to ignore the procedure call standard, e.g. passing parameters to a function by moving them into registers r0-r3. I understand that function calls can be inlined but still the parameters need to be in the registers to perform the system call.
Consider the following code to demonstrate my problem:
unsigned SysCall(unsigned param)
{
volatile unsigned ret_val;
__asm __volatile
(
"swi 0 \n\t" /* perform SystemCall */
"mov %[v], r0 \n\t" /* move the result into ret_val */
: [v]"=r"(ret_val)
:: "r0"
);
return ret_val; /* return the result */
}
int main()
{
unsigned retCode;
retCode = SysCall(5); // expect retCode to be 6 when returning back to usermode
}
I wrote the Top-Level software interrupt handler in assembly as follows:
.type SWIHandler, %function
.global SWIHandler
SWIHandler:
stmfd sp! , {r0-r2, lr} #save regs
ldr r0 , [lr, #-4] #load sysCall instruction and extract sysCall number
bic r0 , #0xff000000
ldr r3 , =DispatchTable #load dispatchTable
ldr r3 , [r3, r0, LSL #2] #load sysCall address into r3
ldmia sp, {r0-r2} #load parameters into r0-r2
mov lr, pc
bx r3
stmia sp ,{r0-r2} #store the result back on the stack
ldr lr, [sp, #12] #restore return address
ldmfd sp! , {r0-r2, lr} #load result into register
movs pc , lr #back to next instruction after swi 0
The dispatch table looks like this:
DispatchTable:
.word activateTaskService
.word getTaskStateService
The SystemCall function looks like this:
unsigned activateTaskService(unsigned tID)
{
return tID + 1; /* only for demonstration */
}
running without optimization everything works fine and the parameters are in the registers as to be expected:
See following code with -O0 optimization:
00000424 <main>:
424: e92d4800 push {fp, lr}
428: e28db004 add fp, sp, #4
42c: e24dd008 sub sp, sp, #8
430: e3a00005 mov r0, #5 #move param into r0
434: ebffffe1 bl 3c0 <SysCall>
000003c0 <SysCall>:
3c0: e52db004 push {fp} ; (str fp, [sp, #-4]!)
3c4: e28db000 add fp, sp, #0
3c8: e24dd014 sub sp, sp, #20
3cc: e50b0010 str r0, [fp, #-16]
3d0: ef000000 svc 0x00000000
3d4: e1a02000 mov r2, r0
3d8: e50b2008 str r2, [fp, #-8]
3dc: e51b3008 ldr r3, [fp, #-8]
3e0: e1a00003 mov r0, r3
3e4: e24bd000 sub sp, fp, #0
3e8: e49db004 pop {fp} ; (ldr fp, [sp], #4)
3ec: e12fff1e bx lr
Compiling the same code with -O3 results in the following assembly code:
00000778 <main>:
778: e24dd008 sub sp, sp, #8
77c: ef000000 svc 0x00000000 #Inline SystemCall without passing params into r0
780: e1a02000 mov r2, r0
784: e3a00000 mov r0, #0
788: e58d2004 str r2, [sp, #4]
78c: e59d3004 ldr r3, [sp, #4]
790: e28dd008 add sp, sp, #8
794: e12fff1e bx lr
Notice how the systemCall gets inlined without assigning the value 5 t0 r0.
My first approach is to move those values manually into the registers by adapting the function SysCall from above as follows:
unsigned SysCall(volatile unsigned p1)
{
volatile unsigned ret_val;
__asm __volatile
(
"mov r0, %[p1] \n\t"
"swi 0 \n\t"
"mov %[v], r0 \n\t"
: [v]"=r"(ret_val)
: [p1]"r"(p1)
: "r0"
);
return ret_val;
}
It seems to work in this minimal example but Im not very sure whether this is the best possible practice. Why does the compiler think he can omit the parameters when inlining the function? Has somebody any suggestions whether this approach is okay or what should be done differently?
Thank you in advance

A function call in C source code does not instruct the compiler to call the function according to the ABI. It instructs the compiler to call the function according to the model in the C standard, which means the compiler must pass the arguments to the function in a way of its choosing and execute the function in a way that has the same observable effects as defined in the C standard.
Those observable effects do not include setting any processor registers. When a C compiler inlines a function, it is not required to set any particular processor registers. If it calls a function using an ABI for external calls, then it would have to set registers. Inline calls do not need to obey the ABI.
So merely putting your system request inside a function built of C source code does not guarantee that any registers will be set.
For ARM, what you should do is define register variables assigned to the required register(s) and use those as input and output to the assembly instructions:
unsigned SysCall(unsigned param)
{
register unsigned Parameter __asm__("r0") = param;
register unsigned Result __asm__("r0");
__asm__ volatile
(
"swi 0"
: "=r" (Result)
: "r" (Parameter)
: // "memory" // if any inputs are pointers
);
return Result;
}
(This is a major kludge by GCC; it is ugly, and the documentation is poor. But see also https://stackoverflow.com/tags/inline-assembly/info for some links. GCC for some ISAs has convenient specific-register constraints you can use instead of r, but not for ARM.) The register variables do not need to be volatile; the compiler knows they will be used as input and output for the assembly instructions.
The asm statement itself should be volatile if it has side effects other than producing a return value. (e.g. getpid() doesn't need to be volatile.)
A non-volatile asm statement with outputs can be optimized away if the output is unused, or hoisted out of loops if its used with the same input (like a pure function call). This is almost never what you want for a system call.
You also need a "memory" clobber if any of the inputs are pointers to memory that the kernel will read or modify. See How can I indicate that the memory *pointed* to by an inline ASM argument may be used? for more details (and a way to use a dummy memory input or output to avoid a "memory" clobber.)
A "memory" clobber on mmap/munmap or other system calls that affect what memory means would also be wise; you don't want the compiler to decide to do a store after munmap instead of before.

Conversion from uint64_t to double

For an STM32F7, which includes instructions for double floating points, I want to convert an uint64_t to double.
In order to test that, I used the following code:
volatile static uint64_t m_testU64 = 45uLL * 0xFFFFFFFFuLL;
volatile static double m_testD;
#ifndef DO_NOT_USE_UL2D
m_testD = (double)m_testU64;
#else
double t = (double)(uint32_t)(m_testU64 >> 32u);
t *= 4294967296.0;
t += (double)(uint32_t)(m_testU64 & 0xFFFFFFFFu);
m_testD = t;
#endif
By default (if DO_NOT_USE_UL2D is not defined) the compiler (gcc or clang) is calling the function: __aeabi_ul2d() which is kind of complex in number of executed instruction. See the assembly code here : https://github.com/gcc-mirror/gcc/blob/master/libgcc/config/arm/ieee754-df.S#L537
For my particular example, it takes 20 instructions without entering in most of the branches
And if DO_NOT_USE_UL2D is defined, the compiler generate the following assembly code:
movw r0, #1728 ; 0x6c0
vldr d2, [pc, #112] ; 0x303fa0
movt r0, #8192 ; 0x2000
vldr s0, [r0, #4]
ldr r1, [r0, #0]
vcvt.f64.u32 d0, s0
vldr s2, [r0]
vcvt.f64.u32 d1, s2
ldr r1, [r0, #4]
vfma.f64 d1, d0, d2
vstr d1, [r0, #8]
The code is simpler, and it is only 10 instructions.
So here the the questions (if DO_NOT_USE_UL2D is defined):
Is my code (in C) correct?
Is my code slower than the __aeabi_ul2d() function (not really important, but a bit curious)?
I have to do that, since I am not allowed to use function from libgcc (There are very good reasons for that...)
Be aware that the main purpure of this question is not about performance, I am really curious about the implementation in libgcc, and I really want to know if there is something wrong in my code.

ARM assembly: can’t find a register in class ‘GENERAL_REGS’ while reloading ‘asm’

I am trying to implement a function which multiplies 32-bit operand with 256-bit operand in ARM assembly on ARM Cortex-a8. The problem is I am running out of registers and I have no idea how I can reduce the number of used registers here. Here is my function:
typedef struct UN_256fe{
uint32_t uint32[8];
}UN_256fe;
typedef struct UN_288bite{
uint32_t uint32[9];
}UN_288bite;
void multiply32x256(uint32_t A, UN_256fe* B, UN_288bite* res){
asm (
"umull r3, r4, %9, %10;\n\t"
"mov %0, r3; \n\t"/*res->uint32[0] = r3*/
"umull r3, r5, %9, %11;\n\t"
"adds r6, r3, r4; \n\t"/*res->uint32[1] = r3 + r4*/
"mov %1, r6; \n\t"
"umull r3, r4, %9, %12;\n\t"
"adcs r6, r5, r3; \n\t"
"mov %2, r6; \n\t"/*res->uint32[2] = r6*/
"umull r3, r5, %9, %13;\n\t"
"adcs r6, r3, r4; \n\t"
"mov %3, r6; \n\t"/*res->uint32[3] = r6*/
"umull r3, r4, %9, %14;\n\t"
"adcs r6, r3, r5; \n\t"
"mov %4, r6; \n\t"/*res->uint32[4] = r6*/
"umull r3, r5, %9, %15;\n\t"
"adcs r6, r3, r4; \n\t"
"mov %5, r6; \n\t"/*res->uint32[5] = r6*/
"umull r3, r4, %9, %16;\n\t"
"adcs r6, r3, r5; \n\t"
"mov %6, r6; \n\t"/*res->uint32[6] = r6*/
"umull r3, r5, %9, %17;\n\t"
"adcs r6, r3, r4; \n\t"
"mov %7, r6; \n\t"/*res->uint32[7] = r6*/
"adc r6, r5, #0 ; \n\t"
"mov %8, r6; \n\t"/*res->uint32[8] = r6*/
: "=r"(res->uint32[8]), "=r"(res->uint32[7]), "=r"(res->uint32[6]), "=r"(res->uint32[5]), "=r"(res->uint32[4]),
"=r"(res->uint32[3]), "=r"(res->uint32[2]), "=r"(res->uint32[1]), "=r"(res->uint32[0])
: "r"(A), "r"(B->uint32[7]), "r"(B->uint32[6]), "r"(B->uint32[5]),
"r"(B->uint32[4]), "r"(B->uint32[3]), "r"(B->uint32[2]), "r"(B->uint32[1]), "r"(B->uint32[0]), "r"(temp)
: "r3", "r4", "r5", "r6", "cc", "memory");
}
EDIT-1: I updated my clobber list based on the first comment, but I still get the same error

A simple solution is to break this up and don't use 'clobber'. Declare the variables as 'tmp1', etc. Try not to use any mov statements; let the compiler do this if it has to. The compiler will use an algorithm to figure out the best 'flow' of information. If you use 'clobber', it can not reuse registers. They way it is now, you make it load all the memory first before the assembler executes. This is bad as you want memory/CPU ALU to pipeline.
void multiply32x256(uint32_t A, UN_256fe* B, UN_288bite* res)
{
uint32_t mulhi1, mullo1;
uint32_t mulhi2, mullo2;
uint32_t tmp;
asm("umull %0, %1, %2, %3;\n\t"
: "=r" (mullo1), "=r" (mulhi1)
: "r"(A), "r"(B->uint32[7])
);
res->uint32[8] = mullo1; /* was 'mov %0, r3; */
volatile asm("umull %0, %1, %3, %4;\n\t"
"adds %2, %5, %6; \n\t"/*res->uint32[1] = r3 + r4*/
: "=r" (mullo2), "=r" (mulhi2), "=r" (tmp)
: "r"(A), "r"(B->uint32[6]), "r" (mullo1), "r"(mulhi1)
: "cc"
);
res->uint32[7] = tmp; /* was 'mov %1, r6; */
/* ... etc */
}
The whole purpose of the 'gcc inline assembler' is not to code assembler directly in a 'C' file. It is to use the register allocation logic of the compiler AND do something that can not be easily done in 'C'. The use of carry logic in your case.
By not making it one huge 'asm' clause, the compiler can schedule the loads from memory as it needs new registers. It will also pipeline your 'UMULL' ALU activity with the load/store unit.
You should only use clobber if an instruction implicitly clobbers a specific register. You may also use something like,
register int *p1 asm ("r0");
and use that as an output. However, I don't know of any ARM instructions like this besides those that might alter the stack and your code doesn't use these and the carry of course.
GCC knows that memory changes if it is listed as an input/output, so you don't need a memory clobber. In fact it is detrimental as the memory clobber is a compiler memory barrier and this will cause memory to be written when the compiler might be able to schedule that for latter.
The moral is use gcc inline assembler to work with the compiler. If you code in assembler and you have huge routines, the register use can become complex and confusing. Typical assembler coders will keep only one thing in a register per routine, but that is not always the best use of registers. The compiler will shuffle the data around in a fairly smart way that is difficult to beat (and not very satisfying to hand code IMO) when the code size gets larger.
You might want to look at the GMP library which has lots of ways to efficiently tackle some of the same issues it looks like your code has.

ARM inline assembly multi-precision multiplication [duplicate]

I am trying to implement a function which multiplies 32-bit operand with 256-bit operand in ARM assembly on ARM Cortex-a8. The problem is I am running out of registers and I have no idea how I can reduce the number of used registers here. Here is my function:
typedef struct UN_256fe{
uint32_t uint32[8];
}UN_256fe;
typedef struct UN_288bite{
uint32_t uint32[9];
}UN_288bite;
void multiply32x256(uint32_t A, UN_256fe* B, UN_288bite* res){
asm (
"umull r3, r4, %9, %10;\n\t"
"mov %0, r3; \n\t"/*res->uint32[0] = r3*/
"umull r3, r5, %9, %11;\n\t"
"adds r6, r3, r4; \n\t"/*res->uint32[1] = r3 + r4*/
"mov %1, r6; \n\t"
"umull r3, r4, %9, %12;\n\t"
"adcs r6, r5, r3; \n\t"
"mov %2, r6; \n\t"/*res->uint32[2] = r6*/
"umull r3, r5, %9, %13;\n\t"
"adcs r6, r3, r4; \n\t"
"mov %3, r6; \n\t"/*res->uint32[3] = r6*/
"umull r3, r4, %9, %14;\n\t"
"adcs r6, r3, r5; \n\t"
"mov %4, r6; \n\t"/*res->uint32[4] = r6*/
"umull r3, r5, %9, %15;\n\t"
"adcs r6, r3, r4; \n\t"
"mov %5, r6; \n\t"/*res->uint32[5] = r6*/
"umull r3, r4, %9, %16;\n\t"
"adcs r6, r3, r5; \n\t"
"mov %6, r6; \n\t"/*res->uint32[6] = r6*/
"umull r3, r5, %9, %17;\n\t"
"adcs r6, r3, r4; \n\t"
"mov %7, r6; \n\t"/*res->uint32[7] = r6*/
"adc r6, r5, #0 ; \n\t"
"mov %8, r6; \n\t"/*res->uint32[8] = r6*/
: "=r"(res->uint32[8]), "=r"(res->uint32[7]), "=r"(res->uint32[6]), "=r"(res->uint32[5]), "=r"(res->uint32[4]),
"=r"(res->uint32[3]), "=r"(res->uint32[2]), "=r"(res->uint32[1]), "=r"(res->uint32[0])
: "r"(A), "r"(B->uint32[7]), "r"(B->uint32[6]), "r"(B->uint32[5]),
"r"(B->uint32[4]), "r"(B->uint32[3]), "r"(B->uint32[2]), "r"(B->uint32[1]), "r"(B->uint32[0]), "r"(temp)
: "r3", "r4", "r5", "r6", "cc", "memory");
}
EDIT-1: I updated my clobber list based on the first comment, but I still get the same error

A simple solution is to break this up and don't use 'clobber'. Declare the variables as 'tmp1', etc. Try not to use any mov statements; let the compiler do this if it has to. The compiler will use an algorithm to figure out the best 'flow' of information. If you use 'clobber', it can not reuse registers. They way it is now, you make it load all the memory first before the assembler executes. This is bad as you want memory/CPU ALU to pipeline.
void multiply32x256(uint32_t A, UN_256fe* B, UN_288bite* res)
{
uint32_t mulhi1, mullo1;
uint32_t mulhi2, mullo2;
uint32_t tmp;
asm("umull %0, %1, %2, %3;\n\t"
: "=r" (mullo1), "=r" (mulhi1)
: "r"(A), "r"(B->uint32[7])
);
res->uint32[8] = mullo1; /* was 'mov %0, r3; */
volatile asm("umull %0, %1, %3, %4;\n\t"
"adds %2, %5, %6; \n\t"/*res->uint32[1] = r3 + r4*/
: "=r" (mullo2), "=r" (mulhi2), "=r" (tmp)
: "r"(A), "r"(B->uint32[6]), "r" (mullo1), "r"(mulhi1)
: "cc"
);
res->uint32[7] = tmp; /* was 'mov %1, r6; */
/* ... etc */
}
The whole purpose of the 'gcc inline assembler' is not to code assembler directly in a 'C' file. It is to use the register allocation logic of the compiler AND do something that can not be easily done in 'C'. The use of carry logic in your case.
By not making it one huge 'asm' clause, the compiler can schedule the loads from memory as it needs new registers. It will also pipeline your 'UMULL' ALU activity with the load/store unit.
You should only use clobber if an instruction implicitly clobbers a specific register. You may also use something like,
register int *p1 asm ("r0");
and use that as an output. However, I don't know of any ARM instructions like this besides those that might alter the stack and your code doesn't use these and the carry of course.
GCC knows that memory changes if it is listed as an input/output, so you don't need a memory clobber. In fact it is detrimental as the memory clobber is a compiler memory barrier and this will cause memory to be written when the compiler might be able to schedule that for latter.
The moral is use gcc inline assembler to work with the compiler. If you code in assembler and you have huge routines, the register use can become complex and confusing. Typical assembler coders will keep only one thing in a register per routine, but that is not always the best use of registers. The compiler will shuffle the data around in a fairly smart way that is difficult to beat (and not very satisfying to hand code IMO) when the code size gets larger.
You might want to look at the GMP library which has lots of ways to efficiently tackle some of the same issues it looks like your code has.

Why does my SWI instruction hang? (BeagleBone Black, ARM Cortex-A8 cpu)

I'm starting to write a toy OS for the BeagleBone Black, which uses an ARM Cortex-A8-based TI Sitara AM3359 SoC and the U-Boot bootloader. I've got a simple standalone hello world app writing to UART0 that I can load through U-Boot so far, and now I'm trying to move on to interrupt handlers, but I can't get SWI to do anything but hang the device.
According to the AM335x TRM (starting on page 4099, if you're interested), the interrupt vector table is mapped in ROM at 0x20000. The ROM SWI handler branches to RAM at 0x4030ce08, which branches to the address stored at 0x4030ce28. (Initially, this is a unique dead loop at 0x20084.)
My code sets up all the ARM processor modes' SP to their own areas at the top of RAM, and enables interrupts in the CPSR, then executes an SWI instruction, which always hangs. (Perhaps jumping to some dead-loop instruction?) I've looked at a bunch of samples, and read whatever documentation I could find, and I don't see what I'm missing.
Currently my only interaction with the board is over serial connection on UART0 with my linux box. U-Boot initializes UART0, and allows loading of the binary over the serial connection.
Here's the relevant assembly:
.arm
.section ".text.boot"
.equ usr_mode, 0x10
.equ fiq_mode, 0x11
.equ irq_mode, 0x12
.equ svc_mode, 0x13
.equ abt_mode, 0x17
.equ und_mode, 0x1b
.equ sys_mode, 0x1f
.equ swi_vector, 0x4030ce28
.equ rom_swi_b_addr, 0x20008
.equ rom_swi_addr, 0x20028
.equ ram_swi_b_addr, 0x4030CE08
.equ ram_swi_addr, 0x4030CE28
.macro setup_mode mode, stackpointer
mrs r0, cpsr
mov r1, r0
and r1, r1, #0x1f
bic r0, r0, #0x1f
orr r0, r0, #\mode
msr cpsr_csfx, r0
ldr sp, =\stackpointer
bic r0, r0, #0x1f
orr r0, r0, r1
msr cpsr_csfx, r0
.endm
.macro disable_interrupts
mrs r0, cpsr
orr r0, r0, #0x80
msr cpsr_c, r0
.endm
.macro enable_interrupts
mrs r0, cpsr
bic r0, r0, #0x80
msr cpsr_c, r0
.endm
.global _start
_start:
// Initial SP
ldr r3, =_C_STACK_TOP
mov sp, r3
// Set up all the modes' stacks
setup_mode fiq_mode, _FIQ_STACK_TOP
setup_mode irq_mode, _IRQ_STACK_TOP
setup_mode svc_mode, _SVC_STACK_TOP
setup_mode abt_mode, _ABT_STACK_TOP
setup_mode und_mode, _UND_STACK_TOP
setup_mode sys_mode, _C_STACK_TOP
// Clear out BSS
ldr r0, =_bss_start
ldr r1, =_bss_end
mov r5, #0
mov r6, #0
mov r7, #0
mov r8, #0
b _clear_bss_check$
_clear_bss$:
stmia r0!, {r5-r8}
_clear_bss_check$:
cmp r0, r1
blo _clear_bss$
// Load our SWI handler's address into
// the vector table
ldr r0, =_swi_handler
ldr r1, =swi_vector
str r0, [r1]
// Debug-print out these SWI addresses
ldr r0, =rom_swi_b_addr
bl print_mem
ldr r0, =rom_swi_addr
bl print_mem
ldr r0, =ram_swi_b_addr
bl print_mem
ldr r0, =ram_swi_addr
bl print_mem
enable_interrupts
swi_call$:
swi #0xCC
bl kernel_main
b _reset
.global _swi_handler
_swi_handler:
// Get the SWI parameter into r0
ldr r0, [lr, #-4]
bic r0, r0, #0xff000000
// Save lr onto the stack
stmfd sp!, {lr}
bl print_uint32
ldmfd sp!, {pc}
Those debugging prints produce the expected values:
00020008: e59ff018
00020028: 4030ce08
4030ce08: e59ff018
4030ce28: 80200164
(According to objdump, 0x80200164 is indeed _swi_handler. 0xe59ff018 is the instruction "ldr pc, [pc, #0x20]".)
What am I missing? It seems like this should work.

The firmware on the board changes the ARM execution mode and the locations of
the vector tables associated with the various modes. In my own case (a bare-metal
snippet code executed at Privilege Level 1 and launched by BBB's uBoot) the active vector table is at address 0x9f74b000.
In general, you might use something like the following function to locate the
active vector table:
static inline unsigned int *get_vectors_address(void)
{
unsigned int v;
/* read SCTLR */
__asm__ __volatile__("mrc p15, 0, %0, c1, c0, 0\n"
: "=r" (v) : : );
if (v & (1<<13))
return (unsigned int *) 0xffff0000;
/* read VBAR */
__asm__ __volatile__("mrc p15, 0, %0, c12, c0, 0\n"
: "=r" (v) : : );
return (unsigned int *) v;
}

change
ldr r0, [lr, #-4]
bic r0, r0, #0xff000000
stmfd sp!, {lr}
bl print_uint32
ldmfd sp!, {pc}
to
stmfd sp!, {r0-r3, r12, lr}
ldr r0, [lr, #-4]
bic r0, r0, #0xff000000
bl print_uint32
ldmfd sp!, {r0-r3, r12, pc}^
PS: You don't restore SPSR into CPSR of interrupted task AND you also scratch registers which are not banked by the cpu mode switch.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Is vldr/vstr atomic? - arm

Is arm's vldr/vstr atomic on smp?like one thread is doing vldr d0, mem0 vstr d0, mem1 the other doing vmov d0, r0,r1 vstr d0, mem0 so would thread one sees the consistent memory state with both r0 and r1 visible or not?

Related

gcc arm optimizes away parameters before System Call

Conversion from uint64_t to double

ARM assembly: can’t find a register in class ‘GENERAL_REGS’ while reloading ‘asm’

ARM inline assembly multi-precision multiplication [duplicate]

Why does my SWI instruction hang? (BeagleBone Black, ARM Cortex-A8 cpu)

Categories

Resources