Yes there is a FPU present with my specific MCU.
The code is compiled with the -mfloat-abi=soft flag otherwise the float variable never gets passed to R0
The FPU gets enabled via SCB->CPACR |= ((3UL << (10 * 2)) | (3UL << (11 * 2)));
The assembly function;
sqrt_func:
VLDR.32 S0, [R0] <-- hardfault
VSQRT.F32 S0, S0
VSTR.32 S0, [R0]
BX LR
C code calling said function;
extern float sqrt_func(float s);
float x = sqrt_func(1000000.0f);
But after stepping through, the MCU hard faults at VLDR.32 S0, [R0] with the CFSR showing
CFSR
->BFARVALID
->PRECISERR
I see that the float is being passed correctly because that's the hex value for it the moment it hard faults;
R0
->0x49742400
S0 never gets loaded with anything.
I can't figure out why this is hard faulting, anyone have any ideas? I am trying to manually calculate the square root using the FPU.
Also what's weird is d13-d15 and s0-s31 registers are showing "0xx-2" but that's probably a quirk of the debugger not being able to pull the registers once it hardfaults.
Ok I'm just a dumbo and thought VLDR and VSTR operated differently for some reason but they're identical to LDR and STR. The value of the float was being passed to R0 but VLDR was trying to load the value at that address (0x49742400 which was my float value in hex) and that's either an invalid address or some sort of memory violation.
Instead you have to use VMOV.32 to copy register contents over;
sqrt_func:
VMOV.32 S0, R0
VSQRT.F32 S0, S0
VMOV.32 R0, S0
BX LR
And now it works.
Related
I don't know if this question is vague or lacks enough information but I was just wondering If I want to convert this line a = a * b * c written in C language to LC-3, how can I do it? Assuming that a, b and c are local variables and that the offset of a is 0, b is -1, c is -2?
I know that I can start off like this:
LDR R0, R5, #0; to load A
LDR R1, R5, #-1; to load B
Is there a limitation to the registers I can use? Can I use R2 to load C?
Edit:
LDR R0, R5, #0; LOAD A
LDR R1, R5, #-1; LOAD B
LDR R2, R5, #-2; LOAD C
AND R3, R3, #0; Sum = 0
LOOP ADD R3, R3, R1; Sum = sum + B
ADD R0, R0, #-1; A = A-1
STR R0, R5, #0; SAVE TO A (a = a*b)
BRp LOOP
ADD R4, R4, R2; Sum1 = sum1 + C
ADD R2, R2, #-1; C = C-1
BRp LOOP
STR R0, R5, #0; SAVE TO A (a = a*c = a*b*c)
If you're writing a whole program, which is often the case with LC-3, the only physical limit is the instruction set, so modulo that, you can use the registers as you like.
Your coursework/assignment may impose some environmental requirements, such as local variables and parameters being accessible from a frame pointer, e.g. R5, and having the stack pointer in R6. If so, then those should probably be left alone, but in a pinch you could save them, and later restore them.
If you're writing just a function that is going to be called, then you'll need to follow the calling convention. Decode the signature of the function you're implementing according to the parameter passing approach. If you want to use R7 (e.g. as a scratch register, or if you want to call another function) be aware that on entry to your function, it holds the return address, whose value is needed to return to the caller, but you can save it on the stack or in a global storage for later retrieval.
The calling convention in use should also inform which registers are call clobbered vs. call preserved. Within a function, call-clobbered registers can be used without fuss, but call-preserved registers require being saving before use, and later restored to original values before returning to the function's caller.
I have a project on armv5te platform, and I have to rewrite some functions and use assembly code to use enhancement DSP instructions.
I use a lot of int64_t type for accumulators, but I do not have an idea how to pass it for arm instruction SMULL (http://www.keil.com/support/man/docs/armasm/armasm_dom1361289902800.htm).
How can I pass lower or higher 32-bits of 64 variables to 32-bit register? (I know, that I can use intermediate variable int32_t, but it does not look good).
I know, that compiler would do it for me, but I just write the small function for an example.
int64_t testFunc(int64_t acc, int32_t x, int32_t y)
{
int64_t tmp_acc;
asm("SMULL %0, %1, %2, %3"
: "=r"(tmp_acc), "=r"(tmp_acc) // no idea how to pass tmp_acc;
: "r"(x), "r"(y)
);
return tmp_acc + acc;
}
You don't need and shouldn't use inline asm for this. The compiler can do even better than smull, and use smlal to multiply-accumulate with one instruction:
int64_t accum(int64_t acc, int32_t x, int32_t y) {
return acc + x * (int64_t)y;
}
which compiles (with gcc8.2 -O3 -mcpu=arm10e on the Godbolt compiler explorer) to this asm: (ARM10E is an ARMv5 microarchitecture I picked from Wikipedia's list)
accum:
smlal r0, r1, r3, r2 #, y, x
bx lr #
As a bonus, this pure C also compiles efficiently for AArch64.
https://gcc.gnu.org/wiki/DontUseInlineAsm
If you insist on shooting yourself in the foot and using inline asm:
Or in the general case with other instructions, there might be a case where you'd want this.
First, beware that smull output registers aren't allowed to overlap the first input register, so you have to tell the compiler about this. An early-clobber constraint on the output operand(s) will do the trick of telling the compiler it can't have inputs in those registers. I don't see a clean way to tell the compiler that the 2nd input can be in the same register as an output.
This restriction is lifted in ARMv6 and later (see this Keil documentation) "Rn must be different from RdLo and RdHi in architectures before ARMv6", but for ARMv5 compatibility you need to make sure the compiler doesn't violate this when filling in your inline-asm template.
Optimizing compilers can optimize away a shift/OR that combines 32-bit C variables into a 64-bit C variable, when targeting a 32-bit platform. They already store 64-bit variables as a pair of registers, and in normal cases can figure out there's no actual work to be done in the asm.
So you can specify a 64-bit input or output as a pair of 32-bit variables.
#include <stdint.h>
int64_t testFunc(int64_t acc, int32_t x, int32_t y)
{
uint32_t prod_lo, prod_hi;
asm("SMULL %0, %1, %2, %3"
: "=&r" (prod_lo), "=&r"(prod_hi) // early clobber for pre-ARMv6
: "r"(x), "r"(y)
);
int64_t prod = ((int64_t)prod_hi) << 32;
prod |= prod_lo; // + here won't optimize away, but | does, with gcc
return acc + prod;
}
Unfortunately the early-clobber means we need 6 total registers, but the ARM calling convention only has 6 call-clobbered registers (r0..r3, lr, and ip (aka r12)). And one of them is LR, which has the return address so we can't lose its value. Probably not a big deal when inlined into a regular function that already saves/restores several registers.
Again from Godbolt:
# gcc -O3 output with early-clobber, valid even before ARMv6
testFunc:
str lr, [sp, #-4]! #, Save return address (link register)
SMULL ip, lr, r2, r3 # prod_lo, prod_hi, x, y
adds r0, ip, r0 #, prod, acc
adc r1, lr, r1 #, prod, acc
ldr pc, [sp], #4 # return by popping the return address into PC
# gcc -O3 output without early-clobber (&) on output constraints:
# valid only for ARMv6 and later
testFunc:
SMULL r3, r2, r2, r3 # prod_lo, prod_hi, x, y
adds r0, r3, r0 #, prod, acc
adc r1, r2, r1 #, prod, acc
bx lr #
Or you can use a "=r"(prod64) constraint and use modifiers to select which half of %0 you get. Unfortunately, gcc and clang emit less efficient asm for some reason, saving more registers (and maintaining 8-byte stack alignment). 2 instead of 1 for gcc, 4 instead of 2 for clang.
// using an int64_t directly with inline asm, using %Q0 and %R0 constraints
// Q is the low half, R is the high half.
int64_t testFunc2(int64_t acc, int32_t x, int32_t y)
{
int64_t prod; // gcc and clang seem to want more free registers this way
asm("SMULL %Q0, %R0, %1, %2"
: "=&r" (prod) // early clobber for pre-ARMv6
: "r"(x), "r"(y)
);
return acc + prod;
}
again compiled with gcc -O3 -mcpu=arm10e. (clang saves/restores 4 registers)
# gcc -O3 with the early-clobber so it's safe on ARMv5
testFunc2:
push {r4, r5} #
SMULL r4, r5, r2, r3 # prod, x, y
adds r0, r4, r0 #, prod, acc
adc r1, r5, r1 #, prod, acc
pop {r4, r5} #
bx lr #
So for some reason it seems to be more efficient to manually handle the halves of a 64-bit integer with current gcc and clang. This is obviously a missed optimization bug.
I'd like to simple inline assembly experiment in uVision with STM32F as the below code.
But I'v got a some problem with error message when I compile it.
unsigned int bar(unsigned int r0)
{
unsigned int r1;
unsigned int r4 = 1234;
__asm
{
MOVS r0,#0
LDR r1,[r0] ; Get initial MSP value
MOV SP, r1
LDR r1,[r0, #4] ; Get initial PC value
BX r1
}
return(r1);
}
I've got the below error messages when I compile it as the below.
*** Using Compiler 'V5.06 update 5 (build 528)', folder: 'C:\Keil_v5\ARM\ARMCC\Bin'
Build target 'STM32F429_439xx'
compiling main.c...
../main.c(79): error: #3061: unrecognized instruction opcode
LDR r1,[r0] ; Get initial MSP value
../main.c(80): error: #20: identifier "SP" is undefined
MOV SP, r1
../main.c(81): error: #3061: unrecognized instruction opcode
LDR r1,[r0, #4] ; Get initial PC value
../main.c(82): error: #1084: This instruction not permitted in inline assembler
BX r1
../main.c(71): warning: #177-D: variable "r4" was declared but never referenced
unsigned int r4 = 1234;
../main.c(82): error: #114: label "r1" was referenced but not defined
BX r1
../main.c: 1 warning, 5 errors
"STM32F429_439xx\STM32F429_439xx.axf" - 5 Error(s), 1 Warning(s).
Target not created.
Build Time Elapsed: 00:00:01
What am I supposed to do to resolve this problem?
I have a nagging feeling that what you're trying to do here isn't complete, and causing a soft reset might be better. However;
http://www.keil.com/support/man/docs/armcc/armcc_chr1359124249383.htm
The inline assembler provides no direct access to the physical
registers of an ARM processor. If an ARM register name is used as an
operand in an inline assembler instruction it becomes a reference to a
variable of the same name, and not the physical ARM register.
...
No variables are declared for the sp (r13), lr (r14), and pc (r15)
registers, and they cannot be read or directly modified in inline
assembly code.
However, CMSIS provides the following:
https://www.keil.com/pack/doc/CMSIS/Core/html/group__Core__Register__gr.html#gab898559392ba027814e5bbb5a98b38d2
__STATIC_INLINE uint32_t __get_MSP(void)
{
register uint32_t __regMainStackPointer __ASM("msp");
return(__regMainStackPointer);
}
Here is my assembly code for A9,
ldr x1, = 0x400020 // Const value may be address also
ldr w0, = 0x200018 // Const value may be address also
str w0, [x1]
The below one is expected output ?
*((u32 *)0x400020) = 0x200018;
When i cross checked with it by compiler it given differnet result as mov and movs insted of ldr. How to create ldr in c?
When i cross checked with it by compiler it given differnet result as mov and movs
It sounds to me like you compiled the C code with a compiler targetting AArch32, but the assembly code you've shown looks like it was written for AArch64.
Here's what I get when I compile with ARM64 GCC 5.4 and optimization level O3 (comments added by me):
mov x0, 32 # x0 = 0x20
mov w1, 24 # w1 = 0x18
movk x0, 0x40, lsl 16 # x0[31:16] = 0x40
movk w1, 0x20, lsl 16 # w1[31:16] = 0x20
str w1, [x0]
How to create ldr in c?
I can't see any good reason why you'd want the compiler to generate an LDR in this case.
LDR reg,=value is a pseudo-instruction that allows you to load immediates that cannot be encoded directly in the instruction word. The assembler achieves this by placing the value (e.g. 0x200018) in a literal pool, and then replacing ldr w0, =0x200018 with a PC-relative load from that literal pool (i.e. something like ldr w0,[pc,#offset_to_value]). Accessing memory is slow, so the compiler generated another sequence of instructions for you that achieves the same thing in a more efficient manner.
Pseudo-instructions are mainly a convenience for humans writing assembly code, making the code easier for them or their colleagues to read/write/maintain. Unlike a human being, a compiler doesn't get fatigued by repeating the same task over and over, and therefore doesn't have as much need for conveniences like that.
TL;DR: The compiler will generate what it thinks is the best (according to the current optimization level) instruction sequence. Also, that particular form of LDR is a pseudo-instruction, so you might not be able to get a compiler to generate it even if you disable all optimizations.
Environment: GCC 4.7.3 (arm-none-eabi-gcc) for ARM Cortex m4f. Bare-metal (actually MQX RTOS, but here that's irrelevant). The CPU is in Thumb state.
Here's a disassembler listing of some code I'm looking at:
//.label flash_command
// ...
while(!(FTFE_FSTAT & FTFE_FSTAT_CCIF_MASK)) {}
// Compiles to:
12: bf00 nop
14: f04f 0300 mov.w r3, #0
18: f2c4 0302 movt r3, #16386 ; 0x4002
1c: 781b ldrb r3, [r3, #0]
1e: b2db uxtb r3, r3
20: b2db uxtb r3, r3
22: b25b sxtb r3, r3
24: 2b00 cmp r3, #0
26: daf5 bge.n 14 <flash_command+0x14>
The constants (after expending macros, etc.) are:
address of FTFE_FSTAT is 0x40020000u
FTFE_FSTAT_CCIF_MASK is 0x80u
This is compiled with NO optimization (-O0), so GCC shouldn't be doing anything fancy... and yet, I don't get this code. Post-answer edit: Never assume this. My problem was getting a false sense of security from turning off optimization.
I've read that "uxtb r3,r3" is a common way of truncating a 32-bit value. Why would you want to truncate it twice and then sign-extend? And how in the world is this equivalent to the bit-masking operation in the C-code?
What am I missing here?
Edit: Types of the thing involved:
So the actual macro expansion of FTFE_FSTAT comes down to
((((FTFE_MemMapPtr)0x40020000u))->FSTAT)
where the struct is defined as
/** FTFE - Peripheral register structure */
typedef struct FTFE_MemMap {
uint8_t FSTAT; /**< Flash Status Register, offset: 0x0 */
uint8_t FCNFG; /**< Flash Configuration Register, offset: 0x1 */
//... a bunch of other uint_8
} volatile *FTFE_MemMapPtr;
The two uxtb instructions are the compiler being stupid, they should be optimized out if you turn on optimization. The sxtb is the compiler being brilliant, using a trick that you wouldn't expect in unoptimized code.
The first uxtb is due to the fact that you loaded a byte from memory. The compiler is zeroing the other 24 bits of register r3, so that the byte value fills the entire register.
The second uxtb is due to the fact that you're ANDing with an 8-bit value. The compiler realizes that the upper 24-bits of the result will always be zero, so it's using uxtb to clear the upper 24-bits.
Neither of the uxtb instructions does anything useful, because the sxtb instruction overwrites the upper 24 bits of r3 anyways. The optimizer should realize that and remove them when you compile with optimizations enabled.
The sxtb instruction takes the one bit you care about 0x80 and moves it into the sign bit of register r3. That way, if bit 0x80 is set, then r3 becomes a negative number. So now the compiler can compare with 0 to determine whether the bit was set. If the bit was not set then the bge instruction branches back to the top of the while loop.