Confused about an ARM instruction

Confused about an ARM instruction - arm

I can't figure out what this ARM instruction does:
strd.w r0, r1, [r2]
I know that it is a store instruction which stores something at *r2 but I'm not entirely sure what. Why are there two source registers (r0 and r1) and what does the d.w suffix mean?

This function stores the 64-bit contents of two 32-bit registers into memory. The 8-byte chunk is stored starting at the address held in r2. The first four bytes come from r0, the second four bytes from r1.
Roughly equivalent C would be:
int32 *ptr=(int32 *) r2;
*(ptr) = r0;
*(ptr+1) = r1; // 'ptr+1' adds four bytes to the memory position

Related

Compare two 64 bit variables on 32 bit microcontroller

I have the following issue: I have two 64 bit variables and they have to be compared as quick as possible, my Microcontroller is only 32bit.
My thoughts are that it is necessary to divide 64 bit variable into two 32 bit variables, like this
uint64_t var = 0xAAFFFFFFABCDELL;
hiPart = (uint32_t)((var & 0xFFFFFFFF00000000LL) >> 32);
loPart = (uint32_t)(var & 0xFFFFFFFFLL);
and then to compare hiParts and loParts, but I am sure that this approach is slow and there is much better solution

The first rule should be: Write your program, so that is readable to a human.
When in doubt, don't assume anything, but measure it. Let's see, what godbolt gives us.
#include <stdint.h>
#include <stdbool.h>
bool foo(uint64_t a, uint64_t b) {
return a == b;
}
bool foo2(uint64_t a, uint64_t b) {
uint32_t ahiPart = (uint32_t)((a & 0xFFFFFFFF00000000ULL) >> 32);
uint32_t aloPart = (uint32_t)(a & 0xFFFFFFFFULL);
uint32_t bhiPart = (uint32_t)((b & 0xFFFFFFFF00000000ULL) >> 32);
uint32_t bloPart = (uint32_t)(b & 0xFFFFFFFFULL);
return ahiPart == bhiPart && aloPart == bloPart;
}
foo:
eor r1, r1, r3
eor r0, r0, r2
orr r0, r0, r1
rsbs r1, r0, #0
adc r0, r0, r1
bx lr
foo2:
eor r1, r1, r3
eor r0, r0, r2
orr r0, r0, r1
rsbs r1, r0, #0
adc r0, r0, r1
bx lr
As you can see, they result in the exact same assembly code, but you decide, which one is less error prone and easiert to read?

There was a time some years ago where you need to do tricks to be more smart than a compiler. But in 99.999% the compiler will be more smart than you.
And your variables are unsigned. So use ULL instead of LL.

The fastest way is to let the compiler do it. Most compilers are much better than humans at micro-optimization.
uint64_t var = …, other_var = …;
if (var == other_var) …
There aren't many ways to go about it. Under the hood, the compiler will arrange to load the upper 32 bits and the lower 32 bits of each variables into registers, and compare the two registers that contain upper 32 bits and the two registers that contain lower 32 bits. The assembly code might look something like this:
load 32 bits from &var into r0
load 32 bits from &other_var into r1
if r0 != r1: goto different
load 32 bits from &var + 4 into r2
load 32 bits from &other_var + 4 into r3
if r2 != r3: goto different
// code for if-equal
different:
// code for if-not-equal
Here are some things the compiler knows better than you:
Which registers to use, based on the needs of the surrounding code.
Whether to reuse the same registers to compare the upper and lower parts, or to use different registers.
Whether to process one part and then the other (as above), or to load one variable then the other. The best order depends on the pressure on registers and on the memory access times and pipelining of the particular processor model.

If you work with a union you could compare Hi and Lo Part without any extra calculations:
typedef union
{
struct
{
uint32_t loPart;
uint32_t hiPart;
};
uint64_t complete;
}uint64T;
uint64T var.complete = 0xAAFFFFFFABCDEULL;

Is the raspberry pi 3 memory byte addressible or word addressable and if it is word addressable what is the word size

I am new to the raspberry pi 3 I wanted to ask a question related to the memory architecture of raspberry pi 3 my question is as follows
Is the raspberry pi 3 memory byte addressable or word addressable and if it is word addressable what is the word size in bytes

Raspberry pi has nothing to do with this unless you are talking about the GPU. It is just another arm. How many word addressable instruction sets do you know about, in mainstream processors like arm? Operating system (linux, bsd, etc) capable operating system processors like arm?
arms definition of the size of a word is in the arm documentation.
void fun ( unsigned char *p, unsigned char x, unsigned int z )
{
unsigned int ra;
for(ra=0;ra<z;ra++) p[ra]=x;
}
00000000 <fun>:
0: e3520000 cmp r2, #0
4: 012fff1e bxeq lr
8: e0802002 add r2, r0, r2
c: e4c01001 strb r1, [r0], #1
10: e1500002 cmp r0, r2
14: 1afffffc bne c <fun+0xc>
18: e12fff1e bx lr
and there is your answer, is it using word addressing to fill the array or byte based?

Debugging Hard Fault on ARM Cortex-M0+ (using CMSIS DSP library)

I'm using the CMSIS DSP library on a Cortex-M0+.
Some functions, such as sqrt and FFT, are resulting in hard faults.
The arm_sqrt_f32 function calls sqrtf:
arm_sqrt_f32(
float32_t in,
float32_t * pOut)
[...]
*pOut = sqrtf(in);
part of the generated code:
0x00003914: bl 0x49e8 <sqrtf>
0x00003918: adds r2, r0, #0
0x0000391a: ldr r3, [r7, #0]
0x0000391c: str r2, [r3, #0]
The hard fault happens on the str instruction at address 0x0000391c. When at this line, the registers are:
$r1 0x0
$r2 0x40000000
$r3 0x0
$r4 0x0
$r5 0x200017fc
$r6 0x0
$r7 0x200017e0
$r8 0xfff7ffff
$r9 0xefbffffe
$r10 0xff7fffff
$r11 0x0
$r12 0x0
the SP register is 0x200017e0, an address containing 0.
I can't figure out why I'm getting this hard fault. What should I do?
Thanks!

Lets look at exactly what your str call is doing by looking at this page
your str call is doing str r2,[r3, #0] which translates to (if i'm not mistaken) :
store r2 in the address r3 offset by #0
Looking at those register values, you are trying to put 0x40000000 into location 0x0 offset by 0, so 0x0 still. It is the equivalent of a segmentation fault, you are trying to access memory that is not avaliable to you thus causing the hard fault.
Seeing as how that code is generated, I'm assuming you are giving it a faulty pOut pointer.
Make sure you aren't calling the function by doing arm_sqrt_f32(float32_t foo, float32_t* pOut) , you'll want to call it by doing arm_sqrt_f32(float32_t foo, float32_t &pOut) where pOut may be delcared as float32_t pOut = bar; since, as a pointer arguement, its looking for an address

If the Cortex-M0 fault mechanism is the same as the Cortex-M3/4/7 fault mechanism, then the following page provides detailed information on how to decode the fault stack, giving you the address of the faulting instruction, as well as the register values at the time.
http://www.freertos.org/Debugging-Hard-Faults-On-Cortex-M-Microcontrollers.html

ARM printing elements of an Array leads to Seg Fault

I am trying to print all the elements of an array using the write() system call. I haven't used write a whole lot before but from what I understand, the parameters I am passing to write seem to be correct. I know that write takes:
The file descriptor of the file (1 means standard output).
The buffer from where data is to be written into the file.
The number of bytes to be read from the buffer.
Here is my code:
mov r3, #0 /*Offset (also acts as LVC)*/
print: mov r0, #1 /*Indicates standard output*/
ldr r4, =array /*Set r4 to the address of array*/
ldr r5, [r3,r4] /*Add offset to array address*/
ldr r1, [r6] /*Element of array to write*/
mov r2, #1 /*Write 1 byte*/
bl write
add r3, r3, #1 /*Increase offset each iteration*/
cmp r3, #41
blt print
Does this look correct? Is it likely that my problem is elsewhere in my program?

No. You want to pass the address where the data to write are in r1, not the value itself.
Therefore r1 should be set to just <address-of-array> + <index>, i.e.:
ldr r4, =array /*Set r4 to the address of array*/
add r1, r3, r4 /*Add offset to point to array item */
It crashed for you, because you tried to read from memory at an invalid address -- the value of the array item. You were reading a word (ldr r5, [r3,r4]), not byte, from the array at index r3, then trying to read another word (not byte) from that address.
It is not relevant in this case, but just for reference, you would use lrdb to read a single byte.
Also the "invalid address" above might be both that it is undefined and falls outside of any mapped region, but also that it is improperly aligned. The ARM architecture disallows reading a word, e.g. a 32 bit value, from address not aligned at those 32-bits (4 bytes). For r3 == 1 in the second iteration, this wouldn't apply (assuming array would start on a 32-bit boundary).

Aligning a Stack pointer 8 byte from 4 byte in ARM assembly

How do I align a stack pointer to 8 byte which is now 4 byte aligned in ARM .As per my understanding stack pointer is 4 byte aligned if it points to some address like 0x4 ,0x8,0x12 and 0x16 so on.
So ,aliging a stack pointer to 8 byte means it should point to addresses like 0x8 ,0x16 ,0x24 and 0x32 and so on.
Now how Do I aligned 4 byte stack pointer to 8 byte aligned pointer?

Don't try to align sp manually yourself, instead push one more register to get alignment. For example instead of
push {r3, r4, lr}
add one more register to the list to get alignment to 8 easily.
push {r1, r3, r4, lr}
This may feel like extra memory access but in general caches works with wider bit vectors than native word sizes.
Another note is also, you don't need to force yourself to get stack alignment right if you are not doing external calls or receiving. So if you have closed box assembly routine which doesn't make calls to external world or receive some, you can live with broken stack alignment as long as it doesn't bite your own loadings.

To move a pointer up to the nearest 8 byte boundary, but leave it unmodified if it's already a multiple of 8 (pseudo-code - you'll need to add some casts if doing this in C):
p = (p + 7) & ~7;
or similarly to move it down to the nearest 8 byte boundary:
p = p & ~7;

Due to the decreasing stack
bic sp, sp, #7
should suffice. With EABI, you can use r12 or r0-r3 to (re)store the previous value.
All this should be done only in assembly; within C you can rely on correctly aligned stack pointer and trying to change it there will probably crash your program.
Compilers take care about correct alignment; misaligned stacks can happen when calling interrupts. Some CPUs (e.g. Cortex-M3) have special registers (STKALIGN) which can be used to enter irqs with 8-bit stack alignment.

If you are writing leaf functions(no subroutine calls), don't bother.
You are perfectly fine with a 4-byte aligned SP since this requirement is due to ldrd and strd instructions that need the address being a multiple of eight.
Therefore, if the function you are writing doesn't call any subroutine unknown to you, there really is no need for that. (ldrd and strd are so rarely used anyway)
The SP is already 8-byte aligned when your function is called from a higher level language anyway.
If you want the SP to be 8-byte aligned, either don't touch it, or preserve only even number of registers.

If you are in a situation where you don't control the value of SP that you got, and you want to align SP on 8 bytes boundary (for example, to call a subroutine), then the following sequence does that, without using any other registers:
; Check if SP is aligned on 8 bytes boundary.
tst sp, #0x7
; If SP is aligned on 8 bytes boundary, then we skip a word on the stack
; and then save SP. This consumes 8 bytes on the stack but keeps SP
; aligned on 8 bytes boundary.
streq sp, [sp, #-8]!
; If SP is aligned on 4 bytes boundary, then we save SP. This consumes
; 4 bytes on the stack and also aligns SP on 8 bytes boundary.
strne sp, [sp, #-4]!
; Here, SP is aligned on 8 bytes boundary, and the previous value of SP
; is stored on the top of the stack.
; For example, let's call some subroutine...
blx lr
; In order to restore the original value of SP, just load the value
; at the top of the stack.
ldr sp, [sp]
Notice that the code above assumes that:
You're running either in ARM 32-bits mode (e.g., ARMv5, ARMv6, ARMv7, AArch32,...).
SP is be aligned at least on 4 bytes boundary, which is usually the case as the stack is viewed as an array of words.