ARM mode, load 32bits constant. General case - arm

I'm looking for a generic way to load a 32 bits constant in ARM mode.
Unfortunately I can't use neither "ldr rX, =const" (due to external problems) nor movw/movt (my target is a armv6k)
This is my attempt:
mov rX, 0
orr rX, (const&0x000000FF)
orr rX, (const&0x0000FF00)
orr rX, (const&0x00FF0000)
orr rX, (const&0xFF000000)
Is my code correct? Can you suggest me a better way? Thank you.

arm and gnu assemblers both allow the syntax:
ldr rX,=0x12345678
Which results in a location within pc relative addressing range (if possible) being allocated with the data word 0x12345678 and the instruction encoded as a pc-relative load, basically:
ldr r0,my_data
...
my_data: .word 0x12345678
Your other alternative is one instruction less than what you outlined:
mov rX,0x0000078
orr rX,rX,0x00005600
orr rX,rX,0x00340000
orr rX,rX,0x12000000
Now at least with gcc, dont know about arm, if you use the ldr rX,=number feature and the number can be encoded with a single move, it will encode that single mov...

Related

Encoding and decoding ARM instructions to/ from binary

Task 1: Write the corresponding ARM assembly representation for the following instructions:
11101001_000111000001000000010000
11100100_110100111000000000011001
10010010_111110100100000011111101
11100001_000000010010000011111010
00010001_101011101011011111001100
Task 2: Write the instruction code for the following instructions:
STMFA R13!, {R1, R3, R5-R11}
LDR R11, [R3, R5, LSL #2]
MOVMI R6, #1536
LDR R1, [R0, #4]!
EORS R3, R5, R10, RRX
I have zero experience with this material and the professor has left us students out to dry. Basically I've found the various methods for decoding these instructions but I have three major doubts still.
I don't have any idea on how to get started on decoding binary to ARM Instructions which is the first part of the homework.
I can't find some of these suffixes for example on EORS what is the S? Is it the set condition bit? Is it set to 1 when there is an S in front of the instruction?
I don't what to make of having multiple registers in one instruction line. Example:
EORS R3,R5,R10,RRx
I don't understand what's going on there with so many registers.
Any nudge in the right direction is greatly appreciated. Also I have searched the ARM manual, they're not very helpful for someone with no understanding of what they're looking for. They do have the majority of instructions for coding and decoding but have little explanation for the things I asked above.
If you have the ARM v7 A+R architecture manual (DDI0406C) there is a good table-based decode/disassembly description in chapter A5. You start at table A5.1 and and depending on the value of different bits in the instruction word it refers to more and more specific tables leading to the instruction.
As an example, consider the following instruction:
0001 0101 1001 1111 0000 0000 0000 1000
According to the first table it is an unsigned load/store instruction since the condition is not 1111 and op1 is 010. The encoding of this is further expanded in A5.3
From this section we see that A=0, op1=11001, Rn=1111 (PC), and B=0. This implies that the instruction is LDR(literal). Checking the page describing this instruction and remembering that cond=0001 we see that the instruction isLDRNE R0, [PC, #4].
To do the reverse procedure you look up the instruction in the alphabetical list of instructions and follow the pattern.
Looking at a different part of (one of) the ARM architectural reference manuals (not the cortex-m (armv6m armv7m) you want the ARMv5 one or the ARMv7-AR one) going to look at a thumb instruction, but the ARM instructions work the same way and are a couple chapters prior.
it says thumb instruction set as the section/chapter, then shortly into that is a table that shows thumb instruction set encoding or you can just search for that. one if them is called Add/subtract register, there are a lot of hardcoded ones and zeros up front then bit 9 is opc, then rm, rn and rd bits.
arm toolchains are easy to come by for windows mac and linux or can easily build from sources (just need binutils). assembling this
.thumb
add r1,r2,r3
add r1,r2,r4
add r1,r2,r5
add r1,r2,r6
add r1,r2,r7
then disassembling gives
00000000 <.text>:
0: 18d1 adds r1, r2, r3
2: 1911 adds r1, r2, r4
4: 1951 adds r1, r2, r5
6: 1991 adds r1, r2, r6
8: 19d1 adds r1, r2, r7
from that chart in the ARM ARM the add register starts with hardcoded bits 000110 the instructions above start with 0x18 or 0x19 which both start with the 6 bits 000110 (00011000 or 00011001). In the alphabetical list of thumb instructions we look for the add instructions. find the three register one and in this case it has 7 bits which happen to match the bits we are decoding 0001100 so we are on the right track. the the last 9 bits are three sets of three
0x1951 is 0001100101010001 or 0001100 101 010 001, the last nine represent r5, r2, and r1. Looking at the syntax part of the instruction it shows add rd, rn, rm but the machine code has rm, rn, rd so we take the machine code and rearrange per the syntax and get add r1,r2,r5. Not bad it matches, now unfortunately the s here is confusing, this thumb instruction doesnt have an s bit there wasnt room so this one always updates the flags, so the nature of this instruction and toolchain and how I used it requires the add without the s on the assembly end and disassembles with the s. confusing, sorry. when using arm instructions the letter s works as expected.
Just repeat this with the ARM instructions in either direction. The immediate values are going to be the most challenging part of this. Or not depends on the specific instruction and immediate encoding.

Signed Halfword division by constant in ARM

I was trying divide a result by 9 in ARM quite similarly to ARM DIVISION HOW TO DO IT?
except for a couple of things,
I'm trying to divide a 16 bit number (halfword)
It is signed
I have the following implementation at the moment to divide [r8] and place it into [r1] but the result differs from the C++ implementation when the 16th bit is set and works otherwise
LDR r7, =0x1C72 ; 2**16 *(1/9) +1
MUL r9, r8, r7
LSR r9, #16
STRH r9, [r1], #2
Please let me know if you understand why. (ps I also tried with SMULBB but it wasn't any better
Not sure if anyone cares but I have found a sort of solution. After looking at the results, I noticed ARM division with my technique yielded a number one less than C++.
Hence the modification which makes it work:
TST r8, #32768
SMULBB r8, r8, r7
ASR r8, #16
ADDNE r8, #1
The other problem I have now,is that the division occurs after nine additions. When the result of those additions is outside the halfword range, C++ manages to still output the good result where as the ARM result seems to get saturated in a way.
I'm going to have to modify the code to translate the halfwords to fullwords and hence will have to change the multiplication to 32 bit.
The code above should work as long as your starting value is in the signed halfword range

ARM Instruction Set - Changing the CPSR (S bit)

I was wondering why does not ARM Instructions set the CPSR by default (like x86), but the S bit must be used in these cases? When Instructions dont change the CPSR offer better performance? For example an ADD instruction offers better performance than ADDS? Or what is the real deal?
It is for performance or perhaps was. if you always change flags then you have a hard time using one flag on multiple instructions without a branch which messes with your pipeline.
if(a==0)
{
b=b+1;
c=0;
}
else
{
b=0;
c=c+1;
}
traditionally you have to literally implement that with branches (pseudocode not real asm)
cmp a,0
bne notzero
add b,b,1
mov c,0
b waszero
notzero:
mov b,0
add c,c,1
waszero:
so you suffer a branch no matter what
but with conditional execution
cmp a,0
addeq b,b,1
moveq c,0
addne c,c,1
movne b,0
no branches you simply rip through the code, now the only way this can work is 1) you have an option per instruction to conditionally execute based on flags and 2) instructions that modify the flags have an option not to modify the flags
Depending on the processor family/architecture the add and maybe even mov will modify the flags, so you have to have both the conditional execution AND the option not to set flags. That is why arm has an adds and an add.
I think they got rid of all that with the 64 bit architecture so perhaps as interesting and cool as it was maybe it wasnt used enough or worth it or they just needed those four bits to keep all/some instructions to 32 bits.
I was wondering why does not ARM Instructions set the CPSR by default (like x86), but the S bit must be used in these cases?
It is a choice and it depends on context. The extra flexibility is only limited by a programmers imagination.
When Instructions don't change the CPSR offer better performance? For example an ADD instruction offers better performance than ADDS?
Most likely neverNote1. Ie, an instruction that doesn't set CPSR does not execute faster (less clocks) for the majority of ARM CPUs and instructions.
Or what is the real deal?
Consider some 'C' code,
int i, sum;
char *p = array; /* passed in */
for(i = 0, sum = 0; i < 10 ; i++)
sum += arrary[i];
return sum;
This can translate to,
mov r2, r0 ; get "array" to R2
mov r1, #10 ; counter (reverse direction)
mov r0, #0 ; sum = 0
1:
subs r1, #1 ; set conditions
add r0, [r2], #1 ; does not affect conditions.
bne 1b
bx lr
In this case, the loop body is simple. However, if there are no conditionals with-in the loop, then a compiler (or assembler programmer) may schedule the loop decrement where ever they like and still set the conditions to be tested much later. This can be more important with more complex logic and where the CPU may have stalls due to data dependencies. It can also be important with conditional execution.
The optional 'S' is more a feature of many instructions than a single instruction.
Note1: Some one can always make an ARM CPU and do this. You would have to look at data sheets. I don't know of any CPU that take more time to set conditions.

How to determine the endian mode the processor is running in?

How do I determine the endian mode the ARM processor is running in using only assembly language.
I can easily see the Thumb/ARM state reading bit 5 of the CPSR, but I don't know if there a corresponding bit in the CPSR or elsewhere for endianness.
;silly example trying to execute ARM code when I may be in Thumb mode....
MRS R0,CPSR
ANDS R0,#0x20
BNE ThumbModeIsActive
B ARMModeIsActive
I've got access to the ARM7TDMI data sheet, but this document does not tell me how to read the current state.
What assembly code do I use to determine the endianness?
Let's assume I'm using an ARM9 processor.
There is no CPSR bit for endianness in ARMv4 (ARM7TDMI) or ARMv5 (ARM9), so you need to use other means.
If your core implements system coprocessor 15, then you can check the bit 7 of the register 1:
MRC p15, 0, r0, c1, c0 ; CP15 register 1
TST r0, #0x80 ; check bit 7 (B)
BNE big_endian
B little_endian
However, the doc (ARM DDI 0100E) seems to hint that this bit is only valid for systems where the endianness is configurable at runtime. If it's set by the pin, the bit may be wrong. And, of course, on most(all?) ARM7 cores, the CP15 is not present.
There is a platform-independent way of checking the endianness which does not require any hardware bits. It goes something like this:
LDR R0, checkbytes
CMP R0, 0x12345678
BE big_endian
BNE little_endian
checkbytes
DB 0x12, 0x34, 0x56, 0x78
Depending on the current endianness, the load will produce either 0x12345678 or 0x78563412.
ARMv6 and later versions let you check CPSR bit E (9) for endianness.
Before ARMv6 co-processor 15 register c1 bit 7 should tell which endianness core is using.
In both cases 1 is big-endian while 0 is little-endian.

Why unsigned types are more efficient in arm cpu?

I'm reading an arm manual and come to this suggestion, but the reason is not mentioned.
Why unsigned types are faster?
Prior to ARMv4, ARM had no native support for loading halfwords and signed bytes. To load a signed byte you had to LDRB then sign extend the value (LSL it up then ASR it back down). This is painful so char is unsigned by default.
In ARMv4 instructions were added to handle halfwords and signed values. These new instructions had to be squeezed into the available instruction space. Limits on the space available meant that they could not be made as flexible as the original instructions, which are able to do various address computations when loading the value.
So you may find that LDRSB, for example, is unable to combine a fetch from memory with an address computation whereas LDRB could. This can cost cycles. Sometimes we can rework short-heavy code to operate on pairs of ints to avoid this.
There's more info on my site here: http://www.davespace.co.uk/arm/efficient-c-for-arm/memaccess.html
I think it's just that the instruction set for ARM CPUs is optimized for unsigned. Some operations can be done with one instruction for unsigned types but will need multiple instructions if it's signed. That's why I think if compiling for ARM in most (all?) C and C++ compilers it defaults to unsigned char rather than the more usual signed char.
The only advantages of unsigned types I can think of are that division and modulo implementations may be slightly faster, and you can do tests like if (unsigned_value < limit) rather than if (signed_value >= 0 && signed_value < limit).
I suspect your manual may be out of date. Any ARM in use today will have v4 or later of the instruction set, and I'm pretty sure that no instructions are faster or slower depending on signedness.
On older ARMs, I believe that signed multiplication could be slower; I think that early termination only looked for all zeros in the top bits, not all ones, so multiplications involving negative numbers would always take the maximum time. Although this depended on the value, not on whether the type was signed or unsigned. On at least ARMv4 and later, early termination works for negative values.
Also, I think very early ARMs couldn't load a single byte, only a word. So you'd need two instructions to load an unsigned byte, and three to load a signed one:
ldr r0, [r1]
and r0, r0, #0xff
versus
ldr r0, [r1]
mov r0, r0, asl #24
mov r0, r0, asr #24 ; but this could maybe be combined with later instructions
versus (these days) ldrb r0, [r1] or ldrsb r0, [r1] to do a single-byte load.
On a modern processor, it's very unlikely that using unsigned types will have a measurable impact on performance. Use whichever type makes most sense, then look at the code in detail once you've identified any performance bottlenecks.

Resources