Is shift operation running in separated instruction in thumb ISA? - arm

ARM instructions may utilize barrel shifter in its second source operand (see below assembly listed), which is part of the data process instruction so save one instruction to do shifting. I am wondering could thumb instruction utilize barrel shift in DP instructions? Or should it separate the shift operation into an independent instruction? I am asking this since thumb may not has sufficient space in the instruction to code barrel shifter.
mov r0, r1, LSL #1

That example's not great, since it's an alternate form of the canonical lsl r0, r1, #1, which does have a 16-bit Thumb encoding (albeit with flag-setting restrictions).
An alternative ARM instruction such as add r0, r0, r1, lsl #1 would indeed have to be done as two Thumb instructions because as you say there just isn't room to squeeze both operations into 16 bits (hence also why you're limited to r0-r7 so registers can be encoded in 3 bits rather than 4).
Thumb-2, on the other hand, generally does have 32-bit encodings for immediate shifts of operands, so on ARMv6T2 and later architectures you can encode add r0, r0, r1, lsl #1 as a single instruction.
The register-shifted register form, however, (e.g. add r0, r0, r1, lsl r2) isn't available even in Thumb-2 - you'd have to do that in 2 Thumb instructions like so:
lsl r1, r2
add r0, r1
Note that unlike the ARM instruction this sequence changes the value in r1 - if you wanted to preserve that as well you'd need an intermediate register and an extra mov instruction (or a Thumb-2 3-register lsl) - failing that the last resort would be to bx to ARM code.

Related

What would be a reason to use ADDS instruction instead of the ADD instruction in ARM assembly?

My course notes always use ADDS and SUBS in their ARM code snippets, instead of the ADD and SUB as I would expect. Here's one such snippet for example:
__asm void my_capitalize(char *str)
{
cap_loop
LDRB r1, [r0] // Load byte into r1 from memory pointed to by r0 (str pointer)
CMP r1, #'a'-1 // compare it with the character before 'a'
BLS cap_skip // If byte is lower or same, then skip this byte
CMP r1, #'z' // Compare it with the 'z' character
BHI cap_skip // If it is higher, then skip this byte
SUBS r1,#32 // Else subtract out difference to capitalize it
STRB r1, [r0] // Store the capitalized byte back in memory
cap_skip
ADDS r0, r0, #1 // Increment str pointer
CMP r1, #0 // Was the byte 0?
BNE cap_loop // If not, repeat the loop
BX lr // Else return from subroutine
}
This simple code for example converts all lowercase English in a string to uppercase. What I do not understand in this code is why they are not using ADD and SUB commands instead of ADDS and SUBS currently being used. The ADDS and SUBS command, afaik, update the APSR flags NZCV, for later use. However, as you can see in the above snippet, the updated values are not being utilized. Is there any other utility of this command then?
Arithmetic instructions (ADD, SUB, etc) don't modify the status flag, unlike comparison instructions (CMP,TEQ) which update the condition flags by default. However, adding the S to the arithmetic instructions(ADDS, SUBS, etc) will update the condition flags according to the result of the operation. That is the only point of using the S for the arithmetic instructions, so if the cf are not going to be checked, there is no reason to use ADDS instead of ADD.
There are more codes to append to the instruction (link), in order to achieve different purposes, such as CC (the conditional flag C=0), hence:
ADDCC: do the operation if the carry status bit is set to 0.
ADDCCS: do the operation if the carry status bit is set to 0 and afterwards, update the status flags (if C=1, the status flags are not overwritten).
From the cycles point of view, there is no difference between updating the conditional flags or not. Considering an ARMv6-M as example, ADDS and ADD will take 1 cycle.
Discard the use of ADD might look like a lazy choice, since ADD is quite useful for some cases. Going further, consider these examples:
SUBS r0, r0, #1
ADDS r0, r0, #2
BNE go_wherever
and
SUBS r0, r0, #1
ADD r0, r0, #2
BNE go_wherever
may yield different behaviours.
As old_timer has pointed out, the UAL becomes quite relevant on this topic. Talking about the unified language, the preferred syntax is ADDS, instead of ADD (link). So the OP's code is absolutely fine (even recommended) if the purpose is to be assembled for Thumb and/or ARM (using UAL).
ADD without the flag update is not available on some cortex-ms. If you look at the arm documentation for the instruction set (always a good idea when doing assembly language) for general purpose use cases that is not available until a thumb2 extension on armv7-m (cortex-m3, cortex-m4, cortex-m7). The cortex-m0 and cortex-m0+ and generally wide compatibility code (which would use armv4t or armv6-m) doesn't have an add without flags option. So perhaps that is why.
The other reason may be to get the 16-bit instruction not the 32, but but that is a slippery slope as it gets even more into assemblers and their syntax (syntax is defined by the assembler, the program that processes assembly language, not the target). For example not syntax unified gas:
.thumb
add r1,r2,r3
Disassembly of section .text:
00000000 <.text>:
0: 18d1 adds r1, r2, r3
The disassembler knows reality but the assembler doesn't:
so.s: Assembler messages:
so.s:2: Error: instruction not supported in Thumb16 mode -- `adds r1,r2,r3'
but
.syntax unified
.thumb
adds r1,r2,r3
add r1,r2,r3
Disassembly of section .text:
00000000 <.text>:
0: 18d1 adds r1, r2, r3
2: eb02 0103 add.w r1, r2, r3
So not slippery in this case, but with the unified syntax you start to get into blahw, blah.w, blah, type syntax and have to spin back around to check to see that the instructions you wanted are being generated. Non-unified has its own games as well, and of course all of this is assembler-specific.
I suspect they were either going with the only choice they had, or were using the smaller and more compatible instruction, especially if this were a class or text, the more compatible the better.

How does a compiler/assembler make sense of processor core registers?

My question is specific to arm cortex M3 micro-controllers. Every peripheral on the micro controller is memory mapped and those memory addresses are used in processing.
For Eg.,: GPIOA->ODR = 0;
This will write a 0 at address 0x4001080C.
This address is defined in the device specific file of the micro controller.
Now, the cortex M3 has processor core registers R0-R12 (general purpose). I want to know, do these registers also have some address like other peripherals?
So, if I have instruction: MOV R0, #10;
will R0 be translated to some address when assembled? Do core registers have special numeric addresses exclusive for core peripherals. Is address of R0 defined in any file (I couldn't find any) like that of GPIOA? Or is it that register R0 and other core registers are referred to as R0 and their respective names only so that the assembler sees "R0" and generates the opcode from it?
I have this confusion because some 8 bit controllers also have addresses for general purpose registers.
Thanks,
Navin
Registers like R0-R12 or SP, PC, .. are registers inside CPU core and they are not mapped to global address space. Access to these registers is possible only from assembler.
And also direct access to core registers from higher level languages like C is not possible, because they are not addressable. These registers are used for internal processing and they are transparent for the programmer.
But registers like GPIOA->ODR are mapped to global address space, so each register has own address.
General Purpose registers are meant to do general purpose operations with the CPU. This is just like we use few temporary variables in any programming language. So if we relate this to your question, CPU requires few reserved memory segments to do it's basic operations. Hence there is no point in sharing this to outside world. This is how ARM based processors work.
You happened to have picked an instruction that is really easy to see...
.thumb
mov r0,#10
mov r1,#10
mov r2,#10
mov r3,#10
mov r4,#10
mov r5,#10
mov r6,#10
mov r7,#10
assemble then disassemble to see the machine code
Disassembly of section .text:
00000000 <.text>:
0: 200a movs r0, #10
2: 210a movs r1, #10
4: 220a movs r2, #10
6: 230a movs r3, #10
8: 240a movs r4, #10
a: 250a movs r5, #10
c: 260a movs r6, #10
e: 270a movs r7, #10
there will be three or four bits depending on the instruction and instruction set (arm vs thumb (and then thumb2 extensions)) that specify the register. In this case those bits happen to line up nicely with the hex representation of the machine code instruction so we can see the 0 through 7. For a cortex-m3 many of the thumb instructions are limited to r0-r7 (implying a 3 bit field within the instruction) with one or two to move between the lower and upper, thumb2 extensions allow for more access to the full r0-r15 (and thus will have a 4 bit field in the instruction). You should get the armv7m architectural reference manual which is what is associated with the cortex-m3 (after you get the cortex-m3 technical reference manual and see that it uses the armv7m architecture), you can also get the oldest armv5 architectural reference manual as it has the oldest description of the thumb instruction set which is the one instruciton set that is compatible across all arm cores armv6m covers the cortex-m0 which has a lot fewer thumb2 extensions then armv7m which covers the cortex-m3 m4 and m7 have tons more thumb2 extensions.
another example that takes only a second to try
.thumb
mov r0,r0
mov r1,r1
mov r2,r2
mov r3,r3
mov r4,r4
mov r5,r5
mov r6,r6
mov r7,r7
mov r0,r0
mov r1,r0
mov r2,r0
mov r3,r0
mov r4,r0
mov r5,r0
mov r6,r0
mov r7,r0
Disassembly of section .text:
00000000 <.text>:
0: 1c00 adds r0, r0, #0
2: 1c09 adds r1, r1, #0
4: 1c12 adds r2, r2, #0
6: 1c1b adds r3, r3, #0
8: 1c24 adds r4, r4, #0
a: 1c2d adds r5, r5, #0
c: 1c36 adds r6, r6, #0
e: 1c3f adds r7, r7, #0
10: 1c00 adds r0, r0, #0
12: 1c01 adds r1, r0, #0
14: 1c02 adds r2, r0, #0
16: 1c03 adds r3, r0, #0
18: 1c04 adds r4, r0, #0
1a: 1c05 adds r5, r0, #0
1c: 1c06 adds r6, r0, #0
1e: 1c07 adds r7, r0, #0
note that the bits didnt line up as nicely as before with hex values, doesnt matter look at the binary to see the three bits changing from instruction to instruction.
in this case the assembler chose to use an add instead of mov
Notes:
Encoding: This instruction is encoded as ADD Rd, Rn, #0.
and
Notes
Operand restriction: If a low register is specified for and
H1==0 and H2==0), the result is UNPREDICTABLE .
All this plus a zillion more things you learn when you read the documentation. http://infocenter.arm.com. on the left arm architecture then reference manuals you may have to sacrifice an email address. you can google arm architectural reference manual and you may get lucky...

What's the meaning of W suffix for thumb-2 instruction?

There is a w suffix for thumb-2 instruction as below, how does it change the semantic of the instruction without it? The search result is very noisy and I didn't get the answer.
addw r0, r1, #0
Simply enough, W means "wide". It is the 32-bit version of the instruction, whereas most Thumb instructions are 16 bits wide. The wide instructions often have bigger immediates or can address more registers.
Edit: Some of the comments seem confused about the difference between addw and add.w. The only essential difference is how the immediate is encoded.
add.w: imm32 = ThumbExpandImm(i:imm3:imm8);
addw: imm32 = ZeroExtend(i:imm3:imm8, 32);
I see ADDW in Cortex-M3 TRM Table 2-5
Data operations with large immediate
ADDW and SUBW have a 12-bit immediate. This means they can replace many from memory literal loads.
It is also mentioned in Quick Reference
add wide T2 ADD Rd, Rn, #<imm12>
Looks like the assembler would recognize the immediate constant <= 12 bits, and do the needful.
In the context where you see it, it is an ordinary "add".
Different encodings of an instruction should have distinguishing syntaxes so when you disassemble a binary you should notice which encoding was used. This also helps when you assemble back a disassembled binary, resulting binary should be the one you start with.
In your case using addw instead of add doesn't change the semantic of instruction as it is an add operation. However it certainly forces assembler to produce Encoding T4 (32-bit) of add instruction, as that's dictated by the specification.
Summary when assembling you can use just add mnemonic and assembler would choose the right encoding and you can see that in the object dump.
int f1(int i) {
asm volatile("add r0, #0");
asm volatile("add r0, #257");
}
00000000 <f1>:
0: f100 0000 add.w r0, r0, #0
4: f200 1001 addw r0, r0, #257 ; 0x101
8: 4770 bx lr
a: bf00 nop

ARM NEON: load data from addresses contained in NEON registers (Q / D registers)

I'm working on a assembly ARM NEON code that consists of two parts. The first part calculates various addresses (memory) starting from a base address added to some computed values (the results are very distant memory addresses). The second part has to load data from the addresses computed in the first part and use them. Both the first and the second part are highly parallelizable and use only NEON parallelism.
What I need is to find the best way to combine the two parts: load data using the addresses output from the first phase.
What I've tried and seems to work is the simplest solution:
//q8 & q9 have 8 computed addresses
VMOV.32 r0, d16[0] //move addresses to standard registers
VMOV.32 r1, d16[1]
VMOV.32 r2, d17[0]
VMOV.32 r3, d17[1]
VLD1.8 d28[0], [r0] //load uchar (deinterleaving in d28 and d29)
VLD1.8 d29[0], [r1] //otherwise do not interleave and use VUZP
VLD1.8 d28[1], [r2]
VLD1.8 d29[1], [r3]
VMOV.32 r0, d18[0]
VMOV.32 r1, d18[1]
VMOV.32 r2, d19[0]
VMOV.32 r3, d19[1]
VLD1.8 d28[2], [r0]
VLD1.8 d29[2], [r1]
VLD1.8 d28[3], [r2]
VLD1.8 d29[3], [r3]
...
//data loaded in d28 and d29
In this example I've used four R registers (can use less or more), and I'm de-interleaving data in d28 and d29 simulating a standard VLD2.8 working on an array.
As this problem (compute addresses in NEON and load from those addresses) happens to me often, is there a better way?
Thanks
What you did might work, but you shouldn't do that.
While ARM->NEON transfers are nimble, NEON->ARM transfers aren't. They cause pipeline stalls wasting about 14 cycles each time initiated.
In your case, 28 cycles are wasted for nothing. And I'm sure doing the math with ARM would take much less.
Stick to ARM. When dealing with multiple 32bit data like addresses, ARMv7 heavily benefits from its dual(triple) issuing capability. (except for multiplications)

ARM v7 ORRS mnemonic without operand2

What does:
ORRS R1, R3 do?
Is it just R1 |= R3, possibly setting the N- or Z flags?
This might be blatantly obvious, but I haven't found any documentation that describes the ORR mnemonic without operand2.
Feedback is greatly appreciated.
-b.
It's a Thumb instruction. In a 16-bit Thumb opcode you can only fit the two operands, you don't get the extra operand2.
As you guessed, it does R1 |= R3. The S flag's presence indicates this instruction is part of an If-Then block (what Thumb has instead of proper ARM conditional execution); ORR and ORRS generate the same opcode and differ only by context.
In ARM assembly, instructions where the destination register (Rd) is the same as the source register (Rn) can omit Rd. So these are the same and will assemble to exactly the same instruction.
orrs r1, r1, r3
orrs r1, r3
add r1, r1, #5
add r1, #5
and so on.
It may be in the armv7m trm or its own trm, but they have a, I dont remember the term unified, assembly language, so
add r0,r3
will assemble for both arm and thumb, for thumb it is as is for arm the above equates to add r0,r0,r3. For thumb you dont get the three register option, but the functionality is implied r0=r0+r3.

Resources