What's the meaning of W suffix for thumb-2 instruction? - arm

There is a w suffix for thumb-2 instruction as below, how does it change the semantic of the instruction without it? The search result is very noisy and I didn't get the answer.
addw r0, r1, #0

Simply enough, W means "wide". It is the 32-bit version of the instruction, whereas most Thumb instructions are 16 bits wide. The wide instructions often have bigger immediates or can address more registers.
Edit: Some of the comments seem confused about the difference between addw and add.w. The only essential difference is how the immediate is encoded.
add.w: imm32 = ThumbExpandImm(i:imm3:imm8);
addw: imm32 = ZeroExtend(i:imm3:imm8, 32);

I see ADDW in Cortex-M3 TRM Table 2-5
Data operations with large immediate
ADDW and SUBW have a 12-bit immediate. This means they can replace many from memory literal loads.
It is also mentioned in Quick Reference
add wide T2 ADD Rd, Rn, #<imm12>
Looks like the assembler would recognize the immediate constant <= 12 bits, and do the needful.
In the context where you see it, it is an ordinary "add".

Different encodings of an instruction should have distinguishing syntaxes so when you disassemble a binary you should notice which encoding was used. This also helps when you assemble back a disassembled binary, resulting binary should be the one you start with.
In your case using addw instead of add doesn't change the semantic of instruction as it is an add operation. However it certainly forces assembler to produce Encoding T4 (32-bit) of add instruction, as that's dictated by the specification.
Summary when assembling you can use just add mnemonic and assembler would choose the right encoding and you can see that in the object dump.
int f1(int i) {
asm volatile("add r0, #0");
asm volatile("add r0, #257");
}
00000000 <f1>:
0: f100 0000 add.w r0, r0, #0
4: f200 1001 addw r0, r0, #257 ; 0x101
8: 4770 bx lr
a: bf00 nop

Related

What would be a reason to use ADDS instruction instead of the ADD instruction in ARM assembly?

My course notes always use ADDS and SUBS in their ARM code snippets, instead of the ADD and SUB as I would expect. Here's one such snippet for example:
__asm void my_capitalize(char *str)
{
cap_loop
LDRB r1, [r0] // Load byte into r1 from memory pointed to by r0 (str pointer)
CMP r1, #'a'-1 // compare it with the character before 'a'
BLS cap_skip // If byte is lower or same, then skip this byte
CMP r1, #'z' // Compare it with the 'z' character
BHI cap_skip // If it is higher, then skip this byte
SUBS r1,#32 // Else subtract out difference to capitalize it
STRB r1, [r0] // Store the capitalized byte back in memory
cap_skip
ADDS r0, r0, #1 // Increment str pointer
CMP r1, #0 // Was the byte 0?
BNE cap_loop // If not, repeat the loop
BX lr // Else return from subroutine
}
This simple code for example converts all lowercase English in a string to uppercase. What I do not understand in this code is why they are not using ADD and SUB commands instead of ADDS and SUBS currently being used. The ADDS and SUBS command, afaik, update the APSR flags NZCV, for later use. However, as you can see in the above snippet, the updated values are not being utilized. Is there any other utility of this command then?
Arithmetic instructions (ADD, SUB, etc) don't modify the status flag, unlike comparison instructions (CMP,TEQ) which update the condition flags by default. However, adding the S to the arithmetic instructions(ADDS, SUBS, etc) will update the condition flags according to the result of the operation. That is the only point of using the S for the arithmetic instructions, so if the cf are not going to be checked, there is no reason to use ADDS instead of ADD.
There are more codes to append to the instruction (link), in order to achieve different purposes, such as CC (the conditional flag C=0), hence:
ADDCC: do the operation if the carry status bit is set to 0.
ADDCCS: do the operation if the carry status bit is set to 0 and afterwards, update the status flags (if C=1, the status flags are not overwritten).
From the cycles point of view, there is no difference between updating the conditional flags or not. Considering an ARMv6-M as example, ADDS and ADD will take 1 cycle.
Discard the use of ADD might look like a lazy choice, since ADD is quite useful for some cases. Going further, consider these examples:
SUBS r0, r0, #1
ADDS r0, r0, #2
BNE go_wherever
and
SUBS r0, r0, #1
ADD r0, r0, #2
BNE go_wherever
may yield different behaviours.
As old_timer has pointed out, the UAL becomes quite relevant on this topic. Talking about the unified language, the preferred syntax is ADDS, instead of ADD (link). So the OP's code is absolutely fine (even recommended) if the purpose is to be assembled for Thumb and/or ARM (using UAL).
ADD without the flag update is not available on some cortex-ms. If you look at the arm documentation for the instruction set (always a good idea when doing assembly language) for general purpose use cases that is not available until a thumb2 extension on armv7-m (cortex-m3, cortex-m4, cortex-m7). The cortex-m0 and cortex-m0+ and generally wide compatibility code (which would use armv4t or armv6-m) doesn't have an add without flags option. So perhaps that is why.
The other reason may be to get the 16-bit instruction not the 32, but but that is a slippery slope as it gets even more into assemblers and their syntax (syntax is defined by the assembler, the program that processes assembly language, not the target). For example not syntax unified gas:
.thumb
add r1,r2,r3
Disassembly of section .text:
00000000 <.text>:
0: 18d1 adds r1, r2, r3
The disassembler knows reality but the assembler doesn't:
so.s: Assembler messages:
so.s:2: Error: instruction not supported in Thumb16 mode -- `adds r1,r2,r3'
but
.syntax unified
.thumb
adds r1,r2,r3
add r1,r2,r3
Disassembly of section .text:
00000000 <.text>:
0: 18d1 adds r1, r2, r3
2: eb02 0103 add.w r1, r2, r3
So not slippery in this case, but with the unified syntax you start to get into blahw, blah.w, blah, type syntax and have to spin back around to check to see that the instructions you wanted are being generated. Non-unified has its own games as well, and of course all of this is assembler-specific.
I suspect they were either going with the only choice they had, or were using the smaller and more compatible instruction, especially if this were a class or text, the more compatible the better.

arm disassembly - ADR or SUB?

I started to build a arm disassembler.
I have the binary "48 00 4F E2"
Ida:
ROM:00000040 48 00 4F E2 ADR R0, sub_0
Qemu:
e24f0048 sub r0, pc, #72
I do not think it's an BE/LE problem because the commands that came before and after would look the same.
What happens ?
so if you just try it...
.word 0x48004FE2
.word 0xe24f0048
gives
0: 48004fe2 stmdami r0, {r1, r5, r6, r7, r8, r9, r10, r11, lr}
4: e24f0048 sub r0, pc, #72 ; 0x48
but since you are writing a disassembler (very good exercise to learn an instruction set BTW, keep going)...the first thing you notice is the condition code. Neither of the results you saw have "mi" associated with it so neither assuming they are both disassembling arm and arm mode, did not see that 4 as the top nibble. the 0xe is ALways so not noted.
you also see in your arm documentation that adr starts with
cccc0010010x1111 or cccc0010100x1111 0xX24F 0xX28F
and it is a sub without the Rn being 1111, but one disassembler chose to honor ADR which is a pseudo instruction, the other just decoded it as a sub. It may also matter what architecture you specified. In the newer arm arm ADR shows that it is supported by armv4t on, but the older arm arm (armv4 and armv5 and some armv6) shows no ADR instruction. It just shows the sub.
which starts as
cccc00I0010SNNNN
where cccc is the condition code I can be a zero in some cases or a 1 in this case, S is the save flags or not and NNNN is Rn.
sub(cond)(s) Rd,Rn,shifter operand
you have 0xE24F so that is
sub r0,pc,shifter_operand.
I is a 1 so that is a 32 bit immediate. 0x48 with a rotate of 0x0 which is a decimal 72 (hex 0x48).
what is at 0x40-0x48+8 (address 0x0000)? is that the label sub_0?
It is looking like they both correctly disassembled the instruction.
If you are truly making a disassembler though then you shouldnt have needed to ask this question as you had everything you needed in front of you already.

Why is SP (apparently) stored on exception entry on Cortex-M3?

I am using a TI LM3S811 (a older Cortex-M3) with the SysTick interrupt to trigger at 10Hz. This is the body of the ISR:
void SysTick_Handler(void)
{
__asm__ volatile("sub r4, r4, #32\r\n");
}
This produces the following assembly with -O0 and -fomit-frame-pointer with gcc-4.9.3. The STKALIGN bit is 0, so stacks are 4-byte aligned.
00000138 <SysTick_Handler>:
138: 4668 mov r0, sp
13a: f020 0107 bic.w r1, r0, #7
13e: 468d mov sp, r1
140: b401 push {r0}
142: f1ad 0420 sub.w r4, r4, #32
146: f85d 0b04 ldr.w r0, [sp], #4
14a: 4685 mov sp, r0
14c: 4770 bx lr
14e: bf00 nop
I don't understand what's going on with r0 in the listing above. Specifically:
1) It seems like we're clearing the lower 3 bits of SP and storing it on the stack. Is that to maintain 8-byte alignment? Or is it something else?
2) Is the exception exit procedure is equally confusing. From my limited understanding of the ARM assembly, it does something like this:
SP = SP + 4; R0 = SP;
Followed by storing it back to SP. Which seems to undo the manipulations until this stage.
3) Why is there a nop instruction after the unconditional branch (at 0x14E)?
The ARM Procedure Calling Standard and C ABI expect an 8 byte (64 bit) alignment of the stack. As an interrupt might occur after pushing/poping a single word, it is not guaranteed the stack is correctly aligned on interrupt entry.
The STKALIGN bit, if set (the default) enforces the hardware to align the stack automatically by conditionally pushing an extra (dummy) word onto the stack.
The interrupt attribute on a function tells gcc, OTOH the stack might be missaligned, so it adds this pre-/postamble which enforces the alignment.
So, both actually do the same; one in hardware, one in software. If you can live with a word-aligned stack only, you should remove the interrupt attribute from the function declarations and clear the STKALIGN bit.
Make sure such a "missaligned" stack is no problem (I would not expect any, as this is a pure 32 bit CPU). OTOH, you should leave it as-is, unless you really need to safe that extra conditional(!) clock and word (very unlikely).
Warning: According to the ARM Architecture Reference Manual, setting STKALIGN == 0 is deprecated. Briefly: do not set this bit to 0!
Since you're using -O0, you should expect lots of redundant and useless code. The general way in which a compiler works is to generate code with the full generality of everything that might be used anywhere in the program, and then rely on the optimizer to get rid of things that are unneeded.
Yes this is doing 8byte alignment. Its also allocating a stack frame to hold local variables even though you have none.
The exit is the reverse, deallocating the stack frame.
The nop at the end is to maintain 4-byte alignment in the code, as you might want to link with non-thumb code at some point.
If you enable optimization, it will eliminate the stack frame (as its unneeded) and the code will become much simpler.

Is shift operation running in separated instruction in thumb ISA?

ARM instructions may utilize barrel shifter in its second source operand (see below assembly listed), which is part of the data process instruction so save one instruction to do shifting. I am wondering could thumb instruction utilize barrel shift in DP instructions? Or should it separate the shift operation into an independent instruction? I am asking this since thumb may not has sufficient space in the instruction to code barrel shifter.
mov r0, r1, LSL #1
That example's not great, since it's an alternate form of the canonical lsl r0, r1, #1, which does have a 16-bit Thumb encoding (albeit with flag-setting restrictions).
An alternative ARM instruction such as add r0, r0, r1, lsl #1 would indeed have to be done as two Thumb instructions because as you say there just isn't room to squeeze both operations into 16 bits (hence also why you're limited to r0-r7 so registers can be encoded in 3 bits rather than 4).
Thumb-2, on the other hand, generally does have 32-bit encodings for immediate shifts of operands, so on ARMv6T2 and later architectures you can encode add r0, r0, r1, lsl #1 as a single instruction.
The register-shifted register form, however, (e.g. add r0, r0, r1, lsl r2) isn't available even in Thumb-2 - you'd have to do that in 2 Thumb instructions like so:
lsl r1, r2
add r0, r1
Note that unlike the ARM instruction this sequence changes the value in r1 - if you wanted to preserve that as well you'd need an intermediate register and an extra mov instruction (or a Thumb-2 3-register lsl) - failing that the last resort would be to bx to ARM code.

ARM v7 ORRS mnemonic without operand2

What does:
ORRS R1, R3 do?
Is it just R1 |= R3, possibly setting the N- or Z flags?
This might be blatantly obvious, but I haven't found any documentation that describes the ORR mnemonic without operand2.
Feedback is greatly appreciated.
-b.
It's a Thumb instruction. In a 16-bit Thumb opcode you can only fit the two operands, you don't get the extra operand2.
As you guessed, it does R1 |= R3. The S flag's presence indicates this instruction is part of an If-Then block (what Thumb has instead of proper ARM conditional execution); ORR and ORRS generate the same opcode and differ only by context.
In ARM assembly, instructions where the destination register (Rd) is the same as the source register (Rn) can omit Rd. So these are the same and will assemble to exactly the same instruction.
orrs r1, r1, r3
orrs r1, r3
add r1, r1, #5
add r1, #5
and so on.
It may be in the armv7m trm or its own trm, but they have a, I dont remember the term unified, assembly language, so
add r0,r3
will assemble for both arm and thumb, for thumb it is as is for arm the above equates to add r0,r0,r3. For thumb you dont get the three register option, but the functionality is implied r0=r0+r3.

Resources