I'm just getting started with the ARM architecture on my Nucleo STM32F303RE, and I'm trying to understand how the instructions are encoded.
I have running a simple LED-blinking program, and the first few disassembled application instructions are:
08000188: push {lr}
0800018a: sub sp, #12
235 __initialize_hardware_early ();
0800018c: bl 0x80005b8 <__initialize_hardware_early>
These instructions resolve to the following in the hex file (displayed weird in Eclipse -- each 32-bit word is in MSB order, but Eclipse doesn't seem to know it... but that's for another topic):
address 0x08000188: B083B500 FA14F000
Using the ARM Architecture Ref Manual, I've confirmed the first 2 instructions, push (0xB500) and sub (0xB083). But I can't make any sense out of the "bl" instruction.
The hex instruction is 0xFA14F000. The Ref Manual says it breaks down like this:
31.28 27 26 25 24 23............0
cond 1 0 1 L signed_immed_24
The first "F" (0xF......) makes sense: all conditions are set (ALways).
The "A" doesn't make sense though, since the L bit should be set (1011). Shouldn't it be 0xFB......?
And the signed_immed_24 doesn't make sense, either. The ref manual says:
- start with 0x14F000
- sign extend to 30 bits (signed 2's-complement), giving 0x0014F000
- shift left to form 32-bit value, giving 0x0053C000
- add to the PC, which is the current instruction + 8, giving 0x0800018c + 8 + 0x0053C000, or 0x0853C194.
So I get a branch address of 0x0853C194, but the disassembly shows 0x080005B8.
What am I missing?
Thanks!
-Eric
bl is two, separate, 16 bit instructions. The armv5 (and older) ARM ARM does a better job of documenting them.
111HHoffset11
From the ARM ARM
The first Thumb instruction has H == 10 and supplies the high part of
the branch offset. This instruction sets up for the subroutine call
and is shared between the BL and BLX forms.
The second Thumb instruction has H == 11 (for BL) or H == 01 (for
BLX). It supplies the low part of the branch offset and causes the
subroutine call to take place.
0xFA14 0xF000
0xF000 is the first instruction upper offset is zeros
0xFA14 is the second instruction offset is 0x214
If starting at 0x0800018c then it is 0x0800018C + 4 + (0x0000214<<1) = 0x080005B8. The 4 is the two instructions head for the current PC. And the offset is units of (16 bit) instructions.
I guess the armv7-m ARM ARM covers it as well, but is harder to read, and apparently features were added. But they do not affect you with this branch link.
The ARMv5 ARM ARM does a better job of describing what happens as well. you can certaily take these two separate instructions and move them apart
.byte 0x00,0xF0
nop
nop
nop
nop
nop
.byte 0x14,0xFA
and it will branch to the same offset (relative to the second instruction). Maybe the broke that in some cores, but I know in some (after armv5) it works.
Related
i'm kinda new to ARM and i am trying to understand how instructions are interpreted/executed:
From what i know, on ARM is quite simple since every instruction takes up 4 bytes and it's all aligned by 4 bytes also.
The problem comes with Thumb-2 where their instructions can be both 16/32bit long. I've read that to determine if the current instruction is 16/32 bits long the processor reads a word (32bit) and evaluates the first half-word on certain bits [15:11]. If those bits are 0b11101/0b11110/0b11111 then that halfword is the first halfword of a 32 bit instruction else it's a 16bit instruction (I don't quite get why those specific bytes determine that). So an example should be:
0x4000 16-bit
0x4002 32-bit
0x4006 16-bit
0x4008 16-bit
0x400a 32-bit
Then the processor should grab from 0x4000 to 0x4004, evaluate the first half-word (0x4000 to 0x4002) and if the instruction is 16 bit then it just jumps to the next half-word and repeats the process but if the half-word indicates a 32bit address then it skips the next half-word and executes that 32bit instruction?
Also, i'm confused on where does PC point in thumb-2, is it still two instructions further?
Most of us don't/won't know exactly how it is implemented in the logic (and there are various cores so each could be different). But what used to be undefined instructions became thumb-2 extensions a couple dozen in armv6-m then like 150 new ones in armv7-m.
Think of the processor fetching 16 bit instructions, and sometimes it runs across a variable length one. Just like other variable length processors, the x86 will look at the one byte instruction then based on that it may or may not need to look at the next byte and so on until it has resolved the whole instruction. Same here, it looks at a halfword determines if it has everything it needs, if not it grabs the next halfword for the rest of the information.
0x4000 16-bit
0x4002 32-bit
0x4006 16-bit
0x4008 16-bit
0x400a 32-bit
the processor grabs 0x4000 sees it has what it needs, executes. The processor grabs 0x4002, sees it needs another halfword, grabs 0x4004, executes. processor grabs 0x4006 has what it needs executes. grabs 0x4008 has what it needs executes. grabs 0x400A sees it needs another halfword, grabs 0x400C, executes.
Those bit patterns were formerly undefined instructions, now they are part of the definition of a variable length instruction. Just like instructions that start with 0b010000 are data processing instructions and to determine is it an add or an xor, you have to look at other bits. These bit patterns define thumb-2 extensions then other bits in those two half words define what the full instruction is.
Why these bit patterns? You can think of it is arbitrary if you want, all instruction sets someone(/group) sat down and decided what bit patterns where going to mean what, no different here. There was room in the instruction set space with certain patterns so those were used. Not uncommon to add instructions later in the life of a processor family, take x86 for example. Plus many others, for an 8 bitter like x86 or 6502 or whatever you can either consume an 8 bit instruction/opcode as your next new instruction or you take that formerly unused byte/opcode and expand it into many more for example you take a byte/opcode that was unused and that byte now means look at the next byte, that next byte could be up to 256 new instructions or it could simply supplement the first byte specifying registers or operations, etc. No different here, down the road arm extended the thumb instruction set, some percentage of the instruction is consumed indicating this is a variable length instruction, but of those 32 bits there still remains quite a few bits to allow for a larger instruction with more options. (but losing the one to one relationship between thumb and arm instructions, all thumb instructions (not thumb-2 extensions) map directly into a full sized arm instruction).
Each core is different they don't all fetch a word at a time, thumb-2 extensions don't have to be aligned so a whole thumb-2 instruction won't necessarily fit in an aligned word fetch for the processors that do word fetches. Think of the (pre)fetcher and decoder as two separate things, since they are, functionally the decoder takes 16 bits at a time in thumb mode, how is it specifically implemented? I don't know. Do they wait for two half words to be ready before decoding the first? I don't know. Is every implementation the same? I don't know, would expect not. As far as fetching goes they are not the same as you can see in the ARM documentation and I think at least one if not more the chip vendor can choose at compile time.
If you are coming from for example a MIPS based textbook and trying to understand other processors, this can be confusing, understand that those text books and terms are for understanding and vocabulary, pipelines are not that depth in general and you don't fetch whole instructions at a time in general (the x86 does not fetch one byte at a time, it fetches MANY instructions at a time). Risc-v has even worse of a problem than arm and mips as you can have 16 bit compressed instructions, 32 bit instructions, and 64 bit instructions, the 32 bit instructions do not have to be aligned on a risc-v (nor the 64 bit) so fetching 32 at a time doesn't get you a whole instruction, the fetcher is separate from the decoder once enough is there then the decoder can complete.
I want to say that thumb is two ahead (independent of a thumb2 extension or not) so pc+4, should be easy to figure out though.
Disassembly of section .text:
00000000 <hello-0xe>:
0: e005 b.n e <hello>
2: bf00 nop
4: bf00 nop
6: f000 b802 b.w e <hello>
a: bf00 nop
c: bf00 nop
0000000e <hello>:
e: bf00 nop
Yes, so two thumb sized halfwords ahead (pc+4) in both cases. It would be significantly more complicated if it were two instructions ahead which is how it used to be to make it easy to remember. If it were two instructions ahead then sometimes pc+4, sometimes pc+6, and sometimes pc+8 the logic would have to decode two instructions in order to know how the pc was offset for the first of the two, so sticking with pc+4 as it has always been for thumb mode is the sane way to do it.
I'm working on a STM32l475 micro-controller which runs a Cortex-M4 processor and ARM/Thumb instruction sets. I see (from objdump) that there are beq.n and bne.n instructions generated in the binary of an ARM program (I added the -mthumb flag when compile the program). However, I don't find these branch instructions in the latest ARMv7-M manual.
Can anyone tell me the reason? And what are the instructions available in the manual that are equivalent to these two branch instructions?
beq and bne are conditional branches; in other words, they are conditional versions of the unconditional branch b. eq and ne are two different condition codes; they are described in section A7.3. beq means branch if equal and bne means branch if not equal.
The b branch instruction has two different encodings in Thumb mode. The encoding you're seeing is probably encoding T1 described in section A7.7.12:
B<c> <label>
15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
1 1 0 1 [-<cond>--] [--------imm8---------]
In this encoding, the condition code (like eq or ne) is encoded directly into the instruction, in bits 8-11. The disassembly from objdump displays the condition code in place of <c> above. So using the condition code table in section A7.3, you would encode beq as 11010000[imm8].
Before I learn a bit of Assembly had heard that you had to "program directly in hardware", "I had to do everything from scratch." For example to write a character without an operating system I thought I would have to know how my monitor work and write pixel by pixel of the character.
So I got interested and I learn a little. And I saw it was not so "close to the metal". Then wanted someone to explain to me how this works and if it is possible to go deeper and really control all hardware.
Here is a code that prints a character:
[BITS 16]
[ORG 0x7C00]
MOV AL, 65
CALL PrintCharacter
JMP $
PrintCharacter:
MOV AH, 0x0E
MOV BH, 0x00
MOV BL, 0x07
INT 0x10
RET
TIMES 510 - ($ - $$) db 0
DW 0xAA55
Lower than assembler is machine code.
However machine code instructions have an 1:1 relation to assembly instructions so there is nothing that can be done in machine code which cannot be done in assembler.
In early times of computing there were computers where you had to enter the machine code directly. The Mits Altair 680b is one of the examples for such a computer:
It had a lot of front panel switches which allowed you to modify the content of the RAM without (!) using the CPU: The CPU was stopped when the front panel switches were in use. You had to translate assembler code into binary code and load the program into the RAM this way. Then you started the CPU.
Later the KIM-1 computer (it is said to be the first affordable hobbyist computer) was released. This computer allowed entering the machine code as hexadecimal code but in contrast to the Mits computer a program running in the background (which means: the CPU) was responsible for writing the data entered by the keyboard into the RAM.
In theory it is still possible to enter Windows programs in hexadecimal code (using a hexadecimal editor) you want to. However this will bring no benefit compared to assembler code!
I am working on a software-based implementation of ARM processor in C.
Given an ARM data processing instruction:
instruction = 0xE3A01808; 1110 0 0 1 1101 0 0000 0001 1000 00001000
Which translates to: MOV r0,#8; shifted by 8 bits.
How to check whether the 8 bit shift is right or left shift?
With ARM 12-bit modified immediate constants, there is no shift, in any direction - it's a rotation, specifically, <7:0> rotated right by 2*<11:8>. Thus the encoding 0x808 represents 8 ROR (2*8), meaning 0xE3A01808 disassembles to mov, r1, #0x80000.
(Note that the canonical encoding of a modified immediate constant is the one with the smallest rotation, so mov, r1, #0x80000 would assemble to 0xE3A01702, i.e. 2 ROR 14, rather than 8 ROR 161).
As for implementing bitwise rotation in C, to solve that there's either compiler intrinsics or the standard shift-part-in-each-direction idiom x>>n | x<<(32-n).
[1] To get a specific encoding, UAL assembly allows an immediate syntax with the constant and rotation specified separately, i.e. mov r1, #8, 16. For full detail, this is all spelled out in the ARM ARM (section A5.2.4 in the v7 issue C I have here) - essentially, the choice of encodings permits a little funny business with flags in certain situations.
I'm not sure if this is what you're referring to, but here's some documentation that seems relevant:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0040d/ch05s05s01.html
The 'WITHDRAWN' pasted over the docs doesn't inspire much confidence.
It would seem to suggest a rotate right. Which plays with how I remember the barrel shifter being addressed on arm more generally (There's a ROR operation for instance, see http://www.davespace.co.uk/arm/introduction-to-arm/barrel-shifter.html)
In this compiler output, I'm trying to understand how machine-code encoding of the nopw instruction works:
00000000004004d0 <main>:
4004d0: eb fe jmp 4004d0 <main>
4004d2: 66 66 66 66 66 2e 0f nopw %cs:0x0(%rax,%rax,1)
4004d9: 1f 84 00 00 00 00 00
There is some discussion about "nopw" at http://john.freml.in/amd64-nopl. Can anybody explain the meaning of 4004d2-4004e0? From looking at the opcode list, it seems that 66 .. codes are multi-byte expansions. I feel I could probably get a better answer to this here than I would unless I tried to grok the opcode list for a few hours.
That asm output is from the following (insane) code in C, which optimizes down to a simple infinite loop:
long i = 0;
main() {
recurse();
}
recurse() {
i++;
recurse();
}
When compiled with gcc -O2, the compiler recognizes the infinite recursion and turns it into an infinite loop; it does this so well, in fact, that it actually loops in the main() without calling the recurse() function.
editor's note: padding functions with NOPs isn't specific to infinite loops. Here's a set of functions with a range of lengths of NOPs, on the Godbolt compiler explorer.
The 0x66 bytes are an "Operand-Size Override" prefix. Having more than one of these is equivalent to having one.
The 0x2e is a 'null prefix' in 64-bit mode (it's a CS: segment override otherwise - which is why it shows up in the assembly mnemonic).
0x0f 0x1f is a 2 byte opcode for a NOP that takes a ModRM byte
0x84 is ModRM byte which in this case codes for an addressing mode that uses 5 more bytes.
Some CPUs are slow to decode instructions with many prefixes (e.g. more than three), so a ModRM byte that specifies a SIB + disp32 is a much better way to use up an extra 5 bytes than five more prefix bytes.
AMD K8 decoders in Agner Fog's microarch pdf:
Each of the instruction decoders can handle three prefixes per clock
cycle. This means that three instructions with three prefixes each can
be decoded in the same clock cycle. An instruction with 4 - 6 prefixes
takes an extra clock cycle to decode.
Essentially, those bytes are one long NOP instruction that will never get executed anyway. It's in there to ensure that the next function is aligned on a 16-byte boundary, because the compiler emitted a .p2align 4 directive, so the assembler padded with a NOP. gcc's default for x86 is
-falign-functions=16. For NOPs that will be executed, the optimal choice of long-NOP depends on the microarchitecture. For a microarchitecture that chokes on many prefixes, like Intel Silvermont or AMD K8, two NOPs with 3 prefixes each might have decoded faster.
The blog article the question linked to ( http://john.freml.in/amd64-nopl ) explains why the compiler uses a complicated single NOP instruction instead of a bunch of single-byte 0x90 NOP instructions.
You can find the details on the instruction encoding in AMD's tech ref documents:
http://developer.amd.com/documentation/guides/pages/default.aspx#manuals
Mainly in the "AMD64 Architecture Programmer's Manual Volume 3: General Purpose and System Instructions". I'm sure Intel's technical references for the x64 architecture will have the same information (and might even be more understandable).
The assembler (not the compiler) pads code up to the next alignment boundary with the longest NOP instruction it can find that fits. This is what you're seeing.
I would guess this is just the branch-delay instruction.
I belive that the nopw is junk - i is never read in your program, and there are thus no need to increment it.