Long mode (64 bit) relative call with a 64 bit immediate value - c

Is it possible? Intel documentation says opcode E8 can be used with a relative displacement value.
E8 cd CALL rel32
"Call near, relative,
displacement relative to next instruction. 32-bit displacement sign extended to 64-bits in 64-bit mode."
Does it mean only 32 bit displacements are allowed? I am quite unclear on the wording here.

Yes. It means that the opcode is followed by a 32-bit displacement. If you want longer, you can compute it yourself with an lea and an indirect call.

Related

Have problem understanding this assembly code from CS:APP [duplicate]

I am reading about x86-64 (and assembly in general) through the book "computer systems a programmer's perspective"(3rd edition). The author, in compliance with other sources from the web, states that idivq takes one operand only - just as this one claims. But then, the author, some chapters later, gives an example with the instruction idivq $9, %rcx.
Two operands? I first thought this was a mistake but it happens a lot in the book from there.
Also, the dividend should be given from the quantity in registers %rdx (high-order 64 bits) and %rax (low-order 64 bits) - so if this is defined in the architecture then it does not seem possible that the second operand could be a specified dividend.
Here is an example of an exercise (too lazy to write it all down - so a picture is the way to go). It claims that GCC emits idivq $9, %rcx when compiling a short C function.
That's a mistake. Only imul has immediate and 2-register forms.
mul, div, or idiv still only exist in the one-operand form introduced with 8086, using RDX:RAX as the implicit double-width operand for output (and input for division).
Or EDX:EAX, DX:AX, or AH:AL, depending on operand-size of course. Consult an ISA reference like Intel's manual, not this book! https://www.felixcloutier.com/x86/idiv
Also see When and why do we sign extend and use cdq with mul/div? and Why should EDX be 0 before using the DIV instruction?
x86-64's only hardware division instructions are idiv and div. 64-bit mode removed aam, which does 8-bit division by an immediate. (Dividing in Assembler x86 and Displaying Time in Assembly has an example of using aam in 16-bit mode).
Of course for division by constants idiv and div (and aam) are very inefficient. Use shifts for powers of 2, or a multiplicative inverse otherwise, unless you're optimizing for code-size instead of performance.
CS:APP 3e Global Edition apparently has multiple serious x86-64 instruction-set mistakes like this in practice problems, claiming that GCC emits impossible instructions. Not just typos or subtle mistakes, but misleading nonsense that's very obviously wrong to people familiar with the x86-64 instruction set. It's not just a syntax mistake, it's trying to use instructions that aren't encodeable (no syntax can exist to express them, other than a macro that expands to multiple instructions. Defining idivq as a pseudo-instruction using a macro would be pretty weird).
e.g. I correctly guessed missing part of a function, but gcc generated assembly code doesn't match the answer is another one where it suggests that (%rbx, %rdi, %rsi) and (%rsi, %rsi, 9) are valid addressing modes! The scale factor is actually a 2-bit shift count so these are total garbage and a sign of a serious lack of knowledge by the authors about the ISA they're teaching, not a typo.
Their code won't assemble with any AT&T syntax assembler.
Also What does this x86-64 addq instruction mean, which only have one operand? (From CSAPP book 3rd Edition) is another example, where they have a nonsensical addq %eax instead of inc %rdx, and a mismatched operand-size in a mov store.
It seems that they're just making stuff up and claiming it was emitted by GCC. IDK if they start with real GCC output and edit it into what they think is a better example, or actually write it by hand from scratch without testing it.
GCC's actual output would have used multiplication by a magic constant (fixed-point multiplicative inverse) to divide by 9 (even at -O0, but this is clearly not debug-mode code. They could have used -Os).
Presumably they didn't want to talk about Why does GCC use multiplication by a strange number in implementing integer division? and replaced that block of code with their made-up instruction. From context you can probably figure out where they expect the output to go; perhaps they mean rcx /= 9.
These errors are from 3rd-party practice problems in the Global Edition
From the publisher's web site (https://csapp.cs.cmu.edu/3e/errata.html)
Note on the Global Edition: Unfortunately, the publisher arranged for the generation of a different set of practice and homework problems in the global edition. The person doing this didn't do a very good job, and so these problems and their solutions have many errors. We have not created an errata for this edition.
So CS:APP 3e is probably a good textbook, as long as you get the North American edition, or ignore the practice / homework problems. This explains the huge disconnect between the textbook's reputation and wide use vs. the serious and obvious (to people familiar with x86-64 asm) errors like this one that go beyond sloppy into don't-know-the-language territory.
How a hypothetical idiv reg, reg or idiv $imm, reg would be designed
Also, the dividend should be given from the quantity in registers %rdx (high-order 64 bits) and %rax (low-order 64 bits) - so if this is defined in the architecture then it does not seem possible that the second operand could be a specified dividend.
If Intel or AMD had introduced a new convenient forms for div or idiv, they would have designed it to use a single-width dividend because that's how compilers always use it.
Most languages are like C and implicitly promote both operands for + - * / to the same type and produce a result of that width. Of course if the inputs are known to be narrow that can be optimized away. (e.g. using one imul r32 to implement a * (int64_t)b).
But div and idiv fault if the quotient overflows so it's not safe to use a single 32-bit idiv when compiling int32_t q = (int64_t)a / (int32_t)b.
Compilers always use xor edx,edx before DIV or cdq or cqo before IDIV to actually do n / n => n-bit division.
Real full-width division using a dividend that isn't just zero- or sign-extended is only done by hand with intrinsics or asm (because gcc/clang and other compilers don't know when the optimization is safe), or in gcc helper functions that do e.g. 64-bit / 64-bit division in 32-bit code. (Or 128-bit division in 64-bit code).
So what would be most helpful is a div/idiv that avoids the extra instruction to set up RDX, too, as well as minimizing the number of implicit register operands. (Like imul r32, r/m32 and imul r32, r/m32, imm do: making the common case of non-widening multiplication more convenient with no implicit registers. That's Intel-syntax like the manuals, destination first)
The simplest way would be a 2-operand instruction that did dst /= src. Or maybe replaced both operands with quotient and remainder. Using a VEX encoding for 3 operands like BMI1 andn, you could maybe have
idivx remainder_dst, dividend, divisor. With the 2nd operand also an output for the quotient. Or you could have the remainder written to RDX with a non-destructive destination for the quotient.
Or more likely to optimize for the simple case where only the quotient is needed, idivx quot, dividend, divisor and not store the remainder anywhere. You can always use regular idiv when you want the quotient.
BMI2 mulx uses an implicit rdx input operand because its purpose is to allow multiple dep chains of add-with-carry for extended-precision multiply. So it still has to produce 2 outputs. But this hypothetical new form of idiv would exist to save code-size and uops around normal uses of idiv that aren't widening. So 386 imul reg, reg/mem is the point of comparison, not BMI2 mulx.
IDK if it would make sense to introduce an immediate form of idivx as well; you'd only use it for code-size reasons. Multiplicative inverses are more efficient division by constants so there's very little real-world use-case for such an instruction.
I think your book has made a mistake.
idivq only has one operand. If I try to assemble this snippet:
idivq $9, %rcx
I get this error:
test.s: Assembler messages:
test.s:1: Error: operand type mismatch for `idiv'
This works:
idivq %rcx
but you probably already know that.
It may also be a macro (unlikely, but possible. credit to #HansPassant for this).
Perhaps you should contact the book's author so that they can add an entry to the errata.
Interestingly, gas seems to allow the following:
mov $20, %rax
mov $0, %rdx
mov $5, %rcx
idivq %rcx, %rax
ret
This is still performing the one operand division under the hood, but it LOOKS like two-operand form. As long as the first operand is a register and the second operand is specifically %rax, this works. However, in general idivq seems to require the one operand form.

decode ARM BL instruction

I'm just getting started with the ARM architecture on my Nucleo STM32F303RE, and I'm trying to understand how the instructions are encoded.
I have running a simple LED-blinking program, and the first few disassembled application instructions are:
08000188: push {lr}
0800018a: sub sp, #12
235 __initialize_hardware_early ();
0800018c: bl 0x80005b8 <__initialize_hardware_early>
These instructions resolve to the following in the hex file (displayed weird in Eclipse -- each 32-bit word is in MSB order, but Eclipse doesn't seem to know it... but that's for another topic):
address 0x08000188: B083B500 FA14F000
Using the ARM Architecture Ref Manual, I've confirmed the first 2 instructions, push (0xB500) and sub (0xB083). But I can't make any sense out of the "bl" instruction.
The hex instruction is 0xFA14F000. The Ref Manual says it breaks down like this:
31.28 27 26 25 24 23............0
cond 1 0 1 L signed_immed_24
The first "F" (0xF......) makes sense: all conditions are set (ALways).
The "A" doesn't make sense though, since the L bit should be set (1011). Shouldn't it be 0xFB......?
And the signed_immed_24 doesn't make sense, either. The ref manual says:
- start with 0x14F000
- sign extend to 30 bits (signed 2's-complement), giving 0x0014F000
- shift left to form 32-bit value, giving 0x0053C000
- add to the PC, which is the current instruction + 8, giving 0x0800018c + 8 + 0x0053C000, or 0x0853C194.
So I get a branch address of 0x0853C194, but the disassembly shows 0x080005B8.
What am I missing?
Thanks!
-Eric
bl is two, separate, 16 bit instructions. The armv5 (and older) ARM ARM does a better job of documenting them.
111HHoffset11
From the ARM ARM
The first Thumb instruction has H == 10 and supplies the high part of
the branch offset. This instruction sets up for the subroutine call
and is shared between the BL and BLX forms.
The second Thumb instruction has H == 11 (for BL) or H == 01 (for
BLX). It supplies the low part of the branch offset and causes the
subroutine call to take place.
0xFA14 0xF000
0xF000 is the first instruction upper offset is zeros
0xFA14 is the second instruction offset is 0x214
If starting at 0x0800018c then it is 0x0800018C + 4 + (0x0000214<<1) = 0x080005B8. The 4 is the two instructions head for the current PC. And the offset is units of (16 bit) instructions.
I guess the armv7-m ARM ARM covers it as well, but is harder to read, and apparently features were added. But they do not affect you with this branch link.
The ARMv5 ARM ARM does a better job of describing what happens as well. you can certaily take these two separate instructions and move them apart
.byte 0x00,0xF0
nop
nop
nop
nop
nop
.byte 0x14,0xFA
and it will branch to the same offset (relative to the second instruction). Maybe the broke that in some cores, but I know in some (after armv5) it works.

Why is PC still 32 bits in Thumb mode on ARM processor?

As Thumb mode can reference 16-bit of the general registers (r1-r14), why PC (r15) is still 32-bit? Will b Label in this mode update all 32 bits of PC, or just lower 16 bits?
Despite the question being based on an entirely incorrect premise, there is a cheeky sort-of-answer to the title question taken literally - being in Thumb mode means you're on a v4T or later architecture which means you must have a full 32 bits of PC, rather than the pre-v3 architectures where it was only 26 bits.
I'm not sure where you got the idea that Thumb operates on 16 bits of a register - the restriction with must Thumb instructions is that they can only access the "low registers" r0-r7, due to limited encoding space. There are 3 different "32-bit"s at play in ARM - the register width, the address space (which wasn't always the case as mentioned above), and the size of the fixed-width instruction encoding. Thumb only changes the third one. One result of this size reduction is that most Thumb encodings only have 3 bits per operand to encode register numbers - only a handful of instructions can spare the extra bits to encode the "high registers" r8-r15.
In operation, Thumb instructions are no different to the equivalent subset of ARM instructions - Thumb is just an alternative fetch/decode stage on the front of the same pipeline, after all.
thumbv1 instructions are 16 bits in size but what does that have to do with the size of the registers? (Nothing) the registers including r15 are 32 bits all the time. Branches with immediate values are simply offsets not absolute addresses.
this is all clearly documented in the ARM architectural reference manuals infocenter.arm.com

In ARM Processor, How to determine if a shift is right or left shift?

I am working on a software-based implementation of ARM processor in C.
Given an ARM data processing instruction:
instruction = 0xE3A01808; 1110 0 0 1 1101 0 0000 0001 1000 00001000
Which translates to: MOV r0,#8; shifted by 8 bits.
How to check whether the 8 bit shift is right or left shift?
With ARM 12-bit modified immediate constants, there is no shift, in any direction - it's a rotation, specifically, <7:0> rotated right by 2*<11:8>. Thus the encoding 0x808 represents 8 ROR (2*8), meaning 0xE3A01808 disassembles to mov, r1, #0x80000.
(Note that the canonical encoding of a modified immediate constant is the one with the smallest rotation, so mov, r1, #0x80000 would assemble to 0xE3A01702, i.e. 2 ROR 14, rather than 8 ROR 161).
As for implementing bitwise rotation in C, to solve that there's either compiler intrinsics or the standard shift-part-in-each-direction idiom x>>n | x<<(32-n).
[1] To get a specific encoding, UAL assembly allows an immediate syntax with the constant and rotation specified separately, i.e. mov r1, #8, 16. For full detail, this is all spelled out in the ARM ARM (section A5.2.4 in the v7 issue C I have here) - essentially, the choice of encodings permits a little funny business with flags in certain situations.
I'm not sure if this is what you're referring to, but here's some documentation that seems relevant:
http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0040d/ch05s05s01.html
The 'WITHDRAWN' pasted over the docs doesn't inspire much confidence.
It would seem to suggest a rotate right. Which plays with how I remember the barrel shifter being addressed on arm more generally (There's a ROR operation for instance, see http://www.davespace.co.uk/arm/introduction-to-arm/barrel-shifter.html)

AMD64 -- nopw assembly instruction?

In this compiler output, I'm trying to understand how machine-code encoding of the nopw instruction works:
00000000004004d0 <main>:
4004d0: eb fe jmp 4004d0 <main>
4004d2: 66 66 66 66 66 2e 0f nopw %cs:0x0(%rax,%rax,1)
4004d9: 1f 84 00 00 00 00 00
There is some discussion about "nopw" at http://john.freml.in/amd64-nopl. Can anybody explain the meaning of 4004d2-4004e0? From looking at the opcode list, it seems that 66 .. codes are multi-byte expansions. I feel I could probably get a better answer to this here than I would unless I tried to grok the opcode list for a few hours.
That asm output is from the following (insane) code in C, which optimizes down to a simple infinite loop:
long i = 0;
main() {
recurse();
}
recurse() {
i++;
recurse();
}
When compiled with gcc -O2, the compiler recognizes the infinite recursion and turns it into an infinite loop; it does this so well, in fact, that it actually loops in the main() without calling the recurse() function.
editor's note: padding functions with NOPs isn't specific to infinite loops. Here's a set of functions with a range of lengths of NOPs, on the Godbolt compiler explorer.
The 0x66 bytes are an "Operand-Size Override" prefix. Having more than one of these is equivalent to having one.
The 0x2e is a 'null prefix' in 64-bit mode (it's a CS: segment override otherwise - which is why it shows up in the assembly mnemonic).
0x0f 0x1f is a 2 byte opcode for a NOP that takes a ModRM byte
0x84 is ModRM byte which in this case codes for an addressing mode that uses 5 more bytes.
Some CPUs are slow to decode instructions with many prefixes (e.g. more than three), so a ModRM byte that specifies a SIB + disp32 is a much better way to use up an extra 5 bytes than five more prefix bytes.
AMD K8 decoders in Agner Fog's microarch pdf:
Each of the instruction decoders can handle three prefixes per clock
cycle. This means that three instructions with three prefixes each can
be decoded in the same clock cycle. An instruction with 4 - 6 prefixes
takes an extra clock cycle to decode.
Essentially, those bytes are one long NOP instruction that will never get executed anyway. It's in there to ensure that the next function is aligned on a 16-byte boundary, because the compiler emitted a .p2align 4 directive, so the assembler padded with a NOP. gcc's default for x86 is
-falign-functions=16. For NOPs that will be executed, the optimal choice of long-NOP depends on the microarchitecture. For a microarchitecture that chokes on many prefixes, like Intel Silvermont or AMD K8, two NOPs with 3 prefixes each might have decoded faster.
The blog article the question linked to ( http://john.freml.in/amd64-nopl ) explains why the compiler uses a complicated single NOP instruction instead of a bunch of single-byte 0x90 NOP instructions.
You can find the details on the instruction encoding in AMD's tech ref documents:
http://developer.amd.com/documentation/guides/pages/default.aspx#manuals
Mainly in the "AMD64 Architecture Programmer's Manual Volume 3: General Purpose and System Instructions". I'm sure Intel's technical references for the x64 architecture will have the same information (and might even be more understandable).
The assembler (not the compiler) pads code up to the next alignment boundary with the longest NOP instruction it can find that fits. This is what you're seeing.
I would guess this is just the branch-delay instruction.
I belive that the nopw is junk - i is never read in your program, and there are thus no need to increment it.

Resources