How to push byte 'AL' to C function correctly? - c

I'm having some issues at the moment, I'm not sure if this is the problem or not with my program, but this is one thing that I'm not 100% about, so I'm going to use it as a learning opportunity.
I have this instruction: in al, 0x60 to read the scancode from the keyboard.
I'm trying to send this scancode to a function written in C. The C function declaration looks like: void cFunction(unsigned int scancode).
So basically, here is what I'm doing:
in al, 0x60
movzx EAX, AL
push EAX
call Cfunction
The goal is to get a value like this into the C function: 0x10, which would mean the Q was pressed, 0x11 the W was pressed, 0x12 is E, and so on...
Questions:
Is what I'm doing passing the right value to the function or not?
Is the result going to be different if I were to push only AX instead of EAX?
I only need the byte AL, but obviously I cannot push AL, so I've been zero extending it to EAX. So, let's say if Q was pressed and I compared it like: if(scancode == 0x10), would this interpret correctly no matter that EAX vs AX was pushed? Or do I only need to get the value of AL into the scancode? If not, how can I go about getting AL to the function?

Answer depends on what calling convention does your C compiler use. If it is standard cdecl than yes, generally you do it right.
Some notes:
It is better to use C data types with exact size in bytes like uint32_t than int which size isn't fixed. These data types are defined in stdint.h
If you want to use AX instead EAX, you must define your function as
void cFunction(uint16_t scancode).
Since AL is a part of AX (and EAX) it is better to just erase AX (or EAX) before reading key scancode than extend it with MOVZX after reading. Typical way for it in assembly:
XOR register with itself
XOR EAX, EAX
Moving zero in register
MOV EAX, 0
is also correct, but XOR usually is a bit faster (and using it for register erase is some kind of tradition now)

So there are a few things to talk about here. Let's cover them one at a time.
First, I'm assuming you're running either without an OS (i.e., making your own OS) or running in an OS like MS-DOS that doesn't get in the way of I/O. As Olaf rightly pointed out, in al, 0x60 isn't going to work on a modern, protected-mode OS; nor will it work on a USB keyboard (unless you're running in an emulator or virtual machine that is pretending to provide a classic PS/2 keyboard).
Second, I'm assuming you're programming on a 32-bit CPU. The C ABI (application binary interface) is different for a 16-bit CPU, a 32-bit CPU, and a 64-bit CPU, so the code will read differently depending on which CPU you're using.
Third, port 60h is a weird beast, and it's been a long time since I wrote a driver for it. The values you read in aren't the values you think you're going to read in a lot of the time, and there are the E0h extended codes, and there's the behavior of the Pause key. Writing a bug-free keyboard driver is a lot harder than it looks. But let's ignore all that for this question.
So let's assume you have no OS to get in the way, and a 32-bit CPU, and only the most basic keystrokes. How would you pass data from the keyboard to a C function? Pretty much the way you did:
in al, 0x60
movzx eax, al
push eax
call cFunction
Why is this correct?
Well, the first line loads an 8-bit register with the keyboard; and it must be al because that's the only 8-bit register that in can write to.
The 32-bit C ABI expects a function's parameters to be pushed onto the stack, in reverse calling order. So in order to call the C function, there must be a push instruction before it. However, in 32-bit mode, all push instructions are 32-bit-sized, so you can only push eax, ebx, esi, edi, and so on: You can't push al directly. (Even if you could — and technically, you can, using direct stack writes — it would be misaligned, because in 32-bit mode, every item pushed must be aligned to a 4-byte boundary.) So the only way to push the value is first to promote it from 8 bits to 32 bits, and movzx does that nicely.
There are, for what it's worth, other ways to do it. You could clear eax before the in:
xor eax, eax
in al, 0x60
push eax
call cFunction
This solution is a bit worse than the original solution for performance; it has the cost of a partial register stall: The processor internally doesn't actually keep al as a part of eax, but rather as a separate register; any attempts to mix the different-sized sub-pieces of the registers together involves the processor stalling before being able to do so: When you push eax here, the processor realizes that al got mutated by the previous instruction, and stalls for a clock cycle to quickly mash al's bits into eax so it still looks like al was actually a part of eax.
It's worth pointing out that if you're in classic 16-bit mode (8086, or '286 protected mode), the calling sequence is slightly different:
in al, 0x60
movzx ax, al
push ax
call cFunction
In this case, int is 16-bit-sized, so doing everything as 16 bits is correct. Alternatively, in 64-bit mode, you need to use rax instead:
in al, 0x60
movzx rax, al
push rax
call cFunction
Even though the cFunction may have been compiled with int being only 32 bits, the stack-alignment requirements in 64-bit mode mandate that a 64-bit value would be pushed. The C function will correctly read out the 64-bit value as a 32-bit value, but you can only push it as 64 bits.
So there you have it. Various ways of interacting with the C ABI to get your port data into your function, depending on your CPU and environment.

Related

CSAPP: Why use subq followed by movq when we have pushq already [duplicate]

I belive push/pop instructions will result in a more compact code, maybe will even run slightly faster. This requires disabling stack frames as well though.
To check this, I will need to either rewrite a large enough program in assembly by hand (to compare them), or to install and study a few other compilers (to see if they have an option for this, and to compare the results).
Here is the forum topic about this and simular problems.
In short, I want to understand which code is better. Code like this:
sub esp, c
mov [esp+8],eax
mov [esp+4],ecx
mov [esp],edx
...
add esp, c
or code like this:
push eax
push ecx
push edx
...
add esp, c
What compiler can produce the second kind of code? They usually produce some variation of the first one.
You're right, push is a minor missed-optimization with all 4 major x86 compilers. There's some code-size, and thus indirectly performance to be had. Or maybe more directly a small amount of performance in some cases, e.g. saving a sub rsp instruction.
But if you're not careful, you can make things slower with extra stack-sync uops by mixing push with [rsp+x] addressing modes. pop doesn't sound useful, just push. As the forum thread you linked suggests, you only use this for the initial store of locals; later reloads and stores should use normal addressing modes like [rsp+8]. We're not talking about trying to avoid mov loads/stores entirely, and we still want random access to the stack slots where we spilled local variables from registers!
Modern code generators avoid using PUSH. It is inefficient on today's processors because it modifies the stack pointer, that gums-up a super-scalar core. (Hans Passant)
This was true 15 years ago, but compilers are once again using push when optimizing for speed, not just code-size. Compilers already use push/pop for saving/restoring call-preserved registers they want to use, like rbx, and for pushing stack args (mostly in 32-bit mode; in 64-bit mode most args fit in registers). Both of these things could be done with mov, but compilers use push because it's more efficient than sub rsp,8 / mov [rsp], rbx. gcc has tuning options to avoid push/pop for these cases, enabled for -mtune=pentium3 and -mtune=pentium, and similar old CPUs, but not for modern CPUs.
Intel since Pentium-M and AMD since Bulldozer(?) have a "stack engine" that tracks the changes to RSP with zero latency and no ALU uops, for PUSH/POP/CALL/RET. Lots of real code was still using push/pop, so CPU designers added hardware to make it efficient. Now we can use them (carefully!) when tuning for performance. See Agner Fog's microarchitecture guide and instruction tables, and his asm optimization manual. They're excellent. (And other links in the x86 tag wiki.)
It's not perfect; reading RSP directly (when the offset from the value in the out-of-order core is nonzero) does cause a stack-sync uop to be inserted on Intel CPUs. e.g. push rax / mov [rsp-8], rdi is 3 total fused-domain uops: 2 stores and one stack-sync.
On function entry, the "stack engine" is already in a non-zero-offset state (from the call in the parent), so using some push instructions before the first direct reference to RSP costs no extra uops at all. (Unless we were tailcalled from another function with jmp, and that function didn't pop anything right before jmp.)
It's kind of funny that compilers have been using dummy push/pop instructions just to adjust the stack by 8 bytes for a while now, because it's so cheap and compact (if you're doing it once, not 10 times to allocate 80 bytes), but aren't taking advantage of it to store useful data. The stack is almost always hot in cache, and modern CPUs have very excellent store / load bandwidth to L1d.
int extfunc(int *,int *);
void foo() {
int a=1, b=2;
extfunc(&a, &b);
}
compiles with clang6.0 -O3 -march=haswell on the Godbolt compiler explorer See that link for all the rest of the code, and many different missed-optimizations and silly code-gen (see my comments in the C source pointing out some of them):
# compiled for the x86-64 System V calling convention:
# integer args in rdi, rsi (,rdx, rcx, r8, r9)
push rax # clang / ICC ALREADY use push instead of sub rsp,8
lea rdi, [rsp + 4]
mov dword ptr [rdi], 1 # 6 bytes: opcode + modrm + imm32
mov rsi, rsp # special case for lea rsi, [rsp + 0]
mov dword ptr [rsi], 2
call extfunc(int*, int*)
pop rax # and POP instead of add rsp,8
ret
And very similar code with gcc, ICC, and MSVC, sometimes with the instructions in a different order, or gcc reserving an extra 16B of stack space for no reason. (MSVC reserves more space because it's targeting the Windows x64 calling convention which reserves shadow space instead of having a red-zone).
clang saves code-size by using the LEA results for store addresses instead of repeating RSP-relative addresses (SIB+disp8). ICC and clang put the variables at the bottom of the space it reserved, so one of the addressing modes avoids a disp8. (With 3 variables, reserving 24 bytes instead of 8 was necessary, and clang didn't take advantage then.) gcc and MSVC miss this optimization.
But anyway, more optimal would be:
push 2 # only 2 bytes
lea rdi, [rsp + 4]
mov dword ptr [rdi], 1
mov rsi, rsp # special case for lea rsi, [rsp + 0]
call extfunc(int*, int*)
# ... later accesses would use [rsp] and [rsp+] if needed, not pop
pop rax # alternative to add rsp,8
ret
The push is an 8-byte store, and we overlap half of it. This is not a problem, CPUs can store-forward the unmodified low half efficiently even after storing the high half. Overlapping stores in general are not a problem, and in fact glibc's well-commented memcpy implementation uses two (potentially) overlapping loads + stores for small copies (up to the size of 2x xmm registers at least), to load everything then store everything without caring about whether or not there's overlap.
Note that in 64-bit mode, 32-bit push is not available. So we still have to reference rsp directly for the upper half of of the qword. But if our variables were uint64_t, or we didn't care about making them contiguous, we could just use push.
We have to reference RSP explicitly in this case to get pointers to the locals for passing to another function, so there's no getting around the extra stack-sync uop on Intel CPUs. In other cases maybe you just need to spill some function args for use after a call. (Although normally compilers will push rbx and mov rbx,rdi to save an arg in a call-preserved register, instead of spilling/reloading the arg itself, to shorten the critical path.)
I chose 2x 4-byte args so we could reach a 16-byte alignment boundary with 1 push, so we can optimize away the sub rsp, ## (or dummy push) entirely.
I could have used mov rax, 0x0000000200000001 / push rax, but 10-byte mov r64, imm64 takes 2 entries in the uop cache, and a lot of code-size.
gcc7 does know how to merge two adjacent stores, but chooses not to do that for mov in this case. If both constants had needed 32-bit immediates, it would have made sense. But if the values weren't actually constant at all, and came from registers, this wouldn't work while push / mov [rsp+4] would. (It wouldn't be worth merging values in a register with SHL + SHLD or whatever other instructions to turn 2 stores into 1.)
If you need to reserve space for more than one 8-byte chunk, and don't have anything useful to store there yet, definitely use sub instead of multiple dummy PUSHes after the last useful PUSH. But if you have useful stuff to store, push imm8 or push imm32, or push reg are good.
We can see more evidence of compilers using "canned" sequences with ICC output: it uses lea rdi, [rsp] in the arg setup for the call. It seems they didn't think to look for the special case of the address of a local being pointed to directly by a register, with no offset, allowing mov instead of lea. (mov is definitely not worse, and better on some CPUs.)
An interesting example of not making locals contiguous is a version of the above with 3 args, int a=1, b=2, c=3;. To maintain 16B alignment, we now need to offset 8 + 16*1 = 24 bytes, so we could do
bar3:
push 3
push 2 # don't interleave mov in here; extra stack-sync uops
push 1
mov rdi, rsp
lea rsi, [rsp+8]
lea rdx, [rdi+16] # relative to RDI to save a byte with probably no extra latency even if MOV isn't zero latency, at least not on the critical path
call extfunc3(int*,int*,int*)
add rsp, 24
ret
This is significantly smaller code-size than compiler-generated code, because mov [rsp+16], 2 has to use the mov r/m32, imm32 encoding, using a 4-byte immediate because there's no sign_extended_imm8 form of mov.
push imm8 is extremely compact, 2 bytes. mov dword ptr [rsp+8], 1 is 8 bytes: opcode + modrm + SIB + disp8 + imm32. (RSP as a base register always needs a SIB byte; the ModRM encoding with base=RSP is the escape code for a SIB byte existing. Using RBP as a frame pointer allows more compact addressing of locals (by 1 byte per insn), but takes an 3 extra instructions to set up / tear down, and ties up a register. But it avoids further access to RSP, avoiding stack-sync uops. It could actually be a win sometimes.)
One downside to leaving gaps between your locals is that it may defeat load or store merging opportunities later. If you (the compiler) need to copy 2 locals somewhere, you may be able to do it with a single qword load/store if they're adjacent. Compilers don't consider all the future tradeoffs for the function when deciding how to arrange locals on the stack, as far as I know. We want compilers to run quickly, and that means not always back-tracking to consider every possibility for rearranging locals, or various other things. If looking for an optimization would take quadratic time, or multiply the time taken for other steps by a significant constant, it had better be an important optimization. (IDK how hard it might be to implement a search for opportunities to use push, especially if you keep it simple and don't spend time optimizing the stack layout for it.)
However, assuming there are other locals which will be used later, we can allocate them in the gaps between any we spill early. So the space doesn't have to be wasted, we can simply come along later and use mov [rsp+12], eax to store between two 32-bit values we pushed.
A tiny array of long, with non-constant contents
int ext_longarr(long *);
void longarr_arg(long a, long b, long c) {
long arr[] = {a,b,c};
ext_longarr(arr);
}
gcc/clang/ICC/MSVC follow their normal pattern, and use mov stores:
longarr_arg(long, long, long): # #longarr_arg(long, long, long)
sub rsp, 24
mov rax, rsp # this is clang being silly
mov qword ptr [rax], rdi # it could have used [rsp] for the first store at least,
mov qword ptr [rax + 8], rsi # so it didn't need 2 reg,reg MOVs to avoid clobbering RDI before storing it.
mov qword ptr [rax + 16], rdx
mov rdi, rax
call ext_longarr(long*)
add rsp, 24
ret
But it could have stored an array of the args like this:
longarr_arg_handtuned:
push rdx
push rsi
push rdi # leave stack 16B-aligned
mov rsp, rdi
call ext_longarr(long*)
add rsp, 24
ret
With more args, we start to get more noticeable benefits especially in code-size when more of the total function is spent storing to the stack. This is a very synthetic example that does nearly nothing else. I could have used volatile int a = 1;, but some compilers treat that extra-specially.
Reasons for not building stack frames gradually
(probably wrong) Stack unwinding for exceptions, and debug formats, I think don't support arbitrary playing around with the stack pointer. So at least before making any call instructions, a function is supposed to have offset RSP as much as its going to for all future function calls in this function.
But that can't be right, because alloca and C99 variable-length arrays would violate that. There may be some kind of toolchain reason outside the compiler itself for not looking for this kind of optimization.
This gcc mailing list post about disabling -maccumulate-outgoing-args for tune=default (in 2014) was interesting. It pointed out that more push/pop led to larger unwind info (.eh_frame section), but that's metadata that's normally never read (if no exceptions), so larger total binary but smaller / faster code. Related: this shows what -maccumulate-outgoing-args does for gcc code-gen.
Obviously the examples I chose were trivial, where we're pushing the input parameters unmodified. More interesting would be when we calculate some things in registers from the args (and data they point to, and globals, etc.) before having a value we want to spill.
If you have to spill/reload anything between function entry and later pushes, you're creating extra stack-sync uops on Intel. On AMD, it could still be a win to do push rbx / blah blah / mov [rsp-32], eax (spill to the red zone) / blah blah / push rcx / imul ecx, [rsp-24], 12345 (reload the earlier spill from what's still the red-zone, with a different offset)
Mixing push and [rsp] addressing modes is less efficient (on Intel CPUs because of stack-sync uops), so compilers would have to carefully weight the tradeoffs to make sure they're not making things slower. sub / mov is well-known to work well on all CPUs, even though it can be costly in code-size, especially for small constants.
"It's hard to keep track of the offsets" is a totally bogus argument. It's a computer; re-calculating offsets from a changing reference is something it has to do anyway when using push to put function args on the stack. I think compilers could run into problems (i.e. need more special-case checks and code, making them compile slower) if they had more than 128B of locals, so you couldn't always mov store below RSP (into what's still the red-zone) before moving RSP down with future push instructions.
Compilers already consider multiple tradeoffs, but currently growing the stack frame gradually isn't one of the things they consider. push wasn't as efficient before Pentium-M introduce the stack engine, so efficient push even being available is a somewhat recent change as far as redesigning how compilers think about stack layout choices.
Having a mostly-fixed recipe for prologues and for accessing locals is certainly simpler.
This requires disabling stack frames as well though.
It doesn't, actually. Simple stack frame initialisation can use either enter or push ebp \ mov ebp, esp \ sub esp, x (or instead of the sub, a lea esp, [ebp - x] can be used). Instead of or additionally to these, values can be pushed onto the stack to initialise the variables, or just pushing any random register to move the stack pointer without initialising to any certain value.
Here's an example (for 16-bit 8086 real/V 86 Mode) from one of my projects: https://bitbucket.org/ecm/symsnip/src/ce8591f72993fa6040296f168c15f3ad42193c14/binsrch.asm#lines-1465
save_slice_farpointer:
[...]
.main:
[...]
lframe near
lpar word, segment
lpar word, offset
lpar word, index
lenter
lvar word, orig_cx
push cx
mov cx, SYMMAIN_index_size
lvar word, index_size
push cx
lvar dword, start_pointer
push word [sym_storage.main.start + 2]
push word [sym_storage.main.start]
The lenter macro sets up (in this case) only push bp \ mov bp, sp and then lvar sets up numeric defs for offsets (from bp) to variables in the stack frame. Instead of subtracting from sp, I initialise the variables by pushing into their respective stack slots (which also reserves the stack space needed).

What is the lea instruction doing in this piece of code? [duplicate]

I was trying to understand how Address Computation Instruction works, especially with leaq command. Then I get confused when I see examples using leaq to do arithmetic computation. For example, the following C code,
long m12(long x) {
return x*12;
}
In assembly,
leaq (%rdi, %rdi, 2), %rax
salq $2, $rax
If my understanding is right, leaq should move whatever address (%rdi, %rdi, 2), which should be 2*%rdi+%rdi, evaluate to into %rax. What I get confused is since value x is stored in %rdi, which is just memory address, why does times %rdi by 3 then left shift this memory address by 2 is equal to x times 12? Isn't that when we times %rdi by 3, we jump to another memory address which does not hold value x?
lea (see Intel's instruction-set manual entry) is a shift-and-add instruction that uses memory-operand syntax and machine encoding. This explains the name, but it's not the only thing it's good for. It never actually accesses memory, so it's like using & in C.
See for example How to multiply a register by 37 using only 2 consecutive leal instructions in x86?
In C, it's like uintptr_t foo = (uintptr_t) &arr[idx]. Note the & to give you arr + idx (scaling for the object size of arr since this is C not asm). In C, this would be abuse of the language syntax and types, but in x86 assembly pointers and integers are the same thing. Everything is just bytes, and it's up to the program put instructions in the right order to get useful results.
Effective address is a technical term in x86: it means the "offset" part of a seg:off logical address, especially when a base_reg + index*scale + displacement calculation was needed. e.g. the rax + (rcx<<2) in a %gs:(%rax,%rcx,4) addressing mode. (But EA still applies to %rdi for stosb, or the absolute displacement for movabs load/store, or other cases without a ModRM addr mode). Its use in this context doesn't mean it must be a valid / useful memory address, it's telling you that the calculation doesn't involve the segment base so it's not calculating a linear address. (Adding the seg base would make it unusable for actual address math in a non-flat memory model.)
The original designer / architect of 8086's instruction set (Stephen Morse) might or might not have had pointer math in mind as the main use-case, but modern compilers think of it as just another option for doing arithmetic on pointers / integers, and so should humans.
(Note that 16-bit addressing modes don't include shifts, just [BP|BX] + [SI|DI] + disp8/disp16, so LEA wasn't as useful for non-pointer math before 386. See this Q&A for more about 32/64-bit addressing modes, although that answer uses Intel syntax like [rax + rdi*4] instead of the AT&T syntax used in this question. x86 machine code is the same regardless of what syntax you use to create it.)
Maybe the 8086 architects did simply want to expose the address-calculation hardware for arbitrary uses because they could do it without using a lot of extra transistors. The decoder already has to be able to decode addressing modes, and other parts of the CPU have to be able to do address calculations. Putting the result in a register instead of using it with a segment-register value for memory access doesn't take many extra transistors. Ross Ridge confirms that LEA on original 8086 reuses the CPUs effective-address decoding and calculation hardware.
Note that most modern CPUs run LEA on the same ALUs as normal add and shift instructions. They have dedicated AGUs (address-generation units), but only use them for actual memory operands. In-order Atom is one exception; LEA runs earlier in the pipeline than the ALUs: inputs have to be ready sooner, but outputs are also ready sooner. Out-of-order execution CPUs (all modern x86) don't want LEA to interfere with actual loads/stores so they run it on an ALU.
lea has good latency and throughput, but not as good throughput as add or mov r32, imm32 on most CPUs, so only use lea when you can save an instructions with it instead of add. (See Agner Fog's x86 microarch guide and asm optimization manual and https://uops.info/)
Ice Lake improved on that for Intel, now able to run LEA on all four ALU ports.
Rules for which kinds of LEA are "complex", running on fewer of the ports that can handle it, vary by microarchitecture. e.g. 3-component (two + operations) is the slower case on SnB-family, having a scaled index is the lower-throughput case on Ice Lake. Alder Lake E-cores (Gracemont) are 4/clock, but 1/clock when there's an index at all, and 2-cycle latency when there's an index and displacement (whether or not there's a base reg). Zen is slower when there's a scaled index or 3 components. (2c latency and 2/clock down from 1c and 4/clock).
The internal implementation is irrelevant, but it's a safe bet that decoding the operands to LEA shares transistors with decoding addressing modes for any other instruction. (So there is hardware reuse / sharing even on modern CPUs that don't execute lea on an AGU.) Any other way of exposing a multi-input shift-and-add instruction would have taken a special encoding for the operands.
So 386 got a shift-and-add ALU instruction for "free" when it extended the addressing modes to include scaled-index, and being able to use any register in an addressing mode made LEA much easier to use for non-pointers, too.
x86-64 got cheap access to the program counter (instead of needing to read what call pushed) "for free" via LEA because it added the RIP-relative addressing mode, making access to static data significantly cheaper in x86-64 position-independent code than in 32-bit PIC. (RIP-relative does need special support in the ALUs that handle LEA, as well as the separate AGUs that handle actual load/store addresses. But no new instruction was needed.)
It's just as good for arbitrary arithmetic as for pointers, so it's a mistake to think of it as being intended for pointers these days. It's not an "abuse" or "trick" to use it for non-pointers, because everything's an integer in assembly language. It has lower throughput than add, but it's cheap enough to use almost all the time when it saves even one instruction. But it can save up to three instructions:
;; Intel syntax.
lea eax, [rdi + rsi*4 - 8] ; 3 cycle latency on Intel SnB-family
; 2-component LEA is only 1c latency
;;; without LEA:
mov eax, esi ; maybe 0 cycle latency, otherwise 1
shl eax, 2 ; 1 cycle latency
add eax, edi ; 1 cycle latency
sub eax, 8 ; 1 cycle latency
On some AMD CPUs, even a complex LEA is only 2 cycle latency, but the 4-instruction sequence would be 4 cycle latency from esi being ready to the final eax being ready. Either way, this saves 3 uops for the front-end to decode and issue, and that take up space in the reorder buffer all the way until retirement.
lea has several major benefits, especially in 32/64-bit code where addressing modes can use any register and can shift:
non-destructive: output in a register that isn't one of the inputs. It's sometimes useful as just a copy-and-add like lea 1(%rdi), %eax or lea (%rdx, %rbp), %ecx.
can do 3 or 4 operations in one instruction (see above).
Math without modifying EFLAGS, can be handy after a test before a cmovcc. Or maybe in an add-with-carry loop on CPUs with partial-flag stalls.
x86-64: position independent code can use a RIP-relative LEA to get a pointer to static data.
7-byte lea foo(%rip), %rdi is slightly larger and slower than mov $foo, %edi (5 bytes), so prefer mov r32, imm32 in position-dependent code on OSes where symbols are in the low 32 bits of virtual address space, like Linux. You may need to disable the default PIE setting in gcc to use this.
In 32-bit code, mov edi, OFFSET symbol is similarly shorter and faster than lea edi, [symbol]. (Leave out the OFFSET in NASM syntax.) RIP-relative isn't available and addresses fit in a 32-bit immediate, so there's no reason to consider lea instead of mov r32, imm32 if you need to get static symbol addresses into registers.
Other than RIP-relative LEA in x86-64 mode, all of these apply equally to calculating pointers vs. calculating non-pointer integer add / shifts.
See also the x86 <!--> tag wiki for assembly guides / manuals, and performance info.
Operand-size vs. address-size for x86-64 lea
See also Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted?. 64-bit address size and 32-bit operand size is the most compact encoding (no extra prefixes), so prefer lea (%rdx, %rbp), %ecx when possible instead of 64-bit lea (%rdx, %rbp), %rcx or 32-bit lea (%edx, %ebp), %ecx.
x86-64 lea (%edx, %ebp), %ecx is always a waste of an address-size prefix vs. lea (%rdx, %rbp), %ecx, but 64-bit address / operand size is obviously required for doing 64-bit math. (Agner Fog's objconv disassembler even warns about useless address-size prefixes on LEA with a 32-bit operand-size.)
Except maybe on Ryzen, where Agner Fog reports that 32-bit operand size lea in 64-bit mode has an extra cycle of latency. I don't know if overriding the address-size to 32-bit can speed up LEA in 64-bit mode if you need it to truncate to 32-bit.
This question is a near-duplicate of the very-highly-voted What's the purpose of the LEA instruction?, but most of the answers explain it in terms of address calculation on actual pointer data. That's only one use.
leaq doesn't have to operate on memory addresses, and it computes an address, it doesn't actually read from the result, so until a mov or the like tries to use it, it's just an esoteric way to add one number, plus 1, 2, 4 or 8 times another number (or the same number in this case). It's frequently "abused"† for mathematical purposes, as you see. 2*%rdi+%rdi is just 3 * %rdi, so it's computing x * 3 without involving the multiplier unit on the CPU.
Similarly, left shifting, for integers, doubles the value for every bit shifted (every zero added to the right), thanks to the way binary numbers work (the same way in decimal numbers, adding zeroes on the right multiplies by 10).
So this is abusing the leaq instruction to accomplish multiplication by 3, then shifting the result to achieve a further multiplication by 4, for a final result of multiplying by 12 without ever actually using a multiply instruction (which it presumably believes would run more slowly, and for all I know it could be right; second-guessing the compiler is usually a losing game).
†: To be clear, it's not abuse in the sense of misuse, just using it in a way that doesn't clearly align with the implied purpose you'd expect from its name. It's 100% okay to use it this way.
LEA is for calculating the address. It doesn't dereference the memory address
It should be much more readable in Intel syntax
m12(long):
lea rax, [rdi+rdi*2]
sal rax, 2
ret
So the first line is equivalent to rax = rdi*3
Then the left shift is to multiply rax by 4, which results in rdi*3*4 = rdi*12

what is this assembly code in C?

I have some assembly code below that I don't really understand. My thoughts are that it is meaningless. Unfortunately I can't provide any more instruction information. What would the output in C be?
0x1000: iretd
0x1001: cli
0x1002: in eax, dx
0x1003: inc byte ptr [rdi]
0x1005: add byte ptr [rax], al
0x1007: add dword ptr [rbx], eax
0x1009: add byte ptr [rax], al
0x100b: add byte ptr [rdx], 0
0x100e: add byte ptr [rax], al
Thanks
The first four bytes (if I did reconstruct them correctly) form 32 bit value 0xFEEDFACF.
Putting that into google led me to:
https://gist.github.com/softboysxp/1084476#file-gistfile1-asm-L15
%define MH_MAGIC_64 0xfeedfacf
Aren't you by accident disassembling Mach-o x64 executable from Mac OS X as raw machine code, instead of reading meta data of file correctly, and disassembling only code section?
P.S. in questions like this one, rather include also the source machine code data, so experienced people may check disassembly by targetting different platform, like 32b x86 or 16b real mode code, or completely different CPU, which may help in case you would mistakenly treat machine code with wrong target platform disassembly. I had to first assemble your disassembly to see the raw bytes.
iretd is "return from interrupt", cli is "clear interrupt flag" which means disable all maskable interrupts. The C language does not understand the concept of an interrupt, so it is unlikely that this was compiled from C. In fact, this isn't a single complete fragment of code.
Also add byte ptr [rdx], 0 is adding 0 to a value which doesn't make sense to me unless it is the result of an unoptimised compilation or the result of disassembling something that isn't code.

Using LEA on values that aren't addresses / pointers?

I was trying to understand how Address Computation Instruction works, especially with leaq command. Then I get confused when I see examples using leaq to do arithmetic computation. For example, the following C code,
long m12(long x) {
return x*12;
}
In assembly,
leaq (%rdi, %rdi, 2), %rax
salq $2, $rax
If my understanding is right, leaq should move whatever address (%rdi, %rdi, 2), which should be 2*%rdi+%rdi, evaluate to into %rax. What I get confused is since value x is stored in %rdi, which is just memory address, why does times %rdi by 3 then left shift this memory address by 2 is equal to x times 12? Isn't that when we times %rdi by 3, we jump to another memory address which does not hold value x?
lea (see Intel's instruction-set manual entry) is a shift-and-add instruction that uses memory-operand syntax and machine encoding. This explains the name, but it's not the only thing it's good for. It never actually accesses memory, so it's like using & in C.
See for example How to multiply a register by 37 using only 2 consecutive leal instructions in x86?
In C, it's like uintptr_t foo = (uintptr_t) &arr[idx]. Note the & to give you arr + idx (scaling for the object size of arr since this is C not asm). In C, this would be abuse of the language syntax and types, but in x86 assembly pointers and integers are the same thing. Everything is just bytes, and it's up to the program put instructions in the right order to get useful results.
Effective address is a technical term in x86: it means the "offset" part of a seg:off logical address, especially when a base_reg + index*scale + displacement calculation was needed. e.g. the rax + (rcx<<2) in a %gs:(%rax,%rcx,4) addressing mode. (But EA still applies to %rdi for stosb, or the absolute displacement for movabs load/store, or other cases without a ModRM addr mode). Its use in this context doesn't mean it must be a valid / useful memory address, it's telling you that the calculation doesn't involve the segment base so it's not calculating a linear address. (Adding the seg base would make it unusable for actual address math in a non-flat memory model.)
The original designer / architect of 8086's instruction set (Stephen Morse) might or might not have had pointer math in mind as the main use-case, but modern compilers think of it as just another option for doing arithmetic on pointers / integers, and so should humans.
(Note that 16-bit addressing modes don't include shifts, just [BP|BX] + [SI|DI] + disp8/disp16, so LEA wasn't as useful for non-pointer math before 386. See this Q&A for more about 32/64-bit addressing modes, although that answer uses Intel syntax like [rax + rdi*4] instead of the AT&T syntax used in this question. x86 machine code is the same regardless of what syntax you use to create it.)
Maybe the 8086 architects did simply want to expose the address-calculation hardware for arbitrary uses because they could do it without using a lot of extra transistors. The decoder already has to be able to decode addressing modes, and other parts of the CPU have to be able to do address calculations. Putting the result in a register instead of using it with a segment-register value for memory access doesn't take many extra transistors. Ross Ridge confirms that LEA on original 8086 reuses the CPUs effective-address decoding and calculation hardware.
Note that most modern CPUs run LEA on the same ALUs as normal add and shift instructions. They have dedicated AGUs (address-generation units), but only use them for actual memory operands. In-order Atom is one exception; LEA runs earlier in the pipeline than the ALUs: inputs have to be ready sooner, but outputs are also ready sooner. Out-of-order execution CPUs (all modern x86) don't want LEA to interfere with actual loads/stores so they run it on an ALU.
lea has good latency and throughput, but not as good throughput as add or mov r32, imm32 on most CPUs, so only use lea when you can save an instructions with it instead of add. (See Agner Fog's x86 microarch guide and asm optimization manual and https://uops.info/)
Ice Lake improved on that for Intel, now able to run LEA on all four ALU ports.
Rules for which kinds of LEA are "complex", running on fewer of the ports that can handle it, vary by microarchitecture. e.g. 3-component (two + operations) is the slower case on SnB-family, having a scaled index is the lower-throughput case on Ice Lake. Alder Lake E-cores (Gracemont) are 4/clock, but 1/clock when there's an index at all, and 2-cycle latency when there's an index and displacement (whether or not there's a base reg). Zen is slower when there's a scaled index or 3 components. (2c latency and 2/clock down from 1c and 4/clock).
The internal implementation is irrelevant, but it's a safe bet that decoding the operands to LEA shares transistors with decoding addressing modes for any other instruction. (So there is hardware reuse / sharing even on modern CPUs that don't execute lea on an AGU.) Any other way of exposing a multi-input shift-and-add instruction would have taken a special encoding for the operands.
So 386 got a shift-and-add ALU instruction for "free" when it extended the addressing modes to include scaled-index, and being able to use any register in an addressing mode made LEA much easier to use for non-pointers, too.
x86-64 got cheap access to the program counter (instead of needing to read what call pushed) "for free" via LEA because it added the RIP-relative addressing mode, making access to static data significantly cheaper in x86-64 position-independent code than in 32-bit PIC. (RIP-relative does need special support in the ALUs that handle LEA, as well as the separate AGUs that handle actual load/store addresses. But no new instruction was needed.)
It's just as good for arbitrary arithmetic as for pointers, so it's a mistake to think of it as being intended for pointers these days. It's not an "abuse" or "trick" to use it for non-pointers, because everything's an integer in assembly language. It has lower throughput than add, but it's cheap enough to use almost all the time when it saves even one instruction. But it can save up to three instructions:
;; Intel syntax.
lea eax, [rdi + rsi*4 - 8] ; 3 cycle latency on Intel SnB-family
; 2-component LEA is only 1c latency
;;; without LEA:
mov eax, esi ; maybe 0 cycle latency, otherwise 1
shl eax, 2 ; 1 cycle latency
add eax, edi ; 1 cycle latency
sub eax, 8 ; 1 cycle latency
On some AMD CPUs, even a complex LEA is only 2 cycle latency, but the 4-instruction sequence would be 4 cycle latency from esi being ready to the final eax being ready. Either way, this saves 3 uops for the front-end to decode and issue, and that take up space in the reorder buffer all the way until retirement.
lea has several major benefits, especially in 32/64-bit code where addressing modes can use any register and can shift:
non-destructive: output in a register that isn't one of the inputs. It's sometimes useful as just a copy-and-add like lea 1(%rdi), %eax or lea (%rdx, %rbp), %ecx.
can do 3 or 4 operations in one instruction (see above).
Math without modifying EFLAGS, can be handy after a test before a cmovcc. Or maybe in an add-with-carry loop on CPUs with partial-flag stalls.
x86-64: position independent code can use a RIP-relative LEA to get a pointer to static data.
7-byte lea foo(%rip), %rdi is slightly larger and slower than mov $foo, %edi (5 bytes), so prefer mov r32, imm32 in position-dependent code on OSes where symbols are in the low 32 bits of virtual address space, like Linux. You may need to disable the default PIE setting in gcc to use this.
In 32-bit code, mov edi, OFFSET symbol is similarly shorter and faster than lea edi, [symbol]. (Leave out the OFFSET in NASM syntax.) RIP-relative isn't available and addresses fit in a 32-bit immediate, so there's no reason to consider lea instead of mov r32, imm32 if you need to get static symbol addresses into registers.
Other than RIP-relative LEA in x86-64 mode, all of these apply equally to calculating pointers vs. calculating non-pointer integer add / shifts.
See also the x86 <!--> tag wiki for assembly guides / manuals, and performance info.
Operand-size vs. address-size for x86-64 lea
See also Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted?. 64-bit address size and 32-bit operand size is the most compact encoding (no extra prefixes), so prefer lea (%rdx, %rbp), %ecx when possible instead of 64-bit lea (%rdx, %rbp), %rcx or 32-bit lea (%edx, %ebp), %ecx.
x86-64 lea (%edx, %ebp), %ecx is always a waste of an address-size prefix vs. lea (%rdx, %rbp), %ecx, but 64-bit address / operand size is obviously required for doing 64-bit math. (Agner Fog's objconv disassembler even warns about useless address-size prefixes on LEA with a 32-bit operand-size.)
Except maybe on Ryzen, where Agner Fog reports that 32-bit operand size lea in 64-bit mode has an extra cycle of latency. I don't know if overriding the address-size to 32-bit can speed up LEA in 64-bit mode if you need it to truncate to 32-bit.
This question is a near-duplicate of the very-highly-voted What's the purpose of the LEA instruction?, but most of the answers explain it in terms of address calculation on actual pointer data. That's only one use.
leaq doesn't have to operate on memory addresses, and it computes an address, it doesn't actually read from the result, so until a mov or the like tries to use it, it's just an esoteric way to add one number, plus 1, 2, 4 or 8 times another number (or the same number in this case). It's frequently "abused"† for mathematical purposes, as you see. 2*%rdi+%rdi is just 3 * %rdi, so it's computing x * 3 without involving the multiplier unit on the CPU.
Similarly, left shifting, for integers, doubles the value for every bit shifted (every zero added to the right), thanks to the way binary numbers work (the same way in decimal numbers, adding zeroes on the right multiplies by 10).
So this is abusing the leaq instruction to accomplish multiplication by 3, then shifting the result to achieve a further multiplication by 4, for a final result of multiplying by 12 without ever actually using a multiply instruction (which it presumably believes would run more slowly, and for all I know it could be right; second-guessing the compiler is usually a losing game).
†: To be clear, it's not abuse in the sense of misuse, just using it in a way that doesn't clearly align with the implied purpose you'd expect from its name. It's 100% okay to use it this way.
LEA is for calculating the address. It doesn't dereference the memory address
It should be much more readable in Intel syntax
m12(long):
lea rax, [rdi+rdi*2]
sal rax, 2
ret
So the first line is equivalent to rax = rdi*3
Then the left shift is to multiply rax by 4, which results in rdi*3*4 = rdi*12

How to index arrays properly in x86 assembly

I am trying to make sure that I understand the SI and DI registers. My background in assembly language is somewhat limited to 6502, so bear with me here.
I have a quick example of how I would go about using SI as a simple counter. I am a bit concerned that I might be misusing this register though.
mov si, 0 ; set si to 0
mov cx, 5 ; set cx to 5 as we will count down to 1
do:
mov ah, 02h ; setup 02h DOS character output interrupt
mov dl, [table + si] ; grab our table with the si offset
add dl, '0' ; convert to ascii integer
int 21h ; call DOS service
inc si ; increment si
loop do ; repeat unto cx = 0
ret
table: db 1,2,3,4,5
---
OUTPUT:> 12345
Is this the right way to use SI? I know in 6502 assembly, you can use the X and Y registers to offset arrays / tables. However, in my studies of x86, I am starting to realize how much more there is to work with. Such as how CX is automatically decremented in the 'loop' instruction.
I am hoping that moving forward, I will be able to save resources by writing efficient code.
Thank you in advance for your input.
This use of SI is perfectly fine. SI has the benefit of being a preserved register in most Intel calling conventions. Also, historically, SI was one of the few registers that you could use as an index in a memory load operation; in a modern Intel CPU, any register would do.
SI still gets some special treatment with the lods instruction.
Your program actually works fine. Adding org $100 at the beginning, I managed to compile it with FASM and run in DosBox:
On the 6502 you have two index registers (X and Y) that you can use in different ways (direct, indirect, indirect indexed, indexed indirect, ...).
On the x86 you have 4 registers that can be used as pointer registers: BX, BP, SI and DI (in 32-bit mode you can use nearly all registers)
BX and DI can be combined (Example: [BX+DI+10])
BP is typically used for storing the old stack pointer when entering a function (when using a C compiler). However there is no missuse of registers (unless you use the stack pointer for something different) when you program in assembler. You cannot do anything wrong!
But be careful: On the x86 (in 16-bit mode) you also have to care about the segment registers - this is what the 6502 does not have!
These registers are needed because you can only address 64 KiB using a 16-bit register but 8086 has an 1 MiB address space. To solve this an address is composed of a 16-bit segment and a 16-bit offset so an address is effectively not 16 but 32 bits long. The exact meaning of the first 16 bits depends on the operating mode of the CPU.
The following segment registers are present:
CS: CS:IP is the instruction pointer
SS: SS:SP is the stack pointer; used for SP and BP pointer operations by default
DS: Used for all other pointer operations (all but SP and BP) by default
ES: Additional register
FS, GS: Additional registers since 80386
You can overwrite the default segment register to be used:
MOV AX,ES:[SI+100] ; Load from ES:SI+100 instead of DS:SI+100
String operations (like movsb) always access DS:SI and ES:DI (you cannot change the segment register for such operations).
That's an alright use of SI. But you could use several other registers in its base (although beware that unlike 32-bit x86, 16-bit x86 code limits the set of registers on which indexing is supported. The ModRegR/M structure governs this.)
You might want to consider doing an add si, table before the loop and mov dl, [si] inside it. It makes the loop slightly easier for the human to read, because there's one less variable in play.

Resources