C Pointer to EFLAGS using NASM - c

For a task at my school, I need to write a C Program which does a 16 bit addition using an assembly programm. Besides the result the EFLAGS shall also be returned.
Here is my C Program:
int add(int a, int b, unsigned short* flags);//function must be used this way
int main(void){
unsigned short* flags = NULL;
printf("%d\n",add(30000, 36000, flags);// printing just the result
return 0;
}
For the moment the Program just prints out the result and not the flags because I am unable to get them.
Here is the assembly program for which I use NASM:
global add
section .text
add:
push ebp
mov ebp,esp
mov ax,[ebp+8]
mov bx,[ebp+12]
add ax,bx
mov esp,ebp
pop ebp
ret
Now this all works smoothly. But I have no idea how to get the pointer which must be at [ebp+16] pointing to the flag register. The professor said we will have to use the pushfd command.
My problem just lies in the assembly code. I will modify the C Program to give out the flags after I get the solution for the flags.

Normally you'd just use a debugger to look at flags, instead of writing all the code to get them into a C variable for a debug-print. Especially since decent debuggers decode the condition flags symbolically for you, instead of or as well as showing a hex value.
You don't have to know or care which bit in FLAGS is CF and which is ZF. (This information isn't relevant for writing real programs, either. I don't have it memorized, I just know which flags are tested by different conditions like jae or jl. Of course, it's good to understand that FLAGS are just data that you can copy around, save/restore, or even modify if you want)
Your function args and return value are int, which is 32-bit in the System V 32-bit x86 ABI you're using. (links to ABI docs in the x86 tag wiki). Writing a function that only looks at the low 16 bits of its input, and leaves high garbage in the high 16 bits of the output is a bug. The int return value in the prototype tells the compiler that all 32 bits of EAX are part of the return value.
As Michael points out, you seem to be saying that your assignment requires using a 16-bit ADD. That will produce carry, overflow, and other conditions with different inputs than if you looked at the full 32 bits. (BTW, this article explains carry vs. overflow very well.)
Here's what I'd do. Note the 32-bit operand size for the ADD.
global add
section .text
add:
push ebp
mov ebp,esp ; stack frames are optional, you can address things relative to ESP
mov eax, [ebp+8] ; first arg: No need to avoid loading the full 32 bits; the next insn doesn't care about the high garbage.
add ax, [ebp+12] ; low 16 bits of second arg. Operand-size implied by AX
cwde ; sign-extend AX into EAX
mov ecx, [ebp+16] ; the pointer arg
pushf ; the simple straightforward way
pop edx
mov [ecx], dx ; Store the low 16 of what we popped. Writing word [ecx] is optional, because dx implies 16-bit operand-size
; be careful not to do a 32-bit store here, because that would write outside the caller's object.
; mov esp,ebp ; redundant: ESP is still pointing at the place we pushed EBP, since the push is balanced by an equal-size pop
pop ebp
ret
CWDE (the 16->32 form of the 8086 CBW instruction) is not to be confused with CWD (the AX -> DX:AX 8086 instruction). If you're not using AX, then MOVSX / MOVZX are a good way to do this.
The fun way: instead of using the default operand size and doing 32-bit push and pop, we can do a 16-bit pop directly into the destination memory address. That would leave the stack unbalanced, so we could either uncomment the mov esp, ebp again, or use a 16-bit pushf (with an operand-size prefix, which according to the docs makes it only push the low 16 FLAGS, not the 32-bit EFLAGS.)
; What I'd *really* do: maximum efficiency if I had to use the 32-bit ABI with args on the stack, instead of args in registers
global add
section .text
add:
mov eax, [esp+4] ; first arg, first thing above the return address
add ax, [esp+8] ; second arg
cwde ; sign-extend AX into EAX
mov ecx, [esp+12] ; the pointer
pushfw ; push the low 16 of FLAGS
pop word [ecx] ; pop into memory pointed to by unsigned short* flags
ret
Both PUSHFW and POP WORD will assemble with an operand-size prefix. output from objdump -Mintel, which uses slightly different syntax from NASM:
4000c0: 66 9c pushfw
4000c2: 66 8f 01 pop WORD PTR [ecx]
PUSHFW is the same as o16 PUSHF. In NASM, o16 applies the operand-size prefix.
If you only needed the low 8 flags (not including OF), you could use LAHF to load FLAGS into AH and store that.
PUSHFing directly into the destination is not something I'd recommend. Temporarily pointing the stack at some random address is not safe in general. Programs with signal handlers will use the space below the stack asynchronously. This is why you have to reserve stack space before using it with sub esp, 32 or whatever, even if you're not going to make a function call that would overwrite it by pushing more stuff on the stack. The only exception is when you have a red-zone.
You C caller:
You're passing a NULL pointer, so of course your asm function segfaults. Pass the address of a local to give the function somewhere to store to.
int add(int a, int b, unsigned short* flags);
int main(void) {
unsigned short flags;
int result = add(30000, 36000, &flags);
printf("%d %#hx\n", result, flags);
return 0;
}

This is just a simple approach. I didn't test it, but you should get the idea...
Just set ESP to the pointer value, increment it by 2 (even for 32-bit arch) and PUSHF like this:
global add
section .text
add:
push ebp
mov ebp,esp
mov ax,[ebp+8]
mov bx,[ebp+12]
add ax,bx
; --- here comes the mod
mov esp, [ebp+16] ; this will set ESP to the pointers address "unsigned short* flags"
lea esp, [esp+2] ; adjust ESP to address above target
db 66h ; operand size prefix for 16-bit PUSHF (alternatively 'db 0x66', depending on your assembler
pushf ; this will save the lower 16-bits of EFLAGS to WORD PTR [EBP+16] = [ESP+2-2]
; --- here ends the mod
mov esp,ebp
pop ebp
ret
This should work, because PUSHF decrements ESP by 2 and then saves the value to WORD PTR [ESP]. Therefore it had to be increased before using the pointer address. Setting ESP to the appropriate value enables you to denominate the direct target of PUSHF.

Related

Causes and benefits of this improvement on gcc version >= 4.9.0 vs gcc version < 4.9?

I have recently exploited a dangerous program and found something interesting about the difference between versions of gcc on x86-64 architecture.
Note:
Wrongful usage of gets is not the issue here.
If we replace gets with any other functions, the problem doesn't change.
This is the source code I use:
#include <stdio.h>
int main()
{
char buf[16];
gets(buf);
return 0;
}
I use gcc.godbolt.org to disassemble the program with flag -m32 -fno-stack-protector -z execstack -g.
At the disassembled code, when gcc with version >= 4.9.0:
lea ecx, [esp+4] # begin of main
and esp, -16
push DWORD PTR [ecx-4] # push esp
push ebp
mov ebp, esp
/* between these comment is not related to the question
push ecx
sub esp, 20
sub esp, 12
lea eax, [ebp-24]
push eax
call gets
add esp, 16
mov eax, 0
*/
mov ebp, esp
mov ecx, DWORD PTR [ebp-4] # ecx = saved esp
leave
lea esp, [ecx-4]
ret # end of main
But gcc with version < 4.9.0 just:
push ebp # begin of main
mov ebp, esp
/* between these comment is not related to the question
and esp, -16
sub esp, 32
lea eax, [esp+16]
mov DWORD PTR [esp], eax
call gets
mov eax, 0
*/
leave
ret # end of main
My question is: What is the causes of this difference on the disassembled code and its benefits? Does it have a name for this technique?
I can't say for sure without the actual values in:
and esp, 0xXX # XX is a number
but this looks a lot like extra code to align the stack to a larger value than the ABI requires.
Edit: The value is -16, which is 32-bit 0xFFFFFFF0 or 64-bit 0xFFFFFFFFFFFFFFF0 so this is indeed stack alignment to 16 bytes, likely meant for use of SSE instructions. As mentioned in comments, there is more code in the >= 4.9.0 version because it also aligns the frame pointer instead of only the stack pointer.
The i386 ABI, used for 32-bit programs, imposes that a process, immediately after loaded, has to have the stack aligned on 32-bit values:
%esp Performing its usual job, the stack pointer holds the address of the
bottom of the stack, which is guaranteed to be word aligned.
confront this with the x86_64 ABI1 used for 64-bit programs:
%rsp The stack pointer holds the address of the byte with lowest address which
is part of the stack. It is guaranteed to be 16-byte aligned at process entry
The opportunity gave by the new AMD's 64-bit technology to rewrite the old i386 ABI allow a number of optimizations that were lacking due to backward compatibility, among these a bigger (stricter?) stack alignment.
I won't dwell on the benefits of stack alignment but it suffices to say that if a 4-byte alignment was good, so is a 16-byte one.
So much that it is worth spending some instructions aligning the stack.
That's what GCC 4.9.0+ does, it aligns the stack at 16-bytes.
That explains the and esp, -16 but not the other instructions.
Aligning the stack with and esp, -16 is the fastest way to do it when the compiler only knows that the stack is 4-byte aligned (since esp MOD 16 can be 0, 4, 8 or 12).
However it is a destructive method, the compiler loses the original esp value.
But now it comes the chicken or the egg problem: if we save the original esp on the stack before aligning the stack, we lose it because we don't know how far the stack pointer is lowered by the alignment. If we save it after the alignment, well, we can't. We lost it in the alignment.
So the only possible solution is to save it in a register, align the stack and then save said register on the stack.
;Save the stack pointer in ECX, actually is ESP+4 but still does
lea ecx, [esp+4] #ECX = ESP+4
;Align the stack
and esp, -16 #This lowers ESP by 0, 4, 8 or 12
;IGNORE THIS FOR NOW
push DWORD PTR [ecx-4]
;Usual prolog
push ebp
mov ebp, esp
;Save the original ESP (before alignment), actually is ESP+4 but OK
push ecx
GCC saves esp+4 in ecx, I don't know why2 but this values still does the trick.
The only mystery left is the push DWORD PTR [ecx-4].
But it turns out to be a simple mystery: for debugging purposes GCC pushes the return addresses just before the old frame pointer (before push ebp), this is where 32-bit tools expect it to be.
Since ecx=esp_o+4, where esp_o is the original stack pointer pre-alignment, [ecx-4] = [esp_o] = return address.
Note that now the stack is at 12 bytes modulo 16, thus the local variable area must be of size 16*k+4 to have the stack aligned at 16-byte again.
In your example k is 1 and the area is of 20 bytes in size.
The subsequent sub esp, 12 is to align the stack for the gets function (the requirement is to have the stack aligned at the function call).
Finally, the code
mov ebp, esp
mov ecx, DWORD PTR [ebp-4] # ecx = saved esp
leave
lea esp, [ecx-4]
ret
The first instruction is copy-paste error.
One could check it out or simply reason that
if it were there the [ebp-4] would be below the stack pointer (and there is no red zone for the i386 ABI).
The rest is just undoing what's is done in the prolog:
;Get the original stack pointer
mov ecx, DWORD PTR [ebp-4] ;ecx = esp_o+4
;Standard epilog
leave ;mov esp, ebp / pop ebp
;The stack pointer points to the copied return address
;Restore the original stack pointer
lea esp, [ecx-4] ;esp = esp_o
ret
GCC has to first get the original stack pointer (+4) saved on the stack, then restore the old frame pointer (ebp) and finally, restore the original stack pointer.
The return address is on the top of the stack when lea esp, [ecx-4] is executed, so in theory GCC could just return but it has to restore the original esp because main is not the first function to be executed in a C program, so it cannot leave the stack unbalanced.
1 This is not the latest version but the text quoted went unchanged in the successive editions.
2 This has been discussed here on SO but I can't remember if in some comment or in an answer.

Why does this 16-bit DOS example from a book crash when I call it from C (compiled with visual studio?)

my OS is window 7 64-bit.
here is my code
first.c :
#include <stdio.h>
extern long second(int, int);
void main()
{
int val1, val2;
long result;
scanf("%d %d", &val1, &val2);
result = second(val1, val2);
printf("%ld", result);
}
second.asm :
.model small
.code
public _second
_second proc near
push bp
mov bp,sp
mov ax,[bp+4]
mov bx,[bp+6]
add ax,bx
pop bp
ret
_second endp
end
compiled OK, but "mov ax,[bp+4]" this line has error "0xC0000005: Access violation reading location 0x00000004."
what's wrong?
You're assembling code in 16-bit mode and linking it into a 32-bit program which is executed in 32-bit mode. The machine code that makes up your second function ends up getting interpreted differently than you expected. This this code that is actually executed:
_second:
00407800: 55 push ebp
00407801: 8B EC mov ebp,esp
00407803: 8B 46 04 mov eax,dword ptr [esi+4]
00407806: 8B 5E 06 mov ebx,dword ptr [esi+6]
00407809: 03 C3 add eax,ebx
0040780B: 5D pop ebp
0040780C: C3 ret
Instead of using 16-bit registers the code uses 32-bit registers. Instead using the BP register as a base when addressing the arguments on the stack, it uses ESI as a base. Since ESI is not initialized to anything in the function, it holds whatever random value it happened to have before the call (eg. 0). Wherever that is isn't valid readable address so accessing it causes a crash.
Your problem is that you've taken assembly code meant to be used with a 16-bit compiler for a 16-bit operating operating system (eg. MS-DOS) and using it with a 32-bit compiler for Windows. You can't blindly cut & paste code examples like that. Here's 32-bit version of your assembly code:
.MODEL FLAT
.CODE
PUBLIC _second
_second PROC
push ebp
mov ebp, esp
mov eax, [ebp+8]
mov edx, [ebp+12]
add eax, edx
pop ebp
ret
_second ENDP
END
The .MODEL FLAT directive tells the assembler you're generating 32-bit code. I've changed the code to use 32-bit registers, and adjusted the frame pointer (EBP) relative offsets to reflect the fact that stack slots in 32-bit mode are 4 bytes long. I also changed the code to use EDX instead of EBX because in 32-bit C calling convention the EBX register needs to preserved by the function, while EDX (like BX in the 16-bit C calling convention) doesn't.
SP and BP are probably 0 in this specific case. Note however that SP and BP are the lowest 16-bit quarters of RSP and RBP respectively, so the stack pointer isn't really 0.
Another solution to pass parameters from .c to .asm is to use the "fastcall" convention, which let you pass two parameters in registers CX and DX (actually it's ECX and EDX, but you are using 16 bit registers in your code). Next is a short example tested in VS 2013, it sends two ints (2, 5) to the asm function and the function returns the addition of those values (7) :
first.cpp
#include "stdafx.h"
extern "C" int __fastcall second(int,int); // ◄■■ KEYWORDS "C" AND __FASTCALL.
int _tmain(int argc, _TCHAR* argv[])
{
short int result = second(2,5); // ◄■■ "RESULT" = 7.
return 0;
}
second.asm
.model small
.code
public #second#8 ◄■■ NOTICE THE # AND THE 8.
#second#8 proc near ◄■■ NOTICE THE # AND THE 8.
mov ax,cx ◄■■ AX = 2.
add ax,dx ◄■■ AX + 5 (RETURN VALUE).
ret
#second#8 endp ◄■■ NOTICE THE # AND THE 8.
end

Copying to and Displaying an Array

Hello Everyone!
I'm a newbie at NASM and I just started out recently. I currently have a program that reserves an array and is supposed to copy and display the contents of a string from the command line arguments into that array.
Now, I am not sure if I am copying the string correctly as every time I try to display this, I keep getting a segmentation error!
This is my code for copying the array:
example:
%include "asm_io.inc"
section .bss
X: resb 50 ;;This is our array
~some code~
mov eax, dword [ebp+12] ; eax holds address of 1st arg
add eax, 4 ; eax holds address of 2nd arg
mov ebx, dword [eax] ; ebx holds 2nd arg, which is pointer to string
mov ecx, dword 0
;Where our 2nd argument is a string eg "abcdefg" i.e ebx = "abcdefg"
copyarray:
mov al, [ebx] ;; get a character
inc ebx
mov [X + ecx], al
inc ecx
cmp al, 0
jz done
jmp copyarray
My question is whether this is the correct method of copying the array and how can I display the contents of the array after?
Thank you!
The loop looks ok, but clunky. If your program is crashing, use a debugger. See the x86 for links and a quick intro to gdb for asm.
I think you're getting argv[1] loaded correctly. (Note that this is the first command-line arg, though. argv[0] is the command name.) https://en.wikibooks.org/wiki/X86_Disassembly/Functions_and_Stack_Frames says ebp+12 is the usual spot for the 2nd arg to a 32bit functions that bother to set up stack frames.
Michael Petch commented on Simon's deleted answer that the asm_io library has print_int, print_string, print_char, and print_nl routines, among a few others. So presumably you a pointer to your buffer to one of those functions and call it a day. Or you could call sys_write(2) directly with an int 0x80 instruction, since you don't need to do any string formatting and you already have the length.
Instead of incrementing separately for two arrays, you could use the same index for both, with an indexed addressing mode for the load.
;; main (int argc ([esp+4]), char **argv ([esp+8]))
... some code you didn't show that I guess pushes some more stuff on the stack
mov eax, dword [ebp+12] ; load argv
;; eax + 4 is &argv[1], the address of the 1st cmdline arg (not counting the command name)
mov esi, dword [eax + 4] ; esi holds 2nd arg, which is pointer to string
xor ecx, ecx
copystring:
mov al, [esi + ecx]
mov [X + ecx], al
inc ecx
test al, al
jnz copystring
I changed the comments to say "cmdline arg", to distinguish between those and "function arguments".
When it doesn't cost any extra instructions, use esi for source pointers, edi for dest pointers, for readability.
Check the ABI for which registers you can use without saving/restoring (eax, ecx, and edx at least. That might be all for 32bit x86.). Other registers have to be saved/restored if you want to use them. At least, if you're making functions that follow the usual ABI. In asm you can do what you like, as long as you don't tell a C compiler to call non-standard functions.
Also note the improvement in the end of the loop. A single jnz to loop is more efficient than jz break / jmp.
This should run at one cycle per byte on Intel, because test/jnz macro-fuse into one uop. The load is one uop, and the store micro-fuses into one uop. inc is also one uop. Intel CPUs since Core2 are 4-wide: 4 uops issued per clock.
Your original loop runs at half that speed. Since it's 6 uops, it takes 2 clock cycles to issue an iteration.
Another hacky way to do this would be to get the offset between X and ebx into another register, so one of the effective addresses could use a one-register addressing mode, even if the dest wasn't a static array.
mov [X + ebx + ecx], al. (Where ecx = X - start_of_src_buf). But ideally you'd make the store the one that used a one-register addressing mode, unless the load was a memory operand to an ALU instruction that could micro-fuse it. Where the dest is a static buffer, this address-different hack isn't useful at all.
You can't use rep string instructions (like rep movsb) to implement strcpy for implicit-length strings (C null-terminated, rather than with a separately-stored length). Well you could, but only scanning the source twice: once for find the length, again to memcpy.
To go faster than one byte clock, you'd have to use vector instructions to test for the null byte at any of 16 positions in parallel. Google up an optimized strcpy implementation for example. Probably using pcmpeqb against a vector register of all-zeros.

What is happening in this disassembled code, and what would it look like in C?

I've disassembled this c code (using ida), and ran across this bit of code. I believe the second line is an array, as well as the 5th line, but I'm not sure why it uses a sign extend or a zero extend.
I need to convert the code to C, and I'm not sure why the sign/zero extend is used, or what C code would cause that.
mov ecx, [ebp+var_58]
mov dl, byte ptr [ebp+ecx*2+var_28]
mov [ebp+var_59], dl
mov eax, [ebp+var_58]
movsx ecx, [ebp+eax*2+var_20]
movzx edx, [ebp+var_59]
or edx, ecx
mov [ebp+var_59], dl
unsigned integer types will be zero-extended, while signed types will be sign-extended.
I kinda want to downvote this as too trivial. It's not like there's anything going on that the instruction reference manual doesn't cover. I guess it's different from asking for an explanation of a really simple C program because the trick here is understanding why one might string this sequence of instructions together, rather than just what each one does individually. Being familiar with the idioms used by non-optimizing compilers (store and reload from RAM after every statement) helps.
I'm guessing this is a snippet from inside a function that makes a stack frame, so positive offsets from ebp are where local variables are spilled when they're not live in registers.
mov ecx, [ebp+var_58] ; load var58 into ecx
mov dl, byte ptr [ebp+ecx*2+var_28] ; load a byte from var28[2*var58]
mov [ebp+var_59], dl ; store it to var59
mov eax, [ebp+var_58] ; load var58 again for some reason? can var59 alias var58?
; otherwise we still have the value in ecx, right?
; Or is this non-optimizing compiler output that's really annoying to read?
movsx ecx, [ebp+eax*2+var_20] ; load var20[var58*2]
movzx edx, [ebp+var_59] ; load var59 again
or edx, ecx ; edx = var59|var20[var58*2]
mov [ebp+var_59], dl ; spill var59 back to memory
I guess the default operand size for movsx/movzx is byte-to-dword. word-to-dword also exists, and I'm surprised your disassembler didn't disambiguate with a byte ptr on the memory operand. I'm inferring that it's a byte load because the preceding store to that address was byte-wide.
movsx is used when loading signed data that's smaller than 32b. C's integer-promotion rules dictate that operations on integer types smaller than int are automatically promoted to int (or unsigned int if int can't represent all values. e.g. if unsigned short and unsigned int are the same size).
8bit or 32bit operand sizes are available without operand-size prefix bytes. Some only Intel P6/SnB family CPUs track partial-register dependencies, sign-extending to a full register width on loads can make for faster code (avoiding false dependencies on the previous contents of the register on AMD and Silvermont). So sign or zero extending (as appropriate for the data type) on loads is often the best way to handle narrow memory locations.
Looking at the output of non-optimizing compilers is not usually worth bothering with.
If the code had been generated by a proper optimizing compiler, it would probably be more like
mov ecx, [ebp+var_58] ; var58 is live in ecx
mov al, byte ptr [ebp+ecx*2+var_28] ; var59 = var28[2*var58]
or al, [ebp+ecx*2+var_20] ; var59 |= var20[var58*2]
mov [ebp+var_59], al ; spill var59 to memory
Much easier to read, IMO, without the noise of constantly storing/reloading. You can see when a value is used multiple times without having to notice that a load was from an address that was just stored to.
If a false dependency on the upper 24 bits of eax was causing a problem, we could use movzx or movsx loads into two registers, and do an or r32, r32 like the original, but then still store the low 8. (Using a 32bit or with a memory operand would do a 4B load, not a 1B load, which could cross a cache line or even a page and segfault.)

Dive into the assembly

Function in c:
PHPAPI char *php_pcre_replace(char *regex, int regex_len,
char *subject, int subject_len,
zval *replace_val, int is_callable_replace,
int *result_len, int limit, int *replace_count TSRMLS_DC)
{
pcre_cache_entry *pce; /* Compiled regular expression */
/* Compile regex or get it from cache. */
if ((pce = pcre_get_compiled_regex_cache(regex, regex_len TSRMLS_CC)) == NULL) {
return NULL;
}
....
}
Its assembly:
php5ts!php_pcre_replace:
1015db70 8b442408 mov eax,dword ptr [esp+8]
1015db74 8b4c2404 mov ecx,dword ptr [esp+4]
1015db78 56 push esi
1015db79 8b74242c mov esi,dword ptr [esp+2Ch]
1015db7d 56 push esi
1015db7e 50 push eax
1015db7f 51 push ecx
1015db80 e8cbeaffff call php5ts!pcre_get_compiled_regex_cache (1015c650)
1015db85 83c40c add esp,0Ch
1015db88 85c0 test eax,eax
1015db8a 7502 jne php5ts!php_pcre_replace+0x1e (1015db8e)
php5ts!php_pcre_replace+0x1c:
1015db8c 5e pop esi
1015db8d c3 ret
The c function call pcre_get_compiled_regex_cache(regex, regex_len TSRMLS_CC) corresponds to 1015db7d~1015db80 which pushes the 3 parameters to the stack and call it.
But my doubt is,among so many registers,how does the compiler decide to use eax,ecx and esi(this is special,as it's restored before using,why?) as the intermediate to carry to the stack?
There must be some hidden indication in c that tells the compiler to do it this way,right?
No, there is no hidden indication.
This is a typical strategy for generating 80x86 instructions used by many compiler implementations, C and otherwise. For example, the 1980s Intel Fortran-77 compiler, when optimization was turned on, did the same thing.
That is uses eax and ecx preferentially is probably an artifact of avoiding use of esi and edi since those registers cannot directly be used to load byte operands.
Why not ebx and edx? Well, those are preferred by many code generators for holding intermediate pointers in evaluating complex structure evaluation, which is to say, there isn't much reason at all. The compiler just looked for two available registers to use and overwrote them to buffer the values.
Why not reuse eax like this?:
push esi
mov eax,dword ptr [esp+2Ch]
push eax
mov eax,dword ptr [esp+8]
push eax
mov eax,dword ptr [esp+4]
push eax
Because that causes pipeline stalls waiting for eax to complete previous memory cycles, in 80x86s since the 80586 (maybe 80486—it's too long ago to be sure off the top of my head).
The x86 architecture is a strange beast. Each register, though promoted as being "general purpose" by Intel, has its quirks (cx/ecx is tied to the loop instruction for example, and eax:edx is tied to the multiply instruction). That combined with the peculiar ways to optimize execution to avoid cache misses and pipeline stalls often leads to inscrutable generated code by a code generator which factors all that in.

Resources