usage of fxch - assembly code - AT&T syntax - c

I'm trying to understand some assembly code with AT&T syntax.
Here is a snippet:
"mov %eax, %ebx; "\
"mov %eax, %ecx;"\
"fxch %st(1);"\
This is what I understood from it.
the mov copies (Am I correct?, or does it move?) the data from the source register to the destination register
In line one: we copy the data from registry eax to ebx.
Similarly, we copy the data from registry eax to ecx.
However, what I failed to understand is the following.
How does fxch work? Here is a link that gives an example.
fxch st(2)
fsqrt
fxch st(2)
It says that this above code takes the sqrt of st(2).
Correct me if I am wrong.
It swaps the top of the stack with st(2) and then takes the sqrt of what?
I don't understand that clearly.
Can you please help me out? How does that work in my case and in the above case?

mov instructions indeed copy a value and fsqrt takes the square root of the top of the stack and replaces the top of the stack with its result. So the given code sequence effectively takes the square root of st(2) and puts it back at the same place.
In answer to your question below. The two mov instructions copy the value in register %eax to %ebx and %ecx. So if you add another mov %eax,%edx, then this value (from %eax) is also copied to %edx.
Note that this holds for AT&T assembly. In Intel assembly the values are copied the other way around. In that case %eax was, quite uselessly, changed repeatedly to contain the value of the other registers.
The fxch st(1) exchanges the top of the stack, which is st(0) with the element just below the top st(1). Similarly st(2) is just below st(1). Contrary to the integer registers, the floating point registers on the x86 are organized in a stack, reducing the instruction length of operations on those floating point registers as they always work on the top element(s) of the stack. This comes with the overhead of having to use fxch instructions to put the right values on the top of the stack.
The integer registers %eax, %ebx etc. are distinct from the floating point stack/registers st(0), st(1) etc. So the mov instructions are not related to the fxch instructions. The order of these instructions could be changed without effecting the result.

Related

"movl" instruction in assembly

I am learning assembly and reading "Computer Systems: A programmer's perspective". In Practice Problem 3.3, it says movl %eax,%rdx will generate an error. The answer keys says movl %eax,%dx Destination operand incorrect size. I am not sure if this is a typo or not, but my question is: is movl %eax,%rdx a legal instruction? I think it is moving the 32 bits in %eax with zero extension to %rdx, which will not be generated as movzql since
an instruction generating a 4-byte value with a register as the destination will fill the upper 4 bytes with zeros` (from the book).
I tried to write some C code to generate it, but I always get movslq %eax, %rdx(GCC 4.8.5 -Og). I am completely confused.
The GNU assembler doesn't accept movl %eax,%rdx. It also doesn't make sense for the encoding, since mov must have a single operand size (using a prefix byte if needed), not two different sized operands.
The effect you want is achieved by movl %eax, %edx since writes to a 32-bit register always zero-extend into the corresponding 64-bit register. See Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?.
movzlq %eax, %rdx might make logical sense, but it's not supported since it would be redundant.

What is the lea instruction doing in this piece of code? [duplicate]

I was trying to understand how Address Computation Instruction works, especially with leaq command. Then I get confused when I see examples using leaq to do arithmetic computation. For example, the following C code,
long m12(long x) {
return x*12;
}
In assembly,
leaq (%rdi, %rdi, 2), %rax
salq $2, $rax
If my understanding is right, leaq should move whatever address (%rdi, %rdi, 2), which should be 2*%rdi+%rdi, evaluate to into %rax. What I get confused is since value x is stored in %rdi, which is just memory address, why does times %rdi by 3 then left shift this memory address by 2 is equal to x times 12? Isn't that when we times %rdi by 3, we jump to another memory address which does not hold value x?
lea (see Intel's instruction-set manual entry) is a shift-and-add instruction that uses memory-operand syntax and machine encoding. This explains the name, but it's not the only thing it's good for. It never actually accesses memory, so it's like using & in C.
See for example How to multiply a register by 37 using only 2 consecutive leal instructions in x86?
In C, it's like uintptr_t foo = (uintptr_t) &arr[idx]. Note the & to give you arr + idx (scaling for the object size of arr since this is C not asm). In C, this would be abuse of the language syntax and types, but in x86 assembly pointers and integers are the same thing. Everything is just bytes, and it's up to the program put instructions in the right order to get useful results.
Effective address is a technical term in x86: it means the "offset" part of a seg:off logical address, especially when a base_reg + index*scale + displacement calculation was needed. e.g. the rax + (rcx<<2) in a %gs:(%rax,%rcx,4) addressing mode. (But EA still applies to %rdi for stosb, or the absolute displacement for movabs load/store, or other cases without a ModRM addr mode). Its use in this context doesn't mean it must be a valid / useful memory address, it's telling you that the calculation doesn't involve the segment base so it's not calculating a linear address. (Adding the seg base would make it unusable for actual address math in a non-flat memory model.)
The original designer / architect of 8086's instruction set (Stephen Morse) might or might not have had pointer math in mind as the main use-case, but modern compilers think of it as just another option for doing arithmetic on pointers / integers, and so should humans.
(Note that 16-bit addressing modes don't include shifts, just [BP|BX] + [SI|DI] + disp8/disp16, so LEA wasn't as useful for non-pointer math before 386. See this Q&A for more about 32/64-bit addressing modes, although that answer uses Intel syntax like [rax + rdi*4] instead of the AT&T syntax used in this question. x86 machine code is the same regardless of what syntax you use to create it.)
Maybe the 8086 architects did simply want to expose the address-calculation hardware for arbitrary uses because they could do it without using a lot of extra transistors. The decoder already has to be able to decode addressing modes, and other parts of the CPU have to be able to do address calculations. Putting the result in a register instead of using it with a segment-register value for memory access doesn't take many extra transistors. Ross Ridge confirms that LEA on original 8086 reuses the CPUs effective-address decoding and calculation hardware.
Note that most modern CPUs run LEA on the same ALUs as normal add and shift instructions. They have dedicated AGUs (address-generation units), but only use them for actual memory operands. In-order Atom is one exception; LEA runs earlier in the pipeline than the ALUs: inputs have to be ready sooner, but outputs are also ready sooner. Out-of-order execution CPUs (all modern x86) don't want LEA to interfere with actual loads/stores so they run it on an ALU.
lea has good latency and throughput, but not as good throughput as add or mov r32, imm32 on most CPUs, so only use lea when you can save an instructions with it instead of add. (See Agner Fog's x86 microarch guide and asm optimization manual and https://uops.info/)
Ice Lake improved on that for Intel, now able to run LEA on all four ALU ports.
Rules for which kinds of LEA are "complex", running on fewer of the ports that can handle it, vary by microarchitecture. e.g. 3-component (two + operations) is the slower case on SnB-family, having a scaled index is the lower-throughput case on Ice Lake. Alder Lake E-cores (Gracemont) are 4/clock, but 1/clock when there's an index at all, and 2-cycle latency when there's an index and displacement (whether or not there's a base reg). Zen is slower when there's a scaled index or 3 components. (2c latency and 2/clock down from 1c and 4/clock).
The internal implementation is irrelevant, but it's a safe bet that decoding the operands to LEA shares transistors with decoding addressing modes for any other instruction. (So there is hardware reuse / sharing even on modern CPUs that don't execute lea on an AGU.) Any other way of exposing a multi-input shift-and-add instruction would have taken a special encoding for the operands.
So 386 got a shift-and-add ALU instruction for "free" when it extended the addressing modes to include scaled-index, and being able to use any register in an addressing mode made LEA much easier to use for non-pointers, too.
x86-64 got cheap access to the program counter (instead of needing to read what call pushed) "for free" via LEA because it added the RIP-relative addressing mode, making access to static data significantly cheaper in x86-64 position-independent code than in 32-bit PIC. (RIP-relative does need special support in the ALUs that handle LEA, as well as the separate AGUs that handle actual load/store addresses. But no new instruction was needed.)
It's just as good for arbitrary arithmetic as for pointers, so it's a mistake to think of it as being intended for pointers these days. It's not an "abuse" or "trick" to use it for non-pointers, because everything's an integer in assembly language. It has lower throughput than add, but it's cheap enough to use almost all the time when it saves even one instruction. But it can save up to three instructions:
;; Intel syntax.
lea eax, [rdi + rsi*4 - 8] ; 3 cycle latency on Intel SnB-family
; 2-component LEA is only 1c latency
;;; without LEA:
mov eax, esi ; maybe 0 cycle latency, otherwise 1
shl eax, 2 ; 1 cycle latency
add eax, edi ; 1 cycle latency
sub eax, 8 ; 1 cycle latency
On some AMD CPUs, even a complex LEA is only 2 cycle latency, but the 4-instruction sequence would be 4 cycle latency from esi being ready to the final eax being ready. Either way, this saves 3 uops for the front-end to decode and issue, and that take up space in the reorder buffer all the way until retirement.
lea has several major benefits, especially in 32/64-bit code where addressing modes can use any register and can shift:
non-destructive: output in a register that isn't one of the inputs. It's sometimes useful as just a copy-and-add like lea 1(%rdi), %eax or lea (%rdx, %rbp), %ecx.
can do 3 or 4 operations in one instruction (see above).
Math without modifying EFLAGS, can be handy after a test before a cmovcc. Or maybe in an add-with-carry loop on CPUs with partial-flag stalls.
x86-64: position independent code can use a RIP-relative LEA to get a pointer to static data.
7-byte lea foo(%rip), %rdi is slightly larger and slower than mov $foo, %edi (5 bytes), so prefer mov r32, imm32 in position-dependent code on OSes where symbols are in the low 32 bits of virtual address space, like Linux. You may need to disable the default PIE setting in gcc to use this.
In 32-bit code, mov edi, OFFSET symbol is similarly shorter and faster than lea edi, [symbol]. (Leave out the OFFSET in NASM syntax.) RIP-relative isn't available and addresses fit in a 32-bit immediate, so there's no reason to consider lea instead of mov r32, imm32 if you need to get static symbol addresses into registers.
Other than RIP-relative LEA in x86-64 mode, all of these apply equally to calculating pointers vs. calculating non-pointer integer add / shifts.
See also the x86 <!--> tag wiki for assembly guides / manuals, and performance info.
Operand-size vs. address-size for x86-64 lea
See also Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted?. 64-bit address size and 32-bit operand size is the most compact encoding (no extra prefixes), so prefer lea (%rdx, %rbp), %ecx when possible instead of 64-bit lea (%rdx, %rbp), %rcx or 32-bit lea (%edx, %ebp), %ecx.
x86-64 lea (%edx, %ebp), %ecx is always a waste of an address-size prefix vs. lea (%rdx, %rbp), %ecx, but 64-bit address / operand size is obviously required for doing 64-bit math. (Agner Fog's objconv disassembler even warns about useless address-size prefixes on LEA with a 32-bit operand-size.)
Except maybe on Ryzen, where Agner Fog reports that 32-bit operand size lea in 64-bit mode has an extra cycle of latency. I don't know if overriding the address-size to 32-bit can speed up LEA in 64-bit mode if you need it to truncate to 32-bit.
This question is a near-duplicate of the very-highly-voted What's the purpose of the LEA instruction?, but most of the answers explain it in terms of address calculation on actual pointer data. That's only one use.
leaq doesn't have to operate on memory addresses, and it computes an address, it doesn't actually read from the result, so until a mov or the like tries to use it, it's just an esoteric way to add one number, plus 1, 2, 4 or 8 times another number (or the same number in this case). It's frequently "abused"† for mathematical purposes, as you see. 2*%rdi+%rdi is just 3 * %rdi, so it's computing x * 3 without involving the multiplier unit on the CPU.
Similarly, left shifting, for integers, doubles the value for every bit shifted (every zero added to the right), thanks to the way binary numbers work (the same way in decimal numbers, adding zeroes on the right multiplies by 10).
So this is abusing the leaq instruction to accomplish multiplication by 3, then shifting the result to achieve a further multiplication by 4, for a final result of multiplying by 12 without ever actually using a multiply instruction (which it presumably believes would run more slowly, and for all I know it could be right; second-guessing the compiler is usually a losing game).
†: To be clear, it's not abuse in the sense of misuse, just using it in a way that doesn't clearly align with the implied purpose you'd expect from its name. It's 100% okay to use it this way.
LEA is for calculating the address. It doesn't dereference the memory address
It should be much more readable in Intel syntax
m12(long):
lea rax, [rdi+rdi*2]
sal rax, 2
ret
So the first line is equivalent to rax = rdi*3
Then the left shift is to multiply rax by 4, which results in rdi*3*4 = rdi*12

Using LEA on values that aren't addresses / pointers?

I was trying to understand how Address Computation Instruction works, especially with leaq command. Then I get confused when I see examples using leaq to do arithmetic computation. For example, the following C code,
long m12(long x) {
return x*12;
}
In assembly,
leaq (%rdi, %rdi, 2), %rax
salq $2, $rax
If my understanding is right, leaq should move whatever address (%rdi, %rdi, 2), which should be 2*%rdi+%rdi, evaluate to into %rax. What I get confused is since value x is stored in %rdi, which is just memory address, why does times %rdi by 3 then left shift this memory address by 2 is equal to x times 12? Isn't that when we times %rdi by 3, we jump to another memory address which does not hold value x?
lea (see Intel's instruction-set manual entry) is a shift-and-add instruction that uses memory-operand syntax and machine encoding. This explains the name, but it's not the only thing it's good for. It never actually accesses memory, so it's like using & in C.
See for example How to multiply a register by 37 using only 2 consecutive leal instructions in x86?
In C, it's like uintptr_t foo = (uintptr_t) &arr[idx]. Note the & to give you arr + idx (scaling for the object size of arr since this is C not asm). In C, this would be abuse of the language syntax and types, but in x86 assembly pointers and integers are the same thing. Everything is just bytes, and it's up to the program put instructions in the right order to get useful results.
Effective address is a technical term in x86: it means the "offset" part of a seg:off logical address, especially when a base_reg + index*scale + displacement calculation was needed. e.g. the rax + (rcx<<2) in a %gs:(%rax,%rcx,4) addressing mode. (But EA still applies to %rdi for stosb, or the absolute displacement for movabs load/store, or other cases without a ModRM addr mode). Its use in this context doesn't mean it must be a valid / useful memory address, it's telling you that the calculation doesn't involve the segment base so it's not calculating a linear address. (Adding the seg base would make it unusable for actual address math in a non-flat memory model.)
The original designer / architect of 8086's instruction set (Stephen Morse) might or might not have had pointer math in mind as the main use-case, but modern compilers think of it as just another option for doing arithmetic on pointers / integers, and so should humans.
(Note that 16-bit addressing modes don't include shifts, just [BP|BX] + [SI|DI] + disp8/disp16, so LEA wasn't as useful for non-pointer math before 386. See this Q&A for more about 32/64-bit addressing modes, although that answer uses Intel syntax like [rax + rdi*4] instead of the AT&T syntax used in this question. x86 machine code is the same regardless of what syntax you use to create it.)
Maybe the 8086 architects did simply want to expose the address-calculation hardware for arbitrary uses because they could do it without using a lot of extra transistors. The decoder already has to be able to decode addressing modes, and other parts of the CPU have to be able to do address calculations. Putting the result in a register instead of using it with a segment-register value for memory access doesn't take many extra transistors. Ross Ridge confirms that LEA on original 8086 reuses the CPUs effective-address decoding and calculation hardware.
Note that most modern CPUs run LEA on the same ALUs as normal add and shift instructions. They have dedicated AGUs (address-generation units), but only use them for actual memory operands. In-order Atom is one exception; LEA runs earlier in the pipeline than the ALUs: inputs have to be ready sooner, but outputs are also ready sooner. Out-of-order execution CPUs (all modern x86) don't want LEA to interfere with actual loads/stores so they run it on an ALU.
lea has good latency and throughput, but not as good throughput as add or mov r32, imm32 on most CPUs, so only use lea when you can save an instructions with it instead of add. (See Agner Fog's x86 microarch guide and asm optimization manual and https://uops.info/)
Ice Lake improved on that for Intel, now able to run LEA on all four ALU ports.
Rules for which kinds of LEA are "complex", running on fewer of the ports that can handle it, vary by microarchitecture. e.g. 3-component (two + operations) is the slower case on SnB-family, having a scaled index is the lower-throughput case on Ice Lake. Alder Lake E-cores (Gracemont) are 4/clock, but 1/clock when there's an index at all, and 2-cycle latency when there's an index and displacement (whether or not there's a base reg). Zen is slower when there's a scaled index or 3 components. (2c latency and 2/clock down from 1c and 4/clock).
The internal implementation is irrelevant, but it's a safe bet that decoding the operands to LEA shares transistors with decoding addressing modes for any other instruction. (So there is hardware reuse / sharing even on modern CPUs that don't execute lea on an AGU.) Any other way of exposing a multi-input shift-and-add instruction would have taken a special encoding for the operands.
So 386 got a shift-and-add ALU instruction for "free" when it extended the addressing modes to include scaled-index, and being able to use any register in an addressing mode made LEA much easier to use for non-pointers, too.
x86-64 got cheap access to the program counter (instead of needing to read what call pushed) "for free" via LEA because it added the RIP-relative addressing mode, making access to static data significantly cheaper in x86-64 position-independent code than in 32-bit PIC. (RIP-relative does need special support in the ALUs that handle LEA, as well as the separate AGUs that handle actual load/store addresses. But no new instruction was needed.)
It's just as good for arbitrary arithmetic as for pointers, so it's a mistake to think of it as being intended for pointers these days. It's not an "abuse" or "trick" to use it for non-pointers, because everything's an integer in assembly language. It has lower throughput than add, but it's cheap enough to use almost all the time when it saves even one instruction. But it can save up to three instructions:
;; Intel syntax.
lea eax, [rdi + rsi*4 - 8] ; 3 cycle latency on Intel SnB-family
; 2-component LEA is only 1c latency
;;; without LEA:
mov eax, esi ; maybe 0 cycle latency, otherwise 1
shl eax, 2 ; 1 cycle latency
add eax, edi ; 1 cycle latency
sub eax, 8 ; 1 cycle latency
On some AMD CPUs, even a complex LEA is only 2 cycle latency, but the 4-instruction sequence would be 4 cycle latency from esi being ready to the final eax being ready. Either way, this saves 3 uops for the front-end to decode and issue, and that take up space in the reorder buffer all the way until retirement.
lea has several major benefits, especially in 32/64-bit code where addressing modes can use any register and can shift:
non-destructive: output in a register that isn't one of the inputs. It's sometimes useful as just a copy-and-add like lea 1(%rdi), %eax or lea (%rdx, %rbp), %ecx.
can do 3 or 4 operations in one instruction (see above).
Math without modifying EFLAGS, can be handy after a test before a cmovcc. Or maybe in an add-with-carry loop on CPUs with partial-flag stalls.
x86-64: position independent code can use a RIP-relative LEA to get a pointer to static data.
7-byte lea foo(%rip), %rdi is slightly larger and slower than mov $foo, %edi (5 bytes), so prefer mov r32, imm32 in position-dependent code on OSes where symbols are in the low 32 bits of virtual address space, like Linux. You may need to disable the default PIE setting in gcc to use this.
In 32-bit code, mov edi, OFFSET symbol is similarly shorter and faster than lea edi, [symbol]. (Leave out the OFFSET in NASM syntax.) RIP-relative isn't available and addresses fit in a 32-bit immediate, so there's no reason to consider lea instead of mov r32, imm32 if you need to get static symbol addresses into registers.
Other than RIP-relative LEA in x86-64 mode, all of these apply equally to calculating pointers vs. calculating non-pointer integer add / shifts.
See also the x86 <!--> tag wiki for assembly guides / manuals, and performance info.
Operand-size vs. address-size for x86-64 lea
See also Which 2's complement integer operations can be used without zeroing high bits in the inputs, if only the low part of the result is wanted?. 64-bit address size and 32-bit operand size is the most compact encoding (no extra prefixes), so prefer lea (%rdx, %rbp), %ecx when possible instead of 64-bit lea (%rdx, %rbp), %rcx or 32-bit lea (%edx, %ebp), %ecx.
x86-64 lea (%edx, %ebp), %ecx is always a waste of an address-size prefix vs. lea (%rdx, %rbp), %ecx, but 64-bit address / operand size is obviously required for doing 64-bit math. (Agner Fog's objconv disassembler even warns about useless address-size prefixes on LEA with a 32-bit operand-size.)
Except maybe on Ryzen, where Agner Fog reports that 32-bit operand size lea in 64-bit mode has an extra cycle of latency. I don't know if overriding the address-size to 32-bit can speed up LEA in 64-bit mode if you need it to truncate to 32-bit.
This question is a near-duplicate of the very-highly-voted What's the purpose of the LEA instruction?, but most of the answers explain it in terms of address calculation on actual pointer data. That's only one use.
leaq doesn't have to operate on memory addresses, and it computes an address, it doesn't actually read from the result, so until a mov or the like tries to use it, it's just an esoteric way to add one number, plus 1, 2, 4 or 8 times another number (or the same number in this case). It's frequently "abused"† for mathematical purposes, as you see. 2*%rdi+%rdi is just 3 * %rdi, so it's computing x * 3 without involving the multiplier unit on the CPU.
Similarly, left shifting, for integers, doubles the value for every bit shifted (every zero added to the right), thanks to the way binary numbers work (the same way in decimal numbers, adding zeroes on the right multiplies by 10).
So this is abusing the leaq instruction to accomplish multiplication by 3, then shifting the result to achieve a further multiplication by 4, for a final result of multiplying by 12 without ever actually using a multiply instruction (which it presumably believes would run more slowly, and for all I know it could be right; second-guessing the compiler is usually a losing game).
†: To be clear, it's not abuse in the sense of misuse, just using it in a way that doesn't clearly align with the implied purpose you'd expect from its name. It's 100% okay to use it this way.
LEA is for calculating the address. It doesn't dereference the memory address
It should be much more readable in Intel syntax
m12(long):
lea rax, [rdi+rdi*2]
sal rax, 2
ret
So the first line is equivalent to rax = rdi*3
Then the left shift is to multiply rax by 4, which results in rdi*3*4 = rdi*12

Assembly: push vs movl

I have some C code that I compiled with gcc:
int main() {
int x = 1;
printf("%d\n",x);
return 0;
}
I've run it through gdb 7.9.1 and come up with this assembler code for main:
0x0000000100000f40 <+0>: push %rbp # save original frame pointer
0x0000000100000f41 <+1>: mov %rsp,%rbp # stack pointer is new frame pointer
0x0000000100000f44 <+4>: sub $0x10,%rsp # make room for vars
0x0000000100000f48 <+8>: lea 0x47(%rip),%rdi # 0x100000f96
0x0000000100000f4f <+15>: movl $0x0,-0x4(%rbp) # put 0 on the stack
0x0000000100000f56 <+22>: movl $0x1,-0x8(%rbp) # put 1 on the stack
0x0000000100000f5d <+29>: mov -0x8(%rbp),%esi
0x0000000100000f60 <+32>: mov $0x0,%al
0x0000000100000f62 <+34>: callq 0x100000f74
0x0000000100000f67 <+39>: xor %esi,%esi # set %esi to 0
0x0000000100000f69 <+41>: mov %eax,-0xc(%rbp)
0x0000000100000f6c <+44>: mov %esi,%eax
0x0000000100000f6e <+46>: add $0x10,%rsp # move stack pointer to original location
0x0000000100000f72 <+50>: pop %rbp # reclaim original frame pointer
0x0000000100000f73 <+51>: retq
As I understand it, push %rbb pushes the frame pointer onto the stack, so we can retrieve it later with pop %rbp. Then, sub $0x10,%rsp clears 10 bytes of room on the stack so we can put stuff on it.
Later interactions with the stack move variables directly into the stack via memory addressing, rather than pushing them onto the stack:
movl $0x0, -0x4(%rbp)
movl $0x1, -0x8(%rbp)
Why does the compiler use movl rather than push to get this information onto the stack?
Does referencing the register after the memory address also put that value into that register?
It is very common for modern compilers to move the stack pointer once at the beginning of a function, and move it back at the end. This allows for more efficient indexing because it can treat the memory space as a memory mapped region rather than a simple stack. For example, values which are suddenly found to be of no use (perhaps due to an optimized shortcutted operator) can be ignored, rather than forcing one to pop them off the stack.
Perhaps in simpler days, there was a performance reason to use push. With modern processors, there is no advantage, so there's no reason to make special cases in the compiler to use push/pop when possible. It's not like compiler-written assembly code is readable!
While Cort is correct, there is another important reason for this practice of apparently allocating space on the stack. According to the ABI, function calls must find the stack 16 byte aligned. Rather than fiddling with the stack every single time a call needs to be made from a function, it is generally easier and more efficient to adjust the stack for proper alignment first and then modify the values that might otherwise have been pushed onto it.
So, the stack is absolutely adjusted for local variable space, but it is also adjusted to provide correct stack alignment for calls into the standard library.
I'm not an authority on assemblers or compilers but I've played around with MASM back in the day and did spend a whole bunch of time with WinDbg while debugging production C++ issues.
I think the answer to your question is because it's easier.
push/pop instructions write to and read from the stack but they also modify the stack as they are processed. C/C++ compiler uses stack for all its local variables. It does it by shifting stack pointer by the exact number of bytes that is needed to hold all local variables and it does so right when you enter the function.
After that reading and writing all those variables can be done from anywhere in the function and also as many times as you want by simply using the mov instructions. If you look at pure assembly, you might question why create a hole in the stack just to copy two values into that space using mov when you could have done two push instructions.
But look at it from compiler author perspective. The process of entering a function, and allocating stack for local variables is coded separately and is completely decoupled from the process of reading/writing those variables.

How do I use GDB Debugger to look at __asm__ content?

I'm trying to understand what is happening in this code, specifically within __asm__. How do I step through the assembly code, so I can print each variable and what not?
Specifically, I am trying to step through this to figure out what does the 8() mean and to see how does it know that it is going into array at index 2.
/* a[2] = 99 in assembly */
__asm__("\n\
movl $_a, %eax\n\
movl $99, 8(%eax)\n\
");
The stepi command steps through the assembly one instruction at a time. There is also a nexti for stepping over function calls. These commands do not adhere to the 'type only the unique prefix of a command is enough' rule that works for most commands -- partially because they the next and step commands are entirely prefixes of these commands and partially because these are not used too often and when they are they are typically used by someone who knows that they really want to use them.
info registers displays a lot of the register contents.
You'll also want to view the disassembly with the disassemble command.
More info on all of these commands is available with the help command, for instance:
(gdb) help info registers
tells you that info registers displays the integer registers and their contents, but it also tells you that if you supply a register name it will limit output to that register's value:
(gdb) info registers rax
rax 0x0 0
(rax is the x86_64 version of eax)
The first column is the register name, the second is the hex value, and the third is the integer value.
There is useful help for the disassemble command as well.
Remember that gdb has tab completion for many commands, and this can be used for more than just simple commands, though many times it offers you bad suggestions -- it's sometimes helpful, though.
Including a label within your inline assembly will allow you to easily make a break point at the beginning of it.
I was never any good at AT&T syntax, but I'm pretty sure the 8(%eax) part means "the address 8 bytes after the address stored in EAX", that is, it's the offset relative to the address stored in the register.
Approximate equivalent in Intel syntax would be something like this (off the top of my head, so it's entirely possible that there is some minor mistake here...)
mov eax, a
mov DWORD PTR [eax+8], 99
movl $_a, %eax // load the memory address of a into %eax
movl $99, 8(%eax) // jump 8 bytes and store number 99 (which is a[2])
It seems to me that a is an int array (int has 4 bytes in most platforms). So by increment 4 bytes you'll be accessing the next item of the array. Other examples of assigning values to this array would be:
movl $10, (%eax) // store number 10 on the the first position: a[0]
movl $20, 4(%eax) // jump 4 bytes from the address loaded in %eax
// and store number 20 on the next position (a[1])

Resources