Writing a compiler for class, and no one in the class could figure out exactly why we couldn't do the straight-forward thing.
cmpq %r13, %r10
movq $0, %r10
cmovne $1, %r10
My best guess is that since cmovXX doesn't explicitly define the size of its arguments like movq or movl, $1 doesn't know how big to be, and therefore, throws a type mismatch tantrum.
My question is, how does one force an integer constant to be a quadword? $1q didn't work, so I'm out of guesses.
Thanks!
Not really. cmov is simply not available (neither Intel, nor AMD created such an encoding of this particular instruction) with an immediate operand. It operates only on registers and memory locations.
Forcing a particular size of an instruction in AT&T syntax is done by appending one of the size prefixes to the instruction's mnemonic - just the way you have done it.
The only instruction in the x86-64 instruction set that can accept a quadword (64-bit) immediate is the mov instruction with a 64-bit register. However, doing movq $0, %rax will give you the ordinary encoding with a 32-bit immediate. In order to force the assembler to emit a 64-bit immediate, you have to use movabs $0, %rax.
Related
I am learning assembly and reading "Computer Systems: A programmer's perspective". In Practice Problem 3.3, it says movl %eax,%rdx will generate an error. The answer keys says movl %eax,%dx Destination operand incorrect size. I am not sure if this is a typo or not, but my question is: is movl %eax,%rdx a legal instruction? I think it is moving the 32 bits in %eax with zero extension to %rdx, which will not be generated as movzql since
an instruction generating a 4-byte value with a register as the destination will fill the upper 4 bytes with zeros` (from the book).
I tried to write some C code to generate it, but I always get movslq %eax, %rdx(GCC 4.8.5 -Og). I am completely confused.
The GNU assembler doesn't accept movl %eax,%rdx. It also doesn't make sense for the encoding, since mov must have a single operand size (using a prefix byte if needed), not two different sized operands.
The effect you want is achieved by movl %eax, %edx since writes to a 32-bit register always zero-extend into the corresponding 64-bit register. See Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?.
movzlq %eax, %rdx might make logical sense, but it's not supported since it would be redundant.
I am looking at some old code from a school project, and in trying to compile it on my laptop I ran into some problems. It was originally written for an old 32 bit version of gcc. Anyway I was trying to convert some of the assembly over to 64 bit compatible code and hit a few snags.
Here is the original code:
pusha
pushl %ds
pushl %es
pushl %fs
pushl %gs
pushl %ss
pusha is not valid in 64 bit mode. So what would be the proper way to do this in x86_64 assembly while in 64 bit mode?
There has got to be a reason why pusha is not valid in 64 bit mode, so I have a feeling manually pushing all the registers may not be a good idea.
AMD needed some room to add new opcodes for REX prefixes and some other new instructions when they developed the 64-bit x86 extensions. They changed the meaning of some of the opcodes to those new instructions.
Several of the instructions were simply short-forms of existing instructions or were otherwise not necessary. PUSHA was one of the victims. It's not clear why they banned PUSHA though, it doesn't seem to overlap any new instruction opcodes. Perhaps they are reserved the PUSHA and POPA opcodes for future use, since they are completely redundant and won't be any faster and won't occur frequently enough in code to matter.
The order of PUSHA was the order of the instruction encoding: eax, ecx, edx, ebx, esp, ebp, esi, edi. Note that it redundantly pushed esp! You need to know esp to find the data it pushed!
If you are converting code from 64-bit the PUSHA code is no good anyway, you need to update it to push the new registers r8 thru r15. You also need to save and restore a much larger SSE state, xmm8 thru xmm15. Assuming you are going to clobber them.
If the interrupt handler code is simply a stub that forwards to C code, you don't need to save all of the registers. You can assume that the C compiler will generate code that will be preserving rbx, rbp, rsi, rdi, and r12 thru r15. You should only need to save and restore rax, rcx, rdx, and r8 thru r11. (Note: on Linux or other System V ABI platforms, the compiler will be preserving rbx, rbp, r12-r15, you can expect rsi and rdi clobbered).
The segment registers hold no value in long mode (if the interrupted thread is running in 32-bit compatibility mode you must preserve the segment registers, thanks ughoavgfhw). Actually, they got rid of most of the segmentation in long mode, but FS is still reserved for operating systems to use as a base address for thread local data. The register value itself doesn't matter, the base of FS and GS are set through MSRs 0xC0000100 and 0xC0000101. Assuming you won't be using FS you don't need to worry about it, just remember that any thread local data accessed by the C code could be using any random thread's TLS. Be careful of that because C runtime libraries use TLS for some functionality (example: strtok typically uses TLS).
Loading a value into FS or GS (even in user mode) will overwrite the FSBASE or GSBASE MSR. Since some operating systems use GS as "processor local" storage (they need a way to have a pointer to a structure for each CPU), they need to keep it somewhere that won't get clobbered by loading GS in user mode. To solve this problem, there are two MSRs reserved for the GSBASE register: one active one and one hidden one. In kernel mode, the kernel's GSBASE is held in the usual GSBASE MSR and the user mode base is in the other (hidden) GSBASE MSR. When context switching from kernel mode to a user mode context, and when saving a user mode context and entering kernel mode, the context switch code must execute the SWAPGS instruction, which swaps the values of the visible and hidden GSBASE MSR. Since the kernel's GSBASE is safely hidden in the other MSR in user mode, the user mode code can't clobber the kernel's GSBASE by loading a value into GS. When the CPU reenters kernel mode, the context save code will execute SWAPGS and restore the kernel's GSBASE.
Learn from existing code that does this kind of thing. For example:
Linux (search for SAVE_ARGS_IRQ): entry_64.S
OpenSolaris (search for INTR_PUSH): privregs.h
FreeBSD (search for IDT_VEC): exception.S (similar is vector.S in NetBSD)
In fact, "manually pushing" the regs is the only way on AMD64 since PUSHA doesn't exist there. AMD64 isn't unique in this aspect - most non-x86 CPUs do require register-by-register saves/restores as well at some point.
But if you inspect the referenced sourcecode closely you'll find that not all interrupt handlers require to save/restore the entire register set, so there is room for optimizations.
pusha is not valid in 64-bit mode because it is redundant. Pushing each register individually is exactly the thing to do.
Hi it might not be the correct way to do it but one can create macros like
.macro pushaq
push %rax
push %rcx
push %rdx
push %rbx
push %rbp
push %rsi
push %rdi
.endm # pushaq
and
.macro popaq
pop %rdi
pop %rsi
pop %rbp
pop %rbx
pop %rdx
pop %rcx
pop %rax
.endm # popaq
and eventually add the other r8-15 registers if one needs to
So I'm trying to learn a little bit of assembly, because I need it for Computer Architecture class. I wrote a few programs, like printing the Fibonacci sequence.
I recognized that whenever I write a function I use those 3 lines (as I learned from comparing assembly code generated from gcc to its C equivalent):
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
I have 2 questions about it:
First of all, why do I need to use %rbp? Isn't it simpler to use %rsp, as its contents are moved to %rbp on the 2nd line?
Why do I have to subtract anything from %rsp? I mean it's not always 16, when I was printfing like 7 or 8 variables, then I would subtract 24 or 28.
I use Manjaro 64 bit on a Virtual Machine (4 GB RAM), Intel 64 bit processor
rbp is the frame pointer on x86_64. In your generated code, it gets a snapshot of the stack pointer (rsp) so that when adjustments are made to rsp (i.e. reserving space for local variables or pushing values on to the stack), local variables and function parameters are still accessible from a constant offset from rbp.
A lot of compilers offer frame pointer omission as an optimization option; this will make the generated assembly code access variables relative to rsp instead and free up rbp as another general purpose register for use in functions.
In the case of GCC, which I'm guessing you're using from the AT&T assembler syntax, that switch is -fomit-frame-pointer. Try compiling your code with that switch and see what assembly code you get. You will probably notice that when accessing values relative to rsp instead of rbp, the offset from the pointer varies throughout the function.
Linux uses the System V ABI for x86-64 (AMD64) architecture; see System V ABI at OSDev Wiki for details.
This means the stack grows down; smaller addresses are "higher up" in the stack. Typical C functions are compiled to
pushq %rbp ; Save address of previous stack frame
movq %rsp, %rbp ; Address of current stack frame
subq $16, %rsp ; Reserve 16 bytes for local variables
; ... function ...
movq %rbp, %rsp ; \ equivalent to the
popq %rbp ; / 'leave' instruction
ret
The amount of memory reserved for the local variables is always a multiple of 16 bytes, to keep the stack aligned to 16 bytes. If no stack space is needed for local variables, there is no subq $16, %rsp or similar instruction.
(Note that the return address and the previous %rbp pushed to the stack are both 8 bytes in size, 16 bytes in total.)
While %rbp points to the current stack frame, %rsp points to the top of the stack. Because the compiler knows the difference between %rbp and %rsp at any point within the function, it is free to use either one as the base for the local variables.
A stack frame is just the local function's playground: the region of stack the current function uses.
Current versions of GCC disable the stack frame whenever optimizations are used. This makes sense, because for programs written in C, the stack frames are most useful for debugging, but not much else. (You can use e.g. -O2 -fno-omit-frame-pointer to keep stack frames while enabling optimizations otherwise, however.)
Although the same ABI applies to all binaries, no matter what language they are written in, certain other languages do need stack frames for "unwinding" (for example, to "throw exceptions" to an ancestor caller of the current function); i.e. to "unwind" stack frames that one or more functions can be aborted and control passed to some ancestor function, without leaving unneeded stuff on the stack.
When stack frames are omitted -- -fomit-frame-pointer for GCC --, the function implementation changes essentially to
subq $8, %rsp ; Re-align stack frame, and
; reserve memory for local variables
; ... function ...
addq $8, %rsp
ret
Because there is no stack frame (%rbp is used for other purposes, and its value is never pushed to stack), each function call pushes only the return address to the stack, which is an 8-byte quantity, so we need to subtract 8 from %rsp to keep it a multiple of 16. (In general, the value subtracted from and added to %rsp is an odd multiple of 8.)
Function parameters are typically passed in registers. See the ABI link at the beginning of this answer for details, but in short, integral types and pointers are passed in registers %rdi, %rsi, %rdx, %rcx, %r8, and %r9, with floating-point arguments in the %xmm0 to %xmm7 registers.
In some cases you'll see rep ret instead of rep. Don't be confused: the rep ret means the exact same thing as ret; the rep prefix, although normally used with string instructions (repeated instructions), does nothing when applied to the ret instruction. It's just that certain AMD processors' branch predictors don't like jumping to a ret instruction, and the recommended workaround is to use a rep ret there instead.
Finally, I've omitted the red zone above the top of the stack (the 128 bytes at addresses less than %rsp). This is because it is not really useful for typical functions: In the normal have-stack-frame case, you'll want your local stuff to be within the stack frame, to make debugging possible. In the omit-stack-frame case, stack alignment requirements already mean we need to subtract 8 from %rsp, so including the memory needed by the local variables in that subtraction costs nothing.
I'm trying to understand some assembly code with AT&T syntax.
Here is a snippet:
"mov %eax, %ebx; "\
"mov %eax, %ecx;"\
"fxch %st(1);"\
This is what I understood from it.
the mov copies (Am I correct?, or does it move?) the data from the source register to the destination register
In line one: we copy the data from registry eax to ebx.
Similarly, we copy the data from registry eax to ecx.
However, what I failed to understand is the following.
How does fxch work? Here is a link that gives an example.
fxch st(2)
fsqrt
fxch st(2)
It says that this above code takes the sqrt of st(2).
Correct me if I am wrong.
It swaps the top of the stack with st(2) and then takes the sqrt of what?
I don't understand that clearly.
Can you please help me out? How does that work in my case and in the above case?
mov instructions indeed copy a value and fsqrt takes the square root of the top of the stack and replaces the top of the stack with its result. So the given code sequence effectively takes the square root of st(2) and puts it back at the same place.
In answer to your question below. The two mov instructions copy the value in register %eax to %ebx and %ecx. So if you add another mov %eax,%edx, then this value (from %eax) is also copied to %edx.
Note that this holds for AT&T assembly. In Intel assembly the values are copied the other way around. In that case %eax was, quite uselessly, changed repeatedly to contain the value of the other registers.
The fxch st(1) exchanges the top of the stack, which is st(0) with the element just below the top st(1). Similarly st(2) is just below st(1). Contrary to the integer registers, the floating point registers on the x86 are organized in a stack, reducing the instruction length of operations on those floating point registers as they always work on the top element(s) of the stack. This comes with the overhead of having to use fxch instructions to put the right values on the top of the stack.
The integer registers %eax, %ebx etc. are distinct from the floating point stack/registers st(0), st(1) etc. So the mov instructions are not related to the fxch instructions. The order of these instructions could be changed without effecting the result.
Consider these two functions using SSE:
#include <xmmintrin.h>
int ftrunc1(float f) {
return _mm_cvttss_si32(_mm_set1_ps(f));
}
int ftrunc2(float f) {
return _mm_cvttss_si32(_mm_set_ss(f));
}
Both are exactly the same in behaviour for any input. But the assembler output is different:
ftrunc1:
pushl %ebp
movl %esp, %ebp
cvttss2si 8(%ebp), %eax
leave
ret
ftrunc2:
pushl %ebp
movl %esp, %ebp
movss 8(%ebp), %xmm0
cvttss2si %xmm0, %eax
leave
ret
That is, ftrunc2 uses one movss instruction extra!
Is this normal? Does it matter? Should _mm_set1_ps always be preferred over _mm_set_ss when you only need to set the bottom element?
Compiler used was GCC 4.5.2 with -O3 -msse.
_mm_set_ss maps directly to an assembly instruction (movss). But _mm_set1_ps does not.
From what I've seen on GCC, MSVC, and ICC:
SSE intrinsics that map one-to-one to an assembly instruction are generally treated "as-is" - a black box. So the compiler will only optimizations that apply to the entire instruction itself. But it will not attempt to do any optimizations that require dataflow/dependency analysis on the individual vector elements.
The _mm_set1_ps and _mm_set_ps intrinsics do not map to a single instruction and have special case handling by most compilers. From what I've seen, all three of the compilers I've listed above do attempt to perform dataflow analysis optimizations on the individual elements.
When you put it all together, the second example leaves the movss because the compiler doesn't realize that the top 3 elements don't matter. (It makes no attempt to "open up" the _mm_set_ss intrinsic.)
You're running into a quirk of the peephole optimizer. For some reason in the first case it figures out that it can fold the mov into the cvttss2si and in the second case it fails. The question is, does it matter? The extra move instruction is almost free -- it takes up an extra 4 bytes in the instruction stream and an extra decode slot, but both sequences require the same number of execution slots and the same number of load/store slots (which is what usually matters). The only potential sticking point is the 4 extra bytes of ifetch -- but since ftrunc1 uses 10 bytes and ftrunc2 uses 14, both will fit in a single cache line, so you won't see any difference. For minimizing that overhead, I'd be far more concerned about the unneeded %ebp cruft (are you compiling with -fno-omit-frame-pointer? -- I though -O3 included -fomit-frame-pointer by default). You'll do even better by inlining this function, which will likely completely change what the peephole optimizer sees, and so may make it work better in either case (or even reverse the cases where it works better) -- there's no way to tell without compiling larger programs and looking at the assembly code.
Bottom line, there's unlikely to be any measurable speed difference between the two...