I'm currently solving problem 3.3 from 3rd edition of Computer System: a programmer's perspective and I'm having a hard time understanding what these errors mean...
movb $0xF, (%ebx) gives an error because ebx can't be used as address register
movl %rax, (%rsp) and
movb %si, 8(%rbp) gives error saying that theres a mismatch between instruction suffix and register I.D.
movl %eax, %rdx gives an error saying that destination operand incorrect size
why can't we use ebx as address register? Is it because its 32-bit register? Would the following line work if it was movb $0xF, (%rbx) instead? since rbx is of 64bit register?
for the error regarding mismatch between instruction suffix and register I.D, does this error appear because it should've been movq %rax, (%rsp)and movew %si, 8(%rbp) instead of movl %rax, (%rsp) and movb %si, 8(%rbp)?
and lastly, for the error regarding "destination operand incorrect size", is this because the destination register was 64 bit instead of 32? so if the line of code was movl %eax, %edx instead, the error wouldn't have occurred?
any enlightenment would be appreciated.
this is for x86-64
movb $0xF, (%ebx) gives an error because ebx can't be used as address register
It's true that ebx can't be used as an address register (for x86-64), but rbx can. ebx is the lower 32bits of rbx. The whole point of 64bit code is that addresses can be 64bits, so trying to reference memory by using a 32bit register makes little sense.
movl %rax, (%rsp) and movb %si, 8(%rbp) gives error saying that
theres a mismatch between instruction suffix and register I.D.
Yes, because you are using movl, the 'l' means long, which (in this context) means 32bits. However, rax is a 64bit register. If you want to write 64bits out of rax, you should use movq. If you want to write 32bits, you should use eax.
movl %eax, %rdx gives an error saying that destination operand incorrect size
You are trying to move a 32bit value into a 64bit register. There are instructions to do this conversion for you (see cdq for example), but movl isn't one of them.
movb $0xF, (%ebx) assembles just fine (with a 0x67 address-size prefix), and executes correctly if the address in ebx is valid.
It might be a bug (and e.g. lead to a segfault from truncating a pointer), or sub-optimal, but if your book makes any stronger claim than that (like that it won't assemble) then your book contains an error.
The only reason you'd ever use that instead of movb $0xF, (%rbx) is if the upper bytes of %rbx potentially held garbage, e.g. in the x32 ABI (ILP32 in long mode), or if you're a dumb compiler that always uses address-size prefixes when targeting 32-bit-pointer mode even when addresses are known to be safely zero-extended.
32-bit address size is actually useful for the x32 ABI for the more common case where an index register holds high garbage, e.g. movl $0x12345, (%edi, %esi,4).
gcc -mx32 could easily emit a movb $0xF, (%ebx) instruction in real life. (Note that -mx32 (32-bit pointers in long mode) is different from -m32 (i386 ABI))
int ext(); // can't inline
void foo(char *p) {
ext(); // clobbers arg-passing registers
*p = 0xf; // so gcc needs to save the arg for after the call
}
Compiles with gcc7.3 -mx32 -O3 on the Godbolt compiler explorer into
foo(char*):
pushq %rbx # rbx is gcc's first choice of call-preserved reg.
movq %rdi, %rbx # stupid gcc copies the whole 64 bits when only the low 32 are useful
call ext()
movb $15, (%ebx) # $15 = $0xF
popq %rbx
ret
mov $edi, %ebx would have been better; IDK why gcc wants to copy the whole 64-bit register when it's treating pointers as 32-bit values. The x32 ABI unfortunately never really caught on on x86 so I guess nobody's put in the time to get gcc to generate great code for it.
AArch64 also has an ILP32 ABI to save memory / cache-footprint on pointer data, so maybe gcc will get better at 32-bit pointers in 64-bit mode in general (benefiting x86-64 as well) if any work for AArch64 ILP32 improves the common cross-architecture parts of this.
so if the line of code was movl %eax, %edx instead, the error wouldn't have occurred?
Right, that would zero-extend EAX into RDX. If you wanted to sign-extend EAX into RDX, use movslq %eax, %rdx (aka Intel-syntax movsxd)
(Almost) all x86 instructions require all their operands to be the same size. (In terms of operand-size; many instructions have a form with an 8-bit or 32-bit immediate that's sign extended to 64-bit or whatever the instruction's operand-size is. e.g. add $1, %eax will use the 3-byte add imm8, r/m32 form.)
Exceptions include shl %cl, %eax, and movzx/movsx.
In AT&T syntax, the sizes of registers have to match the operand-size suffix, if you use one. If you don't, the registers imply an operand-size. e.g. mov %eax, %edx is the same as movl.
Memory + immediate instructions with no register source or destination need an explicit size: add $1, (%rdx) won't assemble because the operand-size is ambiguous, but add %eax, (%rdx) is an addl (32-bit operand-size).
movew %si, 8(%rbp)
No, movw %si, 8(%rbp) would work though :P But note that if you've made a traditional stack frame with push %rbp / mov %rsp, %rbp on function entry, that store to 8(%rbp) will overwrite the low 16 bits of your return address on the stack.
But there's no requirement in x86-64 code for Windows or Linux that you have %rbp pointing there, or holding a valid pointer at all. It's just a call-preserved register like %rbx that you can use for whatever you want as long as you restore the caller's value before returning.
Related
I have written a Assembly program to display the factorial of a number following AT&T syntax. But it's not working. Here is my code
.text
.globl _start
_start:
movq $5,%rcx
movq $5,%rax
Repeat: #function to calculate factorial
decq %rcx
cmp $0,%rcx
je print
imul %rcx,%rax
cmp $1,%rcx
jne Repeat
# Now result of factorial stored in rax
print:
xorq %rsi, %rsi
# function to print integer result digit by digit by pushing in
#stack
loop:
movq $0, %rdx
movq $10, %rbx
divq %rbx
addq $48, %rdx
pushq %rdx
incq %rsi
cmpq $0, %rax
jz next
jmp loop
next:
cmpq $0, %rsi
jz bye
popq %rcx
decq %rsi
movq $4, %rax
movq $1, %rbx
movq $1, %rdx
int $0x80
addq $4, %rsp
jmp next
bye:
movq $1,%rax
movq $0, %rbx
int $0x80
.data
num : .byte 5
This program is printing nothing, I also used gdb to visualize it work fine until loop function but when it comes in next some random value start entering in various register. Help me to debug so that it could print factorial.
As #ped7g points out, you're doing several things wrong: using the int 0x80 32-bit ABI in 64-bit code, and passing character values instead of pointers to the write() system call.
Here's how to print an integer in x8-64 Linux, the simple and somewhat-efficient1 way, using the same repeated division / modulo by 10.
System calls are expensive (probably thousands of cycles for write(1, buf, 1)), and doing a syscall inside the loop steps on registers so it's inconvenient and clunky as well as inefficient. We should write the characters into a small buffer, in printing order (most-significant digit at the lowest address), and make a single write() system call on that.
But then we need a buffer. The maximum length of a 64-bit integer is only 20 decimal digits, so we can just use some stack space. In x86-64 Linux, we can use stack space below RSP (up to 128B) without "reserving" it by modifying RSP. This is called the red-zone. If you wanted to pass the buffer to another function instead of a syscall, you would have to reserve space with sub $24, %rsp or something.
Instead of hard-coding system-call numbers, using GAS makes it easy to use the constants defined in .h files. Note the mov $__NR_write, %eax near the end of the function. The x86-64 SystemV ABI passes system-call arguments in similar registers to the function-calling convention. (So it's totally different from the 32-bit int 0x80 ABI, which you shouldn't use in 64-bit code.)
// building with gcc foo.S will use CPP before GAS so we can use headers
#include <asm/unistd.h> // This is a standard Linux / glibc header file
// includes unistd_64.h or unistd_32.h depending on current mode
// Contains only #define constants (no C prototypes) so we can include it from asm without syntax errors.
.p2align 4
.globl print_integer #void print_uint64(uint64_t value)
print_uint64:
lea -1(%rsp), %rsi # We use the 128B red-zone as a buffer to hold the string
# a 64-bit integer is at most 20 digits long in base 10, so it fits.
movb $'\n', (%rsi) # store the trailing newline byte. (Right below the return address).
# If you need a null-terminated string, leave an extra byte of room and store '\n\0'. Or push $'\n'
mov $10, %ecx # same as mov $10, %rcx but 2 bytes shorter
# note that newline (\n) has ASCII code 10, so we could actually have stored the newline with movb %cl, (%rsi) to save code size.
mov %rdi, %rax # function arg arrives in RDI; we need it in RAX for div
.Ltoascii_digit: # do{
xor %edx, %edx
div %rcx # rax = rdx:rax / 10. rdx = remainder
# store digits in MSD-first printing order, working backwards from the end of the string
add $'0', %edx # integer to ASCII. %dl would work, too, since we know this is 0-9
dec %rsi
mov %dl, (%rsi) # *--p = (value%10) + '0';
test %rax, %rax
jnz .Ltoascii_digit # } while(value != 0)
# If we used a loop-counter to print a fixed number of digits, we would get leading zeros
# The do{}while() loop structure means the loop runs at least once, so we get "0\n" for input=0
# Then print the whole string with one system call
mov $__NR_write, %eax # call number from asm/unistd_64.h
mov $1, %edi # fd=1
# %rsi = start of the buffer
mov %rsp, %rdx
sub %rsi, %rdx # length = one_past_end - start
syscall # write(fd=1 /*rdi*/, buf /*rsi*/, length /*rdx*/); 64-bit ABI
# rax = return value (or -errno)
# rcx and r11 = garbage (destroyed by syscall/sysret)
# all other registers = unmodified (saved/restored by the kernel)
# we don't need to restore any registers, and we didn't modify RSP.
ret
To test this function, I put this in the same file to call it and exit:
.p2align 4
.globl _start
_start:
mov $10120123425329922, %rdi
# mov $0, %edi # Yes, it does work with input = 0
call print_uint64
xor %edi, %edi
mov $__NR_exit, %eax
syscall # sys_exit(0)
I built this into a static binary (with no libc):
$ gcc -Wall -static -nostdlib print-integer.S && ./a.out
10120123425329922
$ strace ./a.out > /dev/null
execve("./a.out", ["./a.out"], 0x7fffcb097340 /* 51 vars */) = 0
write(1, "10120123425329922\n", 18) = 18
exit(0) = ?
+++ exited with 0 +++
$ file ./a.out
./a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=69b865d1e535d5b174004ce08736e78fade37d84, not stripped
Footnote 1: See Why does GCC use multiplication by a strange number in implementing integer division? for avoiding div r64 for division by 10, because that's very slow (21 to 83 cycles on Intel Skylake). A multiplicative inverse would make this function actually efficient, not just "somewhat". (But of course there'd still be room for optimizations...)
Related: Linux x86-32 extended-precision loop that prints 9 decimal digits from each 32-bit "limb": see .toascii_digit: in my Extreme Fibonacci code-golf answer. It's optimized for code-size (even at the expense of speed), but well-commented.
It uses div like you do, because that's smaller than using a fast multiplicative inverse). It uses loop for the outer loop (over multiple integer for extended precision), again for code-size at the cost of speed.
It uses the 32-bit int 0x80 ABI, and prints into a buffer that was holding the "old" Fibonacci value, not the current.
Another way to get efficient asm is from a C compiler. For just the loop over digits, look at what gcc or clang produce for this C source (which is basically what the asm is doing). The Godbolt Compiler explorer makes it easy to try with different options and different compiler versions.
See gcc7.2 -O3 asm output which is nearly a drop-in replacement for the loop in print_uint64 (because I chose the args to go in the same registers):
void itoa_end(unsigned long val, char *p_end) {
const unsigned base = 10;
do {
*--p_end = (val % base) + '0';
val /= base;
} while(val);
// write(1, p_end, orig-current);
}
I tested performance on a Skylake i7-6700k by commenting out the syscall instruction and putting a repeat loop around the function call. The version with mul %rcx / shr $3, %rdx is about 5 times faster than the version with div %rcx for storing a long number-string (10120123425329922) into a buffer. The div version ran at 0.25 instructions per clock, while the mul version ran at 2.65 instructions per clock (although requiring many more instructions).
It might be worth unrolling by 2, and doing a divide by 100 and splitting up the remainder of that into 2 digits. That would give a lot better instruction-level parallelism, in case the simpler version bottlenecks on mul + shr latency. The chain of multiply/shift operations that brings val to zero would be half as long, with more work in each short independent dependency chain to handle a 0-99 remainder.
Related:
NASM version of this answer, for x86-64 or i386 Linux How do I print an integer in Assembly Level Programming without printf from the c library?
How to convert a binary integer number to a hex string? - Base 16 is a power of 2, conversion is much simpler and doesn't require div.
Several things:
0) I guess this is 64b linux environment, but you should have stated so (if it is not, some of my points will be invalid)
1) int 0x80 is 32b call, but you are using 64b registers, so you should use syscall (and different arguments)
2) int 0x80, eax=4 requires the ecx to contain address of memory, where the content is stored, while you give it the ASCII character in ecx = illegal memory access (the first call should return error, i.e. eax is negative value). Or using strace <your binary> should reveal the wrong arguments + error returned.
3) why addq $4, %rsp? Makes no sense to me, you are damaging rsp, so the next pop rcx will pop wrong value, and in the end you will run way "up" into the stack.
... maybe some more, I didn't debug it, this list is just by reading the source (so I may be even wrong about something, although that would be rare).
BTW your code is working. It just doesn't do what you expected. But work fine, precisely as the CPU is designed and precisely what you wrote in the code. Whether that does achieve what you wanted, or makes sense, that's different topic, but don't blame the HW or assembler.
... I can do a quick guess how the routine may be fixed (just partial hack-fix, still needs rewrite for syscall under 64b linux):
next:
cmpq $0, %rsi
jz bye
movq %rsp,%rcx ; make ecx to point to stack memory (with stored char)
; this will work if you are lucky enough that rsp fits into 32b
; if it is beyond 4GiB logical address, then you have bad luck (syscall needed)
decq %rsi
movq $4, %rax
movq $1, %rbx
movq $1, %rdx
int $0x80
addq $8, %rsp ; now rsp += 8; is needed, because there's no POP
jmp next
Again didn't try myself, just writing it from head, so let me know how it changed situation.
I'm still uncertain how registers are being used by the assembler
say I have a program:
int main(int rdi, int rsi, int rdx) {
rdx = rdi;
return 0;
}
Would this in assembly be translated into:
movq %rdx, %rdi
ret rax;
I'm new to AT&T and have hard time predicting when a certain register will be used.
Looking at this chart from Computer Systems - A programmer's perspective, third edition, R.E. Bryant and D. R. O'Hallaron:
charter
Is it certain in which register arguments and variables are stored?
Only at entry and exit of a function.
There is no guarantee as to what registers will be used within a function, even for variables which are parameters to the function. Compilers can (and often will) move variables around between registers to optimize register/stack usage, especially on register-starved architectures like x86.
In this case, a simple assignment operation like rdx = rdi may not compile to any assembly code at all, because the compiler will simply recognize that both values can now be found in the register %rdi. Even for a more complex operation like rdx = rdi + 1, the compiler has the freedom to store the value in any register, not specifically in %rdx. (It may even store the value back to %rdi, e.g. inc %rdi, if it recognizes that the original value is never used afterwards.)
No, it would be translated into:
mov %rdi, %rdx # move %rdi into %rdx
xor %eax, %eax # zero return value
ret # return
Of course, it's more than likely that rdx = rdi (and therefore mov %rdi, %rdx) will be removed by the compiler, because rdx is not used again.
Credit to #Jester for finding this out before me.
I viewed the disassembly of my c code, and found out that pointer to function actually point the jmp instruction, and doesn't point the real start of the function in memory (doesn't point push ebp instruction, that represents start of function's frame).
I have the followed function (that does basically nothing, it's just an example):
int func2(int a, int b)
{
return 1;
}
I tried to print the address of the function- printf("%p", &func2);
I looked at the disassembly of my code, and found out that the address that is printed is the address of the jmp instuction in assembly code. I would like to get the address that represents the start of function's frame. Is there any way to calculate it from the given address of the jmp instruction?
Moreover, I have the bytes that represents the jmp instruction.
011A11EF E9 CC 08 00 00 jmp func2 (011A1AC0h)
How can I get the address that represents the start of function's frame in memory (011A1AC0h in that case), only from the address of the jmp instruction and from the bytes that represents the jmp instruction itself? I read some information about that, and I found out that it is relative jmp, which means that I need to add the value that jmp holds to the address of the jmp instruction itself. Not sure if that's a good direction for the solution, and if it is, how can I get the value that jmp holds?
E916 is the Intel 64 and IA-32 opcode for a jmp instruction with a rel32 offset. The next four bytes contain the offset. Your disassembler shows them as “CC 08 00 00”, but this is reversed; the offset is 000008CC16, which is 225210. The offset is a signed 32-bit value that is added to the EIP register to obtain the address of the jump target. The EIP contains the address of the next instruction to be executed.
So, in this specific case, take the address of the byte just beyond the jump instruction and add the 32-bit offset.
However:
I count 11 forms of jmp instruction in Intel 64 and IA-32 manual. Who knows what the compiler may use when you make a slight change to source or compiler switches and recompile? You would need to be prepared to decode any form of the jmp instruction, or perhaps other instructions the compiler might use.
Intel has some legacy segment features in its architecture. The code segment on your system might be one big thing so you do not have to worry about that, but I cannot provide assurance.
Your compiler might have used this jmp instruction as a convenient way to create a value for the pointer rather than using the routine’s entry point (the proper term for the instruction where function execution normally begins, not frame) because it makes the linker do the relocation work instead of requiring the compiler to insert instructions to do that work at run-time (specifically, at the time the function address must be evaluated so it can be assigned to the pointer). This is somewhat of a guess, but the compiler might do something else next time. You are treading significantly outside normal computing.
I'm not sure to get your question, but take this sample:
#include <stdio.h>
int foo(int x)
{
return x+1;
}
int main(int argc, char** argv)
{
printf("foo = %p\n", foo);
return 0;
}
Which produces the following disassembly:
foo(int):
pushq %rbp
movq %rsp, %rbp
movl %edi, -4(%rbp)
movl -4(%rbp), %eax
addl $1, %eax
popq %rbp
ret
.LC0:
.string "foo = %p\n"
main:
pushq %rbp
movq %rsp, %rbp
subq $16, %rsp
movl %edi, -4(%rbp)
movq %rsi, -16(%rbp)
movl foo(int), %esi # pass the label argument (2) to printf
movl $.LC0, %edi # pass the format argument (1) to printf
movl $0, %eax
call printf
movl $0, %eax
leave
ret
As you can see, only the label is passed to printf. This label is resolved as an address by the compiler.
Also notice that it will be hard for you to get an absolute address of a running binary: the ASLR (Address Space Layout Randomization will choose a random base address for the binary. The offsets inside the binary still holds, hence relative calls.
On X86 machines E9 is the opcode for JMP rel16/32. So the cpu is going to use the value 0x000008CC as jump offset. The base address is the address of the instruction following the JMP instruction.
I'm trying to figure out to convert this x86 assembly code to Y86 form:
Given the c program:
int sum(int x) {
if (x == 0 || x ==1) {
return 1;
} else {
return x + sum(x-1);
}
}
The following x86-64 assembly code is generated:
sum:
cmpl $1, %rdi
ja .L8
movl $1, %eax
ret
.L8:
pushq %rbx
movl %edi, %ebx
leal -1(%rdi), %edi
call sum
addl %ebx, %eax
popq %rbx
ret
How can I convert this to Y86-64 assembly code that does the same thing?
Thank you!
In this case, you can convert by replacing each instruction with a short sequence of y86 instructions which does exactly the same thing.
y86 is Turing complete, but very crippled, so in general you can't always easily convert. Some single x86 instructions might need an entire loop or very long function to implement, but that's not the case for any of your instructions. Each of them can be transliterated to one or a few y86 instructions. (Some might need a scratch register; I forget if y86 has compare with immediate or only mov-immediate to register.)
Your code doesn't have any multiplies, shifts, or bsf, or floating-point, or anything else that y86 doesn't have (and would need a loop to emulate).
Look up each x86 instruction in the instruction-set reference manual (like this online version, or this older one where not having AVX/AVX2 instructions means less to wade through. See also the x86 tag wiki for links to Intel and AMD's PDF manuals.) Look at the Operation section where pseudo-code describes the exact effect of the instruction on the architectural state. That's the behaviour you want to implement using y86 instructions.
As an example, I forget if y86 has push / pop, but if not you can always manipulate rsp directly and load/store. e.g. sub $8, %rsp ; movrm %rbx, (rsp) is push (except it clobbers flags where x86's push doesn't).
I was looking at the Compiler output for a C program, just for academic purposes and happened to get the following output.
.file "test.c"
.section .rodata
.LC0:
.string "Hello World"
.text
.globl main
.type main, #function
main:
.LFB0:
.cfi_startproc
pushq %rbp
.cfi_def_cfa_offset 16
.cfi_offset 6, -16
movq %rsp, %rbp
.cfi_def_cfa_register 6
movl $.LC0, %edi
movl $0, %eax
call printf
movl $0, %eax
popq %rbp
.cfi_def_cfa 7, 8
ret
.cfi_endproc
.LFE0:
.size main, .-main
.ident "GCC: (Ubuntu 4.8.2-19ubuntu1) 4.8.2"
.section .note.GNU-stack,"",#progbits
I understand the parts where the based pointer and stack pointer operations are taking place and other operation, I wanted to know what is the use of putting
movl $.LC0, %edi
how is loading the address of the test "Hello world" from the data block into the destination register solving the purpose, we could have just loaded that address in the accumulator and let printf handle it. I am not used to programming in assembly but i can make out what the program is doing, am i missing something obvious here?
Google searches showed that they were used for string operations but none said why?
First of all, your call on printf may be passing arguments by registers and not by stack because it was optimised in that way, or because its attributes during compilation were set to __fastcall (MSVC) or __attribute__((fastcall)).
%esi and %esi registers are used in string operations because of their meaning to string instructions, such as cmps, lods, movs, scas, stos, outs or ins. These instructions use the destination and source register for quick sequential access to a string of bytes/words/doublewords. They can be used in loops to make simple operations that are known to be performed continuously in memory, and can shorter execution time in combination with loop prefixes by removing the need of pointer manipulation and limit checking.
A very good example on this is the movs instruction (it also has another forms as movsb, movsw, movsd). If you wanted to write a simple string copy procedure without string instruction, you write something like this:
; IN: EAX=source&, EBX=dest&, ECX=count
; OUT: nothing
copy:
.loop:
cmp ecx, 0
jz .end
dec ecx
mov al, byte [eax+ecx]
mov byte [ebx+ecx], al
jmp .loop
.end:
ret
movsb instruction copies [esi] to [edi], increments esi and edi, then decrements ecx. With this in mind you can write somethign similar to this:
; IN: ESI=source&, EDI=dest&, ECX=count
; OUT: nothing
copy:
.loop:
jecxz .end
movsb
jmp .loop
.end:
ret
Using loop prefixes, you can again speed the whole operation
; IN: ESI=source&, EDI=dest&, ECX=count
; OUT: nothing
copy:
rep movsb
ret
I am going to say yes and no to user35443 answer.
I wanted to know what is the use of putting
movl $.LC0, %edi
Since you are using 64bit Linux (from the use of rbp), in 64 bit land, parameters are passed in registers. rdi contains the first parameter, rsi the second, rdx 3rd, rcx 4th, r8 5th, r9 the 6th parameter; any more parameters are passed on the stack.
we could have just loaded that address in the accumulator and let
printf handle it
No! When using Assembly, it is up to you to read and understand the ABI for the OS you are using and follow it to the T! If you were using Windows, the first parameter would be in rcx instead. It has nothing to do with the source nor destination.
the "Accumulator" is actually a parameter to printf and all vararg functions really. r/eax contains the number of floating point numbers passed in the xmm registers, since in your example code no floats are passed, eax is set to 0.