Assembler / GAS / Linux x86_64 - error while reading a file - gnu-assembler

I am writing a simple program in assembler on Linux x86_64 (GAS syntax). I have to read a number that coded in binary system and saved in a text file. So, I have my text file "data.txt" (it's in the same directory as my source file) and below is the most important fragment of my code:
SYS_WRITE = 4
EXIT_SUCCESS = 0
SYS_READ = 3
SYS_OPEN = 5
.data
BIN_LEN = 24
.comm BIN, BIN_LEN
BIN: .space BIN_LEN, 0
.text
PATH: .ascii "data.txt\0"
.global _start
_start:
mov $SYS_OPEN, %eax # open
mov $PATH, %ebx # path
mov $0, %ecx # read only
mov $0666, %edx # mode
int $0x80 # call (open file)
mov $SYS_READ, %eax # reading
mov $3, %ebx # descriptor
mov $BIN, %ecx # bufor
mov $BIN_LEN, %edx # bufor size
int $0x80 # call (read line from file)
After calling the second syscall, the %eax register should contain the number of read bytes.
In my file "data.txt" I have "10101", but when I debug my program with gdb, it shows that the is -11 in %eax, so there was some kind of an error. But I am sure that "10101" was loaded to the buffer (BIN), because when I want to display what the buffer has inside, there is properly written number from the file. I need the number of read bytes to the further algorithm. I have no idea why %eax contains error code instead of the number of bytes loaded to the buffer. I wonder if it may be connected with calling syscall with 32-bit registers, but in all other cases it works properly.
Please, help me.

I entered your code and compiled it on my x64 running fedora 20 using the as and ld 32 bit options to assemble and link it, and it ran perfectly, placing 0x18 into the %eax reg after syscall. If you solved the problem I would like to know what caused it and how you fixed it.
cheers

Related

For GNU Assembly x64 AT&T syntax: How to add 2 quad numbers? [duplicate]

I have written a Assembly program to display the factorial of a number following AT&T syntax. But it's not working. Here is my code
.text
.globl _start
_start:
movq $5,%rcx
movq $5,%rax
Repeat: #function to calculate factorial
decq %rcx
cmp $0,%rcx
je print
imul %rcx,%rax
cmp $1,%rcx
jne Repeat
# Now result of factorial stored in rax
print:
xorq %rsi, %rsi
# function to print integer result digit by digit by pushing in
#stack
loop:
movq $0, %rdx
movq $10, %rbx
divq %rbx
addq $48, %rdx
pushq %rdx
incq %rsi
cmpq $0, %rax
jz next
jmp loop
next:
cmpq $0, %rsi
jz bye
popq %rcx
decq %rsi
movq $4, %rax
movq $1, %rbx
movq $1, %rdx
int $0x80
addq $4, %rsp
jmp next
bye:
movq $1,%rax
movq $0, %rbx
int $0x80
.data
num : .byte 5
This program is printing nothing, I also used gdb to visualize it work fine until loop function but when it comes in next some random value start entering in various register. Help me to debug so that it could print factorial.
As #ped7g points out, you're doing several things wrong: using the int 0x80 32-bit ABI in 64-bit code, and passing character values instead of pointers to the write() system call.
Here's how to print an integer in x8-64 Linux, the simple and somewhat-efficient1 way, using the same repeated division / modulo by 10.
System calls are expensive (probably thousands of cycles for write(1, buf, 1)), and doing a syscall inside the loop steps on registers so it's inconvenient and clunky as well as inefficient. We should write the characters into a small buffer, in printing order (most-significant digit at the lowest address), and make a single write() system call on that.
But then we need a buffer. The maximum length of a 64-bit integer is only 20 decimal digits, so we can just use some stack space. In x86-64 Linux, we can use stack space below RSP (up to 128B) without "reserving" it by modifying RSP. This is called the red-zone. If you wanted to pass the buffer to another function instead of a syscall, you would have to reserve space with sub $24, %rsp or something.
Instead of hard-coding system-call numbers, using GAS makes it easy to use the constants defined in .h files. Note the mov $__NR_write, %eax near the end of the function. The x86-64 SystemV ABI passes system-call arguments in similar registers to the function-calling convention. (So it's totally different from the 32-bit int 0x80 ABI, which you shouldn't use in 64-bit code.)
// building with gcc foo.S will use CPP before GAS so we can use headers
#include <asm/unistd.h> // This is a standard Linux / glibc header file
// includes unistd_64.h or unistd_32.h depending on current mode
// Contains only #define constants (no C prototypes) so we can include it from asm without syntax errors.
.p2align 4
.globl print_integer #void print_uint64(uint64_t value)
print_uint64:
lea -1(%rsp), %rsi # We use the 128B red-zone as a buffer to hold the string
# a 64-bit integer is at most 20 digits long in base 10, so it fits.
movb $'\n', (%rsi) # store the trailing newline byte. (Right below the return address).
# If you need a null-terminated string, leave an extra byte of room and store '\n\0'. Or push $'\n'
mov $10, %ecx # same as mov $10, %rcx but 2 bytes shorter
# note that newline (\n) has ASCII code 10, so we could actually have stored the newline with movb %cl, (%rsi) to save code size.
mov %rdi, %rax # function arg arrives in RDI; we need it in RAX for div
.Ltoascii_digit: # do{
xor %edx, %edx
div %rcx # rax = rdx:rax / 10. rdx = remainder
# store digits in MSD-first printing order, working backwards from the end of the string
add $'0', %edx # integer to ASCII. %dl would work, too, since we know this is 0-9
dec %rsi
mov %dl, (%rsi) # *--p = (value%10) + '0';
test %rax, %rax
jnz .Ltoascii_digit # } while(value != 0)
# If we used a loop-counter to print a fixed number of digits, we would get leading zeros
# The do{}while() loop structure means the loop runs at least once, so we get "0\n" for input=0
# Then print the whole string with one system call
mov $__NR_write, %eax # call number from asm/unistd_64.h
mov $1, %edi # fd=1
# %rsi = start of the buffer
mov %rsp, %rdx
sub %rsi, %rdx # length = one_past_end - start
syscall # write(fd=1 /*rdi*/, buf /*rsi*/, length /*rdx*/); 64-bit ABI
# rax = return value (or -errno)
# rcx and r11 = garbage (destroyed by syscall/sysret)
# all other registers = unmodified (saved/restored by the kernel)
# we don't need to restore any registers, and we didn't modify RSP.
ret
To test this function, I put this in the same file to call it and exit:
.p2align 4
.globl _start
_start:
mov $10120123425329922, %rdi
# mov $0, %edi # Yes, it does work with input = 0
call print_uint64
xor %edi, %edi
mov $__NR_exit, %eax
syscall # sys_exit(0)
I built this into a static binary (with no libc):
$ gcc -Wall -static -nostdlib print-integer.S && ./a.out
10120123425329922
$ strace ./a.out > /dev/null
execve("./a.out", ["./a.out"], 0x7fffcb097340 /* 51 vars */) = 0
write(1, "10120123425329922\n", 18) = 18
exit(0) = ?
+++ exited with 0 +++
$ file ./a.out
./a.out: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, BuildID[sha1]=69b865d1e535d5b174004ce08736e78fade37d84, not stripped
Footnote 1: See Why does GCC use multiplication by a strange number in implementing integer division? for avoiding div r64 for division by 10, because that's very slow (21 to 83 cycles on Intel Skylake). A multiplicative inverse would make this function actually efficient, not just "somewhat". (But of course there'd still be room for optimizations...)
Related: Linux x86-32 extended-precision loop that prints 9 decimal digits from each 32-bit "limb": see .toascii_digit: in my Extreme Fibonacci code-golf answer. It's optimized for code-size (even at the expense of speed), but well-commented.
It uses div like you do, because that's smaller than using a fast multiplicative inverse). It uses loop for the outer loop (over multiple integer for extended precision), again for code-size at the cost of speed.
It uses the 32-bit int 0x80 ABI, and prints into a buffer that was holding the "old" Fibonacci value, not the current.
Another way to get efficient asm is from a C compiler. For just the loop over digits, look at what gcc or clang produce for this C source (which is basically what the asm is doing). The Godbolt Compiler explorer makes it easy to try with different options and different compiler versions.
See gcc7.2 -O3 asm output which is nearly a drop-in replacement for the loop in print_uint64 (because I chose the args to go in the same registers):
void itoa_end(unsigned long val, char *p_end) {
const unsigned base = 10;
do {
*--p_end = (val % base) + '0';
val /= base;
} while(val);
// write(1, p_end, orig-current);
}
I tested performance on a Skylake i7-6700k by commenting out the syscall instruction and putting a repeat loop around the function call. The version with mul %rcx / shr $3, %rdx is about 5 times faster than the version with div %rcx for storing a long number-string (10120123425329922) into a buffer. The div version ran at 0.25 instructions per clock, while the mul version ran at 2.65 instructions per clock (although requiring many more instructions).
It might be worth unrolling by 2, and doing a divide by 100 and splitting up the remainder of that into 2 digits. That would give a lot better instruction-level parallelism, in case the simpler version bottlenecks on mul + shr latency. The chain of multiply/shift operations that brings val to zero would be half as long, with more work in each short independent dependency chain to handle a 0-99 remainder.
Related:
NASM version of this answer, for x86-64 or i386 Linux How do I print an integer in Assembly Level Programming without printf from the c library?
How to convert a binary integer number to a hex string? - Base 16 is a power of 2, conversion is much simpler and doesn't require div.
Several things:
0) I guess this is 64b linux environment, but you should have stated so (if it is not, some of my points will be invalid)
1) int 0x80 is 32b call, but you are using 64b registers, so you should use syscall (and different arguments)
2) int 0x80, eax=4 requires the ecx to contain address of memory, where the content is stored, while you give it the ASCII character in ecx = illegal memory access (the first call should return error, i.e. eax is negative value). Or using strace <your binary> should reveal the wrong arguments + error returned.
3) why addq $4, %rsp? Makes no sense to me, you are damaging rsp, so the next pop rcx will pop wrong value, and in the end you will run way "up" into the stack.
... maybe some more, I didn't debug it, this list is just by reading the source (so I may be even wrong about something, although that would be rare).
BTW your code is working. It just doesn't do what you expected. But work fine, precisely as the CPU is designed and precisely what you wrote in the code. Whether that does achieve what you wanted, or makes sense, that's different topic, but don't blame the HW or assembler.
... I can do a quick guess how the routine may be fixed (just partial hack-fix, still needs rewrite for syscall under 64b linux):
next:
cmpq $0, %rsi
jz bye
movq %rsp,%rcx ; make ecx to point to stack memory (with stored char)
; this will work if you are lucky enough that rsp fits into 32b
; if it is beyond 4GiB logical address, then you have bad luck (syscall needed)
decq %rsi
movq $4, %rax
movq $1, %rbx
movq $1, %rdx
int $0x80
addq $8, %rsp ; now rsp += 8; is needed, because there's no POP
jmp next
Again didn't try myself, just writing it from head, so let me know how it changed situation.

segmentation fault with .text .data and main (main in .data section)

I'm just trying to load the value of myarray[0] to eax:
.text
.data
# define an array of 3 words
array_words: .word 1, 2, 3
.globl main
main:
# assign array_words[0] to eax
mov $0, %edi
lea array_words(,%edi,4), %eax
But when I run this, I keep getting seg fault.
Could someone please point out what I did wrong here?
It seems the label main is in the .data section.
It leads to a segmentation fault on systems that doesn't allow to execute code in the .data section. (Most modern systems map .data with read + write but not exec permission.)
Program code should be in the .text section. (Read + exec)
Surprisingly, on GNU/Linux systems, hand-written asm often results in an executable .data unless you're careful to avoid that, so this is often not the real problem: See Why data and stack segments are executable? But putting code in .text where it belongs can make some debugging tools work better.
Also you need to ret from main or call exit (or make an _exit system call) so execution doesn't fall off the end of main into whatever bytes come next. See What happens if there is no exit system call in an assembly program?
You need to properly terminate your program, e.g. on Linux x86_64 by calling the sys_exit system call:
...
main:
# assign array_words[0] to eax
mov $0, %edi
lea array_words(,%edi,4), %eax
mov $60, %rax # System-call "sys_exit"
mov $0, %rdi # exit code 0
syscall
Otherwise program execution continues with the memory contents following your last instruction, which are most likely in all cases invalid instructions (or even invalid memory locations).

Where Aleph one's shell code is changing itself?

I am reading "Smashing The Stack For Fun And Profit" by Aleph one,
and reached this spot:
jmp 0x2a # 2 bytes
popl %esi # 1 byte
movl %esi,0x8(%esi) # 3 bytes
movb $0x0,0x7(%esi) # 4 bytes
movl $0x0,0xc(%esi) # 7 bytes
movl $0xb,%eax # 5 bytes
movl %esi,%ebx # 2 bytes
leal 0x8(%esi),%ecx # 3 bytes
leal 0xc(%esi),%edx # 3 bytes
int $0x80 # 2 bytes
movl $0x1, %eax # 5 bytes
movl $0x0, %ebx # 5 bytes
int $0x80 # 2 bytes
call -0x2f # 5 bytes
.string \"/bin/sh\" # 8 bytes
------------------------------------------------------------------------------
Looks good. To make sure it works correctly we must compile it and run it.
**But there is a problem. Our code modifies itself**, but most operating system
mark code pages read-only.
My question is where (and how) does this code modifies itself? [I don't know assembly that well]
Thanks!
The first instruction jumps to the call at the end of the code which calls back to the second instruction that pops the return address placed on the stack by the call. Thus esi points to the string at the end. As you can see, the next 3 instructions write to memory relative to esi, setting up the argument pointer and zero terminating the string and the argument list. This is what the self modification refers to. It's slightly misleading because it isn't modifying code, just data. During standalone testing that data is part of the .text section which is typically read only, but can be made writable easily. Note that during actual usage this would be in the stack which is writable, but not executable so you'd have a different problem then.

gcc compiling .c with .s file - .bss confusion (bug?)

Using gcc 4.6.3 under Ubuntu 12.04 on an IA32 architecture, I ran into an issue relating to compiling C files with assembly files using storage on the .bss segment with both .comm and .lcomm directives.
Between a .comm and a .lcomm buffer, the assembly file foo.s uses close to the maximum space gas lets me allocate in this segment (foo calculates prime factorization of long longs). With an assembly file bar.s handling i/o and such, everything compiles and links fine (and fast), and works well.
When I then use a C file bar.c to handle i/o, gcc does not terminate - or at least not in less than 5 minutes. The .bss request is close to my small notebook memory, but as the .bss segment does not get compile-time initialized, and as it works with bar.s, I don't see why this happens.
When I reduce the .bss size requested in foo.s, gcc compiles and links fine, and everything executes as it should. Also, as expected, the file size of the executable created in each case using
gcc bar.c foo.s -Wall
does not depend on the size in .bss requested (I compiled varying sizes which were all much smaller than the original, failing size). The executable is very small (maybe 10k) in all cases - in fact, of identical size - except, obviously, the original case which does not successfully compile and gets hung up.
Is this a gcc bug? Is there a command line option to use to prevent this from happening? Or what is going on?
Edit: per a comment, here is the part with the .bss segment allocation:
# Sieve of Eratosthenes
# Create list of prime numbers smaller than n
#
# Note: - no input error (range) check
# - n <= 500,000,000 (could be changed) - in assembly
# compiling it with gcc: trouble. make n <= 50,000,000
# Returns: pointer to array of ints of prime numbers
# (0 sentinel at end)
#
# Registers: %esi: sentinel value (n+1)
# %edx: n
# %ecx: counting variable (2 - n)
# %ebx: pointer into array of primes
# (position next to be added)
# %eax: inner pointer to A. tmp array
.section .bss
# .lcomm tmp_Arr, 2000000008 # 500,000,000 plus sentinel & padding
# .comm prime_Arr, 500000008 # asymptotically, primes aren't dense
.lcomm tmp_Arr, 200000008 # 50,000,000 plus sentinel & padding
.comm prime_Arr, 50000008 # asymptotically, primes aren't dense
.section .text
.globl sieve
.type sieve, #function
sieve:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %edx
pushl %esi
pushl %ebx
# create Eratosthenes tmp array
movl $0, %ecx
loop_sieve_Tmp_:
movl %ecx, tmp_Arr(, %ecx, 4)
addl $1, %ecx
cmp %ecx, %edx
jge loop_sieve_Tmp_
# initialize registers used in algorithm
movl $2, %ecx # outer loop counting var
movl %ecx, %eax # inner loop counting var
xor %ebx, %ebx # pointer to prime array
movl %edx, %esi
incl %esi # sentinel (or placeholder for 'not prime')
loop_sieve_Outer_:
movl %ecx, prime_Arr(, %ebx, 4) # record prime
incl %ebx
loop_sieve_Inner_:
addl %ecx, %eax
movl %esi, tmp_Arr(, %eax, 4)
cmp %eax, %edx
jge loop_sieve_Inner_
find_Next_: # find minimum in Erist. tmp array
addl $1, %ecx
cmp %ecx, %edx
jl lbl_sieve_done_
cmp tmp_Arr(, %ecx, 4), %esi
je find_Next_
movl %ecx, %eax
jmp loop_sieve_Outer_
lbl_sieve_done_:
movl $0, prime_Arr(, %ebx, 4) # sentinel
movl $prime_Arr, %eax
popl %ebx
popl %esi
movl %ebp, %esp
popl %ebp
ret
# end sieve
I've replicated your problem with gcc 4.7.2 on Debian. It doesn't hang for me, but it does take a substantial length of time (10 seconds).
It appears that the linker actually allocates and zeros "comm" memory during it's processing. If the machine is sufficiently memory limited (as it seems yours is) this will cause the linker to thrash the swap even though the final executable will be tiny. For me the memory allocation was around 2.3Gb.
I tried a couple of variations (.space, .zero, .org) and they all seem to give the same effect.
With an up to date version on the compiler (4.9) this no longer happens.

Can some one write assembly code for the c program above that converts into machine code that is less than 100 bytes?

I want to overflow the array buffer[100] and I will be passing python script on bash shell on FreeBSD. I need machine code to pass as a string to overflow that buffer buffer[100] and make the program print its hostname to stdout.
Here is the code in C that I tried and gives the host name on the console. :
#include <stdio.h>
int main()
{
char buff[256];
gethostname(buff, sizeof(buff));
printf(""%s", buff);
return 0;
}
Here is the code in assembly that I got using gcc but is longer than I need becuase when I look for the machine code of the text section of the c program it is longer than 100 bytes and I need a machine code for the c program above that is less than 100 bytes.
.type main, #function
main:
pushl %ebp; saving the base pointer
movl %esp, %ebp; Taking a snapshot of the stack pointer
subl $264, %esp;
addl $-8, %esp
pushl $256
leal -256(%ebp), %eax
pushl %eax
call gethostname
addl $16, %esp
addl $-8, %esp
leal -256(%ebp), %eax
pushl %eax
pushl $.LCO
call printf
addl $16, %esp
xorl %eax, %eax
jmp .L6
.p2align 2, 0x90
.L6:
leave
ret
.Lfe1:
.size main, .Lfe1-main
.ident "GCC: (GNU) c 2.95.4 20020320 [FreeBSD]"
A person has already done it on another computer and he has given me the ready made machine code which is 37 bytes and he is passing it in the format below to the buffer using perl script. I tried his code and it works but he doesn't tell me how to do it.
“\x41\xc1\x30\x58\x6e\x61\x6d\x65\x23\x23\xc3\xbc\xa3\x83\xf4\x69\x36\xw3\xde\x4f\x2f\x5f\x2f\x39\x33\x60\x24\x32\xb4\xab\x21\xc1\x80\x24\xe0\xdb\xd0”
I know that he did it on a differnt machine so I can not get the same code but since we both are using exactly the same c function so the size of the machine code should be almost the same if not exactly the same. His machine code is 37 bytes which he will pass on shell to overflow the gets() function in a binary file on FreeBSD 2.95 to print the hostname on stdout. I want to do the same thing and I have tried his machine code and it works but he will not tell me how did he get this machine code. So I am concerned actually about the procedure of getting that code.
OK I tried the methods suggested in the posts here but just for the function gethostname() I got a 130 character of machine code. It did not include the printf() machine code. As I need to print the hostname to console so that should also be included but that will make the machine code longer. I have to fit the code in an array of 100 bytes so the code should be less than 100 bytes.
Can some one write assembly code for the c program above that converts into machine code that is less than 100 bytes?
To get the machine code, you need to compile the program then disassemble. Using gcc for example do something like this:
gcc -o hello hello.c
objdump -D hello
The dump will show the machine code in bytes and the disassembly of that machine code.
A simple example, that is related, you have to understand the difference between an object file and an executable file but this should still demonstrate what I mean:
unsigned int myfun ( unsigned int x )
{
return(x+5);
}
gcc -O2 -c -o hello.o hello.c
objdump -D hello.o
Disassembly of section .text:
00000000 <myfun>:
0: e2800005 add r0, r0, #5
4: e12fff1e bx lr
FreeBSD is an operating system, not a compiler or assembler.
You want to assemble the assembly source into machine code, so you should use an assembler.
You can typically use GCC, since it's smart enough to know that for a filename ending in .s, it should run the assembler.
If you already have the code in an object file, you can use objdump to read out the code segment of the file.
The 37 bytes posted are completely junk.
If run under any version of Windows ( windows 2000 or later ), I believe, that
the "outsb" and "insd" instructions (in an userland program) will cause a fault,
because userland programs are not allowed directly doing port -level I/O.
Since machine code will not end in "vacuum", I added some \x90 -bytes (again NOP) after the posted code. That merely affects the argument of the last rcl -instruction (which in the given code ends prematurely; eg the code posted is not only rubbish, but also ends prematurely).
But, microprocessors do not have their own intelligence, so they will (try to) execute whatever junk code you feed them. And, the code starts with "inc ecx", a stupid move since we do not know what value the ecx had before. Also "shl dword ptr [eax],$58" is a "good"
way to randomly corrupt memory (since value if eax is also unknown).
And, one of them is NOT even valid byte (should be represented as two hexadecimal digits).
The invalid "byte" is \xw3.
I replaced that invalid byte as \x90 ( a NOP, if it is at start of instruction), and got:
00451B51 41 inc ecx
00451B52 C13058 shl dword ptr [eax],$58
00451B55 6E outsb
00451B56 61 popad
00451B57 6D insd
00451B58 652323 and esp,gs:[ebx]
00451B5B C3 ret
// code below is NEVER executed, since the line above does a RET.
00451B5C BCA383F469 mov esp,$69f483a3
00451B61 3690 nop // 36, w3 ????
00451B63 DE4F2F fimul word ptr [edi+$2f]
00451B66 5F pop edi
00451B67 2F das
00451B68 3933 cmp [ebx],esi
00451B6A 60 pushad
00451B6B 2432 and al,$32
00451B6D B4AB mov ah,$ab
00451B6F 21C1 and ecx,eax
00451B71 8024E0DB and byte ptr [eax],$db
00451B75 D09090909090 rcl [eax-$6f6f6f70],1
You get a nice hexdump of the text section of your object file with objdump -s -j .text.
Edited some more details:
You need to find out what the address of the function in your object code is. This is what objdump -t is for. In this case I am looking for the function main in a program "hello".
> objdump -t hello|grep main
> 0000000000400410 g F .text 000000000000002f main
Now I create a hexdump with objdump -s -j .text hello:
400410 4881ec08 010000be 00010000 31c04889 H...........1.H.
400420 e7e8daff ffff4889 e6bff405 400031c0 ......H.....#.1.
400430 e8abffff ff31c048 81c40801 0000c390 .....1.H........
400440 31ed4989 d15e4889 e24883e4 f0505449 1.I..^H..H...PTI
400450 c7c0e005 400048c7 c1500540 0048c7c7 ....#.H..P.#.H..
...
The first row are the addresses. It starts with 400410, the address of the main function, but this may not always be the case. The following 4 rows are 16 bytes of machinecode in hex, the last row are the same 16 bytes of machine code in ASCII. Because a lot of bytes have no representation in ASCII, there are a lot of dots. You need to use the 4 hexadecimal colums: \x48 \x81 \xec....
I have done this on a linux system, but for FreeBSD you can do exactly the same - only the resulting machindecode will be different.

Resources