I have the following code:
.global _launchProgram
_launchProgram:
push bp
mov bp, sp
push cs
mov bx, [bp + 4]
mov cs, bx
mov es, bx
eseg
call #0x0
pop bx
mov cs, bx
pop bp
ret
In this code I am trying to make it to jump to another piece of code and execute it. This code is being called from C as shown below:
launchProgram(segment) //Here segment is an integer which holds the
//memory segment where I have loaded my code
Thus in this function I make cs register to be equal to the segment variable and I use call 0x0 to jump to the start of that segment. But when I run it using:
as86 launchProgram.asm -o launchProgram.o
I get the following error:
00010 000C E8 0000 call #0x0
***** relocation impossible.................................^
Why am I getting this error?
Your call #0x0 seems to specify an IP (Instruction Pointer)-relative call in as86 (an offset relative to the next instruction). Was that intentional? as86 might be complaining because it expected a label or a symbol instead, which the linker would be able to resolve (relocate) if needed.
The as86 man page has the following:
The 'near and 'far' do not allow multi-segment programming, all 'far'
operations are specified explicitly through the use of the instructions: jmpi,
jmpf, callf, retf, etc. The 'Near' operator can be used to force the use of
80386 16bit conditional branches. The 'Dword' and 'word' operators can control
the size of operands on far jumps and calls.
The code assembles if I use callf 0x12345678,0x1234 instead, which generates the following instructions:
$ as86 a.asm -o a.o
$ objdump -D -b binary -mi386 -Maddr16,data16,intel a.o
...
3b: 8e cb mov cs,bx
3d: 8e c3 mov es,bx
3f: 26 66 9a 78 56 34 12 es call 0x1234:0x12345678
46: 34 12
48: 5b pop bx
48: 5b pop %bx
...
(-b binary it needed since it's raw code, -mi386 selects the instruction set, and -Maddr16,data16,intel selects Intel syntax and 16-bit code, which seems to be what as86 generates by default.)
The second operand to callf seems to be the segment selector part of the address (having a single operand to callf causes as86 to complain). My x86-fu is too weak to say if the segment override on the call actually makes sense there. You'd want callf #0x0,#0x0 in your code, of course.
If you want to "trick" as86 into generating a relative call that's identical to what you're trying to do (not sure if this makes sense -- you might get random bits from whatever IP happens to be), then you could do the following:
eseg
call zero_offset
zero_offset: pop bx
The output is
53: 26 e8 00 00 es call 0x57
, where the 00 00 part shows that the offset is 0.
I don't think setting cs before call is a good idea, the called procedure don't know how to return. You have to execute a far call,
call segment:offset. This will push the value of the ip and cs register on stack for return. For your code something like: call cs:0x00
Also is esag a x86 instruction?
Se this link
Related
I have this code in C:
int main(void)
{
int a = 1 + 2;
return 0;
}
When I objdump -x86-asm-syntax=intel -d a.out which is compiled with -O0 flag with GCC 9.3.0_1, I get:
0000000100000f9e _main:
100000f9e: 55 push rbp
100000f9f: 48 89 e5 mov rbp, rsp
100000fa2: c7 45 fc 03 00 00 00 mov dword ptr [rbp - 4], 3
100000fa9: b8 00 00 00 00 mov eax, 0
100000fae: 5d pop rbp
100000faf: c3 ret
and with -O1 flag:
0000000100000fc2 _main:
100000fc2: b8 00 00 00 00 mov eax, 0
100000fc7: c3 ret
which removes the unused variable a and stack managing altogether.
However, when I use Apple clang version 11.0.3 with -O0 and -O1, I get
0000000100000fa0 _main:
100000fa0: 55 push rbp
100000fa1: 48 89 e5 mov rbp, rsp
100000fa4: 31 c0 xor eax, eax
100000fa6: c7 45 fc 00 00 00 00 mov dword ptr [rbp - 4], 0
100000fad: c7 45 f8 03 00 00 00 mov dword ptr [rbp - 8], 3
100000fb4: 5d pop rbp
100000fb5: c3 ret
and
0000000100000fb0 _main:
100000fb0: 55 push rbp
100000fb1: 48 89 e5 mov rbp, rsp
100000fb4: 31 c0 xor eax, eax
100000fb6: 5d pop rbp
100000fb7: c3 ret
respectively.
I never get the stack managing part stripped off as in GCC.
Why does (Apple) Clang keep unnecessary push and pop?
This may or may not be a separate question, but with the following code:
int main(void)
{
// return 0;
}
GCC creates a same ASM with or without the return 0;.
However, Clang -O0 leaves this extra
100000fa6: c7 45 fc 00 00 00 00 mov dword ptr [rbp - 4], 0
when there is return 0;.
Why does Clang keep these (probably) redundant ASM codes?
I suspect you were trying to see the addition happen.
int main(void)
{
int a = 1 + 2;
return 0;
}
but with optimization say -O2, your dead code went away
00000000 <main>:
0: 2000 movs r0, #0
2: 4770 bx lr
The variable a is local, it never leaves the function it does not rely on anything outside of the function (globals, input variables, return values from called functions, etc). So it has no functional purpose it is dead code it doesn't do anything so an optimizer is free to remove it and did.
So I assume you went to use no or less optimization and then saw it was too verbose.
00000000 <main>:
0: cf 93 push r28
2: df 93 push r29
4: 00 d0 rcall .+0 ; 0x6 <main+0x6>
6: cd b7 in r28, 0x3d ; 61
8: de b7 in r29, 0x3e ; 62
a: 83 e0 ldi r24, 0x03 ; 3
c: 90 e0 ldi r25, 0x00 ; 0
e: 9a 83 std Y+2, r25 ; 0x02
10: 89 83 std Y+1, r24 ; 0x01
12: 80 e0 ldi r24, 0x00 ; 0
14: 90 e0 ldi r25, 0x00 ; 0
16: 0f 90 pop r0
18: 0f 90 pop r0
1a: df 91 pop r29
1c: cf 91 pop r28
1e: 08 95 ret
If you want to see addition happen instead first off don't use main() it has baggage, and the baggage varies among toolchains. So try something else
unsigned int fun ( unsigned int a, unsigned int b )
{
return(a+b);
}
now the addition relies on external items so the compiler cannot optimize any of this away.
00000000 <_fun>:
0: 1d80 0002 mov 2(sp), r0
4: 6d80 0004 add 4(sp), r0
8: 0087 rts pc
If we want to figure out which one is a and which one is b then.
unsigned int fun ( unsigned int a, unsigned int b )
{
return(a+(b<<1));
}
00000000 <_fun>:
0: 1d80 0004 mov 4(sp), r0
4: 0cc0 asl r0
6: 6d80 0002 add 2(sp), r0
a: 0087 rts pc
Want to see an immediate value
unsigned int fun ( unsigned int a )
{
return(a+0x321);
}
00000000 <fun>:
0: 8b 44 24 04 mov eax,DWORD PTR [esp+0x4]
4: 05 21 03 00 00 add eax,0x321
9: c3 ret
you can figure out what the compilers return address convention is, etc.
But you will hit some limits trying to get the compiler to do things for you to learn asm likewise you can easily take the code generated by these compilations
(using -save-temps or -S or disassemble and type it in (I prefer the latter)) but you can only get so far on your operating system in high level/C callable functions. Eventually you will want to do something bare-metal (on a simulator at first) to get maximum freedom and to try instructions you cant normally try or try them in a way that is hard or you don't quite understand yet how to use in the confines of an operating system in a function call. (please do not use inline assembly until down the road or never, use real assembly and ideally the assembler not the compiler to assemble it, down the road then try those things).
The one compiler was built for or defaults to using a stack frame so you need to tell the compiler to omit it. -fomit-frame-pointer. Note that one or both of these can be built to default not to have a frame pointer.
../gcc-$GCCVER/configure --target=$TARGET --prefix=$PREFIX --without-headers --with-newlib --with-gnu-as --with-gnu-ld --enable-languages='c' --enable-frame-pointer=no
(Don't assume gcc nor clang/llvm have a "standard" build as they are both customizable and the binary you downloaded has someone's opinion of the standard build)
You are using main(), this has the return 0 or not thing and it can/will carry other baggage. Depends on the compiler and settings. Using something not main gives you the freedom to pick your inputs and outputs without it warning that you didn't conform to the short list of choices for main().
For gcc -O0 is ideally no optimization although sometimes you see some. -O3 is max give me all you got. -O2 is historically where folks live if for no other reason than "I did it because everyone else is doing it". -O1 is no mans land for gnu it has some items not in -O0 but not a lot of good ones in -O2, so depends heavily on your code as to whether or not you landed in one/some of the optimizations associated with -O1. These numbered optimization things if your compiler even has a -O option is just a pre-defined list 0 means this list 1 means that list and so on.
There is no reason to expect any two compilers or the same compiler with different options to produce the same code from the same sources. If two competing compilers were able to do that most if not all of the time something very fishy is going on...Likewise no reason to expect the list of optimizations each compiler supports, what each optimization does, etc, to match much less the -O1 list to match between them and so on.
There is no reason to assume that any two compilers or versions conform to the same calling convention for the same target, it is much more common now and further for the processor vendor to create a recommended calling convention and then the competing compilers to often conform to that because why not, everyone else is doing it, or even better, whew I don't have to figure one out myself, if this one fails I can blame them.
There are a lot of implementation defined areas in C in particular, less so in C++ but still...So your expectations of what come out and comparing compilers to each other may differ for this reason as well. Just because one compiler implements some code in some way doesn't mean that is how that language works sometimes it is how that compiler author(s) interpreted the language spec or had wiggle room.
Even with full optimizations enabled, everything that compiler has to offer there is no reason to assume that a compiler can outperform a human. Its an algorithm with limits programmed by a human, it cannot outperform us. With experience it is not hard to examine the output of a compiler for sometimes simple functions but often for larger functions and find missed optimizations, or other things that could have been done "better" for some opinion of "better". And sometimes you find the compiler just left something in that you think it should have removed, and sometimes you are right.
There is education as shown above in using a compiler to start to learn assembly language, and even with decades of experience and dabbling with dozens of assembly languages/instruction sets, if there is a debugged compiler available I will very often start with disassembling simple functions to start learning that new instruction set, then look those up then start to get a feel from what I find there for how to use it.
Very often starting with this one first:
unsigned int fun ( unsigned int a )
{
return(a+5);
}
or
unsigned int fun ( unsigned int a, unsigned int b )
{
return(a+b);
}
And going from there. Likewise when writing a disassembler or a simulator for fun to learn the instruction set I often rely on an existing assembler since it is often the documentation for a processor is lacking, the first assembler and compiler for that processor are very often done with direct access to the silicon folks and then those that follow can also use existing tools as well as documentation to figure things out.
So you are on a good path to start learning assembly language I have strong opinions on which ones to or not to start with to improve the experience and chances of success, but I have been in too many battles on Stack Overflow this week, I'll let that go. You can see that I chose an array of instruction sets in this answer. And even if you don't know them you can probably figure out what the code is doing. "standard" installs of llvm provide the ability to output assembly language for several instruction sets from the same source code. The gnu approach is you pick the target (family) when you compile the toolchain and that compiled toolchain is limited to that target/family but you can easily install several gnu toolchains on your computer at the same time be they variations on defaults/settings for the same target or different targets. A number of these are apt gettable without having to learn to build the tools, arm, avr, msp430, x86 and perhaps some others.
I cannot speak to the why does it not return zero from main when you didn't actually have any return code. See comments by others and read up on the specs for that language. (or ask that as a separate question, or see if it was already answered).
Now you said Apple clang not sure what that reference was to I know that Apple has put a lot of work into llvm in general. Or maybe you are on a mac or in an Apple supplied/suggested development environment, but check Wikipedia and others, clang had a lot of corporate help not just Apple, so not sure what the reference was there. If you are on an Apple computer then the apt gettable isn't going to make sense, but there are still lots of pre-built gnu (and llvm) based toolchains you can download and install rather than attempt to build the toolchain from sources (which isn't difficult BTW).
Say I want to convert a certain function into hex
void func(char* string) {
puts(string);
}
1139: 55 push %rbp
113a: 48 89 e5 mov %rsp,%rbp
113d: 48 83 ec 10 sub $0x10,%rsp
1141: 48 89 7d f8 mov %rdi,-0x8(%rbp)
1145: 48 8b 45 f8 mov -0x8(%rbp),%rax
1149: 48 89 c7 mov %rax,%rdi
114c: e8 df fe ff ff callq 1030 <puts#plt>
1151: 90 nop
1152: c9 leaveq
1153: c3 retq
This is what I got on x86_64: \x55\x48\x89\xe5\x48\x83\xec\x10\x48\x89\x7d\xf8\x48\x8b\x45\xf8\x48\x89\xc7\xe8\xdf\xfe\xff\xff\x90\xc9\xc3
encrypt it and use it in this program. A decryptor at the start to decrypt these instructions at run time so it can't be analyzed statically.
Converting the above function into hex and creating a function pointer for it doesn't run and ends with SIGSEGV at push %rbp.
My aim is to make this code print Hi.
int main() {
char* decrypted = decrypt(hexcode);
void (*func)(char*) = (void)(*)(char)) decrypted;
func("HI");
}
My questions are:
How do I convert a function into hex properly.
How do I then run this hex code from main as shown above?
If you want to execute a binary blob; then you need to do something like this:
void *p = mmap(0, blob_size, PROT_WRITE, MAP_ANON, NOFD, 0);
read(blob_file, p, blob_size);
mprotect(p, blob_size, PROT_EXEC);
void (*UndefinedBehaviour)(char *x) = p;
UndefinedBehaviour("HI");
The allocates some memory, copies a blob into it, changes the memory to be PROT_EXEC, then invokes the blob at its beginning. You need to add some error checking, and depending upon what sort of system you are on, it may be running malware monitors to prevent you from doing this.
Answer for 1. : It is near impossible to do it automatically, because there is no simple way for determining the length of function code - it depends to machine CPU, compiler optimizations etc. Only way is "manual" analysis of disassembled binary.
You can't for those instructions because they're not fully position-independent and self-contained.
e8 df fe ff ff is a call rel32 (with a little-endian relative displacement as the call target). It only works if that displacement reaches the puts#plt stub, and that only happens in the executable you're disassembling, where this code appears at a fixed distance from the PLT. (So the executable itself is position-independent when relocated as a whole, but taking the machine code for one function and trying to run it from some other address will break.)
In theory you could fixup the call target using a function pointer to puts in some code that included this machine code in an array, but if you're trying to make shellcode you can't depend on the "target" process helping you that way.
Instead you should use system calls directly via the syscall instruction, for example Linux syscall with RAX=1=__NR_write is write. (Not via their libc wrapper functions like write(), that would have exactly the same problem as puts).
Then you can refer to How to get c code to execute hex bytecode? for how to put machine code in a C array, make sure that's in an executable page (e.g. gcc -z execstack or mprotect or mmap), and cast that to a function pointer + call it like you're doing here.
ends with SIGSEGV at push %rbp
Yup, code-fetch from a page without EXEC permission will do that. gcc -z execstack is an easy way to fix that, or mmap like other answers suggest, at which point execution will get as far as the call -289 and fault or run bad instructions.
Can someone explain why this code:
#include <stdio.h>
int main()
{
return 0;
}
when compiled with tcc using tcc code.c produces this asm:
00401000 |. 55 PUSH EBP
00401001 |. 89E5 MOV EBP,ESP
00401003 |. 81EC 00000000 SUB ESP,0
00401009 |. 90 NOP
0040100A |. B8 00000000 MOV EAX,0
0040100F |. E9 00000000 JMP fmt_vuln1.00401014
00401014 |. C9 LEAVE
00401015 |. C3 RETN
I guess that
00401009 |. 90 NOP
is maybe there for some memory alignment, but what about
0040100F |. E9 00000000 JMP fmt_vuln1.00401014
00401014 |. C9 LEAVE
I mean why would compiler insert this near jump that jumps to the next instruction, LEAVE would execute anyway?
I'm on 64-bit Windows generating 32-bit executable using TCC 0.9.26.
Superfluous JMP before the Function Epilogue
The JMP at the bottom that goes to the next statement, this was fixed in a commit. Version 0.9.27 of TCC resolves this issue:
When 'return' is the last statement of the top-level block
(very common and often recommended case) jump is not needed.
As for the reason it existed in the first place? The idea is that each function has a possible common exit point. If there is a block of code with a return in it at the bottom, the JMP goes to a common exit point where stack cleanup is done and the ret is executed. Originally the code generator also emitted the JMP instruction erroneously at the end of the function too if it appeared just before the final } (closing brace). The fix checks to see if there is a return statement followed by a closing brace at the top level of the function. If there is, the JMP is omitted
An example of code that has a return at a lower scope before a closing brace:
int main(int argc, char *argv[])
{
if (argc == 3) {
argc++;
return argc;
}
argc += 3;
return argc;
}
The generated code looks like:
401000: 55 push ebp
401001: 89 e5 mov ebp,esp
401003: 81 ec 00 00 00 00 sub esp,0x0
401009: 90 nop
40100a: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
40100d: 83 f8 03 cmp eax,0x3
401010: 0f 85 11 00 00 00 jne 0x401027
401016: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
401019: 89 c1 mov ecx,eax
40101b: 40 inc eax
40101c: 89 45 08 mov DWORD PTR [ebp+0x8],eax
40101f: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
; Jump to common function exit point. This is the `return argc` inside the if statement
401022: e9 11 00 00 00 jmp 0x401038
401027: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
40102a: 83 c0 03 add eax,0x3
40102d: 89 45 08 mov DWORD PTR [ebp+0x8],eax
401030: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
; Jump to common function exit point. This is the `return argc` at end of the function
401033: e9 00 00 00 00 jmp 0x401038
; Common function exit point
401038: c9 leave
401039: c3 ret
In versions prior to 0.9.27 the return argc inside the if statement would jump to a common exit point (function epilogue). As well the return argc at the bottom of the function also jumps to the same common exit point of the function. The problem is that the common exit point for the function happens to be right after the top level return argcso the side effect is an extra JMP that happens to be to the next instruction.
NOP after Function Prologue
The NOP isn't for alignment. Because of the way Windows implements guard pages for the stack (Programs that are in Portable Executable format) TCC has two types of prologues. If the local stack space required < 4096 (smaller than a single page) then you see this kind of code generated:
401000: 55 push ebp
401001: 89 e5 mov ebp,esp
401003: 81 ec 00 00 00 00 sub esp,0x0
The sub esp,0 isn't optimized out. It is the amount of stack space needed for local variables (in this case 0). If you add some local variables you will see the 0x0 in the SUB instruction changes to coincide with the amount of stack space needed for local variables. This prologue requires 9 bytes. There is another prologue to handle the case where the stack space needed is >= 4096 bytes. If you add an array of 4096 bytes with something like:
char somearray[4096]
and look at the resulting instruction you will see the function prologue change to a 10 byte prologue:
401000: b8 00 10 00 00 mov eax,0x1000
401005: e8 d6 00 00 00 call 0x4010e0
TCC's code generator assumes that the function prologue is always 10 bytes when targeting WinPE. This is primarily because TCC is a single pass compiler. The compiler doesn't know how much stack space a function will use until after the function is processed. To get around not knowing this ahead of time, TCC pre-allocates 10 bytes for the prologue to fit the largest method. Anything shorter is padded to 10 bytes.
In the case where stack space needed < 4096 bytes the instructions used total 9 bytes. The NOP is used to pad the prologue to 10 bytes. For the case where >= 4096 bytes are needed, the number of bytes is passed in EAX and the function __chkstk is called to allocate the required stack space instead.
TCC is not an optimizing compiler, at least not really. Every single instruction it emitted for main is sub-optimal or not needed at all, except the ret. IDK why you thought the JMP was the only instruction that might not make sense for performance.
This is by design: TCC stands for Tiny C Compiler. The compiler itself is designed to be simple, so it intentionally doesn't include code to look for many kinds of optimizations. Notice the sub esp, 0: this useless instruction clearly come from filling in a function-prologue template, and TCC doesn't even look for the special case where the offset is 0 bytes. Other function need stack space for locals, or to align the stack before any child function calls, but this main() doesn't. TCC doesn't care, and blindly emits sub esp,0 to reserve 0 bytes.
(In fact, TCC is truly one pass, laying out machine code as it does through the C statement by statement. It uses the imm32 encoding for sub so it will have room to fill in the right number (upon reaching the end of the function) even if it turns out the function uses more than 255 bytes of stack space. So instead of constructing a list of instructions in memory to finish assembling later, it just remembers one spot to fill in a uint32_t. That's why it can't omit the sub when it turns out not to be needed.)
Most of the work in creating a good optimizing compiler that anyone will use in practice is the optimizer. Even parsing modern C++ is peanuts compared to reliably emitting efficient asm (which not even gcc / clang / icc can do all the time, even without considering autovectorization). Just generating working but inefficient asm is easy compared to optimizing; most of gcc's codebase is optimization, not parsing. See Basile's answer on Why are there so few C compilers?
The JMP (as you can see from #MichaelPetch's answer) has a similar explanation: TCC (until recently) didn't optimize the case where a function only has one return path, and doesn't need to JMP to a common epilogue.
There's even a NOP in the middle of the function. It's obviously a waste of code bytes and decode / issue front-end bandwidth and out-of-order window size. (Sometimes executing a NOP outside a loop or something is worth it to align the top of a loop which is branched to repeatedly, but a NOP in the middle of a basic block is basically never worth it, so that's not why TCC put it there. And if a NOP did help, you could probably do even better by reordering instructions or choosing larger instructions to do the same thing without a NOP. Even proper optimizing compilers like gcc/clang/icc don't try to predict this kind of subtle front-end effect.)
#MichaelPetch points out that TCC always wants its function prologue to be 10 bytes, because it's a single-pass compiler (and it doesn't know how much space it needs for locals until the end of the function, when it comes back and fills in the imm32). But Windows targets need stack probes when modifying ESP / RSP by more than a whole page (4096 bytes), and the alternate prologue for that case is 10 bytes, instead of 9 for the normal one without the NOP. So this is another tradeoff favouring compilation speed over good asm.
An optimizing compiler would xor-zero EAX (because that's smaller and at least as fast as mov eax,0), and leave out all the other instruction. Xor-zeroing is one of the most well-known / common / basic x86 peephole optimizations, and has several advantages other than code-size on some modern x86 microarchitectures.
main:
xor eax,eax
ret
Some optimizing compilers might still make a stack frame with EBP, but tearing it down with pop ebp would be strictly better than leave on all CPUs, for this special case where ESP = EBP so the mov esp,ebp part of leave isn't needed. pop ebp is still 1 byte, but it's also a single-uop instruction on modern CPUs, unlike leave which is 2 or 3 on modern CPUs. (http://agner.org/optimize/, and see also other performance optimization links in the x86 tag wiki.) This is what gcc does. It's a fairly common situation; if you push some other registers after making a stack frame, you have to point ESP at the right place before pop ebx or whatever. (Or use mov to restore them.)
The benchmarks TCC cares about are compilation speed, not quality (speed or size) of the resulting code. For example, the TCC web site has a benchmark in lines/sec and MB/sec (of C source) vs. gcc3.2 -O0, where it's ~9x faster on a P4.
However, TCC is not totally braindead: it will apparently do some inlining, and as Michael's answer points out, a recent patch does leave out the JMP (but still not the useless sub esp, 0).
I am writing a bootloader as follows:
bits 16
[org 0x7c00]
KERN_OFFSET equ 0x1000
mov [BOOTDISK], dl
mov dl, 0x0 ;0 is for floppy-disk
mov ah, 0x2 ;Read function for the interrupt
mov al, 0x15 ;Read 15 sectors conating kernel
mov ch, 0x0 ;Use cylinder 0
mov cl, 0x2 ;Start from the second sector which contains kernel
mov dh, 0x0 ;Read head 0
mov bx, KERN_OFFSET
int 0x13
jc disk_error
cmp al, 0x15
jne disk_error
jmp KERN_OFFSET:0x0
jmp $
disk_error:
jmp $
BOOTDISK: db 0
times 510-($-$$) db 0
dw 0xaa55
The kernel is a simple C program which prints "e" on the VGA display (seen on QEmu):
void main()
{
extern void put_in_mem();
char c = 'e';
put_in_mem(c, 0xA0);
}
I am using this code in 16 bit (real mode) in QEmu so I am using the compiler bcc for this code using:
bcc -ansi -c -o kernel.o kernel.c
I have the following questions:
1. When I try to disassemble this code, using
objdump -D -b binary -mi386 kernel.o
I get an output like this (only initial portion of output):
kernel.o: file format binary
Disassembly of section .data:
00000000 <.data>:
0: a3 86 01 00 2a mov %eax,0x2a000186
5: 3e 00 00 add %al,%ds:(%eax)
8: 00 22 add %ah,(%edx)
a: 00 00 add %al,(%eax)
c: 00 19 add %bl,(%ecx)
e: 00 00 add %al,(%eax)
10: 00 55 55 add %dl,0x55(%ebp)
13: 55 push %ebp
14: 55 push %ebp
15: 00 00 add %al,(%eax)
17: 00 02 add %al,(%edx)
19: 22 00 and (%eax),%al
This output does not seem to correspond to the kernel.c file I made. For example I could not see where 'e' is stored as ASCII 0x65 or where is the call to put_in_mem made. Is something wrong with the way I am disassembling the code?
To make the object file of the kernel for QEmu I used the following command:
ld86 -o kernel -d kernel.o put_in_mem.o
Here put_in_mem.o is the object file created after assembling the put_in_mem.asm file which contains the definition of the function put_in_mem() used in kernel.c.
Then floppy image for QEmu is made using:
cat boot.o kernel > floppy_img
But when I try to look at the address 0x10000 (using GDB), where the kernel was supposed to be present after loading (using the boot.asm program), it was not present.
Why is this happening?
Further, in ld command we used -Ttext option to specify the load address of the binary, should we use some similar option here with ld86?
Your kernel.o is in an object file format not understood by objdump so it tries to disassemble everything in it, including headers and whatnot. Try to disassemble the linked output kernel instead. Also objdump might not understand 16 bit code. Better try objdump86 if you have that available.
As to why it's not present: you are looking at the wrong place. You are loading it to offset 0x1000 (3 zeroes) but you are looking at 0x10000 (4 zeroes). Also note that you don't set up ES which is bad practice. Maybe you intended to set ES to 0x1000 and BX to 0x0000 and then you would find your kernel at 0x10000 physical address.
The -Ttext doesn't influence loading, it only specifies where the code expects to find itself.
I'm familiar with data alignment and performance but I'm rather new to aligning code. I started programming in x86-64 assembly recently with NASM and have been comparing performance using code alignment. As far as I can tell NASM inserts nop instructions to achieve code alignment.
Here is a function I have been trying this on a Ivy Bridge system
void triad(float *x, float *y, float *z, int n, int repeat) {
float k = 3.14159f;
int(int r=0; r<repeat; r++) {
for(int i=0; i<n; i++) {
z[i] = x[i] + k*y[i];
}
}
}
The assembly I'm using for this is below. If I don't specify the alignment my performance compared to the peak is only about 90%. However, when I align the code before the loop as well as both inner loops to 16 bytes the performance jumps to 96%. So clearly the code alignment in this case makes a difference.
But here is the strangest part. If I align the innermost loop to 32 bytes it makes no difference in the performance of this function, however, in another version of this function using intrinsics in a separate object file I link in its performance jumps from 90% to 95%!
I did an object dump (using objdump -d -M intel) of the version aligned to 16 bytes (I posted the result to the end of this question) and 32 bytes and they are identical! It turns out that the inner most loop is aligned to 32 bytes anyway in both object files. But there must be some difference.
I did a hex dump of each object file and there is one byte in the object files that differ. The object file aligned to 16 bytes has a byte with 0x10 and the object file aligned to 32 bytes has a byte with 0x20. What exactly is going on! Why does code alignment in one object file affect the performance of a function in another object file? How do I know what is the optimal value to align my code to?
My only guess is that when the code is relocated by the loader that the 32 byte aligned object file affects the other object file using intrinsics. You can find the code to test all this at Obtaining peak bandwidth on Haswell in the L1 cache: only getting 62%
The NASM code I am using:
global triad_avx_asm_repeat
;RDI x, RSI y, RDX z, RCX n, R8 repeat
pi: dd 3.14159
align 16
section .text
triad_avx_asm_repeat:
shl rcx, 2
add rdi, rcx
add rsi, rcx
add rdx, rcx
vbroadcastss ymm2, [rel pi]
;neg rcx
align 16
.L1:
mov rax, rcx
neg rax
align 16
.L2:
vmulps ymm1, ymm2, [rdi+rax]
vaddps ymm1, ymm1, [rsi+rax]
vmovaps [rdx+rax], ymm1
add rax, 32
jne .L2
sub r8d, 1
jnz .L1
vzeroupper
ret
Result from objdump -d -M intel test16.o. The disassembly is identical if I change align 16 to align 32 in the assembly above just before .L2. However, the object files still differ by one byte.
test16.o: file format elf64-x86-64
Disassembly of section .text:
0000000000000000 <pi>:
0: d0 0f ror BYTE PTR [rdi],1
2: 49 rex.WB
3: 40 90 rex xchg eax,eax
5: 90 nop
6: 90 nop
7: 90 nop
8: 90 nop
9: 90 nop
a: 90 nop
b: 90 nop
c: 90 nop
d: 90 nop
e: 90 nop
f: 90 nop
0000000000000010 <triad_avx_asm_repeat>:
10: 48 c1 e1 02 shl rcx,0x2
14: 48 01 cf add rdi,rcx
17: 48 01 ce add rsi,rcx
1a: 48 01 ca add rdx,rcx
1d: c4 e2 7d 18 15 da ff vbroadcastss ymm2,DWORD PTR [rip+0xffffffffffffffda] # 0 <pi>
24: ff ff
26: 90 nop
27: 90 nop
28: 90 nop
29: 90 nop
2a: 90 nop
2b: 90 nop
2c: 90 nop
2d: 90 nop
2e: 90 nop
2f: 90 nop
0000000000000030 <triad_avx_asm_repeat.L1>:
30: 48 89 c8 mov rax,rcx
33: 48 f7 d8 neg rax
36: 90 nop
37: 90 nop
38: 90 nop
39: 90 nop
3a: 90 nop
3b: 90 nop
3c: 90 nop
3d: 90 nop
3e: 90 nop
3f: 90 nop
0000000000000040 <triad_avx_asm_repeat.L2>:
40: c5 ec 59 0c 07 vmulps ymm1,ymm2,YMMWORD PTR [rdi+rax*1]
45: c5 f4 58 0c 06 vaddps ymm1,ymm1,YMMWORD PTR [rsi+rax*1]
4a: c5 fc 29 0c 02 vmovaps YMMWORD PTR [rdx+rax*1],ymm1
4f: 48 83 c0 20 add rax,0x20
53: 75 eb jne 40 <triad_avx_asm_repeat.L2>
55: 41 83 e8 01 sub r8d,0x1
59: 75 d5 jne 30 <triad_avx_asm_repeat.L1>
5b: c5 f8 77 vzeroupper
5e: c3 ret
5f: 90 nop
Ahhh, code alignment...
Some basics of code alignment..
Most intel architectures fetch 16B worth of instructions per clock.
The branch predictor has a larger window and looks at typically double that, per clock. The idea is to get ahead of the instructions fetched.
How your code is aligned will dictate which instructions you have available to decode and predict at any given clock (simple code locality argument).
Most modern intel architectures cache instructions at various levels (either at the macro instructions level before decoding, or at the micro instruction level after decoding). This eliminates the effects of code alignment, as long as you executing out of the micro/macro cache.
Also, most modern intel architectures have some form of loop stream detector that detects loops, again, executing them out of some cache that bypasses the front end fetch mechanism.
Some intel architectures are finicky about what they can cache, and what they can't. There are often dependencies on number of instructions/uops/alignment/branches/etc. Alignment may, in some cases, affect what's cached and what's not, and you can create cases where padding can prevent or cause a loop to get cached.
To make things even more complicated, the addresses of instructions are also use by the branch predictor. They are used in several ways, including (1) as a lookup into a branch prediction buffer to predict branches, (2) as a key/value to maintain some form of global state of branch behavior for prediction purposes, (3) as a key into determining indirect branch targets, etc.. Therefore, alignment can actually have a pretty huge impact on branch prediction, in some case, due to aliasing or other poor prediction.
Some architectures use instruction addresses to determine when to prefetch data, and code alignment can interfere with that, if just the right conditions exist.
Aligning loops is not always a good thing to do, depending on how the code is laid out (especially if there's control flow in the loop).
Having said all that blah blah, your issue could be one of any of these. It's important to look at the disassembly of not just the object, but the executable. You want to see what the final addresses are after everything is linked. Making changes in one object, could affect the alignment/addresses of instructions in another object after linking.
In some cases, it's near impossible to align your code in such a way as to maximize performance, simply due to so many low level architectural behaviors being hard to control and predict (that doesn't necessarily mean this is always the case). In some cases, your best bet is to have some default alignment strategy (say align all entries on 16B boundaries, and outer loops the same) so as you minimize the amount your performance varies from change-to-change. As a general strategy, aligning function entries is good. Aligning loops that are relatively small is good, as long as you're not adding nops in your execution path.
Beyond that, I'd need more info/data to pinpoint your exact problem, but thought some of this may help.. Good luck :)
The confusing nature of the effect (the assembled code doesn't change!) you are seeing is due to section alignment. When using the ALIGN macro in NASM, it actually has two separate effects:
Add 0 or more nop instructions so that the next instruction is aligned to the specified power-of-two boundary.
Issue an implicit SECTALIGN macro call which will set the section alignment directive to alignment amount1.
The first point is the commonly understood behavior for align. It aligns the loop relatively within the section in the output file.
The second part is also needed however: imagine your loop was aligned to a 32 byte boundary in the assembled section, but then the runtime loader put your section, in memory, at an address aligned only to 8 bytes: this would make the in-file alignment quite pointless. To fix this, most executable formats allow each section to specify an alignment requirement, and the runtime loader/linker will be sure to load the section at a memory address which respects the requirement.
That's what the hidden SECTALIGN macro does - it ensures that your ALIGN macro works.
For your file, there is no difference in the assembled code between ALIGN 16 and ALIGN 32 because the next 16-byte boundary happens to also be the next 32-byte boundary (of course, every other 16-byte boundary is a 32-byte one, so that happens about half the time). The implicit SECTALIGN call is still different though, and that's the one byte difference you see in your hexdump. The 0x20 is decimal 32, and the 0x10 is decimal 16.
You can verify this with objdump -h <binary>. Here's an example on a binary I aligned to 32 bytes:
objdump -h loop-test.o
loop-test.o: file format elf64-x86-64
Sections:
Idx Name Size VMA LMA File off Algn
0 .text 0000d18a 0000000000000000 0000000000000000 00000180 2**5
CONTENTS, ALLOC, LOAD, READONLY, CODE
The 2**5 in the Algn column is the 32-byte alignment. With 16-byte alignment this changes to 2**4.
Now it should be clear what happens - aligning the first function in your example changes the section alignment, but not the assembly. When you linked your program together, the linker will merge the various .text sections and pick the highest alignment.
At runtime, then this causes the code to be aligned to a 32-byte boundary - but this doesn't affect the first function, because it isn't alignment sensitive. Since the linker has merged your object files into one section, the larger alignment of 32 changes the alignment of every function (and instruction) in the section, including your other method, and so it changes the performance of your other function, which is alignment-sensitive.
1To be precise, SECTALIGN only changes the section alignment if the current section alignment is less than the specified amount - so the final section alignment will be the same as the largest SECTALIGN directive in the section.