I had been trying to follow this tutorial (https://paraschetal.in/writing-your-own-shellcode) on how to write your own shellcode. 99% of it makes sense to me but there is just two lingering doubts in my mind - this relates to writing shellcode in general
Firstly, I think I understand why we want to avoid the null byte but how does using the following avoid null bytes?
xor eax, eax
Doesn't eax now contain exactly null bytes? Or does it contain 0s? When we XOR something with itself, it returns False, correct?
Secondly, the tutorial says:
Finally, we’ll load the syscall number(11 or 0xb) to the eax
register. However if we use eax in our instruction, the resulting
shellcode will contain some NULL(\x00) bytes and we don’t want that.
Our eax register already is NULL. So we’ll just load the syscall
number to the al register instead of the entire eax register.
mov byte al, 0x0b
Now I do understand what is going on here, the number 11 (for execve) is being loaded into the first 8 bits of eax register (which is al). But the rest of eax still contains null bytes so what exactly is achieved here?
Please note, I've come here as a last resort after spending most of the day trying to understand this, so please take it easy on me :)
The exploits usually attack C code, and therefore the shell code often needs to be delivered in a NUL-terminated string. If the shell code contains NUL bytes the C code that is being exploited might ignore and drop rest of the code starting from the first zero byte.
This concerns only the machine code. If you need to call the system call with number 0xb, then naturally you need to be able to produce the number 0xb in the EAX register, but you can only use those forms of machine code that do not contain zero bytes in the machine code itself.
xor eax, eax
will invert all the 1 bits in the eax, i.e. zero it. It is a functional equivalent to
mov eax, 0
except that the latter will have the 0 coded as zero bytes in the machine code.
The machine code for
xor eax, eax
mov byte al, 0x0b
is
31 c0 b0 0b
As you can see, there are no embedded zero bytes in it. The machine code for
mov eax, 0xb
is
b8 0b 00 00 00
Both these programs are functionally equivalent in that they set the value of EAX register to 0xb.
If the latter shell code is handled as a null-terminated string by a C program, the rest of it after the b8 0b 00 could be discarded by the program and be replaced by other bytes in the memory, essentially making the shellcode not work.
The instruction mov eax, 0 assembles to
b8 00 00 00 00
which contains NUL bytes. However, the instruction xor eax, eax assembles to
31 c0
which is free of NUL bytes, making it suitable for shell code.
The same applies to mov al, 0x0b. If you do mov eax, 0x0b, the encoding is
b8 0b 00 00 00
which contains NUL bytes. However, mov al, 0x0b encodes to
b0 0b
avoiding NUL bytes.
Related
Let's say we work on architecture x86_64 and let's say we have the following string, "123456". In ASCII characters, it becomes 31 32 33 34 35 36 00.
Now, which assembly instructions should I use to move the entire (even if fragmented) content of this string somewhere in a way that %rdi stores the address of that string (points to that)?
Because I am not simply able to move the hex representation of the string into a register, like one can do with unsigned values, how do I do it?
There are a couple of ways to do so.
If you want to move the entire string to another offset first, you would have to do so with a loop.
mov rbx, 0
loop:
mov al, [string+rbx]
mov [copyoffset+rbx], al
inc rbx
cmp al, 0x0
jne loop
... Insert other code here
Then you can use the Lea instruction described below to move it into rdi.
If you just want to load the address of the string and don't care about moving it you can just use lea
lea rdi, [stringoffset]
Edit: Changed rax to al so we only move one byte at a time
I have disassembled a C program with Radare2. Inside this program there are many calls to scanf like the following:
0x000011fe 488d4594 lea rax, [var_6ch]
0x00001202 4889c6 mov rsi, rax
0x00001205 488d3df35603. lea rdi, [0x000368ff] ; "%d" ; const char *format
0x0000120c b800000000 mov eax, 0
0x00001211 e86afeffff call sym.imp.__isoc99_scanf ; int scanf(const char *format)
0x00001216 8b4594 mov eax, dword [var_6ch]
0x00001219 83f801 cmp eax, 1 ; rsi ; "ELF\x02\x01\x01"
0x0000121c 740a je 0x1228
Here scanf has the address of the string "%d" passed to it from the line lea rdi, [0x000368ff]. I'm assuming 0x000368ff is the location of "%d" in the exectable file because if I restart Radare2 in debugging mode (r2 -d ./exec) then lea rdi, [0x000368ff] is replaced by lea rdi, [someMemoryAddress].
If lea rdi, [0x000368ff] is whats hard coded in the file then how does the instruction change to the actual memory address when run?
Radare is tricking you, what you see is not the real instruction, it has been simplified for you.
The real instruction is:
0x00001205 488d3df3560300 lea rdi, qword [rip + 0x356f3]
0x0000120c b800000000 mov eax, 0
This is a typical position independent lea. The string to use is stored in your binary at the offset 0x000368ff, but since the executable is position independent, the real address needs to be calculated at runtime. Since the next instruction is at offset 0x0000120c, you know that, no matter where the binary is loaded in memory, the address you want will be rip + (0x000368ff - 0x0000120c) = rip + 0x356f3, which is what you see above.
When doing static analysis, since Radare does not know the base address of the binary in memory, it simply calculates 0x0000120c + 0x356f3 = 0x000368ff. This makes reverse engineering easier, but can be confusing since the real instruction is different.
As an example, the following program:
int main(void) {
puts("Hello world!");
}
When compiled produces:
6b4: 48 8d 3d 99 00 00 00 lea rdi,[rip+0x99]
6bb: e8 a0 fe ff ff call 560 <puts#plt>
So rip + 0x99 = 0x6bb + 0x99 = 0x754, and if we take a look at offset 0x754 in the binary with hd:
$ hd -s 0x754 -n 16 a.out
00000754 48 65 6c 6c 6f 20 77 6f 72 6c 64 21 00 00 00 00 |Hello world!....|
00000764
The full instruction is
48 8d 3d f3 56 03 00
This instruction is literally
lea rdi, [rip + 0x000356f3]
with a rip relative addressing mode. The instruction pointer rip has the value 0x0000120c when the instruction is executed, thus rdi receives the desired value 0x000368ff.
If this is not the real address, it is possible that your program is a position-independent executable (PIE) which is subject to relocation. Since the address is encoded using a rip-relative addressing mode, no relocation is needed and the address is correct, regardless of where the binary is loaded.
Can someone explain why this code:
#include <stdio.h>
int main()
{
return 0;
}
when compiled with tcc using tcc code.c produces this asm:
00401000 |. 55 PUSH EBP
00401001 |. 89E5 MOV EBP,ESP
00401003 |. 81EC 00000000 SUB ESP,0
00401009 |. 90 NOP
0040100A |. B8 00000000 MOV EAX,0
0040100F |. E9 00000000 JMP fmt_vuln1.00401014
00401014 |. C9 LEAVE
00401015 |. C3 RETN
I guess that
00401009 |. 90 NOP
is maybe there for some memory alignment, but what about
0040100F |. E9 00000000 JMP fmt_vuln1.00401014
00401014 |. C9 LEAVE
I mean why would compiler insert this near jump that jumps to the next instruction, LEAVE would execute anyway?
I'm on 64-bit Windows generating 32-bit executable using TCC 0.9.26.
Superfluous JMP before the Function Epilogue
The JMP at the bottom that goes to the next statement, this was fixed in a commit. Version 0.9.27 of TCC resolves this issue:
When 'return' is the last statement of the top-level block
(very common and often recommended case) jump is not needed.
As for the reason it existed in the first place? The idea is that each function has a possible common exit point. If there is a block of code with a return in it at the bottom, the JMP goes to a common exit point where stack cleanup is done and the ret is executed. Originally the code generator also emitted the JMP instruction erroneously at the end of the function too if it appeared just before the final } (closing brace). The fix checks to see if there is a return statement followed by a closing brace at the top level of the function. If there is, the JMP is omitted
An example of code that has a return at a lower scope before a closing brace:
int main(int argc, char *argv[])
{
if (argc == 3) {
argc++;
return argc;
}
argc += 3;
return argc;
}
The generated code looks like:
401000: 55 push ebp
401001: 89 e5 mov ebp,esp
401003: 81 ec 00 00 00 00 sub esp,0x0
401009: 90 nop
40100a: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
40100d: 83 f8 03 cmp eax,0x3
401010: 0f 85 11 00 00 00 jne 0x401027
401016: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
401019: 89 c1 mov ecx,eax
40101b: 40 inc eax
40101c: 89 45 08 mov DWORD PTR [ebp+0x8],eax
40101f: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
; Jump to common function exit point. This is the `return argc` inside the if statement
401022: e9 11 00 00 00 jmp 0x401038
401027: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
40102a: 83 c0 03 add eax,0x3
40102d: 89 45 08 mov DWORD PTR [ebp+0x8],eax
401030: 8b 45 08 mov eax,DWORD PTR [ebp+0x8]
; Jump to common function exit point. This is the `return argc` at end of the function
401033: e9 00 00 00 00 jmp 0x401038
; Common function exit point
401038: c9 leave
401039: c3 ret
In versions prior to 0.9.27 the return argc inside the if statement would jump to a common exit point (function epilogue). As well the return argc at the bottom of the function also jumps to the same common exit point of the function. The problem is that the common exit point for the function happens to be right after the top level return argcso the side effect is an extra JMP that happens to be to the next instruction.
NOP after Function Prologue
The NOP isn't for alignment. Because of the way Windows implements guard pages for the stack (Programs that are in Portable Executable format) TCC has two types of prologues. If the local stack space required < 4096 (smaller than a single page) then you see this kind of code generated:
401000: 55 push ebp
401001: 89 e5 mov ebp,esp
401003: 81 ec 00 00 00 00 sub esp,0x0
The sub esp,0 isn't optimized out. It is the amount of stack space needed for local variables (in this case 0). If you add some local variables you will see the 0x0 in the SUB instruction changes to coincide with the amount of stack space needed for local variables. This prologue requires 9 bytes. There is another prologue to handle the case where the stack space needed is >= 4096 bytes. If you add an array of 4096 bytes with something like:
char somearray[4096]
and look at the resulting instruction you will see the function prologue change to a 10 byte prologue:
401000: b8 00 10 00 00 mov eax,0x1000
401005: e8 d6 00 00 00 call 0x4010e0
TCC's code generator assumes that the function prologue is always 10 bytes when targeting WinPE. This is primarily because TCC is a single pass compiler. The compiler doesn't know how much stack space a function will use until after the function is processed. To get around not knowing this ahead of time, TCC pre-allocates 10 bytes for the prologue to fit the largest method. Anything shorter is padded to 10 bytes.
In the case where stack space needed < 4096 bytes the instructions used total 9 bytes. The NOP is used to pad the prologue to 10 bytes. For the case where >= 4096 bytes are needed, the number of bytes is passed in EAX and the function __chkstk is called to allocate the required stack space instead.
TCC is not an optimizing compiler, at least not really. Every single instruction it emitted for main is sub-optimal or not needed at all, except the ret. IDK why you thought the JMP was the only instruction that might not make sense for performance.
This is by design: TCC stands for Tiny C Compiler. The compiler itself is designed to be simple, so it intentionally doesn't include code to look for many kinds of optimizations. Notice the sub esp, 0: this useless instruction clearly come from filling in a function-prologue template, and TCC doesn't even look for the special case where the offset is 0 bytes. Other function need stack space for locals, or to align the stack before any child function calls, but this main() doesn't. TCC doesn't care, and blindly emits sub esp,0 to reserve 0 bytes.
(In fact, TCC is truly one pass, laying out machine code as it does through the C statement by statement. It uses the imm32 encoding for sub so it will have room to fill in the right number (upon reaching the end of the function) even if it turns out the function uses more than 255 bytes of stack space. So instead of constructing a list of instructions in memory to finish assembling later, it just remembers one spot to fill in a uint32_t. That's why it can't omit the sub when it turns out not to be needed.)
Most of the work in creating a good optimizing compiler that anyone will use in practice is the optimizer. Even parsing modern C++ is peanuts compared to reliably emitting efficient asm (which not even gcc / clang / icc can do all the time, even without considering autovectorization). Just generating working but inefficient asm is easy compared to optimizing; most of gcc's codebase is optimization, not parsing. See Basile's answer on Why are there so few C compilers?
The JMP (as you can see from #MichaelPetch's answer) has a similar explanation: TCC (until recently) didn't optimize the case where a function only has one return path, and doesn't need to JMP to a common epilogue.
There's even a NOP in the middle of the function. It's obviously a waste of code bytes and decode / issue front-end bandwidth and out-of-order window size. (Sometimes executing a NOP outside a loop or something is worth it to align the top of a loop which is branched to repeatedly, but a NOP in the middle of a basic block is basically never worth it, so that's not why TCC put it there. And if a NOP did help, you could probably do even better by reordering instructions or choosing larger instructions to do the same thing without a NOP. Even proper optimizing compilers like gcc/clang/icc don't try to predict this kind of subtle front-end effect.)
#MichaelPetch points out that TCC always wants its function prologue to be 10 bytes, because it's a single-pass compiler (and it doesn't know how much space it needs for locals until the end of the function, when it comes back and fills in the imm32). But Windows targets need stack probes when modifying ESP / RSP by more than a whole page (4096 bytes), and the alternate prologue for that case is 10 bytes, instead of 9 for the normal one without the NOP. So this is another tradeoff favouring compilation speed over good asm.
An optimizing compiler would xor-zero EAX (because that's smaller and at least as fast as mov eax,0), and leave out all the other instruction. Xor-zeroing is one of the most well-known / common / basic x86 peephole optimizations, and has several advantages other than code-size on some modern x86 microarchitectures.
main:
xor eax,eax
ret
Some optimizing compilers might still make a stack frame with EBP, but tearing it down with pop ebp would be strictly better than leave on all CPUs, for this special case where ESP = EBP so the mov esp,ebp part of leave isn't needed. pop ebp is still 1 byte, but it's also a single-uop instruction on modern CPUs, unlike leave which is 2 or 3 on modern CPUs. (http://agner.org/optimize/, and see also other performance optimization links in the x86 tag wiki.) This is what gcc does. It's a fairly common situation; if you push some other registers after making a stack frame, you have to point ESP at the right place before pop ebx or whatever. (Or use mov to restore them.)
The benchmarks TCC cares about are compilation speed, not quality (speed or size) of the resulting code. For example, the TCC web site has a benchmark in lines/sec and MB/sec (of C source) vs. gcc3.2 -O0, where it's ~9x faster on a P4.
However, TCC is not totally braindead: it will apparently do some inlining, and as Michael's answer points out, a recent patch does leave out the JMP (but still not the useless sub esp, 0).
Hey guys I'm not sure if I'm going about all this the right way. I need the first 12 numbers of Fibonacci sequence to calculate which its doing already I'm pretty sure. But now I need to display the hexadecimal contents of (Fibonacci) in my program using dumpMem. I need to be getting a print out of : 01 01 02 03 05 08 0D 15 22 37 59 90
But I'm only getting: 01 01 00 00 00 00 00 00 00 00 00 00
Any tips or help is much much appreciated.
INCLUDE Irvine32.inc
.data
reg DWORD -1,1,0 ; Initializes a DOUBLEWORD array, giving it the values of -1, 1, and 0
array DWORD 48 DUP(?)
Fibonacci BYTE 1, 1, 10 DUP (?)
.code
main PROC
mov array, 1
mov esi,OFFSET array ; or should this be Fibonacci?
mov ecx,12
add esi, 4
L1:
mov edx, [reg]
mov ebx, [reg+4]
mov [reg+8], edx
add [reg+8], ebx ; Adds the value of the EBX and 'temp(8)' together and stores it as temp(8)
mov eax, [reg+8] ; Moves the value of 'temp(8)' into the EAX register
mov [esi], eax ; Moves the value of EAX into the offset of array
mov [reg], ebx ; Moves the value of the EBX register to 'temp(0)'
mov [reg+4], eax ; Moves the value of the EAX register to 'temp(4)
add esi, 4
; call DumpRegs
call WriteInt
loop L1
;mov ebx, offset array
;mov ecx, 12
;L2:
;mov eax, [esi]
;add esi, 4
;call WriteInt
;loop L2
;Below will show hexadecimal contents of string target-----------------
mov esi, OFFSET Fibonacci ; offset the variables
mov ebx,1 ; byte format
mov ecx, SIZEOF Fibonacci ; counter
call dumpMem
exit
main ENDP
END main
It seems to me that the problem here is with computing the Fibonacci sequence. Your code for that leaves me somewhat...puzzled. You have a bunch of "stuff" there, that seems to have nothing to do with computing Fibonacci numbers (e.g., reg), and others that could, but it seems you don't really know what you're trying to do with them.
Looking at your loop to compute the sequence, the first thing that practically jumps out at me is that you're using memory a lot. One of the first (and most important) things when you're writing assembly language is to maximize your use of registers and minimize your use of memory.
As a hint, I think if you read anything from memory in the course if computing the sequence, you're probably making a mistake. You should be able to do all the computation in registers, so the only memory references will be writing results. Since you're (apparently) producing only byte-sized results, you should need only one array of the proper number of bytes to hold the results (i.e., one byte per number you're going to generate).
I'm tempted to write a little routine showing how neatly this can be adapted to assembly language, but I suppose I probably shouldn't do that...
Your call to dumpMem is correct, but your program is not storing the results of your calculations at the correct location: the region you call "Fibonacci" remains initialized to 1, 1, and ten zeros. You need to make sure that your loop starts writing at the offset of Fibonacci plus 2, and moves ten times in one-byte increments (ten, not twelve, because you provided the two initial items in the initializer).
I'm sorry, I cannot be any more specific, as any question containing the word "Fibonacci" inevitably turns out to be someone's homework :-)
LEA EAX, [EAX]
I encountered this instruction in a binary compiled with the Microsoft C compiler. It clearly can't change the value of EAX. Then why is it there?
It is a NOP.
The following are typcially used as NOP. They all do the same thing but they result in machine code of different length. Depending on the alignment requirement one of them is chosen:
xchg eax, eax = 90
mov eax, eax = 89 C0
lea eax, [eax + 0x00] = 8D 40 00
From this article:
This trick is used by MSVC++ compiler
to emit the NOP instructions of
different length (for padding before
jump targets). For example, MSVC++
generates the following code if it
needs 4-byte and 6-byte padding:
8d6424 00 lea [ebx+00],ebx
; 4-byte padding 8d9b 00000000
lea [esp+00000000],esp ; 6-byte
padding
The first line is marked as "npad 4"
in assembly listings generated by the
compiler, and the second is "npad 6".
The registers (ebx, esp) can be chosen
from the rarely used ones to avoid
false dependencies in the code.
So this is just a kind of NOP, appearing right before targets of jmp instructions in order to align them.
Interestingly, you can identify the compiler from the characteristic nature of such instructions.
LEA EAX, [EAX]
Indeed doesn't change the value of EAX. As far as I understand, it's identical in function to:
MOV EAX, EAX
Did you see it in optimized code, or unoptimized code?