I tried to write a char[] in stdout using inline NASM (note .intel_syntax and .att_syntax are added so it can be compiled with gcc)
but it doesn't write anything in stdout.
I am using Linux 16 (x86) on Virtual machine
Is it cause of char c[] ? ( I read by this way of compiling we cannot use memory variables but what to do instead?)
#include<stdio.h>
char c[] = "abcd";
int main(){
asm (".intel_syntax noprefix");
// write c into stdout
asm("mov EAX,4"); // 4: write
asm("mov EBX,1"); // 1: stdout
asm("mov ECX,c");
asm("mov EDX,5");
asm("int 0x80");
// exit program
asm("mov EAX,1")
asm("int 0x80")
asm (".att_syntax noprefix");
}
the output is nothing
The GNU assembler (which is what gcc uses) does not use NASM syntax. It rather uses a variant of Microsoft's MASM syntax where no brackets are needed to dereference a variable. Since you don't want to load the value of the c variable but rather its address, you need an offset keyword:
mov ecx, offset c
I strongly recommend you to avoid inline assembly as much as possible for learning assembly. Using inline assembly in gcc requires good knowledge of how exactly this whole thing works and writing random instructions usually leads to wrong code. Even your simple code is already fundamentally broken and would not work if it was any more complicated than that (so the compiler had a chance to try to use the registers you overwrote without telling).
Instead, put your assembly in a separate file and link it in. This sidesteps all issues you have with inline assembly and allows you to use NASM as you wanted. For example, try something like this:
main.c
char c[] = "abcd";
/* the function you define in print_c.asm */
extern void print_c();
int main() {
print_c(); /* call your assembly function */
}
print_c.asm
; pull in c defined in main.c
extern c
section .text
global print_c
print_c:
; write c to stdout
mov eax, 4
mov ebx, 1
mov ecx, c
mov edx, 5
int 0x80
; exit program
mov eax, 1
int 0x80
Then assemble, compile, and link with:
nasm -felf print_c.asm
cc -m32 -o print_c print_c.o main.c
Related
Recently I read the code of one public C library and found below function definition:
void* block_alloc(void** block, size_t* len, size_t type_size)
{
return malloc(type_size);
(void)block;
(void)len;
}
I wonder whether it will arrive at the statements after return. If not, what's the purpose of these 2 statements that convert some data to void ?
As Basil notes, the (void) statements are likely intended to silence compiler warnings about the unused parameters. But - you can move the (void) statements before the return to make them less confusing, and with the same effect.
In fact, there's yet another way to achieve the same effect, without resorting to any extra statements. It's supported by many compilers already today, although it's not officially in the C standard before C2X:
void* block_alloc(void**, size_t*, size_t type_size)
{
return malloc(type_size);
}
if you don't name the parameters, typical compilers don't expect you to be using them.
First, these statements appearing in the block after a return will never be executed.
Check by reading some C standard like n1570.
Second, on some compilers (perhaps GCC 10 invoked as gcc -Wall -Wextra) the useless statements might avoid some warnings.
In my opinion, coding these statements before the return won't change the machine code emitted by an optimizing compiler (use gcc -Wall -Wextra -O2 -fverbose-asm -S to check and emit the assembler code) and makes the C source code more understandable.
GCC provides, as an extension, the variable __attribute__ named unused.
Perhaps in your software your block_alloc is assigned to some important function pointer (whose signature is requested)
It is used to silence the warnings. Some programming standards required all the parameters to be used in the function body, and their static analyzers will not pass the code without it.
It is added after the return to prevent a generation of the code in some circumstances:
int foo(volatile unsigned x)
{
(void)x;
return 0;
}
int foo1(volatile unsigned x)
{
return 0;
(void)x;
}
foo:
mov DWORD PTR [rsp-4], edi
mov eax, DWORD PTR [rsp-4]
xor eax, eax
ret
foo1:
mov DWORD PTR [rsp-4], edi
xor eax, eax
ret
I came across a minimal HTTP server that is written without libc: https://github.com/Francesco149/nolibc-httpd
I can see that basic string handling functions are defined, leading to the write syscall:
#define fprint(fd, s) write(fd, s, strlen(s))
#define fprintn(fd, s, n) write(fd, s, n)
#define fprintl(fd, s) fprintn(fd, s, sizeof(s) - 1)
#define fprintln(fd, s) fprintl(fd, s "\n")
#define print(s) fprint(1, s)
#define printn(s, n) fprintn(1, s, n)
#define printl(s) fprintl(1, s)
#define println(s) fprintln(1, s)
And the basic syscalls are declared in the C file:
size_t read(int fd, void *buf, size_t nbyte);
ssize_t write(int fd, const void *buf, size_t nbyte);
int open(const char *path, int flags);
int close(int fd);
int socket(int domain, int type, int protocol);
int accept(int socket, sockaddr_in_t *restrict address,
socklen_t *restrict address_len);
int shutdown(int socket, int how);
int bind(int socket, const sockaddr_in_t *address, socklen_t address_len);
int listen(int socket, int backlog);
int setsockopt(int socket, int level, int option_name, const void *option_value,
socklen_t option_len);
int fork();
void exit(int status);
So I guess the magic happens in start.S, which contains _start and a special way of encoding syscalls by creating global labels which fall through and accumulating values in r9 to save bytes:
.intel_syntax noprefix
/* functions: rdi, rsi, rdx, rcx, r8, r9 */
/* syscalls: rdi, rsi, rdx, r10, r8, r9 */
/* ^^^ */
/* stack grows from a high address to a low address */
#define c(x, n) \
.global x; \
x:; \
add r9,n
c(exit, 3) /* 60 */
c(fork, 3) /* 57 */
c(setsockopt, 4) /* 54 */
c(listen, 1) /* 50 */
c(bind, 1) /* 49 */
c(shutdown, 5) /* 48 */
c(accept, 2) /* 43 */
c(socket, 38) /* 41 */
c(close, 1) /* 03 */
c(open, 1) /* 02 */
c(write, 1) /* 01 */
.global read /* 00 */
read:
mov r10,rcx
mov rax,r9
xor r9,r9
syscall
ret
.global _start
_start:
xor rbp,rbp
xor r9,r9
pop rdi /* argc */
mov rsi,rsp /* argv */
call main
call exit
Is this understanding correct? GCC use the symbols defined in start.S for the syscalls, then the program starts in _start and calls main from the C file?
Also how does the separate httpd.asm custom binary work? Just hand-optimized assembly combining the C source and start assembly?
(I cloned the repo and tweaked the .c and .S to compile better with clang -Oz: 992 bytes, down from the original 1208 with gcc. See the WIP-clang-tuning branch in my fork, until I get around to cleaning that up and sending a pull request. With clang, inline asm for the syscalls does save size overall, especially once main has no calls and no rets. IDK if I want to hand-golf the whole .asm after regenerating from compiler output; there are certainly chunks of it where significant savings are possible, e.g. using lodsb in loops.)
It looks like they need r9 to be 0 before a call to any of these labels, either with a register global var or maybe gcc -ffixed-r9 to tell GCC to keep its hands off that register permanently. Otherwise GCC would have left whatever garbage in r9, just like other registers.
Their functions are declared with normal prototypes, not 6 args with dummy 0 args to get every call site to actually zero r9, so that's not how they're doing it.
special way of encoding syscalls
I wouldn't describe that as "encoding syscalls". Maybe "defining syscall wrapper functions". They're defining their own wrapper function for each syscall, in an optimized way that falls through into one common handler at the bottom. In the C compiler's asm output, you'll still see call write.
(It might have been more compact for the final binary to use inline asm to let the compiler inline a syscall instruction with the args in the right registers, instead of making it look like a normal function that clobbers all the call-clobbered registers. Especially if compiled with clang -Oz which would use 3-byte push 2 / pop rax instead of 5-byte mov eax, 2 to set up the call number. push imm8/pop/syscall is the same size as call rel32.)
Yes, you can define functions in hand-written asm with .global foo / foo:. You could look at this as one large function with multiple entry points for different syscalls. In asm, execution always passes to the next instruction, regardless of labels, unless you use a jump/call/ret instruction. The CPU doesn't know about labels.
So it's just like a C switch(){} statement without break; between case: labels, or like C labels you can jump to with goto. Except of course in asm you can do this at global scope, while in C you can only goto within a function. And in asm you can call instead of just goto (jmp).
static long callnum = 0; // r9 = 0 before a call to any of these
...
socket:
callnum += 38;
close:
callnum++; // can use inc instead of add 1
open: // missed optimization in their asm
callnum++;
write:
callnum++;
read:
tmp=callnum;
callnum=0;
retval = syscall(tmp, args);
Or if you recast this as a chain of tailcalls, where we can omit even the jmp foo and instead just fall through: C like this truly could compile to the hand-written asm, if you had a smart enough compiler. (And you could solve the arg-type
register long callnum asm("r9"); // GCC extension
long open(args...) {
callnum++;
return write(args...);
}
long write(args...) {
callnum++;
return read(args...); // tailcall
}
long read(args...){
tmp=callnum;
callnum=0; // reset callnum for next call
return syscall(tmp, args...);
}
args... are the arg-passing registers (RDI, RSI, RDX, RCX, R8) which they simply leave unmodified. R9 is the last arg-passing register for x86-64 System V, but they didn't use any syscalls that take 6 args. setsockopt takes 5 args so they couldn't skip the mov r10, rcx. But they were able to use r9 for something else, instead of needing it to pass the 6th arg.
That's amusing that they're trying so hard to save bytes at the expense of performance, but still use xor rbp,rbp instead of xor ebp,ebp. Unless they build with gcc -Wa,-Os start.S, GAS won't optimize away the REX prefix for you. (Does GCC optimize assembly source file?)
They could save another byte with xchg rax, r9 (2 bytes including REX) instead of mov rax, r9 (REX + opcode + modrm). (Code golf.SE tips for x86 machine code)
I'd also have used xchg eax, r9d because I know Linux system call numbers fit in 32 bits, although it wouldn't save code size because a REX prefix is still needed to encode the r9d register number. Also, in the cases where they only need to add 1, inc r9d is only 3 bytes, vs. add r9d, 1 being 4 bytes (REX + opcode + modrm + imm8). (The no-modrm short-form encoding of inc is only available in 32-bit mode; in 64-bit mode it's repurposed as a REX prefix.)
mov rsi,rsp could also save a byte as push rsp / pop rsi (1 byte each) instead of 3-byte REX + mov. That would make room for returning main's return value with xchg edi, eax before call exit.
But since they're not using libc, they could inline that exit, or put the syscalls below _start so they can just fall into it, because exit happens to be the highest-numbered syscall! Or at least jmp exit since they don't need stack alignment, and jmp rel8 is more compact than call rel32.
Also how does the separate httpd.asm custom binary work? Just hand-optimized assembly combining the C source and start assembly?
No, that's fully stand-alone incorporating the start.S code (at the ?_017: label), and maybe hand-tweaked compiler output. Perhaps from hand-tweaking disassembly of a linked executable, hence not having nice label names even for the part from the hand-written asm. (Specifically, from Agner Fog's objconv, which uses that format for labels in its NASM-syntax disassembly.)
(Ruslan also pointed out stuff like jnz after cmp, instead of jne which has the more appropriate semantic meaning for humans, so another sign of it being compiler output, not hand-written.)
I don't know how they arranged to get the compiler not to touch r9. It seems just luck. The readme indicates that just compiling the .c and .S works for them, with their GCC version.
As far as the ELF headers, see the comment at the top of the file, which links A Whirlwind Tutorial on Creating Really Teensy ELF Executables for Linux - you'd assemble this with nasm -fbin and the output is a complete ELF binary, ready to run. Not a .o that you need to link + strip, so you get to account for every single byte in the file.
You're pretty much correct about what's going on. Very interesting, I've never seen something like this before. But basically as you said, every time it calls the label, as you said, r9 keeps adding up until it reaches read, whose syscall number is 0. This is why the order is pretty clever. Assuming r9 is 0 before read is called (the read label itself zeroes r9 before calling the correct syscall), no adding is needed because r9 already has the correct syscall number that is needed. write's syscall number is 1, so it only needs to be added by 1 from 0, which is shown in the macro call. open's syscall number is 2, so first it is added by 1 at the open label, then again by 1 at the write label, and then the correct syscall number is put into rax at the read label. And so on. Parameter registers like rdi, rsi, rdx, etc. are also not touched so it basically acts like a normal function call.
Also how does the separate httpd.asm custom binary work? Just hand-optimized assembly combining the C source and start assembly?
I'm assuming you're talking about this file. Not sure exactly what's going on here, but it looks like an ELF file is manually being created, probably to reduce size further.
I am currently working on 'Pentester Academy's x86_64 Assembly Language and Shellcoding on Linux' course (www.pentesteracademy.com/course?id=7). I have one simple question that I can't quite figure out: what is the exact difference between running an assembly program that has been assembled and linked with NASM and ld vs. running the same disassembled program in the classic shellcode.c program (written below). Why use one method over the other?
As an example, when following the first method, I use the commands :
nasm -f elf64 -o execve_stack.o execve_stack.asm
ld -o execve_stack execve_stack.o
./execve_stack
When using the second method, I insert the disassembled shellcode in the shellcode.c program:
#include <stdio.h>
#include <string.h>
unsigned char code[] = \
"\x48\x31\xc0\x50\x48\x89\xe2\x48\xbb\x2f\x62\x69\x6e\x2f\x2f\x73\x68\x53\x48\x89\xe7\x50\x57\x48\x89\xe6\xb0\x3b\x0f\x05";
int main(void) {
printf("Shellcode length: %d\n", (int)strlen(code));
int (*ret)() = (int(*)())code;
ret();
return 0;
}
... and use the commands:
gcc -fno-stack-protector -z execstack -o shellcode shellcode.c
./shellcode
I have analyzed both programs in GDB and found that addresses stored in certain registers differ. I have also read the answer to the following question (C code explanation), which helped me understand the way the shellcode.c program works. Having said that, I still don't fully understand the exact way in which these two methods differ.
There is no theoretical difference between the two methods. In both you end up executing a bunch of assembly instructions on the processor.
The shellcode.c program is there to just demonstrate what would happen if you run the assembly defined as an array of bytes in the unsigned char code[] variable.
Why use one method over the other?
I think you don't understand the purpose of shellcodes and the reasoning behind the shellcode.c program (why it shows what happens when an arbitrary sequence of bytes you have control on is executed on the processor).
A shellcode is a small piece of assembly code that is used to exploit a software vulnerability. An attacker usually injects a shellcode into software by taking advantage of common programming errors such as buffer overflows and then tries to make the software execute that injected shellcode.
A good article showing a step-by-step tutorial on how to generate a shell by performing shellcode injection using buffer overflows can be found here.
Here is how a classic shellcode \x83\xec\x48\x31\xc0\x31\xd2\x50\x68\x2f\x2f\x73\x68\x68\x2f\x62\x69\x6e\x89\xe3\x50\x53\x89\xe1\xb0\x0b\xcd\x80 looks like in assembler:
sub esp, 72
xor eax, eax
xor edx, edx
push eax
push 0x68732f2f ; "hs//" (/ is doubled because you need to push 4 bytes on the stack)
push 0x6e69622f ; "nib/"
mov ebx, esp ; EBX = address of string "/bin//sh"
push eax
push ebx
mov ecx, esp
mov al, 0xb ; EAX = 11 (which is the ID of the sys_execve Linux system call)
int 0x80
In an x86 environment, this does an execve system call with the "/bin/sh" string as parameter.
I was reading some answers and questions on here and kept coming up with this suggestion but I noticed no one ever actually explained "exactly" what you need to do to do it, On Windows using Intel and GCC compiler. Commented below is exactly what I am trying to do.
#include <stdio.h>
int main()
{
int x = 1;
int y = 2;
//assembly code begin
/*
push x into stack; < Need Help
x=y; < With This
pop stack into y; < Please
*/
//assembly code end
printf("x=%d,y=%d",x,y);
getchar();
return 0;
}
You can't just push/pop safely from inline asm, if it's going to be portable to systems with a red-zone. That includes every non-Windows x86-64 platform. (There's no way to tell gcc you want to clobber it). Well, you could add rsp, -128 first to skip past the red-zone before pushing/popping anything, then restore it later. But then you can't use an "m" constraints, because the compiler might use RSP-relative addressing with offsets that assume RSP hasn't been modified.
But really this is a ridiculous thing to be doing in inline asm.
Here's how you use inline-asm to swap two C variables:
#include <stdio.h>
int main()
{
int x = 1;
int y = 2;
asm("" // no actual instructions.
: "=r"(y), "=r"(x) // request both outputs in the compiler's choice of register
: "0"(x), "1"(y) // matching constraints: request each input in the same register as the other output
);
// apparently "=m" doesn't compile: you can't use a matching constraint on a memory operand
printf("x=%d,y=%d\n",x,y);
// getchar(); // Set up your terminal not to close after the program exits if you want similar behaviour: don't embed it into your programs
return 0;
}
gcc -O3 output (targeting the x86-64 System V ABI, not Windows) from the Godbolt compiler explorer:
.section .rodata
.LC0:
.string "x=%d,y=%d"
.section .text
main:
sub rsp, 8
mov edi, OFFSET FLAT:.LC0
xor eax, eax
mov edx, 1
mov esi, 2
#APP
# 8 "/tmp/gcc-explorer-compiler116814-16347-5i3lz1/example.cpp" 1
# I used "\n" instead of just "" so we could see exactly where our inline-asm code ended up.
# 0 "" 2
#NO_APP
call printf
xor eax, eax
add rsp, 8
ret
C variables are a high level concept; it doesn't cost anything to decide that the same registers now logically hold different named variables, instead of swapping the register contents without changing the varname->register mapping.
When hand-writing asm, use comments to keep track of the current logical meaning of different registers, or parts of a vector register.
The inline-asm didn't lead to any extra instructions outside the inline-asm block either, so it's perfectly efficient in this case. Still, the compiler can't see through it, and doesn't know that the values are still 1 and 2, so further constant-propagation would be defeated. https://gcc.gnu.org/wiki/DontUseInlineAsm
#include <stdio.h>
int main()
{
int x=1;
int y=2;
printf("x::%d,y::%d\n",x,y);
__asm__( "movl %1, %%eax;"
"movl %%eax, %0;"
:"=r"(y)
:"r"(x)
:"%eax"
);
printf("x::%d,y::%d\n",x,y);
return 0;
}
/* Load x to eax
Load eax to y */
If you want to exchange the values, it can also be done using this way. Please note that this instructs GCC to take care of the clobbered EAX register. For educational purposes, it is okay, but I find it more suitable to leave micro-optimizations to the compiler.
You can use extended inline assembly. It is a compiler feature whicg allows you to write assembly instructions within your C code. A good reference for inline gcc assembly is available here.
The following code copies the value of x into y using pop and push instructions.
( compiled and tested using gcc on x86_64 )
This is only safe if compiled with -mno-red-zone, or if you subtract 128 from RSP before pushing anything. It will happen to work without problems in some functions: testing with one set of surrounding code is not sufficient to verify the correctness of something you did with GNU C inline asm.
#include <stdio.h>
int main()
{
int x = 1;
int y = 2;
asm volatile (
"pushq %%rax\n" /* Push x into the stack */
"movq %%rbx, %%rax\n" /* Copy y into x */
"popq %%rbx\n" /* Pop x into y */
: "=b"(y), "=a"(x) /* OUTPUT values */
: "a"(x), "b"(y) /* INPUT values */
: /*No need for the clobber list, since the compiler knows
which registers have been modified */
);
printf("x=%d,y=%d",x,y);
getchar();
return 0;
}
Result x=2 y=1, as you expected.
The intel compiler works in a similar way, I think you have just to change the keyword asm to __asm__. You can find info about inline assembly for the INTEL compiler here.
Is there a C library for assembling a x86/x64 assembly string to opcodes?
Example code:
/* size_t assemble(char *string, int asm_flavor, char *out, size_t max_size); */
unsigned char bytes[32];
size_t size = assemble("xor eax, eax\n"
"inc eax\n"
"ret",
asm_x64, &bytes, 32);
for(int i = 0; i < size; i++) {
printf("%02x ", bytes[i]);
}
/* Output: 31 C0 40 C3 */
I have looked at asmpure, however it needs modifications to run on non-windows machines.
I actually both need an assembler and a disassembler, is there a library which provides both?
There is a library that is seemingly a ghost; its existance is widely unknown:
XED (X86 Encoder Decoder)
Intel wrote it: https://software.intel.com/sites/landingpage/pintool/docs/71313/Xed/html/
It can be downloaded with Pin: https://software.intel.com/en-us/articles/pintool-downloads
Sure - you can use llvm. Strictly speaking, it's C++, but there are C interfaces. It will handle both the assembling and disassembling you're trying to do, too.
Here you go:
http://www.gnu.org/software/lightning/manual/lightning.html
Gnu Lightning is a C library which is designed to do exactly what you want. It uses a portable assembly language though, rather than x86 specific one. The portable assembly is compiled in run time to a machine specific one in a very straightforward manner.
As an added bonus, it is much smaller and simpler to start using than LLVM (which is rather big and cumbersome).
You might want libyasm (the backend YASM uses). You can use the frontends as examples (most particularly, YASM's driver).
I'm using fasm.dll: http://board.flatassembler.net/topic.php?t=6239
Don't forget to write "use32" at the beginning of code if it's not in PE format.
Keystone seems like a great choice now, however it didn't exist when I asked this question.
Write the assembly into its own file, and then call it from your C program using extern. You have to do a little bit of makefile trickery, but otherwise it's not so bad.
Your assembly code has to follow C conventions, so it should look like
global _myfunc
_myfunc: push ebp ; create new stack frame for procedure
mov ebp,esp ;
sub esp,0x40 ; 64 bytes of local stack space
mov ebx,[ebp+8] ; first parameter to function
; some more code
leave ; return to C program's frame
ret ; exit
To get at the contents of C variables, or to declare variables which C can access, you need only declare the names as GLOBAL or EXTERN. (Again, the names require leading underscores.) Thus, a C variable declared as int i can be accessed from assembler as
extern _i
mov eax,[_i]
And to declare your own integer variable which C programs can access as extern int j, you do this (making sure you are assembling in the _DATA segment, if necessary):
global _j
_j dd 0
Your C code should look like
extern void myasmfunc(variable a);
int main(void)
{
myasmfunc(a);
}
Compile the files, then link them using
gcc mycfile.o myasmfile.o