Getting printf in assembly with only system calls? - c

I am looking to understand the printf() statement at the assembly level. However most of the assembly programs do something like call an external print function whose dependency is met by some other object file that the linker adds on. I would like to know what is inside that print function in terms of system calls and very basic assembly code. I want a piece of assembly code where the only external calls are the system calls, for printf. I'm thinking of something like a de assembled object file. Where can I get something like that??

I would suggest instead to stay first at the C level, and study the source code of some existing C standard library free software implementation on Linux. Look into the source code of musl-libc or of GNU libc (a.k.a. glibc). You'll understand that several intermediate (usually internal) functions are useful between printf and the basic system calls (listed in syscalls(2) ...). Use also strace(1) on a sample C program doing printf (e.g. the usual hello-world example).
In particular, musl-libc has a very readable stdio/printf.c implementation, but you'll need to follow several other C functions there before reaching the write(2) syscall. Notice that some buffering is involved. See also setvbuf(3) & fflush(3). Several answers (e.g. this and that one) explain the chain between functions like printf and system calls (up to kernel code).
I want a piece of assembly code where the only external calls are the system calls, for printf
If you want exactly that, you might start from musl-libc's stdio/printf.c, add any additional source file from musl-libc till you have no more external undefined symbols, and compile all of them with gcc -flto -O2 and perhaps also -S, you probably will finish with a significant part of musl-libc in object (or assembly) form (because printf may call malloc and many other functions!)... I'm not sure it is worth the pain.
You could also statically link your libc (e.g. libc.a). Then the linker will link only the static library members needed by printf (and any other function you are calling).
To be picky, system calls are not actually external calls (your libc write function is actually a tiny wrapper around the raw system call). You could make them using SYSENTER machine instructions (but using vdso(7) is preferable: more portable, and perhaps quicker), and you don't even need a valid stack pointer (on x86_64) to make a system call.
You can write Linux user-level programs without even using the libc; the bones implementation of Scheme is such a program (and you'll find others).

The function printf() is in the standard C library, so it is linked into your program and not copied into it. Dynamically linked libraries save memory because you don't have the exact same code copied in resident memory for every program that uses it.
Think about what printf() does. Interpreting the formatted string and generating the correct output is fairly complex. The series of functions that printf() belongs to also buffers the output. You probably don't really want to re-implement all of this in assembly. The standard C library is omnipresent, and probably available for you.
Maybe you're looking for write(2), which is the system call for unbuffered writes of just bytes to a file descriptor. You'd have to generate the string to print beforehand and format it yourself. (See also open(2) for opening files.)
To disassemble a binary, you can use objdump:
objdump -d binary
where binary is some compiled binary. This gives opcodes and human readable instructions. You probably want to redirect to a file and read elsewhere.
You can disassemble the standard C binary on your system and try to interpret it if you want (strongly not recommended). The problem is that it will be far too complex to understand. Things like printf() were written in C, then compiled and assembled. You can't (within a reasonable number of decades) restore the high level structure from the assembly of a compiled (non-trivial) program. If you really want to try this, good luck.
An easier thing to do is to look at the C source code for printf() itself. The real work is actually done in vfprintf() which is in stdio-common/vfprintf.c of the GNU C library source code.

Related

How do I get a full assembly code from C file?

I'm currently trying to figure out the way to produce equivalent assembly code from corresponding C source file.
I've been using the C language for several years, but have little experience with assembly language.
I was able to output the assembly code using the -S option in gcc. However, the resulting assembly code contained call instructions which in turn make a jump to another function like _exp. This is not what I wanted, I needed a fully functional assembly code in a single file, with no dependency to other code.
Is it possible to achieve what I'm looking for?
To better describe the problem, I'm showing you my code here:
#include <math.h>
float sigmoid(float i){
return 1/(1+exp(-i));
}
The platform I am working on is Windows 10 64-bit, the compiler I'm using is cl.exe from MSbuild.
My initial objective was to see, at a lowest level possible, how computers calculate mathematical functions. The level where I decided to observe the calculation process is assembly code, and the mathematical function I've chosen was sigmoid defined as above.
_exp is the standard math library function double exp(double); apparently you're on a platform that prepends a leading underscore to C symbol names.
Given a .s that calls some library functions, build it the same way you would a .c file that calls library functions:
gcc foo.S -o foo -lm
You'll get a dynamic executable by default.
But if you really want all the code in one file with no external dependencies, you can link your .c into a static executable and disassemble that.
gcc -O3 -march=native foo.c -o foo -static -lm
objdump -drwC -Mintel foo > foo.s
There's no guarantee that the _exp implementation in libm.a (static library) is identical to the one you'd get in libm.so or libm.dll or whatever, because it's a different file. This is especially true for a function like memcpy where dynamic-linker tricks are often used to select an optimal version (for your CPU) at run-time.
It is not possible in general, there are exceptions sure, I could craft one so that means other folks can too, but it isnt an interesting program.
Normally your C program, your main() entry point is only a percentage of the code. There is a bootstrap that contains the actual entry point for the operating system to launch your program, this does some things that prepare your virtual memory space so that your program can run. Zeros .bss and other such things. that is often and or should be written in assembly language (otherwise you get a chicken and egg problem) but not an assembly language file you will see unless you go find the sources for the C library, you will often get an object as part of the toolchain along with other compiler libraries, etc.
Then if you make any C calls or create code that results in a compiler library call (perform a divide on a platform that doesnt support divide, perform floating point on a platform that doesnt have floating point, etc) that is another object that came from some other C or assembly that is part of the library or compiler sources and is not something you will see during the compile/assemble/link (the chain in toolchain) process.
So except for specifically crafted trivial programs or specifically crafted tools for this purpose (for specific likely baremetal platforms), you will not see your whole program turn into one big assembly source file before it gets assembled then linked.
If not baremetal then there is of course the operating system layer which you certainly would not get to see as part of your source code, ultimately the C library calls that need the system will have a place where they do that, all compiled to object/lib before you use them, and the assembly sources for the operating system side is part of some other source and build process somewhere else.

Are system calls directly send to the kernel?

I have a couple of assumptions, most likely some of them will be incorrect. Please correct me where they are wrong.
We could categorize the functions in a program written in C as follows:
Functions that are sent to dynamically loaded libraries:
These are sent to the library that translates them in to multiple standard C-functions
The library passes them on to libc where they are translated into multiple system calls.
Libc passes those on to the kernel where they are executed and the returns are sent back to libc.
Libc will collect the returs, group them by c-function and use them to create 1 return for each c-function. These returns will be send back to the dynamically loaded library.
This library will collect all returns and use them to create 1 return that is send back to the original program.
Functions that are either defined in the code or part of statically compiled libraries: Everything is the same as the category above but:
They program already does the translation into standard C functions where they are known or into functions calling dynamically loaded libraries in the other case.
The standard c functions are send to libc, the others to the dynamically loaded libraries (where they will be handled as above).
The creation of 1 final return based on the returns from both types of functions will be done by the program
Functions that are standard C functions: They will just be sent to libc which will communicate with the kernel in the same way as above and 1 return will be sent to the program
Functions that are system calls: They are NOT sent directly to the kernel but have to pass to libc although it doesn't do any extra work.
Security checks (permissions, writing to unallocated mem, ...) are always done by the kernel, although libc and libraries above might also check it first.
All POSIX-compliant systems follow these rules
It might not be the same on Linux and on some other POSIX system (like FreeBSD).
On Linux, the ABI defines how a system call is done. Read about Linux kernel interfaces. The system calls are listed in syscalls(2) (but see also /usr/include/asm*/unistd.h ...). Read also vdso(7). The assembler HowTo explains more details, but for 32 bits i686 only.
Most Linux libc are free software, you can study their source code. IMHO the source code of musl-libc is very readable.
To simplify a tiny bit, most system calls (e.g. write(2)) are small C functions in the libc which:
call the kernel using SYSENTER machine instruction (and take care of passing the system call number and its arguments with the kernel convention, which is not the usual C ABI). What the kernel considers as a system call is only that machine instruction (and conventions about it).
handle the failure case, by passing it to errno(3) and returning -1.
(IIRC, on failure, the carry -or perhaps the overflow- flag bit is set when the kernel returns from SYSENTER; but I could be wrong in the details)
handle the success case, by returning a result.
You could invoke system calls without libc, with some assembler code. This is unusual, but has been done (e.g. in BusyBox or in Bones).
So the libc code for write is doing some tiny extra work (passing arguments, handling failure & errno and success cases).
Some few system calls (probably getpid & clock_gettime) avoid the overhead of the SYSENTER machine instruction (and user-mode -> kernel-mode switch) thanks to vDSO.
No you can't categorize things like that. When you program in C (but that makes no difference in almost all other languages), there is only functions and whatever is the real status of these, you call them exactly the same way. This is defined by ABI (how to pass parameters, get returned values, etc) and enforced by the compiler/linker. Of course some functions are just stubs. For example stubs to shared libraries functions (stubs may be need to load the library, dynamic link to the real function, etc) or system calls (this is more technical and differs from kernel to kernel). But from the viewpoint of your program everything is the same (this is why it is hard to understand difference between fread and read at the beginning: you call them the same way, they make almost the same job, what's the difference?).
POSIX doesn't say a single word about kernels... It just lists the C (and formerly ADA) API of a set of functions with minimal semantic (plus some command, tools, etc). Implementation of these is totally free.

Dynamically change the running code by writing into the __FILE__?

I got to know of a way to print the source code of a running code in C using the __FILE__ macro. As such I can seek the location and use putchar() to alter the contents of the file.
Is it possible to dynamically change the running code using this method?
Is it possible to dynamically change the running code using this method ?
No, because once a program is compiled it no longer depends on the source file.
If you want learn how to alter the behavior of an process that is already running from within the process itself, you need to learn about assembly for the architecture you're using, the executable file format on your system, and the process API on your system, at the very least.
As most other answers are explaining, in practical terms, most C implementations are compilers. So the executable that is running has only an indirect (and delayed) relation with the source code, because the source code had to be processed by the compiler to produce that executable.
Remember that a programming language is (not a software but...) a specification, written in some report. Read n1570, draft specification of C11. Most implementations of C are command-line compilers (e.g. GCC & Clang/LLVM in the free software realm), even if you might find interpreters.
However, with some operating systems (notably POSIX ones, such as MacOSX and Linux), you could dynamically load some plugin. Or you could create, in some other way (such as JIT compilation libraries like libgccjit or LLVM or libjit or GNU lightning), a fresh function and dynamically get a pointer to it (and that is not stricto sensu conforming to the C standard, where a function pointer should point to some existing function of your program).
On Linux, you might generate (at runtime of your own program, linked with -rdynamic to have its names usable from plugins, and with -ldl library to get the dynamic loader) some C code in some temporary source file e.g. /tmp/gencode.c, run a compilation (using e.g. system(3) or popen(3)) of that emitted code as a /tmp/gencode.so plugin thru a command like e.g. gcc -O1 -g -Wall -fPIC -shared /tmp/gencode.c -o /tmp/gencode.so, then dynamically load that plugin using dlopen(3), find function pointers (from some conventional name) in that loaded plugin with dlsym(3), and call indirectly that function pointer. My manydl.c program shows that is possible for many hundred thousands of generated C files and loaded plugins. I'm using similar tricks in my GCC MELT. See also this and that. Notice that you don't really "self-modify" C code, you more broadly generate additional C code, compile it (as some plugin, etc...), and then load it -as an extension or plugin- then use it.
(for pragmatical reasons including ease of debugging, I don't recommend overwriting some existing C file, but just emitting new C code in some fresh temporary .c file -from some internal AST-like representation- that you would later feed to the compiler)
Is it possible to dynamically change the running code?
In general (at least on Linux and most POSIX systems), the machine code sits in a read-only code segment of the virtual address space so you cannot change or overwrite it; but you can use indirection thru function pointers (in your C code) to call newly loaded code (e.g. from dlopen-ed plugins).
However, you might also read about homoiconic languages, metaprogramming, multi-staged programming, and try to use Common Lisp (e.g. using its SBCL implementation, which compile to machine code at every REPL interaction and at every eval). I also recommend reading SICP (an excellent and freely available introduction to programming, with some chapters related to metaprogramming approaches)
PS. Dynamic loading of plugins is also possible in Windows -which I don't know- with LoadLibrary, but with a very different (and incompatible) model. Read Levine's linkers and loaders.
A computer doesn't understand the code as we do. It compiles or interprets it and loads into memory. Our modification of code is just changing the file. One needs to compile it and link it with other libraries and load it into memory.
ptrace() is a syscall used to inject code into a running program. You can probably look into that and achieve whatever you are trying to do.
Inject hello world in a running program. I have tried and tested this sometime before.

How to print something without using std lib functions?

In the C language, when printing something on the screen, we usually use printf, puts and so on. Which are all defined in the or other header documents.
Is there any way to print something on screen without using such functions? That is to say, how is printf realised?
Eventually the C function printf will result in a sys_write system call, directly or by going through write (see man 2 write). The actual implementation depends on the compiler and the standard libraries.
Printing to screen requires access to framebuffer (hardware) and userspace programs are not allowed to have a direct access to it. So what they do is make a system call and kernel performs the required function for them. printf -> write system call -> kernel writes the data into framebuffer and then control is given back to user program.
Even if you don't want to use printf or puts (they are implemented in hosted libc) still you have to use write system call to tell the kernel on which device you want to write the buffer.
The standard headers are not, necessarily, a library containing functions written in C code.
They are functions with C "interfase", however it's very probably that they contain explicit machine code, adapted, in each case, to the target system.
The standard headers provide, in this way, ways of doing special process that it would not be possible to achieve in strict C code.
In the specific case of printf(), the situation is even more clear, because if none header is #include-d, then there is not any mechanism through the use of the C syntax only that performs an Input/Output operation.
library ncurses can help you, but if you want to use a low level function use write() and if you want to do kernel programming you have to use printk()

Some general C questions

I am trying to fully understand the process pro writing code in some language to execution by OS. In my case, the language would be C and the OS would be Windows. So far, I read many different articles, but I am not sure, whether I understand the process right, and I would like to ask you if you know some good articles on some subjects I couldnĀ“t find.
So, what I think I know about C (and basically other languages):
C compiler itself handles only data types, basic math operations, pointers operations, and work with functions. By work with functions I mean how to pass argument to it, and how to get output from function. During compilation, function call is replaced by passing arguments to stack, and than if function is not inline, its call is replaced by some symbol for linker. Linker than find the function definition, and replace the symbol to jump adress to that function (and of course than jump back to program).
If the above is generally true and I get it right, where to final .exe file actually linker saves the functions? After the main() function? And what creates the .exe header? Compiler or Linker?
Now, additional capabilities of C, today known as C standart library is set of functions and the declarations of them, that other programmers wrote to extend and simplify use of C language. But these functions like printf() were (or could be?) written in different language, or assembler. And there comes my next question, can be, for example printf() function be written in pure C without use of assembler?
I know this is quite big question, but I just mostly want to know, wheather I am right or not. And trust me, I read a lots of articles on the web, and I would not ask you, If I could find these infromation together on one place, in one article. Insted I must piece by piece gather informations, so I am not sure if I am right. Thanks.
I think that you're exposed to some information that is less relevant as a beginning C programmer and that might be confusing you - part of the goal of using a higher level language like this is to not have to initially think about how this process works. Over time, however, it is important to understand the process. I think you generally have the right understanding of it.
The C compiler merely takes C code and generates object files that contain machine language. Most of the object file is taken by the content of the functions. A simple function call in C, for example, would be represented in the compiled form as low level operators to push things into the stack, change the instruction pointer, etc.
The C library and any other libraries you would use are already available in this compiled form.
The linker is the thing that combines all the relevant object files, resolves all the dependencies (e.g., one object file calling a function in the standard library), and then creates the executable.
As for the language libraries are written in: Think of every function as a black box. As long as the black box has a standard interface (the C calling convention; that is, it takes arguments in a certain way, returns values in a certain way, etc.), how it is written internally doesn't matter. Most typically, the functions would be written in C or directly in assembly. By the time they make it into an object file (or as a compiled library), it doesn't really matter how they were initially created, what matters is that they are now in the compiled machine form.
The format of an executable depends on the operating system, but much of the body of the executable in windows is very similar to that of the object files. Imagine as if someone merged together all the object files and then added some glue. The glue does loading related stuff and then invokes the main(). When I was a kid, for example, people got a kick out of "changing the glue" to add another function before the main() that would display a splash screen with their name.
One thing to note, though is that regardless of the language you use, eventually you have to make use of operating system services. For example, to display stuff on the screen, to manage processes, etc. Most operating systems have an API that is also callable in a similar way, but its contents are not included in your EXE. For example, when you run your browser, it is an executable, but at some point there is a call to the Windows API to create a window or to load a font. If this was part of your EXE, your EXE would be huge. So even in your executable, there are "missing references". Usually, these are addressed at load time or run time, depending on the operating system.
I am a new user and this system does not allow me to post more than one link. To get around that restriction, I have posted some idea at my blog http://zhinkaas.blogspot.com/2010/04/how-does-c-program-work.html. It took me some time to get all links, but in totality, those should get you started.
The compiler is responsible for translating all your functions written in C into assembly, which it saves in the object file (DLL or EXE, for example). So, if you write a .c file that has a main function and a few other function, the compiler will translate all of those into assembly and save them together in the EXE file. Then, when you run the file, the loader (which is part of the OS) knows to start running the main function first. Otherwise, the main function is just like any other function for the compiler.
The linker is responsible for resolving any references between functions and variables in one object file with the references in other files. For example, if you call printf(), since you do not define the function printf() yourself, the linker is responsible for making sure that the call to printf() goes to the right system library where printf() is defined. This is done at compile-time.
printf() is indeed be written in pure C. What it does is call a system call in the OS which knows how to actually send characters to the standard output (like a window terminal). When you call printf() in your program, at compile time, the linker is responsible for linking your call to the printf() function in the standard C libraries. When the function is passed at run-time, printf() formats the arguments properly and then calls the appropriate OS system call to actually display the characters.

Resources