printf and memcpy linkage to standard C library - c

It is my understanding that if I call printf in a program, by default (if the program isn't statically compiled) it makes a call to printf in the standard C library. However, if I were to call say memcpy, I'd hope the code would be inlined, as a function call is very expensive if memcpy is only copying a few bytes. If you're inlining sometimes and calling out others, the behaviour of your program after a libc upgrade is implementation dependent.
What actually occurs in both of these cases and generally?

First of all the function is never truly "inlined" - that applies to functions that you've written that are visible in the same compilation unit.
If you're inlining sometimes and calling out others, the behaviour of your program after a libc upgrade is implementation dependent.
This is not the case. The memcpy might be "inlined" at compile time. Once compiled, your libc version makes no difference.
In GCC, memcpy is recognized as a builtin. That means if GCC decides it, the call to memcpy will be replaced with a suitable implementation. On x86, this will usually be a rep movsb or similar instruction - depending on the size of the copy, and if it is of a constant size or not.

An implementation is allowed by the C standard to behave "as if" the actual standard library function were called. This is indeed a common optimization: small memcpy calls can be unrolled/inlined, and much more.
You're right that in some cases you could upgrade your libc and not see any change in function calls which were optimized out.

It's going to depend on a lot of things, here's how you can find out. GNU Binutils comes with a utility objdump that gives all sorts of details on what's in a binary.
On my system (an ARM Chromebook), compiling test.c:
#include <stdio.h>
int main(void) {
printf("Hello, world!\n");
}
with gcc test.c -o test and then running objdump -R test gives
test: file format elf32-littlearm
DYNAMIC RELOCATION RECORDS
OFFSET TYPE VALUE
000105e4 R_ARM_GLOB_DAT __gmon_start__
000105d4 R_ARM_JUMP_SLOT puts
000105d8 R_ARM_JUMP_SLOT __libc_start_main
000105dc R_ARM_JUMP_SLOT __gmon_start__
000105e0 R_ARM_JUMP_SLOT abort
These are the dynamic relocation entries that are in the file, all the stuff that will be linked in from libraries external to the binary. Here it seems that the printf has been entirely optimized out, since it is only giving a constant string, and thus puts is sufficient. If we modify this to
printf("Hello world #%d\n", 1);
then we get the expected
000105e0 R_ARM_JUMP_SLOT printf
To get memcpy to be explicitly linked to, we have to prevent gcc from using its own builtin version with -fno-buildin-memcpy.

You can always attempt to drive the compiler behavior. For instance, with gcc:
gcc -fno-inline -fno-builtin-inline -fno-inline-functions -fno-builtin...
You should check the different results with nm or directly the interrupt calls in the assembly source code.

Related

Compile c program with access to stdlib functions, but without _start and all of the libc init functions using gcc

So essentially I want to compile a c program statically with gcc, and I want it to be able to link c stdlib functions, but I want it to start at main, and not include the _start function as well as the libc init stuff that happens before main. Normally when you want to compile a program without _start, you run gcc with the -nostdlib flag, but I want to also be able to include code from stdlib, just not the libc init. Is there any way to do this?
I know that this could cause a lot of problems, but for my use case I'm not actually running the c program itself so it makes sense to do this.
Thanks in advance
The option -nostdlib tells the linker to not use the startup files (ie. the code that is executed before the main).
-nostdlib
Do not use the standard system startup files or libraries when linking.
No startup files and only the libraries you specify are
passed to the linker, and options specifying linkage of the system
libraries, such as -static-libgcc or -shared-libgcc, are ignored.
The compiler may generate calls to memcmp, memset, memcpy and memmove.
These entries are usually resolved by entries in libc. These
entry points should be supplied through some other mechanism when this
option is specified.
It is frequent to use this option in low-level bare-metal programming in order to control exactly what is going on.
You can still use the functions of your libc by using -lc. However keep in mind that some of the libc function depend on the startup code. For example in some implementations printf requires dynamic memory allocation and the heap is initialized by the startup code.

Where does GCC find printf ? My code worked without any #include

I am a C beginner so I tried to hack around the stuff.
I read stdio.h and I found this line:
extern int printf (const char *__restrict __format, ...);
So I wrote this code and i have no idea why it works.
code:
extern int printf (const char *__restrict __format, ...);
main()
{
printf("Hello, World!\n");
}
output:
sh-5.1$ ./a.out
Hello, World!
sh-5.1$
Where did GCC find the function printf? It also works with other compilers.
I am a beginner in C and I find this very strange.
gcc will link your program, by default, with the c library libc which implements printf:
$ ldd ./a.out
linux-vdso.so.1 (0x00007ffd5d7d3000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fdf2d307000)
/lib64/ld-linux-x86-64.so.2 (0x00007fdf2d4f0000)
$ nm -D /lib/x86_64-linux-gnu/libc.so.6 | grep ' printf' | head -1
0000000000056cf0 T printf##GLIBC_2.2.5
If you build your program with -nolibc you have to satisfy a few symbols on your own (see
Compiling without libc):
$ gcc -nolibc ./1.c
/usr/bin/ld: /usr/lib/gcc/x86_64-linux-gnu/10/../../../x86_64-linux-gnu/Scrt1.o: in function `_start':
(.text+0x12): undefined reference to `__libc_csu_fini'
/usr/bin/ld: (.text+0x19): undefined reference to `__libc_csu_init'
/usr/bin/ld: (.text+0x26): undefined reference to `__libc_start_main'
/usr/bin/ld: /tmp/user/1000/ccCFGFhf.o: in function `main':
1.c:(.text+0xc): undefined reference to `puts'
collect2: error: ld returned 1 exit status
You need to understand the difference between the compile and link phases of program compilation.
In the compilation phase you describe to the compiler the various things you intend to call that may be in this file, in other files or in libraries. This is done using function declarations.
int woodle(char*);
for example. This is what header files are full of.
If the function is in the same file then the compiler will work out how to call it while it compiles that file. But for other functions it leaves a note in the generated code that says
please wire up the woodle function here so I can call it.
Usually called an import and there are tools you can use to look at the imports in an object file - name depends on platform and toolset
The linkers job is to find those imports and resolve them. It will look at objects files passed on the command line, at libraries included on the command line and also standard libraries that the c standard says should be available to all programs.
In your printf case the linker found printf in the c standard library that the linker includes automatically.
BTW - the linker looks for 'exports' from objects and libraries, there are tools to look at those too. The linkers job is to match each 'import' to an 'export'
First, realize what the gcc program is. Technically, it is not a compiler, but a compiler driver. A compiler driver is responsible for driving the various other tools which perform compilation-related tasks. Some of the tools are found in PATH, whereas others are in internal compiler directories.
There are various ways to check what the driver is doing. I won't go into much detail about how I made the rest of this post, but briefly:
strace -f -e %process gcc is a Linux-specific way of showing all the programs executed (elsewhere in this answer, I assume Linux when specifying details but it doesn't matter)
gcc -v will dump out various information, but you have to learn what parts actually matter for whatever you are doing.
there exists a "specs" file that controls some of the argument-related stuff the driver does
Now for the actual data:
Here's the tree of processes that gcc might execute:
gcc, the "driver" (input various, output various. Some arguments are handled by the driver itself, but most are passed to the various subprocesses)
(these are repeated for every input file. If -pipe is passed, temporary files are omitted and processes are run in parallel; if --save-temps is passed, intermediate files are preserved):
cc1 -E -lang-asm, the "preprocessor" for assembly code (input .S, output .s - yes, case matters. Only relevant if you're trying to compile separate ASM files that need preprocessing)
cc1 -E, the "preprocessor" for C code (input .c; output .i. Only a separate process if -fno-integrated-cpp is passed, which is rare. Note that the cpp program in PATH is never called, even though it is provided by GCC - rather, it calls this. If -E is passed, the driver stops after this)
cc1, the "compiler" proper (input (usually) .c or (rarely) .i; output .s. If -S is passed, the driver stops after this; if -fsyntax-only is passed, this stage doesn't even complete)
(For other languages, replace cc1 with cc1plus, cc1d, cc1obj, f951, gnat1, etc. Note that the different drivers like g++, gdc, etc. only affect what extra libraries are linked by default)
as, the "assembler" (input .s; output .o. This is looked up in PATH; it is shipped as part of Binutils, not GCC. If -c is passed, the driver stops here)
collect2, the "linker" wrapper (supposedly this has something to do with constructors, and potentially calls ld twice, but in practice I've never seen it. Just think of it as forwarding all its arguments to ld, even if you have constructors normally)
ld, the "linker" proper (input .o or others (assumed to be libraries); output executable or shared library. Like as, this is actually part of Binutils, not GCC, so it is looked up in PATH)
The driver has a lot of logic, so it is important that you use it. Notably, you should never call as or ld yourself, since that will omit arguments that rely on the driver's sense of "exact current platform".
Now, getting to your specific question:
Ignoring irrevelant arguments and simplifying paths, the ld call ends up looking like:
ld -o foo Scrt1.o crti.o crtbeginS.o foo.o -lgcc -lgcc_s -lc -lgcc -lgcc_s crtendS.o crtn.o
The various "crt" loose object files are a mixture of parts of GLIBC and GCC, needed to support the C runtime (note that there are others as well; which are linked depends on arguments). The gcc and gcc_s libraries are needed to run code on the platform at all; they are repeated because they rely on the c library which also relies on them.
Since -lc is passed by default (regardless of language), the printf symbol can be resolved. Notably, -lm, -lrt, -lpthread and others are not passed by default, so other symbols from differents parts of the C library will not be resolved unless you pass them manually.
All of this is completely independent of what headers are included.
That your program compiles without a header present means that the compiler settings were lenient. You should still get a warning though. The reason that your program links is that the C standard library, which contains the code of the function printf, is linked automatically. Almost every C program needs it because input and output, or generally interaction with peripherals, which that library handles, are the general means of generating a "side effect", an effect outside the program. The opposite is so uncommon that one must make the wish to not link with it explicit.
So why does your compiler accept a call to a function which has not been declared?
C emerged at a time when programs were much smaller and software development as an engineering discipline didn't formally exist:
Four years later [i.e., in 1978], as a still-junior faculty member, I tried to get my colleagues [...] to create an undergraduate computer-science degree. A senior mechanical engineer of forbidding mien snorted surely not: Harvard had never offered a degree in automotive science, why would we create one in computer science? I waited until I had tenure before trying again (and succeeding) in 1982. -Harry R. Lewis
That was about 10 years after Denis Ritchie had started to develop this versatile new programming language, the successor to B. The problems involved in creating and maintaining large programs back then were simply not as pressing and not as well-understood as they are, perhaps, today.
Among the many things that help us today, at least in most compiled languages, is strong typing. Every identifier we use is declared with a static type. But the importance and benefits of that were not that obvious in the 1970s, and early C permitted mixing and matching integers and pointers at will. It's all numbers, right? And a function is just a name for a jump address, right? The user will know what to put on the stack, and the function will read it off the stack — I really don't see a problem here ;-). This attitude brought us functions like printf().
After this stage-setting we are slowly getting to the point. Because a function is just a jump address, no function declaration needed to be present in order to to call one. The assumed parameters were what you presented, and the presumed return type defaulted to int, which was often correct or at least didn't hurt. And for a long time C kept this backward compatibility. I think the C99 standard forbid the use of undeclared identifiers, and the standard drafts for C11 and C21 both say:
An identifier is a primary expression, provided it has been declared as designating an object (in which case it is an lvalue) or a function (in which case it is a function designator)91
Footnote 91 says "Thus, an undeclared identifier is a violation of the syntax." (All emphasis by me.)
All compilers I tried compile it anyway (with a warning), perhaps because some ancient code that still gets compiled frequently depends on it.

Why am I able to link without including ctype.h

Without #include<ctype.h>, the following program outputs 1 and 0.
With the include, it outputs 1 and 1.
I am using TDM-GCC 4.9.2 64-bit. I wonder what the implementation of isdigit is in the first case, and why it is able to link.
#include<stdio.h>
//#include<ctype.h>
int main()
{
printf("%d %d\n",isdigit(48),isdigit(48.4));
return 0;
}
By default GCC uses the C90 standard (with GNU extensions (reference)) which allows implicit declarations. The problem with your case is that you have two calls to isdigit with two different arguments which might confuse the compiler when it creates the implicit declaration of the function, and it probably selects int isdigit(double) to be on the safe side. That is of course the wrong prototype for the function, which means that when the library function is called at run-time it will be called with wrong arguments and you will have undefined behavior.
When you include the <ctype.h> header file, there is a correct prototype, and then the compiler know that isdigit takes an int argument and can convert the double literal 48.4 to the integer 48 for the call.
As for why it's linking, it's because while these functions may be implemented as macros, that's not a requirement. What is a requirement is that those functions, at least in the C11 standard (I don't have any older version available at the moment), have to be aware of the current locale which will make their implementation as macros much harder, and much easier as normal library functions. And as the standard library is always linked (unless you tell GCC otherwise) the functions will be available.
First of all #include statements don't have anything to do with linking. Remember anything with a # in-front in C is meant for the preprocessor, not the compiler or the linker.
But that said the function has to be linked isn't it?
Let's do the steps in separate steps.
$ gcc -c -Werror --std=c99 st.c
st.c: In function ‘main’:
st.c:5:22: error: implicit declaration of function ‘isdigit’ [-Werror=implicit-function-declaration]
printf("%d %d\n",isdigit(48),isdigit(48.4));
^
cc1: all warnings being treated as errors
Well as you see gcc's lint(static analyzer) is in action!
Whatever we will proceed to ignore it...
$ gcc -c --std=c99 st.c
st.c: In function ‘main’:
st.c:5:22: warning: implicit declaration of function ‘isdigit’ [-Wimplicit-function-declaration]
printf("%d %d\n",isdigit(48),isdigit(48.4));
This time only an warning. Now we have a object file at the current directory. Let's inspect it...
$ nm st.o
U isdigit
0000000000000000 T main
U printf
As you can see both printf and isdigit is listed as undefined. So the code has to come from somewhere isn't it?
let's proceed to link it ...
$ gcc st.o
$ nm a.out | grep 'printf\|isdigit'
U isdigit##GLIBC_2.2.5
U printf##GLIBC_2.2.5
Well as you can see situation is mildly improved. As isdigit and printf are not helpless loners like they were in the st.o. You could see both of the functions are provided by GLIBC_2.2.5. But where is that GLIBC?
Well let's examine the final executable a bit more...
$ ldd a.out
linux-vdso.so.1 => (0x00007ffe58d70000)
libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fb66f299000)
/lib64/ld-linux-x86-64.so.2 (0x000055b26631d000)
AHA...there is that libc . So it turns out, though you have not given any instruction, the linker is linking with 3 libraries by default, one of them is the libc which contains both printf and isdigit.
You can see the default behaviour of the linker by :
$gcc -dumpspec
*link:
%{!r:--build-id} %{!static:--eh-frame-hdr} %{!mandroid|tno-android-ld:%{m16|m32|mx32:;:-m elf_x86_64} %{m16|m32:-m elf_i386} %{mx32:-m elf32_x86_64} --hash-style=gnu --as-needed %{shared:-shared} %{!shared: %{!static: %{rdynamic:-export-dynamic} %{m16|m32:-dynamic-linker %{muclibc:/lib/ld-uClibc.so.0;:%{mbionic:/system/bin/linker;:/lib/ld-linux.so.2}}} %{m16|m32|mx32:;:-dynamic-linker %{muclibc:/lib/ld64-uClibc.so.0;:%{mbionic:/system/bin/linker64;:/lib64/ld-linux-x86-64.so.2}}} %{mx32:-dynamic-linker %{muclibc:/lib/ldx32-uClibc.so.0;:%{mbionic:/system/bin/linkerx32;:/libx32/ld-linux-x32.so.2}}}} %{static:-static}};:%{m16|m32|mx32:;:-m elf_x86_64} %{m16|m32:-m elf_i386} %{mx32:-m elf32_x86_64} --hash-style=gnu --as-needed %{shared:-shared} %{!shared: %{!static: %{rdynamic:-export-dynamic} %{m16|m32:-dynamic-linker %{muclibc:/lib/ld-uClibc.so.0;:%{mbionic:/system/bin/linker;:/lib/ld-linux.so.2}}} %{m16|m32|mx32:;:-dynamic-linker %{muclibc:/lib/ld64-uClibc.so.0;:%{mbionic:/system/bin/linker64;:/lib64/ld-linux-x86-64.so.2}}} %{mx32:-dynamic-linker %{muclibc:/lib/ldx32-uClibc.so.0;:%{mbionic:/system/bin/linkerx32;:/libx32/ld-linux-x32.so.2}}}} %{static:-static}} %{shared: -Bsymbolic}}
What are the other two libraries?
Well remember when you dug into a.out, both printf and isdigit were still shown as U that means unknown. In other words, there were no memory address associated with these symbols.
In reality this is where the magic lies. These libraries were actually loaded during runtime, not during link time like older systems.
How it's implemented? Well it has a jargon associated with, something like lazy linking. What it does, is when the process calls a function , if there is no memory address(TEXT section), it generates a Trap (Something like a Exception in high level language jargon, when control is handed over to the language engine). The kernel intercepts such Trap and hands it over to the dynamic loader which loads the library and returns the associated memory address to the caller process.
There are multiple theoretical reason, why doing things lazily is better than doing it beforehand. I guess that's a whole new topic, which we will discuss at some other time.

Is a main() required for a C program?

Well the title says it all. Is a main() function absolutely essential for a C program?
I am asking this because I was looking at the Linux kernel code, and I didn't see a main() function.
No, the ISO C standard states that a main function is only required for a hosted environment (such as one with an underlying OS).
For a freestanding environment like an embedded system (or an operating system itself), it's implementation defined. From C99 5.1.2:
Two execution environments are defined: freestanding and hosted. In both cases, program startup occurs when a designated C function is called by the execution environment.
In a freestanding environment (in which C program execution may take place without any benefit of an operating system), the name and type of the function called at program startup are implementation-defined.
As to how Linux itself starts, the start point for the Linux kernel is start_kernel though, for a more complete picture of the entire boot process, you should start here.
Well, no, but ...
C99 specifies that main() is called in the hosted environment "at program startup", however, you don't have to use the C runtime support. Your operating system executes image files and starts a program at an address provided by the linker.
If you are willing to write your program to conform to the operating system's requirements rather than C99's, you can do it without main(). The more modern (and complex) the system, though, the more trouble you will have with the C library making assumptions that the standard runtime startup is used.
Here is an example for Linux...
$ cat > nomain.S
.text
_start:
call iamnotmain
movl $0xfc, %eax
xorl %ebx, %ebx
int $0x80
.globl _start
$ cat > demo.c
void iamnotmain(void) {
static char s[] = "hello, world\n";
write(1, s, sizeof s);
}
$ as -o nomain.o nomain.S
$ cc -c demo.c
$ ld -static nomain.o demo.o -lc
$ ./a.out
hello, world
It's arguably not "a C99 program" now, though, just a "Linux program" with a object module written in C.
The main() function is called by an object file included with the libc. Since the kernel doesn't link against the libc it has its own entry point, written in assembler.
Paxdiablo's answer covers two of the cases where you won't encounter a main. Let me add a couple of more:
Many plug-in systems for other programs (like, say, browsers or text editors or the like) have no main().
Windows programs written in C have no main(). (They have a WinMain() instead.)
The operating systems loader has to call a single entry point; in the GNU compiler, the entry point is defined in the crt0.o linked object file, the source for this is the assembler file crt0.s - that invokes main() after performing various run-time start-up tasks (such as establishing a stack, static initialisation). So when building an executable that links the default crt0.o, you must have a main(), otherwise you will get a linker error since in crt0.o main() is an unresolved symbol.
It would be possible (if somewhat perverse and unnecessary) to modify crt0.s to call a different entry point. Just make sure that you make such an object file specific to your project rather than modifying the default version, or you will break every build on that machine.
The OS itself has its own C runtime start-up (which will be called from the bootloader) so can call any entry point it wishes. I have not looked at the Linux source, but imagine that it has its own crt0.s that will call whatever the C code entry point is.
main is called by glibc,that is a part of application(ring 3), not the kernel(ring 0).
the driver has another entry point,for example windows driver base on WDM is start from DRIVERENTRY
In machine language things get executed sequentially, what comes first is executed first. So, the default is for the compiler place a call to you main method to fit the C standard.
Your program works like a library, which is a collection of compiled functions. The main difference between a library and a standard executable is that for the second one the compiler generates assembly code which calls one of the functions in your program.
But you could write assembly code which calls your an arbitrary C program function (the same way calls to library functions work, actually) and this would work the same way other executables do. But the thing is you cannot do it in plain standard C, you have to resort to assembly or even some other compiler specific tricks.
This was intended as a general and superficial explanation, there are some technical differences I avoided on purpose as they don't seem relevant.

Compiling without libc

I want to compile my C-code without the (g)libc. How can I deactivate it and which functions depend on it?
I tried -nostdlib but it doesn't help: The code is compilable and runs, but I can still find the name of the libc in the hexdump of my executable.
If you compile your code with -nostdlib, you won't be able to call any C library functions (of course), but you also don't get the regular C bootstrap code. In particular, the real entry point of a program on Linux is not main(), but rather a function called _start(). The standard libraries normally provide a version of this that runs some initialization code, then calls main().
Try compiling this with gcc -nostdlib -m32:
// Tell the compiler incoming stack alignment is not RSP%16==8 or ESP%16==12
__attribute__((force_align_arg_pointer))
void _start() {
/* main body of program: call main(), etc */
/* exit system call */
asm("movl $1,%eax;"
"xorl %ebx,%ebx;"
"int $0x80"
);
__builtin_unreachable(); // tell the compiler to make sure side effects are done before the asm statement
}
The _start() function should always end with a call to exit (or other non-returning system call such as exec). The above example invokes the system call directly with inline assembly since the usual exit() is not available.
The simplest way to is compile the C code to object files (gcc -c to get some *.o files) and then link them directly with the linker (ld). You will have to link your object files with a few extra object files such as /usr/lib/crt1.o in order to get a working executable (between the entry point, as seen by the kernel, and the main() function, there is a bit of work to do). To know what to link with, try linking with the glibc, using gcc -v: this should show you what normally comes into the executable.
You will find that gcc generates code which may have some dependencies to a few hidden functions. Most of them are in libgcc.a. There may also be hidden calls to memcpy(), memmove(), memset() and memcmp(), which are in the libc, so you may have to provide your own versions (which is not hard, at least as long as you are not too picky about performance).
Things might get clearer at times if you look at the produced assembly (use the -S flag).

Resources