Step by step C compilation using GCC? - c

I am trying to make the four steps that takes to transform from C source into an executable with GCC. The first 3 steps works as expected, but the last one gives me problems. I have two files: writeByte.h and writeByte.c, which contains the following:
// writeByte.h
// USED GCC COMMANDS BY ORDER:
// 1 - "gcc writeByte.c -o pre-processed.i -E"
// 2 - "gcc pre-processed.i -o assembled.s -S"
// 3 - "gcc assembled.s -o compiled.o -c"
// 4 - ???
void writeByte(char* addr, char val);
and
// writeByte.c
#include "writeByte.h"
void writeByte(char* addr, char val) { *addr = val; }
Supposedly, to link a file, I have to execute gcc compiled.o -o executable, but it says that at .text+0x20, reference to main is undefined, so I don't know how to follow.

TL;DR: Define a main() function, and your program will work.
From the C specification C11 (ISO/IEC 9899:201x / N1548)
5.1.2 Execution environments
Two execution environments are defined: freestanding and hosted.
[…]
5.1.2.1 Freestanding environment
In a freestanding environment (in which C program execution may take place without any benefit of an operating system), the name and type of the function called at program startup are implementation-defined.
[…]
5.1.2.2 Hosted environment
[…]
5.1.2.2.1 Program startup
The function called at program startup is named main. […]
Furthermore:
J2 Undefined behavior
The behavior is undefined in the following circumstances:
[…]
A program in a hosted environment does not define a function named main using one of the specified forms (5.1.2.2.1).
Hosted Environment
This is most likely your case.
This applies to typical operating systems, such as Linux, Unix, Mac OS X, Windows, Amiga OS, and many more. In that case, your environment typically is the hosted environment.
Given you run your C compiler and linker without any options that would select or influence the environment, and given the target is such a typical operating system, and you finally link, the assumption made by the C compiler and linker will be a hosted environment. As described above, this means that you need to provide a main() function so that the hosted environment knows where to start your C program. Because you did not provide a main function, clause J2 Undefined behavior applied, and the linker refused to complete its job.
Note: Under the hood, these operating systems actually provide their own custom interface, see below.
Solution: Provide a main() function.
Freestanding Environment
Freestanding environments typically occur when developing firmware or operating system yourself. In that case, the entry point is defined by the CPU. Most CPUs will either start execution at a pre-defined address, or at a configurable address read from a vector table specified at a pre-defined address.
Custom Environments
Besides that, there are custom environments. The two most common custom environments are:
The typical OS
In order to be able to do more than specified by the C specification, operating systems define their own environment. This environment is typically an extension of the hosted environment, and will use the entry point specified by the linker. Typically, that entry point is actually another function, often called _start, which is provided by a default library such as libglibc. This function _start is called by the OS, and that function _start then actually calls main.
So, instead of providing main, it would be possible to provide _start, or an equivalent, instead.
You could write your own _start function or entry point.
However, that risks that your program would unnecessarily be less portable, and that you have to deal with operating system issues that a hosted environment is hiding from you.
It is therefore not recommended for "normal" programs by "normal" developers.
DLL environments
When programs are supposed to run as plugins for other programs, those other programs define a custom environment.
Typically, that custom environment is realized as DLL (Dynamic Link Library).
Embedded environments
For several embedded systems, the toolchains (compiler etc.) come with libraries which provide a custom environment for that system. These environments have features which ranks somewhere between a freestanding environment and a hosted environment, and the entry points depend on the corresponding toolchain. To avoid confusion that stems from unexpected entry point names, toolchains usually use main, start, _start, Start or _Start as names for the entry point.

Related

Which mechanism knows the entry point of a program is main()

How does an application program know its entry point is the main() function?
I know an application doesn't know its entry point is main() -- it is directed to main() function by means of the language specification whatever it is.
At that point, where is the specification actually declared? For example in C, entry point shall be main() function. Who provides this mechanism to the program? An operating system or compiler?
I came to the question after disassembling a canonical simple "Hello World" example in Visual Studio.
In this code there are only a few lines and a function main().
But after disassembling it, there are lots of definitions and macro in the memory space and main() is not the only declaration and definiton.
Here below disassembling part's screenshot. I also know there is a strict rule in language definition which is only one main() function must be defined and exist.
To summarize my question: I wonder which mechanism directs or sets main() function as an entry point of an application program.
The application does not know that main() is the entry point. Firstly, we assume C not C++ here despite your picture.
For C the "C" entry point is main(). But you cant just start execution there as we have assumptions, more than that, rules, in C that for example .data needs to be initialized and .bss zeroed.
unsigned x = 1;
unsigned int y;
We expect that when main() is hit that x=1. and most folks assume and perhaps it is specified that y = 0 at that time, I wouldn't make that assumption, but anyway.
We also need a stack pointer and need to deal with argc/argv. If C++ then other stuff has to be done. Even for C depending.
The APPLICATION does not generally know any of this. You are likely working with a C library and that library is/should be responsible for bootstrap code that preceeds main() as well as a linker script for the linker as bootstrap and linker script are intimately related. And one could argue based on some implementations that the C library is separable from the toolchain as we know with gnu you can choose from different ones and those have different bootstraps and linker scripts. But I am sure there are many that are intimately related, there is also a relationship of the library and the operating system as so many C library calls end up in one or countless system calls.
You design an operating system, part of the design of the operating system assuming it supports runtime loadable applications is a file format that the operating systems loader supports, features that the operating system loader wants to support and how they overlap with the file format, not uncommon for the OS to define the file format, but with elf and others (not accidentally/independently created no doubt) you have opportunities for a new OS to use an existing container like elf. The OS design and its loader determines a lot of things, and the C library that mates up with all of that has to follow all of those rules, if integrated into the compiler then the compiler has to play along as well.
It is not the application that knows it is part of the system design and the application is simply a slave to all of that, when you compile on that platform for that platform all of these rules and relationships are in play, you are putting in a very small part of the puzzle, the rest is already in place, what file formats are supported, per format what information is required, what rules are required that the compiler/library solution must provide. The system design dictates if .data and .bss are zeroed by the loader or by the application and what I mean by that is by the bootstrap not the user's portion of the program, you cant bootstrap C in C because that C would need a bootstrap and if that bootstrap were in C that C would need a bootstrap and so on.
int main ( void )
{
return 0;
}
there are a lot of things going on in the background when you compile that program not just the few instructions that might be needed to implement that code.
compile that program on windows and Linux and mac and different versions of each with different compilers for each or C libraries, and different versions of each, etc. And what you should expect to see is perhaps even if the same target ISA, same computer even, some percentage of the combinations MIGHT choose the same few instructions for the function, what is wrapped around it is expected to be maybe similar but not the same. Would be no reason to be surprised if some of the implementations are very different from each other.
And this is all for full blown operating systems that load programs into ram and run them, for embedded things don't be surprised if the differences are even bigger. Within a full blown os you would expect to see an mmu and the application gets a perhaps zero based address space for .text, .data, .bss at a minimum so all the solutions might have a favorite place or favorite number of sections in the same order in the binary but the size of each may be specific to the implementation. The order/size might vary by C library version or compiler version, etc.
The magic is in the system design. and that is not magic, that is design. main() cannot be entered directly and still have various parts of the language still work like .data and .bss init, stack pointer can be solved before the entry but how and where .data and .bss are is application specific so cant be handled by a simple branch to main from the OS.
The linker for your toolchain can be told in various ways where the entry point is it could be assumed/dictated for that tool/target or a command line option or a linker script option, or some special symbol you put on a label or whatever the designers choose. main is assumed to be the C entry point, although that doesn't actually mean it is there might be some C code that precedes it but in general there is some amount of asm (cant bootstrap C with C) and then one or more steps to main().

Possible drawbacks of overriding the entry point of a main program

So I was trying to set my own custom name for main in my C program, and I found this answer.
You can specify an entry point to your program using the -e flag to ld.
That means you can override the entry point if you like, but you may not want to do that for a C program you intend to run normally on your machine, since start might do all kinds of OS specific stuff that's required before your program runs.
What would be the (possible) drawbacks of not calling _start from crt0.o and writing my own that simply does whatever I want it to?
The entry point usually does stuff like
Prepare arguments and call main and handles its exit
Call global constructors before main and destructors after
Populate global variables like environ and the like
Initialize the C runtime, e.g. timezone, stdio streams and such
Maybe configure x87 to use 80-bit floating point
Inflate and zero .bss if your loader doesn't
Whatever else is necessary for hosted C programs to run on your platform
These things are tightly coupled to your C implementation, so usually you provide your own _start only when you are targeting a freestanding environment.

Compiler check to ensure I'm running in bare-metal and not in a hosted environment

If I'm compiling a C program for bare-metal, I know I can insert things like
#if defined(__linux__)
#error "You're not using a cross-compiler."
#endif`
But, I don't want to check for every operating system. Is there a single check to see if I'm in a hosted environment?
If you want to determine that you are building with -ffreestanding then make your code check for the __STDC_HOSTED__ macro. It will be set to 1 for normal code and set to 0 for a freestanding build.
See the GCC info pages or the docs. The relevant quote is
By default, it acts as the compiler for a hosted
implementation, defining 'STDC_HOSTED' as '1' and presuming that
when the names of ISO C functions are used, they have the semantics
defined in the standard. To make it act as a conforming freestanding
implementation for a freestanding environment, use the option
'-ffreestanding'; it then defines 'STDC_HOSTED' to '0' and does not
make assumptions about the meanings of function names from the standard
library, with exceptions noted below.

Visibility, Fortran common variables, runtime loading of shared libraries

Environment: Intel Linux, Red Hat 5.
Compiler: gcc 3.4.6
(old stuff, legacy environment with serious infrastructure, sorry)
I have multiple versions of a particular shared library (call it something like "shared_lib.so") derived from Fortran which contains a COMMON block and various computations with references to variables in that COMMON.
I need to be able to (from C code elsewhere in the end-product executable) use dlclose() and dlopen() to switch between versions of this library (within which all versions of the COMMON contents are identical) while running. In some cases the same COMMON also appears in code which is part of a static library (call it "static_lib.a") that is also linked into the executable, and is separately maintained from my project but which has functionality which interacts with that in my shared library.
I appear to be seeing that multiple instances of the COMMON wind up in the executable, and (more importantly) that there is no linkage between the values of variables in the instance from the static library, and the values of the “same” variables in the instance from a shared library pulled in with dlopen().
What I need, in summary, is (within the overall executable) for a dlopen()-loaded shared_lib.so to be able to set/use variable XYZ in COMMON ABC, and for code in static_lib.a to set/use XYZ, and have it in effect be the same instance of XYZ, or at least for the two to be kept in synch. Is this possible?
My compilation commands for sources in shared_lib.so are of the form:
g77 –c –g –m32 -fPIC –o shared_src.o shared_src.f
My command for building shared_lib.so is of the form:
gcc -g -m32 -fPIC -shared -o shared_lib.so *.o
My command for building the executable is of the form:
gcc –g -m32 –rdynamic –o exec exec.o static_lib.a shared_lib.so –lm –ldl –lg2c
My need is to do something from the C code of the form:
handle1 = dlopen ("shared_lib.so", RTLD_NOLOAD);
dlclose (handle1);
handle2 = dlopen ("shared_lib2.so", RTLD_NOW | RTLD_GLOBAL);
...
The initial startup configuration does appear to function correctly with respect to the needed variables, but the result of subsequent dlclose() and dlopen() sequences do not. Perhaps the underlying issue is that dlopen() lacks some intelligence that gcc possesses when it is linking.
Short answer
Did/can you recompile the executable with the -fPIC? I found that it was necessary to compile both the shared library AND the executable with the -fPIC to get the COMMON blocks to be recognized properly.
Long answer
I ran into a slightly similar problem recently with COMMON blocks shared between an executable and a FORTRAN shared library. However, I'm using Intel compilers NOT the GNU compilers. The executable is mixed C/C++ and FORTRAN.
The existing (working) Windows version of the code works by sharing the common blocks between executable and DLL through DLLEXPORT/DLLIMPORT ATTRIBUTE directives. According to the Intel compiler documentation, these attribute directives are not recognized in Linux. Indeed, the Linux Intel compiler just produces warnings for these directives.
The main changes in converting the code from Windows to Linux were replacing the Windows LoadLibrary and GetProcAddress with Linux's dlopen and dlsym routines, respectively, using #ifdef sections. The shared library was compiled using -fpic and linked with -shared.
While the shared library was compiled with -fpic, the executable was NOT. When running the code compiled in this manner, variables passed to the shared library through subroutine calls were passed properly, however, the COMMON block variables were not set correctly (or were uninitialized).
In desperation, I finally tried compiling the executable itself with the -fpic compiler option, and then the COMMON blocks were recognized properly in the shared library.
This isn't really an answer, but might help you.
Here's what the 2008 standard says about COMMON:
5.7.2.4 Common association
1 Within a program, the common block storage sequences of all nonzero-sized common blocks with the same
name have the same first storage unit, and the common block storage
sequences of all zero-sized common blocks with the same name are
storage associated with one another. Within a program, the common
block storage sequences of all nonzero-sized blank common blocks have
the same first storage unit and the storage sequences of all
zero-sized blank common blocks are associated with one another and
with the first storage unit of any nonzero-sized blank common blocks.
This results in the association of objects in different scoping units.
Use or host association may cause these associated objects to be
accessible in the same scoping unit.
In short, COMMON sections with the same name in the same program occupy the same storage.
A program is defined as follows.
2.2.2 Program
1 A program shall consist of exactly one main program, any number (including zero) of other kinds of program units, any
number (including zero) of external procedures, and any number
(including zero) of other entities defined by means other than
Fortran. The main program shall be defined by a Fortran main-program
program-unit or by means other than Fortran, but not both.
The standard doesn't say anything about static vs dynamic linking and it doesn't restrict the previous statements to static linking. Therefore, it seems the dynamically loaded library should share the COMMON block with the main program (which I'm not sure is even technically possible) and thus the GNU implementation is incorrect.
On the other hand, the standard also doesn't say anything about being able to load libraries dynamically. Program units "defined by means other than Fortran" should include C libraries, but that doesn't tell us how these program units are connected to the main program. Fortran, in general, is not a very dynamic language.
Of course, you can work around all this by simply not using COMMON blocks. If a procedure needs to read/write some data, just pass it as a parameter with intent in/out. You can also group data together in a derived type and pass it around together as a unit. Nowadays (Fortran 2003+), you can even use object oriented programming, so there is really no need for global variables anymore.

Is a main() required for a C program?

Well the title says it all. Is a main() function absolutely essential for a C program?
I am asking this because I was looking at the Linux kernel code, and I didn't see a main() function.
No, the ISO C standard states that a main function is only required for a hosted environment (such as one with an underlying OS).
For a freestanding environment like an embedded system (or an operating system itself), it's implementation defined. From C99 5.1.2:
Two execution environments are defined: freestanding and hosted. In both cases, program startup occurs when a designated C function is called by the execution environment.
In a freestanding environment (in which C program execution may take place without any benefit of an operating system), the name and type of the function called at program startup are implementation-defined.
As to how Linux itself starts, the start point for the Linux kernel is start_kernel though, for a more complete picture of the entire boot process, you should start here.
Well, no, but ...
C99 specifies that main() is called in the hosted environment "at program startup", however, you don't have to use the C runtime support. Your operating system executes image files and starts a program at an address provided by the linker.
If you are willing to write your program to conform to the operating system's requirements rather than C99's, you can do it without main(). The more modern (and complex) the system, though, the more trouble you will have with the C library making assumptions that the standard runtime startup is used.
Here is an example for Linux...
$ cat > nomain.S
.text
_start:
call iamnotmain
movl $0xfc, %eax
xorl %ebx, %ebx
int $0x80
.globl _start
$ cat > demo.c
void iamnotmain(void) {
static char s[] = "hello, world\n";
write(1, s, sizeof s);
}
$ as -o nomain.o nomain.S
$ cc -c demo.c
$ ld -static nomain.o demo.o -lc
$ ./a.out
hello, world
It's arguably not "a C99 program" now, though, just a "Linux program" with a object module written in C.
The main() function is called by an object file included with the libc. Since the kernel doesn't link against the libc it has its own entry point, written in assembler.
Paxdiablo's answer covers two of the cases where you won't encounter a main. Let me add a couple of more:
Many plug-in systems for other programs (like, say, browsers or text editors or the like) have no main().
Windows programs written in C have no main(). (They have a WinMain() instead.)
The operating systems loader has to call a single entry point; in the GNU compiler, the entry point is defined in the crt0.o linked object file, the source for this is the assembler file crt0.s - that invokes main() after performing various run-time start-up tasks (such as establishing a stack, static initialisation). So when building an executable that links the default crt0.o, you must have a main(), otherwise you will get a linker error since in crt0.o main() is an unresolved symbol.
It would be possible (if somewhat perverse and unnecessary) to modify crt0.s to call a different entry point. Just make sure that you make such an object file specific to your project rather than modifying the default version, or you will break every build on that machine.
The OS itself has its own C runtime start-up (which will be called from the bootloader) so can call any entry point it wishes. I have not looked at the Linux source, but imagine that it has its own crt0.s that will call whatever the C code entry point is.
main is called by glibc,that is a part of application(ring 3), not the kernel(ring 0).
the driver has another entry point,for example windows driver base on WDM is start from DRIVERENTRY
In machine language things get executed sequentially, what comes first is executed first. So, the default is for the compiler place a call to you main method to fit the C standard.
Your program works like a library, which is a collection of compiled functions. The main difference between a library and a standard executable is that for the second one the compiler generates assembly code which calls one of the functions in your program.
But you could write assembly code which calls your an arbitrary C program function (the same way calls to library functions work, actually) and this would work the same way other executables do. But the thing is you cannot do it in plain standard C, you have to resort to assembly or even some other compiler specific tricks.
This was intended as a general and superficial explanation, there are some technical differences I avoided on purpose as they don't seem relevant.

Resources