"C or gcc" is like "Chicken or the egg" ? :( [duplicate] - c

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How are gcc/g++ bootstrapped?
I would like to know how gcc is compiled as we all know it is written in C.
Did they used some other compiler to come up with gcc?
If so, can I use the same compile to compile my C program?

There is no chicken and egg here. glibc is compiled with the compiler you are using.
That compiler was first compiled with a previous version of the same compiler. Then it can compile itself as well.
The real chicken-and-egg problem was solved in the 1950's when someone had to write the world's first compiler. After that, you can use one compiler to compile the next one.

There are two basic ways to build a new compiler:
If you're writing a new compiler for an established language like C, use an existing compiler from a different vendor to build your new compiler. For example, you could use the C compiler shipped with HP-UX to build gcc.
If you're writing a compiler for a new language, start by implementing a very simple compiler in a different language (the first C compiler was written in PDP-11 assembler). This initial compiler will only recognize a small subset of the target language; basically enough to do some file I/O and some simple statements. Write a new compiler in the target language subset and build it with your first compiler. Now write a slightly more capable compiler that can recognize a larger subset of the target language, and build it with the second compiler. Repeat the process until you have a compiler capable of recognizing the full target language.

They did not use some other compiler. You can write a C program that doesn't use glibc by simply telling the compiler not to use it. So something like this:
gcc main.c -nostdlib

This is an interesting question. I think you are wondering in what language is written a compiler of a new language, aren't you? Well, if we had only Assembly language (for instance,x86), the only way to write a C compiler would be in Assembly language. Later, we could write a better, yet more powerful compiler written in C by using our assembly-written compiler, and so on...
An the question arises: how did the early programmers write the first assembly compiler? My father told me: by manually entering the 1's and 0's! :-)

Related

What are Vectors and < > in C?

I was looking at the source code for gcc (out of curiosity), and I noticed a data structure that I've never seen in C before.
At line 80 and 129 (and many other places) in the parser, they seem to be using vectors.
80: vec<tree> incomplete_record_decls;
129: ridpointers = ggc_cleared_vec_alloc<tree> ((int) RID_MAX);
I've never encountered this data type in C, nor these: < >. Are they native to C?
Does anyone know what they are and how they are used?
Despite the .c filename, this code is not valid C; it is C++, using that language's template feature. If you inspect the gcc build process, you will find that this file is actually compiled with a C++ compiler.
https://gcc.gnu.org/codingconventions.html
The directories gcc, libcpp and fixincludes may use C++03. They may also use the long long type if the host C++ compiler supports it. These directories should use reasonably portable parts of C++03, so that it is possible to build GCC with C++ compilers other than GCC itself. If testing reveals that reasonably recent versions of non-GCC C++ compilers cannot compile GCC, then GCC code should be adjusted accordingly. (Avoiding unusual language constructs helps immensely.) Furthermore, these directories should also be compatible with C++11.
Keep in mind that although compilers will usually by default infer a source file's language from its filename, this default can always be overridden. It is entirely possible to have C++ code in a .c file, or C code in a .bas file for that matter; you just may have to tell the compiler some other way what language is in use.
I expect that gcc chose this file naming convention because this code was originally written in C and later converted to C++, and they found it too much of a pain to change all the filenames. It would mean a lot of work to update all the makefiles, etc. It may have been less of a pain to just change which compiler was used, and to explain the convention to all the developers. Of course, in general it is better programming practice to name your files in the standard way, but apparently the gcc developers felt it was not the best course of action in this case.
GCC has moved from C to C++ since GCC 4.8
GCC now uses C++ as its implementation language. This means that to build GCC from sources, you will need a C++ compiler that understands C++ 2003. For more details on the rationale and specific changes, please refer to the C++ conversion page.
GCC 4.8 Release Series - Changes, New Features, and Fixes
The work has actually begun long before that, with the creation of gcc-in-cxx branch. The developers first tried to compile the source code with a C++ compiler, so there weren't any name changes. I guess they didn't bother to rename the files later when merging the two branches and officially have only one C++ branch
You can read GCC's move to C++ for more historical information

How can a C compiler be written in C? [duplicate]

This question already has answers here:
Writing a compiler in its own language
(14 answers)
Closed 9 years ago.
This question may stem from a misunderstanding of compilers on my part, but here goes...
One can find the following statement in the preface to the first edition of K&R (page xi):
The operating system, the C compiler, and essentially all UNIX applications programs (including all of the software used to prepare this book) are written in C.
(my emphasis)
Here's what I don't understand: doesn't that C compiler have to be compiled itself before it can compile any C code? And if that C compiler is written in C, wouldn't compiling it require an already existing C compiler?!
The only way out of this infinite-regression conundrum (or chicken-and-egg problem) is that the C compiler written in C that K&R are referring to was actually compiled with an already existing C compiler that was written in a language other than C. The C compiler written in C then superseded the latter.
Or am I completely off?
It's called Bootstrapping, quoting from Wikipedia:
If one needs a compiler for language X to obtain a compiler for language X (which is written in language X), how did the first compiler get written? Possible methods to solving this chicken or the egg problem include:
Implementing an interpreter or compiler for language X in language
Y. Niklaus Wirth reported that he wrote the first Pascal compiler in
Fortran.
Another interpreter or compiler for X has already been written in
another language Y; this is how Scheme is often bootstrapped.
Earlier versions of the compiler were written in a subset of X for
which there existed some other compiler; this is how some supersets
of Java, Haskell, and the initial Free Pascal compiler are
bootstrapped.
The compiler for X is cross compiled from another architecture where
there exists a compiler for X; this is how compilers for C are
usually ported to other platforms. Also this is the method used for
Free Pascal after the initial bootstrap.
Writing the compiler in X; then hand-compiling it from source (most
likely in a non-optimized way) and running that on the code to get
an optimized compiler. Donald Knuth used this for his WEB literate
programming system.
And if you are interested, here is Dennis Richie's first C compiler source.
Usually, a first compiler is written in another language (directly in PDP11 assembler in this case, or in C for most of the "modern" languages). Then, this first compiler is used to program a compiler written in the language itself.
You can read this page about the history of the C language. You will see that it is also strongly linked to the UNIX system.
See the Chicken and Egg section of the Wikipedia page:
If one needs a compiler for language X to obtain a compiler for language X (which is written in language X), how did the first compiler get written? Possible methods to solving this chicken or the egg problem include:
Implementing an interpreter or compiler for language X in language Y. Niklaus Wirth reported that he wrote the first Pascal compiler in Fortran.
Another interpreter or compiler for X has already been written in another language Y; this is how Scheme is often bootstrapped.
Earlier versions of the compiler were written in a subset of X for which there existed some other compiler; this is how some supersets of Java, Haskell, and the initial Free Pascal compiler are bootstrapped.
The compiler for X is cross compiled from another architecture where there exists a compiler for X; this is how compilers for C are usually ported to other platforms. Also this is the method used for Free Pascal after the initial bootstrap.
Writing the compiler in X; then hand-compiling it from source (most likely in a non-optimized way) and running that on the code to get an optimized compiler. Donald Knuth used this for his WEB literate programming system.
It's perfectly ordinary for a compiler to be written in the language it compiles. One way to achieve this would be to write a complete compiler for language L in some other language, and then to write a new compiler for L in L. A more interesting approach would be to write a minimal compiler for a subset of L in some other language, and then use this minimal subset to improve the compiler, making it less minimal increasing the available subset of L. In this way, a complete compiler can be built.

Trying to grasp C bytecode... does/can GNU/gcc produce C bytecode like Clang/LLVM?

Recently I was told to look at how C functions are compiled into LLVM bytecode, and then how the LLVM bytecode is translated into x86 ASM. As a regular GNU/gcc user, I have some questions about this. To put it mildly.
Does GNU/gcc compile to bytecode, too? Can it? I was under the impression that gcc compiles directly into ASM. If not, is there a way to view the bytecode intermediary as there is with the clang command?
~$ clang ~/prog_name.c -S -emit-llvm -o - <== will show bytecode for prog_name.c.
Also, I find bytecode to be rather byzantine. By contrast, it makes assembly language seem like light reading. In other words: I have little idea what it is saying.
Does anyone have any advice or references for vaguely deciphering the information that the bytecode gives? Currently I compare and contrast with actual ASM, so to say it is slow going is a compliment.
Perhaps this is all comically naive, but I find it quite challenging to break through the surface of this.
Perhaps try taking a look at the language reference.
As far as I know, GCC does have an IR as well known as GIMPLE (another reference here).
If you mean that you would rather analyze the assembly output instead of the IR, you can take a look at this question which describes how to output an assembly file.

Bootstrapping A compiler [duplicate]

I've heard of the idea of bootstrapping a language, that is, writing a compiler/interpreter for the language in itself. I was wondering how this could be accomplished and looked around a bit, and saw someone say that it could only be done by either
writing an initial compiler in a different language.
hand-coding an initial compiler in Assembly, which seems like a special case of the first
To me, neither of these seem to actually be bootstrapping a language in the sense that they both require outside support. Is there a way to actually write a compiler in its own language?
Is there a way to actually write a compiler in its own language?
You have to have some existing language to write your new compiler in. If you were writing a new, say, C++ compiler, you would just write it in C++ and compile it with an existing compiler first. On the other hand, if you were creating a compiler for a new language, let's call it Yazzleof, you would need to write the new compiler in another language first. Generally, this would be another programming language, but it doesn't have to be. It can be assembly, or if necessary, machine code.
If you were going to bootstrap a compiler for Yazzleof, you generally wouldn't write a compiler for the full language initially. Instead you would write a compiler for Yazzle-lite, the smallest possible subset of the Yazzleof (well, a pretty small subset at least). Then in Yazzle-lite, you would write a compiler for the full language. (Obviously this can occur iteratively instead of in one jump.) Because Yazzle-lite is a proper subset of Yazzleof, you now have a compiler which can compile itself.
There is a really good writeup about bootstrapping a compiler from the lowest possible level (which on a modern machine is basically a hex editor), titled Bootstrapping a simple compiler from nothing. It can be found at https://web.archive.org/web/20061108010907/http://www.rano.org/bcompiler.html.
The explanation you've read is correct. There's a discussion of this in Compilers: Principles, Techniques, and Tools (the Dragon Book):
Write a compiler C1 for language X in language Y
Use the compiler C1 to write compiler C2 for language X in language X
Now C2 is a fully self hosting environment.
The way I've heard of is to write an extremely limited compiler in another language, then use that to compile a more complicated version, written in the new language. This second version can then be used to compile itself, and the next version. Each time it is compiled the last version is used.
This is the definition of bootstrapping:
the process of a simple system activating a more complicated system that serves the same purpose.
EDIT: The Wikipedia article on compiler bootstrapping covers the concept better than me.
A super interesting discussion of this is in Unix co-creator Ken Thompson's Turing Award lecture.
He starts off with:
What I am about to describe is one of many "chicken and egg" problems that arise when compilers are written in their own language. In this ease, I will use a specific example from the C compiler.
and proceeds to show how he wrote a version of the Unix C compiler that would always allow him to log in without a password, because the C compiler would recognize the login program and add in special code.
The second pattern is aimed at the C compiler. The replacement code is a Stage I self-reproducing program that inserts both Trojan horses into the compiler. This requires a learning phase as in the Stage II example. First we compile the modified source with the normal C compiler to produce a bugged binary. We install this binary as the official C. We can now remove the bugs from the source of the compiler and the new binary will reinsert the bugs whenever it is compiled. Of course, the login command will remain bugged with no trace in source anywhere.
Check out podcast Software Engineering Radio episode 61 (2007-07-06) which discusses GCC compiler internals, as well as the GCC bootstrapping process.
Donald E. Knuth actually built WEB by writing the compiler in it, and then hand-compiled it to assembly or machine code.
As I understand it, the first Lisp interpreter was bootstrapped by hand-compiling the constructor functions and the token reader. The rest of the interpreter was then read in from source.
You can check for yourself by reading the original McCarthy paper, Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I.
Every example of bootstrapping a language I can think of (C, PyPy) was done after there was a working compiler. You have to start somewhere, and reimplementing a language in itself requires writing a compiler in another language first.
How else would it work? I don't think it's even conceptually possible to do otherwise.
Another alternative is to create a bytecode machine for your language (or use an existing one if it's features aren't very unusual) and write a compiler to bytecode, either in the bytecode, or in your desired language using another intermediate - such as a parser toolkit which outputs the AST as XML, then compile the XML to bytecode using XSLT (or another pattern matching language and tree-based representation). It doesn't remove the dependency on another language, but could mean that more of the bootstrapping work ends up in the final system.
It's the computer science version of the chicken-and-egg paradox. I can't think of a way not to write the initial compiler in assembler or some other language. If it could have been done, I should Lisp could have done it.
Actually, I think Lisp almost qualifies. Check out its Wikipedia entry. According to the article, the Lisp eval function could be implemented on an IBM 704 in machine code, with a complete compiler (written in Lisp itself) coming into being in 1962 at MIT.
Some bootstrapped compilers or systems keep both the source form and the object form in their repository:
ocaml is a language which has both a bytecode interpreter (i.e. a compiler to Ocaml bytecode) and a native compiler (to x86-64 or ARM, etc... assembler). Its svn repository contains both the source code (files */*.{ml,mli}) and the bytecode (file boot/ocamlc) form of the compiler. So when you build it is first using its bytecode (of a previous version of the compiler) to compile itself. Later the freshly compiled bytecode is able to compile the native compiler. So Ocaml svn repository contains both *.ml[i] source files and the boot/ocamlc bytecode file.
The rust compiler downloads (using wget, so you need a working Internet connection) a previous version of its binary to compile itself.
MELT is a Lisp-like language to customize and extend GCC. It is translated to C++ code by a bootstrapped translator. The generated C++ code of the translator is distributed, so the svn repository contains both *.melt source files and melt/generated/*.cc "object" files of the translator.
J.Pitrat's CAIA artificial intelligence system is entirely self-generating. It is available as a collection of thousands of [A-Z]*.c generated files (also with a generated dx.h header file) with a collection of thousands of _[0-9]* data files.
Several Scheme compilers are also bootstrapped. Scheme48, Chicken Scheme, ...

How does C code call assembly code (e.g. optimized strlen)?

I always read things about how certain functions within the C programming language are optimized by being written in assembly. Let me apologize if that sentence sounds a little misguided.
So, I'll put it clearly: How is it that when you call some functions like strlen on UNIX/C systems, the actual function you're calling is written in assembly? Can you write assembly right into C programs somehow or is it an external call situation? Is it part of the C standard to be able to do this, or is it an operating system specific thing?
The C standard dictates what each library function must do rather than how it is implemented.
Almost all known implementations of C are compiled into machine language. It is up to the implementers of the C compiler/library how they choose to implement functions like strlen. They could choose to implement it in C and compile it to an object, or they could choose to write it in assembly and assemble it to an object. Or they could implement it some other way. It doesn't matter so long as you get the right effect and result when you call strlen.
Now, as it happens, many C toolsets do allow you to write inline assembly, but that is absolutely not part of the standard. Any such facilties have to be included as extensions to the C standard.
At the end of the road compiled programs and programs in assembly are all machine language, so they can call each other. The way this is done is by having the assembly code use the same calling conventions (way to prepare for a call, prepare parameters and such) as the program written in C. An overview of popular calling conventions for x86 processors can be found here.
Many (most?) C compilers do happen to support inline assembly, though it's not part of the standard. That said, there's no strict need for a compiler to support any such thing.
First, recognize that assembly is mostly just human (semi-)readable machine code, and that C ends up as machine code anyway.
"Calling" a C function just generates a set of instructions that prepare registers, the stack, and/or some other machine-dependent mechanism according to some established calling convention, and then jumps to the start of the called function.
A block of assembly code can conform to the appropriate calling convention, and thus generate a blob of machine code that another blob of machine code that was originally written in C is able to call. The reverse is, of course, also possible.
The details of the calling convention, the assembly process, and the linking process (to link the assembly-generated object file with the C-generated object file) may all vary wildly between platforms, compilers, and linkers. A good assembly tutorial for your platform of choice will probably cover such details.
I happen to like the x86-centric PC Assembly Tutorial, which specifically addresses interfacing assembly and C code.
When C code is compiled by gcc, it's first compiled to assembler instructions, which are then again compiled to a binary, machine-executable file. You can see the generated assembler instructions by specifying -S, as in gcc file.c -S.
Assembler code just passes the first stage of C-to-assembler compilation and is then indistinguishable from code compiled from C.
One way to do it is to use inline assembler. That means you can write assembler code directly into your C code. The specific syntax is compiler-specific. For example, see GCC syntax and MS Visual C++ syntax.
You can write inline assembly in your C code. The syntax for this is highly compiler specific but the asm keyword is ususally used. Look into inline assembly for more information.

Resources