How can a C compiler be written in C? [duplicate] - c

This question already has answers here:
Writing a compiler in its own language
(14 answers)
Closed 9 years ago.
This question may stem from a misunderstanding of compilers on my part, but here goes...
One can find the following statement in the preface to the first edition of K&R (page xi):
The operating system, the C compiler, and essentially all UNIX applications programs (including all of the software used to prepare this book) are written in C.
(my emphasis)
Here's what I don't understand: doesn't that C compiler have to be compiled itself before it can compile any C code? And if that C compiler is written in C, wouldn't compiling it require an already existing C compiler?!
The only way out of this infinite-regression conundrum (or chicken-and-egg problem) is that the C compiler written in C that K&R are referring to was actually compiled with an already existing C compiler that was written in a language other than C. The C compiler written in C then superseded the latter.
Or am I completely off?

It's called Bootstrapping, quoting from Wikipedia:
If one needs a compiler for language X to obtain a compiler for language X (which is written in language X), how did the first compiler get written? Possible methods to solving this chicken or the egg problem include:
Implementing an interpreter or compiler for language X in language
Y. Niklaus Wirth reported that he wrote the first Pascal compiler in
Fortran.
Another interpreter or compiler for X has already been written in
another language Y; this is how Scheme is often bootstrapped.
Earlier versions of the compiler were written in a subset of X for
which there existed some other compiler; this is how some supersets
of Java, Haskell, and the initial Free Pascal compiler are
bootstrapped.
The compiler for X is cross compiled from another architecture where
there exists a compiler for X; this is how compilers for C are
usually ported to other platforms. Also this is the method used for
Free Pascal after the initial bootstrap.
Writing the compiler in X; then hand-compiling it from source (most
likely in a non-optimized way) and running that on the code to get
an optimized compiler. Donald Knuth used this for his WEB literate
programming system.
And if you are interested, here is Dennis Richie's first C compiler source.

Usually, a first compiler is written in another language (directly in PDP11 assembler in this case, or in C for most of the "modern" languages). Then, this first compiler is used to program a compiler written in the language itself.
You can read this page about the history of the C language. You will see that it is also strongly linked to the UNIX system.

See the Chicken and Egg section of the Wikipedia page:
If one needs a compiler for language X to obtain a compiler for language X (which is written in language X), how did the first compiler get written? Possible methods to solving this chicken or the egg problem include:
Implementing an interpreter or compiler for language X in language Y. Niklaus Wirth reported that he wrote the first Pascal compiler in Fortran.
Another interpreter or compiler for X has already been written in another language Y; this is how Scheme is often bootstrapped.
Earlier versions of the compiler were written in a subset of X for which there existed some other compiler; this is how some supersets of Java, Haskell, and the initial Free Pascal compiler are bootstrapped.
The compiler for X is cross compiled from another architecture where there exists a compiler for X; this is how compilers for C are usually ported to other platforms. Also this is the method used for Free Pascal after the initial bootstrap.
Writing the compiler in X; then hand-compiling it from source (most likely in a non-optimized way) and running that on the code to get an optimized compiler. Donald Knuth used this for his WEB literate programming system.

It's perfectly ordinary for a compiler to be written in the language it compiles. One way to achieve this would be to write a complete compiler for language L in some other language, and then to write a new compiler for L in L. A more interesting approach would be to write a minimal compiler for a subset of L in some other language, and then use this minimal subset to improve the compiler, making it less minimal increasing the available subset of L. In this way, a complete compiler can be built.

Related

"C or gcc" is like "Chicken or the egg" ? :( [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How are gcc/g++ bootstrapped?
I would like to know how gcc is compiled as we all know it is written in C.
Did they used some other compiler to come up with gcc?
If so, can I use the same compile to compile my C program?
There is no chicken and egg here. glibc is compiled with the compiler you are using.
That compiler was first compiled with a previous version of the same compiler. Then it can compile itself as well.
The real chicken-and-egg problem was solved in the 1950's when someone had to write the world's first compiler. After that, you can use one compiler to compile the next one.
There are two basic ways to build a new compiler:
If you're writing a new compiler for an established language like C, use an existing compiler from a different vendor to build your new compiler. For example, you could use the C compiler shipped with HP-UX to build gcc.
If you're writing a compiler for a new language, start by implementing a very simple compiler in a different language (the first C compiler was written in PDP-11 assembler). This initial compiler will only recognize a small subset of the target language; basically enough to do some file I/O and some simple statements. Write a new compiler in the target language subset and build it with your first compiler. Now write a slightly more capable compiler that can recognize a larger subset of the target language, and build it with the second compiler. Repeat the process until you have a compiler capable of recognizing the full target language.
They did not use some other compiler. You can write a C program that doesn't use glibc by simply telling the compiler not to use it. So something like this:
gcc main.c -nostdlib
This is an interesting question. I think you are wondering in what language is written a compiler of a new language, aren't you? Well, if we had only Assembly language (for instance,x86), the only way to write a C compiler would be in Assembly language. Later, we could write a better, yet more powerful compiler written in C by using our assembly-written compiler, and so on...
An the question arises: how did the early programmers write the first assembly compiler? My father told me: by manually entering the 1's and 0's! :-)

How does C work? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How was the first compiler written?
I'm asking this as a single question because, essentially what I'm trying to ask is at the bottom how is all of this implemented, here goes:
How was the first C compiler generated, since C compiler is written in C itself then how was the first source of C compiler generated?
Is C written in ASM, how are languages actually designed?, because before we had high level languages the only way to design something was through ASM, even if C is derived from earlier languages, how were they designed? (My clue is ASM)
I'm getting confused as to how does C work down at the bottom. What I'm trying to say is since at the bottom, everything is implemented at the processor by OPcodes. So what my understanding was that C programs are "essentially" translated to Sys Calls which are implemented by the Kernel.
But then how are syscalls implemented? (Do they directly correspond to OPcodes or is there any other layer of abstraction.
How was the first C compiler generated, since C compiler is written in C itself then how was the first source of C compiler generated?
Bootstrapping.

Compilers that compile `generic made up language X` into portable C

I'm looking for two things. The first is a terminology.
What do we call compilers that compiles one language into another?
Secondly, are there any compilers that compile generic made up language X into portable C code?
I'm just throwing the idea out there, but I was thinking, what if we created our own front-end for a our own language of choice, but instead of going the whole way, the compiler emitted portable C code. This way, we could add new language features but still be very compatible with existing C code.
Now maybe there's a huge flaw in this approach (except that you need to build it) but do people do this?
People absolutely do this. In fact, the original implementation of C++ was a program called Cfront that translated C++ into C code, to then be compiled with a C compiler.
With the prevalence today of intermediate "bytecode" languages such as JVM, CLR, and LLVM, translating languages to C source code is now much less common. It's much more powerful and less annoying to generate bytecode directly, rather than to generate textual source code. These bytecode (or "bitcode" in the case of LLVM) languages are lower level than textual programming languages, but still higher level than raw machine code that is tied to a specific CPU or CPU family.
I would call this sort of program a "translator", but that's just me. "Compiler" would work just fine too.

Bootstrapping A compiler [duplicate]

I've heard of the idea of bootstrapping a language, that is, writing a compiler/interpreter for the language in itself. I was wondering how this could be accomplished and looked around a bit, and saw someone say that it could only be done by either
writing an initial compiler in a different language.
hand-coding an initial compiler in Assembly, which seems like a special case of the first
To me, neither of these seem to actually be bootstrapping a language in the sense that they both require outside support. Is there a way to actually write a compiler in its own language?
Is there a way to actually write a compiler in its own language?
You have to have some existing language to write your new compiler in. If you were writing a new, say, C++ compiler, you would just write it in C++ and compile it with an existing compiler first. On the other hand, if you were creating a compiler for a new language, let's call it Yazzleof, you would need to write the new compiler in another language first. Generally, this would be another programming language, but it doesn't have to be. It can be assembly, or if necessary, machine code.
If you were going to bootstrap a compiler for Yazzleof, you generally wouldn't write a compiler for the full language initially. Instead you would write a compiler for Yazzle-lite, the smallest possible subset of the Yazzleof (well, a pretty small subset at least). Then in Yazzle-lite, you would write a compiler for the full language. (Obviously this can occur iteratively instead of in one jump.) Because Yazzle-lite is a proper subset of Yazzleof, you now have a compiler which can compile itself.
There is a really good writeup about bootstrapping a compiler from the lowest possible level (which on a modern machine is basically a hex editor), titled Bootstrapping a simple compiler from nothing. It can be found at https://web.archive.org/web/20061108010907/http://www.rano.org/bcompiler.html.
The explanation you've read is correct. There's a discussion of this in Compilers: Principles, Techniques, and Tools (the Dragon Book):
Write a compiler C1 for language X in language Y
Use the compiler C1 to write compiler C2 for language X in language X
Now C2 is a fully self hosting environment.
The way I've heard of is to write an extremely limited compiler in another language, then use that to compile a more complicated version, written in the new language. This second version can then be used to compile itself, and the next version. Each time it is compiled the last version is used.
This is the definition of bootstrapping:
the process of a simple system activating a more complicated system that serves the same purpose.
EDIT: The Wikipedia article on compiler bootstrapping covers the concept better than me.
A super interesting discussion of this is in Unix co-creator Ken Thompson's Turing Award lecture.
He starts off with:
What I am about to describe is one of many "chicken and egg" problems that arise when compilers are written in their own language. In this ease, I will use a specific example from the C compiler.
and proceeds to show how he wrote a version of the Unix C compiler that would always allow him to log in without a password, because the C compiler would recognize the login program and add in special code.
The second pattern is aimed at the C compiler. The replacement code is a Stage I self-reproducing program that inserts both Trojan horses into the compiler. This requires a learning phase as in the Stage II example. First we compile the modified source with the normal C compiler to produce a bugged binary. We install this binary as the official C. We can now remove the bugs from the source of the compiler and the new binary will reinsert the bugs whenever it is compiled. Of course, the login command will remain bugged with no trace in source anywhere.
Check out podcast Software Engineering Radio episode 61 (2007-07-06) which discusses GCC compiler internals, as well as the GCC bootstrapping process.
Donald E. Knuth actually built WEB by writing the compiler in it, and then hand-compiled it to assembly or machine code.
As I understand it, the first Lisp interpreter was bootstrapped by hand-compiling the constructor functions and the token reader. The rest of the interpreter was then read in from source.
You can check for yourself by reading the original McCarthy paper, Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I.
Every example of bootstrapping a language I can think of (C, PyPy) was done after there was a working compiler. You have to start somewhere, and reimplementing a language in itself requires writing a compiler in another language first.
How else would it work? I don't think it's even conceptually possible to do otherwise.
Another alternative is to create a bytecode machine for your language (or use an existing one if it's features aren't very unusual) and write a compiler to bytecode, either in the bytecode, or in your desired language using another intermediate - such as a parser toolkit which outputs the AST as XML, then compile the XML to bytecode using XSLT (or another pattern matching language and tree-based representation). It doesn't remove the dependency on another language, but could mean that more of the bootstrapping work ends up in the final system.
It's the computer science version of the chicken-and-egg paradox. I can't think of a way not to write the initial compiler in assembler or some other language. If it could have been done, I should Lisp could have done it.
Actually, I think Lisp almost qualifies. Check out its Wikipedia entry. According to the article, the Lisp eval function could be implemented on an IBM 704 in machine code, with a complete compiler (written in Lisp itself) coming into being in 1962 at MIT.
Some bootstrapped compilers or systems keep both the source form and the object form in their repository:
ocaml is a language which has both a bytecode interpreter (i.e. a compiler to Ocaml bytecode) and a native compiler (to x86-64 or ARM, etc... assembler). Its svn repository contains both the source code (files */*.{ml,mli}) and the bytecode (file boot/ocamlc) form of the compiler. So when you build it is first using its bytecode (of a previous version of the compiler) to compile itself. Later the freshly compiled bytecode is able to compile the native compiler. So Ocaml svn repository contains both *.ml[i] source files and the boot/ocamlc bytecode file.
The rust compiler downloads (using wget, so you need a working Internet connection) a previous version of its binary to compile itself.
MELT is a Lisp-like language to customize and extend GCC. It is translated to C++ code by a bootstrapped translator. The generated C++ code of the translator is distributed, so the svn repository contains both *.melt source files and melt/generated/*.cc "object" files of the translator.
J.Pitrat's CAIA artificial intelligence system is entirely self-generating. It is available as a collection of thousands of [A-Z]*.c generated files (also with a generated dx.h header file) with a collection of thousands of _[0-9]* data files.
Several Scheme compilers are also bootstrapped. Scheme48, Chicken Scheme, ...

Is C open source?

Does C (or any other low-level language, for that matter) even have source, or is the compiler the part that "does all the work", including parsing? If so, couldn't different compilers have different C dialects? Where does the stdlib factor into this? I would really like to know how this works.
The C language is not a piece of software but a defined standard, so one wouldn't say that it's open-source, but rather that it's an open standard.
There are a gazillion different compilers for C however, and many of those are indeed open-source. The most notable example is GCC's C compiler, which is all under the GNU General Public License (GPL), an open-source license.
There are more options. Watcom is open-source, for instance. There is no shortage of open-source C compilers, but without a doubt the most widespread one, at least in the non-Windows world, is GCC.
For Windows, your best bet is probably Watcom or GCC by using Cygwin or MinGW.
C is a standard which specifies how C compilers should generate programs.
C itself doesn't have any source code, just like a musical note doesn't have any plastic.
Some C compilers, such as GCC, are open source.
C is just a language, and a standardised one at that, too. It pretty much is the compiler that "does all the work". Different compilers did have different dialects; before the the C99 ANSI standard, you had things like Borland C and other competing compilers, that implemented the C language in their own fantastic ways.
stdlib is just an agreed-upon collection of standard libraries that are required to be present in any ANSI C implementation.
To add on to the other great answers:
Regarding different dialects -- there are some additional features added to C that are compiler specific. You can provide the command line flag -std=... to gcc to specify the C standard that you want to use, each has slight variations/additions to syntax, the most common is probably c99.
Each compiler tends to implement a few different extras, for example, typeof() is not in the C standard and so compilers do not have to implement this but nevertheless it is useful and most compilers provide it. Here is a list of gcc C extensions
The stdlib is a set of functions specified in the C standard. Much like compilers, stdlib can have different implementations. The GNU implementation is open source, as is gcc, but there are other compilers and could be other implementations of stdlib that are closed source.
The Compiler would determine all the mappings from C to Assembly etc... but as far as someone owning it.....noone really owns C however the ANSI/ISO determines the standards
GCC's C compiler is written in C. So we know there are at least one C compiler written in C.
GNU's stdlib (glibc) is also written in C (stdio.h, stdlib.h). But it also has some parts written in assembly language.
A really good question. There is a way to define a language standard (not the implementation!) in a form of a "source code", in a strict and unambigous language. Unfortunately, all of the old languages, including C, are poorly defined. But it is still possible to translate that definitions into a source code form.
Another approach is to define a language via its operational semantics, often in a form of a simple (and unefficient) reference implementation.
Helgi Hrafn Gunnarsson has written the main answer but I thought it would be worth noting that you can effectively end up with dialects too.
The compilers should do the same thing with regards to whichever standard they support (which these days should be pretty much all the same version) but there are grey areas. The way in which the compilers work for 'undefined' functionality for example. If the C specification says that the behaviour is undefined for a specific case then the compiler can do pretty much what it wants.
There are also examples of functions added to the libraries (and new libraries added) by the compiler makers to support specific platform traits, create a competitive advantage or simply to make life easier. The cynical might suggest that some of these are added to help lock people into a specific compiler too.
I would say that C as a language is not open source.
As pointed out by many, you can download GNU licensed compilers and libraries for free, but if you wanted to write your own C compiler, you would need to follow the ISO C standards, and ISO charge hard cash for the specification of the C language, which at the time of posting this is $178.
So really the answer depends on what elements you are interested in being free and open source.
I'm not sure what your definitions of "open source" are.
For the standardization process, it is possible for anyone to participate, but if you want to be able to vote then you will need to pay to join your national body (for instance, ANSI for the USA, BSI for the UK, AFNOR for France etc.). As a rule most standards body memberships are paid by corporations. That said, the process is fairly open. You can access discussion papers on the standards web site.
The standards themselves are not free either. The ISO pdf store currently sells the C standard for 198 swiss francs. Draft copies of the standard can be found easily for free.
There are plenty of open source implementations of both compilers and libraries.

Resources