How create a C compiler without a C native compiler - c

It's a simple question. If to compile the C compiler is needed a C compiler... Maybe directly with assembly code? Perhaps the kernel provides a basic tool for converting C to assembler and create an escalating infrastructure? It's a stupid question also but I'm really interested in how to design an operating system (not me) from 0 to interact with the CPU and memory.

Bootstrapping
This comes from the phrase, "pulling yourself up by your bootstraps", if you want to look it up. Bootstrapping is a straightforward process, but it is a lot of work.
Write a C compiler in another language.
Write a C compiler in C.
Use the compiler from step #1 to compile the compiler from step #2.
Some other compilers have to be bootstrapped, not just those for C. For example, GHC must be used to compile itself.
Note that bootstrapping is only necessary if both of the following are true:
You are inventing a new language, so there are no existing compilers.
You want to write the compiler in the language you are compiling.
This has nothing to do with operating systems. If you are designing an operating system, you have enough work ahead of you already. If your new operating system does not have a C compiler yet, you can cross-compile the compiler from a different operating system. This is much less work — maybe a few hours or days, instead of months or years.

It sounds like this is a task for tcc (http://bellard.org/tcc/), the Tiny C Compiler.
It's the smallest C compiler I know of, it can compile a bigger one once you've ported it, and there's numerous guides on how to port it to custom systems.

Related

How could one possibly bootstrap a C compiler(from source)?

I was looking into compiler bootstrapping, and I looked at how Golang implements bootstrapping from source, i.e., by building the last version of Golang implemented in C and using the generated executable to compile newer Go releases. This made me curious as to how the same could be done with C. Can you construct a C compiler on a computer with literally nothing present on it? If not, then how can I trust that the binary of the compiler I use doesn't automatically fill the binaries it compiles with spyware?
Related question, since the first C compiler was written in B and B was written in BCPL, what was BCPL written in?
Can you construct a C compiler on a computer with literally nothing present on it?
The main issue is how (in 2021) would you write a program for that computer! And how would you input it?
In the 1970s computers (like IBM 360 mainframes) had many mechanical switches to enter some initial program. In the 1960s, they had even more, e.g. IBM1620.
Today, how would you input that initial program? Did you consider using some Arduino ? Even oscilloscopes today contain microprocessors with programs....
Some hobbyists today have designed (and spent a lot of money) in making - a few years ago - computers with mechanical relays. These are probably thousands times slower than the cheapest laptop computer you could buy (or the micro-controller inside your computer mouse - and your mouse contains some software too).
You could also buy many discrete transistors (e.g. thousands of 2N2222) and make a computer by soldering them.
Even a cheap motherboard (like e.g. MSI A320M A-PRO) has today some firmware program called UEFI or BIOS. It is shipped with that program.... and rumored to be mostly written in C (several dozen of thousands of statements).
In some ways, computer chips are "software" coded in VHDL, SystemC, etc... etc...
However, you can in principle still bootstrap a C compiler in 2021.
Here is an hypothetical tale....
Imagine you have today a laptop running a small Linux distribution on some isolated island (à la Robinson Crusoe), without any Internet connection - but with books (including Modern C and some book about x86-64 assembly and instruction set architecture and many other books in paper form), pencils, papers, food and a lot of time to spend. Imagine that system does not have any C compiler (e.g. because you just removed by mistake the gcc package from some Debian distribution), but just GNU binutils (that is, the linker ld and the assembler gas), some editor in binary form (e.g. GNU emacs or vim), GNU bash and GNU make as binary packages. We assume you are motivated enough to spend months in writing a C compiler. We also assume you have access to man pages in some paper form (notably elf(5) and ld(1)...). We have to assume you can inspect a file in binary form with od(1) and less(1).
Then you could design on paper a subset µC of the C language in EBNF notation. With months of efforts, you can write a small assembler program, directly doing syscalls(2) (see Linux Assembly HowTo) and interpreting that µC language (since writing an interpreter is easier than writing a compiler; read for example the Dragon book, and Queinnec's Lisp In Small Pieces and Scott's programming language pragmatics book).
Once you have your tiny µC interpreter, you can write a naive µC compiler in µC (since Fabrice Bellard has been able to write his tinyC compiler).
Once you have debugged that µC compiler, you can extend it to accept all the syntax and semantics of C.
Once you have a full C compiler, you could improve it to optimize better, maybe extend it to accept a small subset of C++, and you might also write a static C code analyzer inspired by Frama-C.
PS. Bootstrapping can be generalized a lot - see Pitrat's blog on bootstrapping artificial intelligence (Jacques Pitrat, born in 1934, died in october 2019) and the RefPerSys project.
As Some programmer dude stated in a comment, since C is a portable programming language, you can use a compiler for a different platform to produce a cross-compiler that on that platform would produce executables for the target platform.
You then compile that same C compiler for the target platform on that host platform so that the result is an executable for the target platform.
Then you copy that compiler binary onto the target machine and from thereon it is self-hosting.
Naturally at some point in early history someone really had to write something in assembler or machine code somewhere. Today, it is no longer a necessity but a "life choice".
As for the "how can I trust that the binary of the compiler I use doesn't automatically fill the binaries it compiles with spyware?" problem has been solved - you can use two independent compilers to compile the cross-compiler from the same source base and the target and both of those cross-compilers should produce bitwise-identical results for the target executable. Then you would know that the result is either free of spyware, or that the two independent compilers you used in the beginning would infect the resulting executable with exact same spyware - which is exceedingly unlikely.
You can write a really feeble C compiler in assembly or machine code, then bootstrap from there.
Before programming languages existed you just wrote machine code. That was simply how it was done.
Later came assembler, which is like "easy mode" machine code, and from there evolved high-level languages like Fortran and BCPL. These were decoupled from the machine architecture by having a proper compiler to do the translation.
Today you'd probably write something in C and go from there, anything compiled is suitable, though "compiled" is a loose definition now that LLVM exists and you can just bang out LLVM IR code instead of actual machine code. Rust started in OCaml and is now "self-hosted" on top of LLVM, for example.

relationship of c compiler and c standard library

I have been doing a lot reading lately about how glibc functions wrap system calls in linux. I am wondering however about the relationship between glibc and the GNU C Compiler.
Lets say for example I wanted to write my own C Standard implementation and write a new library called "newglibc" and I change things just slightly. Like for example I take more checks and actions before and after the system calls. Would I have to write a new compiler? Or would I be able to use the same GNU gcc compiler?
If the compiler is completely separate from the library, then would someone be able to, THEORETICALLY, use the gcc on windows system if they could turn it into a .exe and provide the standard C library that windows provides?
Thank you
The Linux kernel, the GNU C Library ("glibc"), and the GNU Compiler Collection (gcc) are three separate development projects. They are often used all together, but they don't have to be. The ones with "GNU" in their name are offically part of the GNU Project; Linux isn't.
The C standard does not make a distinction between the "compiler" and the "library"; it's all one "implementation" to the committee. It is largely a historical accident that GCC is a separate development project from glibc—but a motivated one: back in the day, each commercial Unix variant shipped with its own C library and compiler, and they were terrible, 90% bugs by volume was typical. GNU got its start providing a less terrible replacement for the compiler (and the shell utilities, which were also terrible).
Replacing the compiler on a traditional commercial Unix is a lot easier than replacing the C library, because the C library isn't just the functions defined in clause 7 of the C standard; as you have noticed, it also provides the lowest-level interface to the kernel, and often that wasn't very well documented. glibc did at one time at least sort-of support a bunch of these Unixes, but nowadays it can only be used with Linux and an experimental kernel called the Hurd. By contrast, GCC supports dozens of different CPUs and kernels, and Linux supports dozens of different CPUs.
If you write your own C library and/or kernel, it is relatively easy to write a "back end" so that GCC can generate code for them as a cross-compiler, and somewhat more difficult to port GCC to run in that environment. You may also need to write a back end for the assembler and linker, which are yet a fourth project ("GNU Binutils"). Porting glibc to a new CPU running Linux is a large but straightforward task; porting glibc to a new operating system is hard, especially if that OS is not Unix-ish. (Windows is decidedly not Unix-ish, so much so that when Microsoft wanted to make it easier to run programs written for Unix under Windows, the path of least resistance was to bolt an in-house clone of the Linux kernel onto the side of the NT kernel. I am not making this up.)
If you write your own C compiler, you will have to make it conform to the expectations of the library and kernel that it is generating code for. A lot of that is documented in the "ABI" specification for the environment you're working in, but not all, unfortunately.
If that doesn't clarify, please let us know what is still unclear.

Why was C not made a platform independent language?

I recently read the dragon book of compiler design. It mentions that the compiler has intermediate code generation as one of its phases which produces a machine independent code. Then why was C not developed as a platform independent language like java?
What the Dragon Book is describing is the following process:
Compile the source code into an intermediate machine-independent byte code format
Perform optimizations and analyses on that IR
Translate the IR to the target platform's actual machine code
The upside of this is that if you want to support additional systems, you just need to add a new code generator for step 3 without having to touch steps 1 and 2.
All common C compilers work this way. So if your question is "Why don't C compilers do what the Dragon Book describes?", the answer is: "They do".
Now you mentioned Java. What a Java compiler does is the following:
Compile the Java code into Java byte code. As far as the Java compiler is concerned, this is not an intermediate format, but the actual target language.
The end
Now to run this byte code you need a JVM, which interprets the byte code and/or JIT-compiles it. The optimizations and analyses usually happen during JIT-compilation. This is not the process described in the Dragon Book.
From the language implementers' point of view, this doesn't change the effort of supporting a new target system very much. You no longer have to change the compiler, but instead you have to change the JVM: Instead of having to add a new backend to the javac compiler, you instead add a new backend to the JIT-compiler. The effort remains basically the same.
The major difference is for the Java programmers: Instead of compiling the program for every target platform and distributing packages for each platform, you can now compile the code once and give the resulting package to everyone. Now the people running your code need to install an JVM to be able to use the package, so you basically moved the effort from the programmer to the end user, but installing a JVM is something you need to do only once (not for every Java program you want to run).
So instead of "write once, compile everywhere", you now have "compile once, run everywhere".
So why didn't C do the same thing that Java does? Performance. Interpreting byte code is slow (compared to running compiled code) and JIT-compilation leads to increased start-up time.
C was initially designed for a particular use case, which involved a specific machine. Although it was loosely based on the language BCPL, which was implemented by way of a platform-independent virtual machine, the goal for C was to be able to write low-level code, such as an operating system, which meant that it needed to be able to take advantage of specific features of the target machine, particularly its ability to directly address individual bytes. By contrast, BCPL's underlying architecture is resolutely word-oriented.
The fact that Bell Labs was able to rapidly reimplement the Unix Operating System in their new language (C) certainly contributed ti its popularity. (At least, that's why I initially learned it.) To allow for a wider dissemination of the language, a version of the compiler was written more closely following the architecture outlined in the Dragon Book, with an initial generation of virtual machine code which is then used to produce code for a target machine. This Portable C Compiler was for many years a reference implementation, and continues to be available.
Other languages contemporary with C, notably Pascal, also used the tactic of targetting a platform independent vurtual machine, and it was once common to refer to virtual machine code as "P-Code" because that's what Niklaus Wirth's Pascal project called their target architecture.
Although GCC does not use a virtual machine as such, it does start by generating a liw-level machine-independent internal representation, simplifying the task of porting the compiler to new archutectures. And of course the Clang compiler produces LLVM (low-level virtual machine) code, which can be transpiled into various concrete machine codes, or interpreted directly.
C was originally designed and written as a "Write-Once, Compile-Anywhere" language, which was as close as they could get at the time to a Universal Language.
Processors and Architectures were so radically different, and resources were so small that the idea of a Universal Virtual Machine (like Java has) was just impossible.
The idea that a single code-base could be run through a compiler, and then you have the same software on any target platform was pretty incredible.
The short answer: Because it was not feasible at that time.
The long answer: the Java platform is a language + virtual machine, Java code compiles to a something called ByteCode, then the virtual machine can take this byte code (it is similar to assembly language) and translates it to the relevant command at runtime, meaning the machine instruction that will be understood by the local machine.
Every architecture has it's own instruction set, meaning that an ARM architecture will not be able to understand code compiled for x86 architecture for example.
in C, the c code is compiled directly to machine instructions, these instructions are then performed by the local machine.
to get a behaviour like Java, you will need to have some kind of interpretor that reads C and translates it to machine code at runtime, this is no cheap task and was way too much for the computers of the time (c was invented in 1972) of course another way this could be implemented is to have the user compile your program before using it, which could be nice but probably will involve making your source code visible to the client, which is unwanted.
hopefully that clarifies things a bit.
Aside from leaving a number of things implementation-defined (in practice this is largely platform/ABI-defined, but strictly speaking doesn't have to be), C is mostly a platform-independent language. Indeed there are implementations of C (such as emscripten) that produce output in a form that can run on any machine platform with the right runtime environment for it. If software written in C makes assumptions about the implementation-defined (or worse, undefined) aspects of the language, then it might fail to work on some implementations/machines, but quite often the cause is more a matter of API/environment/library assumptions (like assuming POSIX, or Windows, or glibcisms) than making nonportable assumptions about the language itself.

Why compilers are written in C/C++ instead of using CoffeeScript (JavaScript, Node JS)?

I am exposed to C because of embedded system programming, and I think it's one wonderful language in this field. However, why is it used to write compilers? If the reason why gcc is implemented in C/C++ is that there aren't many good languages at that time, there's no excuse for why clang is taking the same path (using C/C++).
Is it for performance reasons? Mostly interpreted languages are a bit slower compared with compiled languages, but I guess the difference is almost negligible in CoffeeScript (JavaScript), because of Node.js.
From the perspective of developers, I suppose it's much easier to write one compiler using high level languages. Unfortunately, most of compilers out there are written in C/C++. Is it just because of legacy code?
Response to comments:
Bootstrapping is just one way to illustrate that this language is powerful enough to write one compiler. It shouldn't the dominant reason why we choose the language to implement the compiler.
I agree with the guess given below, that "most compiler developers would answer because most of compiler related tools (bison, yacc) emit C code". However, neither GCC nor Clang use generated parser, they implemented one themselves. This front-end process is independent of targeting architecture, and should not be C/C++'s strength.
There's more or less consensus that performance is one key factor. Indeed, even for GCC and Clang, building a reasonable size of C project (Linux kernel) takes a lot of time. Is it because of the front-end or the back-end. I have to admit that I didn't have much experience on backe-end of compilers, as we finished the course on compiler with generated LLVM code.
I am exposed to C because of embedded system programming, and I think
it's one wonderful language in this field.
Yes. It's better than Java.
However, why is it used to write compilers?
This question can't be answered without asking the developers. I suspect that the majority of them will tell you that common compiler-writing software (yacc, flex, bison, etc) produce C code.
If the reason for gcc is that there aren't many good languages,
there's no excuse for clang.
GCC isn't a programming language, and neither is Clang. They're both implementations of the C programming language.
Is it for performance reasons?
Don't confuse implementation with specification. Speed is an attribute introduced by your compiler and your computer, not by the programming language. GCC happens to produce fairly efficient machine code, which might influence developers to use C as their primary programming language... but in ten years time, it could* be that node.js produces more efficient machine code than GCC. Don't forget, StackOverflow is forever.
* could, but most likely won't. See Ira Baxters comment below for more info.
Mostly interpreted languages are a bit slower compared with compiled
languages, but I guess the difference is almost negligible in
CoffeeScript (JavaScript), because of Node.js.
Similarly, interpretation or compilation isn't the choice of the language, but of the implementation of the language. For example, GCC and Clang choose to compile C to machine code. Ch and CINT are two interpreters that translate C code directly to behaviour, rather than machine code. Java was once predominantly translated using interpretation, too, but is now predominantly compiled into JVM bytecode. Javascript seems to be phasing towards predominant compilation, too. Who knows? Maybe you'll see compilers written predominantly in Javascript in ten years time...
From the perspective of developers, I suppose it's much easier to
write one compiler using high level languages.
All of these programming languages are technically high level. They're mostly defined in terms of an abstract machine; They're certainly not low level.
Unfortunately, most of compilers out there are written in C/C++.
I don't consider it unfortunate that C++ is used to write software; It's not a bad programming language.
Is it just because of legacy code?
I suppose legacy code might influence the decision of a programmer. In the end though, as I said, you'd have to ask the developers. They might just decide to use C or C++ because C or C++ is their favourite programming language... Why do you speak English?
Compilers are very complex software in general. The front end part is pretty simple (parsing), but the backend part (scheduling, code generation, optimizations, register allocations) involve NP-complete problems (of course compilers try to approximate solutions to these problems). Thus, implementing in C would help compile times. C is also very good at bitwise operators and other low level stuff, which is useful for writing a compiler.
Note that not all compilers are written in C though. For example, Haskell GHC compiler is written in Haskell using bootstrapping technique.
Javascript is async, which doesn't suit compiler writing.
I see many reasons:
There is no elegant way of handling bit-precise code in Javascript
You can't write binary files easily in Javascript, so the assembler part of the compiler would have to be in a more low-level language
Huge JS codebase are very heavy to load in memory (that's plain text, remember?)
Writing optimizing routines for compilers are heavily CPU-intensive, which is not yet very compatible with Javascript
You wouldn't be able to compile your compiler with it (bootstrap), because you need a Javascript interpreter behing your compiler. The bootstrap phase wouldn't be "pure":
JS Compiler compiles NodeJS -> NodeJS runs your new Compiler -> new JS Compiler
gcc is implemented primarily in C, but that is not true of all compilers, including some that are quite standard. It is a common pattern for a compiler to be implemented in the language that it compiles. ghc is written largely in Haskell. Recent versions of guile feature a compiler implemented mostly in Scheme.
nope, coffeescript et al are still much slower than natively-compiled (and optimised) C code. Even if you take the subset of javscript that is able to be optimised (asm.js) its still twice as slow as native C.
What you hear about when people say node.js is just as fast as C code means that its just as fast as part of an overall system that does other things like read from disk, wait for data off the network, etc. In these systems the CPU is underused (especially with today's superfast CPUs) so the performance problem is not the raw processing capability of the language. Hence, a node.js server is exactly as fast as a C server if they're both stuck waiting for a network call to return data. The type of system written in node.js does a lot of waiting for network which is why people use node.js. The type of system written in C does not suit being written in node.js

Why C is the language of compilers- when a Scheme subset would seem to be a better fit?

I was just listening to episode 57 of Software Engineering Radio
(TRANSCRIPT: http://www.se-radio.net/transcript-57-compiletime-metaprogramming )
I'm only 40 minutes in, but I'm wondering why C is the language of compilers- when a Scheme subset would seem to be a better fit? (or some other HLL)
(excluding the obvious reason of not wanting to rewrite gcc)
PS originally posted this at LtU http://lambda-the-ultimate.org/node/3754
I won't bother to listen to 40 minutes of radio to perhaps understand your question more thoroughly, but I would claim the exact opposite. Only a minority of compilers are written in C. I rather have the impression that (at least where appropriate), most compilers are implemented in the language they are meant to compile.
C need not be the language for compilers, but it does have some advantages. C is available on almost all platforms and that makes it easy to port and bootstrap the compiler. C is closer to the hardware and makes possible many optimizations that will be difficult to achieve in other languages. It is easy for a compiler written in C to co-exist with other languages, libraries and systems as most of them provide a C interface. It is also easy for others to extend the compiler as C is the Esperanto of system programmers.
Well, one reason will be the issue of bootstrapping the compiler on unsupported architectures. That will usually require the existence of a working compiler for that architecture, which generally means C. I remember trying to compile MIT-scheme from source, and getting really pissed off that it required MIT-scheme to be installed before I could build MIT-scheme.
Incidentally, I'm not sure I agree with your premise... C certainly seems to be the most widely deployed language, but other language compilers (e.g. MIT-scheme) are often implemented in those languages.
It's probably a combination of factors:
C compilers are available for almost every platform, making it easier to build a new compiler for a new language.
History: C is a very popular language, so it makes sense that a lot of projects are in C (no matter the project).
Scheme, specifically, is very unpopular (compared to C).
C has Flex and Yacc which help with implementing the Frontend (parser and lexer) of a compiler, if I remember right their output is limited to c code
Many compilers today are written in languages other than C (such as Scheme). To make them portable they initially generate C code as a target language.
I think a lot has to do with backends. Someone mentioned Flex and Yacc, but there's also GCC and LLVM that will help you with a lot of other important stuff, like optimizations.

Resources