How could one possibly bootstrap a C compiler(from source)? - c

I was looking into compiler bootstrapping, and I looked at how Golang implements bootstrapping from source, i.e., by building the last version of Golang implemented in C and using the generated executable to compile newer Go releases. This made me curious as to how the same could be done with C. Can you construct a C compiler on a computer with literally nothing present on it? If not, then how can I trust that the binary of the compiler I use doesn't automatically fill the binaries it compiles with spyware?
Related question, since the first C compiler was written in B and B was written in BCPL, what was BCPL written in?

Can you construct a C compiler on a computer with literally nothing present on it?
The main issue is how (in 2021) would you write a program for that computer! And how would you input it?
In the 1970s computers (like IBM 360 mainframes) had many mechanical switches to enter some initial program. In the 1960s, they had even more, e.g. IBM1620.
Today, how would you input that initial program? Did you consider using some Arduino ? Even oscilloscopes today contain microprocessors with programs....
Some hobbyists today have designed (and spent a lot of money) in making - a few years ago - computers with mechanical relays. These are probably thousands times slower than the cheapest laptop computer you could buy (or the micro-controller inside your computer mouse - and your mouse contains some software too).
You could also buy many discrete transistors (e.g. thousands of 2N2222) and make a computer by soldering them.
Even a cheap motherboard (like e.g. MSI A320M A-PRO) has today some firmware program called UEFI or BIOS. It is shipped with that program.... and rumored to be mostly written in C (several dozen of thousands of statements).
In some ways, computer chips are "software" coded in VHDL, SystemC, etc... etc...
However, you can in principle still bootstrap a C compiler in 2021.
Here is an hypothetical tale....
Imagine you have today a laptop running a small Linux distribution on some isolated island (à la Robinson Crusoe), without any Internet connection - but with books (including Modern C and some book about x86-64 assembly and instruction set architecture and many other books in paper form), pencils, papers, food and a lot of time to spend. Imagine that system does not have any C compiler (e.g. because you just removed by mistake the gcc package from some Debian distribution), but just GNU binutils (that is, the linker ld and the assembler gas), some editor in binary form (e.g. GNU emacs or vim), GNU bash and GNU make as binary packages. We assume you are motivated enough to spend months in writing a C compiler. We also assume you have access to man pages in some paper form (notably elf(5) and ld(1)...). We have to assume you can inspect a file in binary form with od(1) and less(1).
Then you could design on paper a subset µC of the C language in EBNF notation. With months of efforts, you can write a small assembler program, directly doing syscalls(2) (see Linux Assembly HowTo) and interpreting that µC language (since writing an interpreter is easier than writing a compiler; read for example the Dragon book, and Queinnec's Lisp In Small Pieces and Scott's programming language pragmatics book).
Once you have your tiny µC interpreter, you can write a naive µC compiler in µC (since Fabrice Bellard has been able to write his tinyC compiler).
Once you have debugged that µC compiler, you can extend it to accept all the syntax and semantics of C.
Once you have a full C compiler, you could improve it to optimize better, maybe extend it to accept a small subset of C++, and you might also write a static C code analyzer inspired by Frama-C.
PS. Bootstrapping can be generalized a lot - see Pitrat's blog on bootstrapping artificial intelligence (Jacques Pitrat, born in 1934, died in october 2019) and the RefPerSys project.

As Some programmer dude stated in a comment, since C is a portable programming language, you can use a compiler for a different platform to produce a cross-compiler that on that platform would produce executables for the target platform.
You then compile that same C compiler for the target platform on that host platform so that the result is an executable for the target platform.
Then you copy that compiler binary onto the target machine and from thereon it is self-hosting.
Naturally at some point in early history someone really had to write something in assembler or machine code somewhere. Today, it is no longer a necessity but a "life choice".
As for the "how can I trust that the binary of the compiler I use doesn't automatically fill the binaries it compiles with spyware?" problem has been solved - you can use two independent compilers to compile the cross-compiler from the same source base and the target and both of those cross-compilers should produce bitwise-identical results for the target executable. Then you would know that the result is either free of spyware, or that the two independent compilers you used in the beginning would infect the resulting executable with exact same spyware - which is exceedingly unlikely.

You can write a really feeble C compiler in assembly or machine code, then bootstrap from there.
Before programming languages existed you just wrote machine code. That was simply how it was done.
Later came assembler, which is like "easy mode" machine code, and from there evolved high-level languages like Fortran and BCPL. These were decoupled from the machine architecture by having a proper compiler to do the translation.
Today you'd probably write something in C and go from there, anything compiled is suitable, though "compiled" is a loose definition now that LLVM exists and you can just bang out LLVM IR code instead of actual machine code. Rust started in OCaml and is now "self-hosted" on top of LLVM, for example.

Related

Why was C not made a platform independent language?

I recently read the dragon book of compiler design. It mentions that the compiler has intermediate code generation as one of its phases which produces a machine independent code. Then why was C not developed as a platform independent language like java?
What the Dragon Book is describing is the following process:
Compile the source code into an intermediate machine-independent byte code format
Perform optimizations and analyses on that IR
Translate the IR to the target platform's actual machine code
The upside of this is that if you want to support additional systems, you just need to add a new code generator for step 3 without having to touch steps 1 and 2.
All common C compilers work this way. So if your question is "Why don't C compilers do what the Dragon Book describes?", the answer is: "They do".
Now you mentioned Java. What a Java compiler does is the following:
Compile the Java code into Java byte code. As far as the Java compiler is concerned, this is not an intermediate format, but the actual target language.
The end
Now to run this byte code you need a JVM, which interprets the byte code and/or JIT-compiles it. The optimizations and analyses usually happen during JIT-compilation. This is not the process described in the Dragon Book.
From the language implementers' point of view, this doesn't change the effort of supporting a new target system very much. You no longer have to change the compiler, but instead you have to change the JVM: Instead of having to add a new backend to the javac compiler, you instead add a new backend to the JIT-compiler. The effort remains basically the same.
The major difference is for the Java programmers: Instead of compiling the program for every target platform and distributing packages for each platform, you can now compile the code once and give the resulting package to everyone. Now the people running your code need to install an JVM to be able to use the package, so you basically moved the effort from the programmer to the end user, but installing a JVM is something you need to do only once (not for every Java program you want to run).
So instead of "write once, compile everywhere", you now have "compile once, run everywhere".
So why didn't C do the same thing that Java does? Performance. Interpreting byte code is slow (compared to running compiled code) and JIT-compilation leads to increased start-up time.
C was initially designed for a particular use case, which involved a specific machine. Although it was loosely based on the language BCPL, which was implemented by way of a platform-independent virtual machine, the goal for C was to be able to write low-level code, such as an operating system, which meant that it needed to be able to take advantage of specific features of the target machine, particularly its ability to directly address individual bytes. By contrast, BCPL's underlying architecture is resolutely word-oriented.
The fact that Bell Labs was able to rapidly reimplement the Unix Operating System in their new language (C) certainly contributed ti its popularity. (At least, that's why I initially learned it.) To allow for a wider dissemination of the language, a version of the compiler was written more closely following the architecture outlined in the Dragon Book, with an initial generation of virtual machine code which is then used to produce code for a target machine. This Portable C Compiler was for many years a reference implementation, and continues to be available.
Other languages contemporary with C, notably Pascal, also used the tactic of targetting a platform independent vurtual machine, and it was once common to refer to virtual machine code as "P-Code" because that's what Niklaus Wirth's Pascal project called their target architecture.
Although GCC does not use a virtual machine as such, it does start by generating a liw-level machine-independent internal representation, simplifying the task of porting the compiler to new archutectures. And of course the Clang compiler produces LLVM (low-level virtual machine) code, which can be transpiled into various concrete machine codes, or interpreted directly.
C was originally designed and written as a "Write-Once, Compile-Anywhere" language, which was as close as they could get at the time to a Universal Language.
Processors and Architectures were so radically different, and resources were so small that the idea of a Universal Virtual Machine (like Java has) was just impossible.
The idea that a single code-base could be run through a compiler, and then you have the same software on any target platform was pretty incredible.
The short answer: Because it was not feasible at that time.
The long answer: the Java platform is a language + virtual machine, Java code compiles to a something called ByteCode, then the virtual machine can take this byte code (it is similar to assembly language) and translates it to the relevant command at runtime, meaning the machine instruction that will be understood by the local machine.
Every architecture has it's own instruction set, meaning that an ARM architecture will not be able to understand code compiled for x86 architecture for example.
in C, the c code is compiled directly to machine instructions, these instructions are then performed by the local machine.
to get a behaviour like Java, you will need to have some kind of interpretor that reads C and translates it to machine code at runtime, this is no cheap task and was way too much for the computers of the time (c was invented in 1972) of course another way this could be implemented is to have the user compile your program before using it, which could be nice but probably will involve making your source code visible to the client, which is unwanted.
hopefully that clarifies things a bit.
Aside from leaving a number of things implementation-defined (in practice this is largely platform/ABI-defined, but strictly speaking doesn't have to be), C is mostly a platform-independent language. Indeed there are implementations of C (such as emscripten) that produce output in a form that can run on any machine platform with the right runtime environment for it. If software written in C makes assumptions about the implementation-defined (or worse, undefined) aspects of the language, then it might fail to work on some implementations/machines, but quite often the cause is more a matter of API/environment/library assumptions (like assuming POSIX, or Windows, or glibcisms) than making nonportable assumptions about the language itself.

Why compilers are written in C/C++ instead of using CoffeeScript (JavaScript, Node JS)?

I am exposed to C because of embedded system programming, and I think it's one wonderful language in this field. However, why is it used to write compilers? If the reason why gcc is implemented in C/C++ is that there aren't many good languages at that time, there's no excuse for why clang is taking the same path (using C/C++).
Is it for performance reasons? Mostly interpreted languages are a bit slower compared with compiled languages, but I guess the difference is almost negligible in CoffeeScript (JavaScript), because of Node.js.
From the perspective of developers, I suppose it's much easier to write one compiler using high level languages. Unfortunately, most of compilers out there are written in C/C++. Is it just because of legacy code?
Response to comments:
Bootstrapping is just one way to illustrate that this language is powerful enough to write one compiler. It shouldn't the dominant reason why we choose the language to implement the compiler.
I agree with the guess given below, that "most compiler developers would answer because most of compiler related tools (bison, yacc) emit C code". However, neither GCC nor Clang use generated parser, they implemented one themselves. This front-end process is independent of targeting architecture, and should not be C/C++'s strength.
There's more or less consensus that performance is one key factor. Indeed, even for GCC and Clang, building a reasonable size of C project (Linux kernel) takes a lot of time. Is it because of the front-end or the back-end. I have to admit that I didn't have much experience on backe-end of compilers, as we finished the course on compiler with generated LLVM code.
I am exposed to C because of embedded system programming, and I think
it's one wonderful language in this field.
Yes. It's better than Java.
However, why is it used to write compilers?
This question can't be answered without asking the developers. I suspect that the majority of them will tell you that common compiler-writing software (yacc, flex, bison, etc) produce C code.
If the reason for gcc is that there aren't many good languages,
there's no excuse for clang.
GCC isn't a programming language, and neither is Clang. They're both implementations of the C programming language.
Is it for performance reasons?
Don't confuse implementation with specification. Speed is an attribute introduced by your compiler and your computer, not by the programming language. GCC happens to produce fairly efficient machine code, which might influence developers to use C as their primary programming language... but in ten years time, it could* be that node.js produces more efficient machine code than GCC. Don't forget, StackOverflow is forever.
* could, but most likely won't. See Ira Baxters comment below for more info.
Mostly interpreted languages are a bit slower compared with compiled
languages, but I guess the difference is almost negligible in
CoffeeScript (JavaScript), because of Node.js.
Similarly, interpretation or compilation isn't the choice of the language, but of the implementation of the language. For example, GCC and Clang choose to compile C to machine code. Ch and CINT are two interpreters that translate C code directly to behaviour, rather than machine code. Java was once predominantly translated using interpretation, too, but is now predominantly compiled into JVM bytecode. Javascript seems to be phasing towards predominant compilation, too. Who knows? Maybe you'll see compilers written predominantly in Javascript in ten years time...
From the perspective of developers, I suppose it's much easier to
write one compiler using high level languages.
All of these programming languages are technically high level. They're mostly defined in terms of an abstract machine; They're certainly not low level.
Unfortunately, most of compilers out there are written in C/C++.
I don't consider it unfortunate that C++ is used to write software; It's not a bad programming language.
Is it just because of legacy code?
I suppose legacy code might influence the decision of a programmer. In the end though, as I said, you'd have to ask the developers. They might just decide to use C or C++ because C or C++ is their favourite programming language... Why do you speak English?
Compilers are very complex software in general. The front end part is pretty simple (parsing), but the backend part (scheduling, code generation, optimizations, register allocations) involve NP-complete problems (of course compilers try to approximate solutions to these problems). Thus, implementing in C would help compile times. C is also very good at bitwise operators and other low level stuff, which is useful for writing a compiler.
Note that not all compilers are written in C though. For example, Haskell GHC compiler is written in Haskell using bootstrapping technique.
Javascript is async, which doesn't suit compiler writing.
I see many reasons:
There is no elegant way of handling bit-precise code in Javascript
You can't write binary files easily in Javascript, so the assembler part of the compiler would have to be in a more low-level language
Huge JS codebase are very heavy to load in memory (that's plain text, remember?)
Writing optimizing routines for compilers are heavily CPU-intensive, which is not yet very compatible with Javascript
You wouldn't be able to compile your compiler with it (bootstrap), because you need a Javascript interpreter behing your compiler. The bootstrap phase wouldn't be "pure":
JS Compiler compiles NodeJS -> NodeJS runs your new Compiler -> new JS Compiler
gcc is implemented primarily in C, but that is not true of all compilers, including some that are quite standard. It is a common pattern for a compiler to be implemented in the language that it compiles. ghc is written largely in Haskell. Recent versions of guile feature a compiler implemented mostly in Scheme.
nope, coffeescript et al are still much slower than natively-compiled (and optimised) C code. Even if you take the subset of javscript that is able to be optimised (asm.js) its still twice as slow as native C.
What you hear about when people say node.js is just as fast as C code means that its just as fast as part of an overall system that does other things like read from disk, wait for data off the network, etc. In these systems the CPU is underused (especially with today's superfast CPUs) so the performance problem is not the raw processing capability of the language. Hence, a node.js server is exactly as fast as a C server if they're both stuck waiting for a network call to return data. The type of system written in node.js does a lot of waiting for network which is why people use node.js. The type of system written in C does not suit being written in node.js

How create a C compiler without a C native compiler

It's a simple question. If to compile the C compiler is needed a C compiler... Maybe directly with assembly code? Perhaps the kernel provides a basic tool for converting C to assembler and create an escalating infrastructure? It's a stupid question also but I'm really interested in how to design an operating system (not me) from 0 to interact with the CPU and memory.
Bootstrapping
This comes from the phrase, "pulling yourself up by your bootstraps", if you want to look it up. Bootstrapping is a straightforward process, but it is a lot of work.
Write a C compiler in another language.
Write a C compiler in C.
Use the compiler from step #1 to compile the compiler from step #2.
Some other compilers have to be bootstrapped, not just those for C. For example, GHC must be used to compile itself.
Note that bootstrapping is only necessary if both of the following are true:
You are inventing a new language, so there are no existing compilers.
You want to write the compiler in the language you are compiling.
This has nothing to do with operating systems. If you are designing an operating system, you have enough work ahead of you already. If your new operating system does not have a C compiler yet, you can cross-compile the compiler from a different operating system. This is much less work — maybe a few hours or days, instead of months or years.
It sounds like this is a task for tcc (http://bellard.org/tcc/), the Tiny C Compiler.
It's the smallest C compiler I know of, it can compile a bigger one once you've ported it, and there's numerous guides on how to port it to custom systems.

Are cores (device abstraction level) of OSs written entirely in C? (Like: "UNIX is written in C")

Are cores of OSs (device interaction level) really written in C, or "written in C" means that only most part of OS is written in C and interaction with devices is written in asm?
Why I ask that:
If core is written in asm - it can't be cross-platform.
If it is written is C - I can't imagine how it could be written in C.
OK. And what about I\O exactly - I can't imagine how can interaction with controller HDD or USB controller or some other real stuffs which we should send signals to be written without (or with small amount of) asm.
After all, thanks. I'll have a look at some other sources of web.
PS (Flood) It's a pity we have no OS course in university, despite of the fact that MIPT is the Russian twin of MIT, I found that nobody writes OSs like minix here.
The basic idea in Unix is to write nearly everything in C. So originally, something like 99% of it was C, it was the point, and the main goal was portability.
So to answer your question, interaction with devices is also written in C, yes. Even very low-level stuff is written in C, especially in Unix. But there are still very little parts written in assembly language. On x86 for example, the boot loader of any OS will include some part in assembler. You may have little parts of device drivers written in assembly language. But it is uncommon, and even when it's done it's typically a very small part of even the lowest-level code. How much exactly depends on implementations. For example, NetBSD can run on dozens of different architectures, so they avoid assembly language at all costs; conversely, Apple doesn't care about portability so a decent part of MacOS libc is written in assembly language.
It depends.
An OS for a small, embedded, device with a simple CPU can be written entirely in C (or C++ for that matter).
For more complicated OS-es, such as current Windows or Linux, it is very likely that there are small parts written in assembly. I would expect them most in the task scheduler, because it has some tricky fiddling to do with special CPU registers and it may need to use some special instructions that the compiler normally does not generate.
Device drivers can, almost always, be written entirely in C.
Typically, there's a minimal amount of assembly (since you need some), and the rest is written in C and interfaces with it. You can write functions in assembly and call them from C, so you can encapsulate whatever functionality you want.
By a little implementation-specific trickery, it can be possible to write drivers entirely in C, as it is normally possible to create, say, a volatile int * that will use a memory-mapped device register.
Some operating systems are written in Assembly language, but it is much more common for a kernel to be written in a high level language such as C for portability. Typically (although this is not always the case), a kernel written in a high level language will also have some small bits of assembly language for items that the compiler cannot express and need to be written in Assembler for some reason. Typical examples are:
Certain kernel-mode only instructions
to manipulate the MMU or perform
other privileged operations cannot be generated by a standard compiler. In this case the code must be written in assembly language.
Platform-specific performance optimisations. For example the X64 architecture has an endan-ness swapping instruction and the ARM has a barrel shifter (rotates the word being read by N bits) that can be used on load operations.
Assembly 'glue' to interface something that won't play nicely with (for example) C's stack frame structure, data formats or parameter layout conventions.
There are also operating systems written in other languages, for example:
http://en.wikipedia.org/wiki/Oberon_%28operating_system%29
http://en.wikipedia.org/wiki/Cosmos_%28operating_system%29
http://en.wikipedia.org/wiki/JX_%28operating_system%29
You don't have to wonder about questions like these. Go grab the linux kernel source and look for yourself. Most of the assembly is stored per architecture in the arch directory. It's really not that surprising that the vast majority is in C. The compiler generates native machine code, after all. It doesn't have to be C either. Our embedded kernel at work is written in C++.
If you are interested in specific pointers, then consider the Linux kernel. The entire software tree is virtually all written in C. The most well-known portion of assembly used in the kernel is entry.S that is specific to each architecture:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=arch/x86/kernel/entry_64.S;h=17be5ec7cbbad332973b6b46a79cdb3db2832f74;hb=HEAD
Additionally, for each architecture, functionality and optimizations that are not possible in C (e.g. spinlocks, atomic operations) may be implemented in assembly:
http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=blob;f=arch/x86/include/asm/spinlock.h;h=3089f70c0c52059e569c8745d1dcca089daee8af;hb=HEAD

Languages using for OS development before C

I know that C is the standard programming language for operating system development, but out of curiosity I was wondering what preceded it. What was the main programming language used for operating system development before C?
There were a lot of systems before C was used for Unix (1969...). Here's a sparse timeline. Click on each system for details. Most early systems would be implemented in assembler. A notable exception (not listed in the timeline) was the ahead-of-its-time 1961 B5000 with an O/S written in ALGOL.
Burroughs was one of the first to use something other than assembler for OS development. They chose a dialect of Algol.
In 1965 Multics(Project MAC, funded by ARPA) design began and PL/I was chosen to develop the OS. In 1969 Multics was opened for use at MIT, but there were frustrations and Bell Labs withdrew from Project MAC. Ken Thompson, Dennis Ritchie, Doug McIlroy, and J. F. Ossanna continued to seek the holy grail and Unics (later Unix) development began.
Multics History
Multics Failure?
Unix v. Multics
During the 1970's "Cold War" there was an effort to use the data security and parallel processing features of ALGOL 68 to create Secure/Capability based operating systems:
Cambridge CAP computer - All procedures constituting the operating system were written in ALGOL 68C, although a number of other closely associated protected procedures - such as a paginator - are written in BCPL. c.f. microsoft
Flex machine - The hardware was custom and microprogrammable, with an operating system, (modular) compiler, editor, garbage collector and filing system all written in Algol 68RS. A Linux port of this Algol68RS can be downloaded from compile can be downloaded from Sourceforge:algol68toc.
/* Interestingly portions of DRA's algebraically specified abstract machine Ten15 is still available, also from Sourceforge:TenDRA (for minux). Ten15 serves as DRA's intermediate language for compilers, and evolved to support C and Ada. Apparently an attempt was made to port FreeBSD/Unix using the TenDRA C compiler */
ICL VME - S3 programming language was the implementation language of the operating system VME. S3 was based on ALGOL 68 but with data types and operators aligned to those offered by the ICL 2900 Series. This OS is still in use as a Linux VM, and has some 100,000 users.
The Soviet Era computers Эльбрус-1 (Elbrus-1) and Эльбрус-2 were created using high-level language uЭль-76 (AL-76), rather than the traditional assembly. uЭль-76 resembles Algol-68, The main difference is the dynamic binding types in uЭль-76 supported at the hardware level. uЭль-76 is used for application, job control, system programming c.f. e2k-spec.
Maybe the US military was doing something similar somewhere. Anyone?
There were many 16-bit Forth systems where the interpreter and (fairly primitive) OS layer were written in Forth.
The original Mac OS was written in a mix of 68k ASM and a slightly extended Pascal.
ADA has been used to write several OS's.
But I'd guess that the dominant language used for OS development prior to C was IBM 360 assembly language.
That depends. The Amiga OS for example was originally written to a certain extent in BCPL, and you would imagine that many ancient operating systems were written in pure assembly language.
CP/M (which is kind of MS-DOS' predecessor) was written in PL/M, but MS-DOS was written in assembly for performance reasons. Here is something on MS-DOS: http://www.patersontech.com/Dos/Byte/InsideDos.htm
(Edited, not sure where I picked up this Fortran garbage.)
Operating systems "want" to be written in assembler. If you're starting from scratch, once you have the interrupt routines done, you can just keep on going and not get around to a high-level language interface.
Furthermore, assemblers like to evolve. Once you've covered the specified instruction set, it's convenient to add alias names for instructions that serve multiple purposes. Next come pseudo-instructions that can alias a couple machine instructions. Then it's nice to have an extensible facility for writing macro subroutines, to generate arbitrary sections of code that look like instructions. (Unlike C macros, this often may allow flow control and script-like programming.) Then, there are scoping rules to ensure identifiers are only used in a particular context.
Bit by bit, languages evolve. C didn't pop out of thin air. It was preceded by a generation or two of languages (Algol, BCPL) that evolved from high-level assemblers. Many platform-specific assembly languages were in fact reasonably nice. IBM still makes a mean assembler. (Of course, before that were not-so-nice assemblers, and before that were punch cards and toggle switches.)
More recently, GNU as has given assembly a bit of a bad name by being relatively primitive. Don't believe the scare tactics, though.

Resources