Bootstrapping a cross-platform compiler - c

Suppose you are designing, and writing a compiler for, a new language called Foo, among whose virtues is intended to be that it's particularly good for implementing compilers. A classic approach is to write the first version of the compiler in C, and use that to write the second version in Foo, after which it becomes self-compiling.
This does mean you have to be careful to keep backup copies of the binary (as opposed to most programs where you only have to keep backup copies of the source); once the language has evolved away from the first version, if you lost all copies of the binary, you would have nothing capable of compiling the current version. So be it.
But suppose it is intended to support both Linux and Windows. As long as it is in fact running on both platforms, it can compile itself on each platform, no problem. Supposing however you lost the binary on one platform (or had reason to suspect it had been compromised by an attacker); now there is a problem. And having to safeguard the binary for every supported platform is at least one more failure point than I'm comfortable with.
One solution would be to make it a cross-compiler, such that the binary on either platform can target both platforms.
This is not quite as easy as it sounds - while there is no problem selecting the binary output format, each platform provides the system API in the form of C header files, which normally only exist on their native platform, e.g. there is no guarantee code compiled against the Windows stdio.h will work on Linux even if compiled into Linux binary format.
Perhaps that problem could be solved by downloading the Linux header files onto a Windows box and using the Windows binary to cross-compile a Linux binary.
Are there any caveats with that solution I'm missing?
Another solution might be to maintain a separate minimum bootstrap compiler in Python, that compiles Foo into portable C, accepting only that subset of the language needed by the main Foo compiler and performing minimum error checking and no optimization, the intent being that the bootstrap compiler will thus remain simple enough that maintaining it across subsequent language versions wouldn't cost very much.
Again, are there any caveats with that solution I'm missing?
What methods have people used to solve this problem in the past?

This is a problem for C compilers themselves. It's typically solved by the use of a cross-compiler, exactly as you suggest.
The process of cross-compiling a compiler is no more difficult than cross-compiling any other project: that is to say, it's trickier than you'd like, but by no means impossible.
Of course, you first need the cross-compiler itself. This probably means some major surgery to your build-configuration system, and you'll need some kind of "sysroot" taken from the target (header, libraries, anything else you'll need to reference in a build).
So, in the end it depends on how your compiler is structured. Either it's easier to re-bootstrap using historical sources, repeating each phase of language compatibility you went through in the first place (you did use source revision control, right?), or it's easier to implement a cross-compiler configuration. I can't tell you which from here.
For many years, the GCC compiler was always written only in standard-compliant C code for exactly this reason: they wanted to be able to bring it up on any OS, given only the native C compiler for that system. Only in 2012 was it decided that C++ is now sufficiently widespread that the compiler itself can be written in it. Even then, they're only permitting themselves a subset of the language. In future, if anybody wants to port GCC to a platform that does not already have C++, they will need to either use a cross-compiler, or first port GCC 4.7 (that last major C-only version) and then move to the latest.
Additionally, the GCC build process does not "trust" the compiler it was built with. When you type "make", it first builds a reduced version of itself, it then uses that the build a full version. Finally, it uses the full version to rebuild another full version, and compares the two binaries. If the two do not match it knows that the original compiler was buggy and introduced some bad code, and the build has failed.

Related

GCC preprocessor directives for Arch Linux

Does GCC (or alternatively Clang) defines any macro when it is compiled for the Arch Linux OS?
I need to check that my software restricts itself from compiling under anything but Arch Linux (the reason behind this is off-topic). I couldn't find any relevant resources on the internet.
Does anyone know how to guarantee through GCC preprocessor directives that my binaries are only compilable under Arch Linux?
Of course I can always
#ifdef __linux__
...
#endif
But this is not precise enough.
Edit: This must be done through C source code and not by any building systems, so, for example, doing this through CMake is completely discarded.
Edit 2: Users faking this behaviour is not a problem since the software is distributed to selected clients and thus, actively trying to "misuse" our source code is "their decision".
Does GCC (or alternatively Clang) defines any macro when it is compiled for the Arch Linux OS?
No. Because there's nothing inherently specific to Arch Linux on the binary level. For what it's worth, when compiling the only things you/the compiler has to care about is the target architecture (i.e. what kind of CPU it's going to run with), data type sizes and alignments and function calling conventions.
Then later on, when it's time to link the compiled translation unit objects into the final binary executable, the runtime libraries around are also of concern. Without taking special precautions you're essentially locking yourself into the specific brand of runtime libraries (glibc vs. e.g. musl; libstdc++ vs. libc++) pulled by the linker.
One can easily sidestep the later problem by linking statically, but that limits the range of system and midlevel APIs available to the program. For example on Linux a purely naively statically linked program wouldn't be able to use graphics acceleration APIs like OpenGL-3.x or Vulkan, since those rely on loading components of the GPU drivers into the process. You can however still use X11 and indirect GLX OpenGL, since those work using wire protocols going over sockets, which are implemented using direct syscalls to the kernel.
And these kernel syscalls are exactly the same on the binary level for each and every Linux kernel of every distribution out there. Although inside of the kernel there's a lot of leeway when it comes to redefining interfaces, when it comes to the interfaces toward the userland (i.e. regular programs) there's this holy, dogmatic, ironclad rule that YOU NEVER BREAK USERLAND! Kernel developers breaking this rule, intentionally or not are chewed out publicly by Linus Torvalds in his in-/famous rants.
The bottom line to this is, that there is no such thing as a "Linux distribution specific identifier on the binary level". At the end of the day, a Linux distribution is just that: A distribution of stuff. That means someone or more decided on a set of files that make up a working Linux system, wrap it up somehow and slap a name on it. That's it. There's nothing inherently specific to "Arch" Linux other than it's called "Arch" and (for the time being) relies on the pacman package manager. That's it. Everything else about "Arch", or any other Linux distribution, is just a matter of happenstance.
If you really want to sort different Linux distributions into certain bins regarding binary compatibility, then you'd have to pigeonhole the combinations of
Minimum required set of supported syscalls. This translates into minimum required kernel version.
What libc variant is being used; and potentially which version, although it's perfectly possible to link against a minimally supported set of functions, that has been around for almost "forever".
What variant of the C++ standard library the distribution decided upon. This actually also inflicts programs that might appear to be purely C, because certain system level libraries (*cough* Mesa *cough*) will internally pull a lot of C++ infrastructure (even compilers), also triggering other "fun" problems¹
I need to check that my software restricts itself from running under anything but Arch Linux (the reason behind this is off-topic). I couldn't find any relevant resources on the internet.
You couldn't find resources on the Internet, because there's nothing specific on the binary level that makes "Arch" Arch. For what it's worth right now, this instant I could create a fork of Arch, change out its choice of default XDG skeleton – so that by default user directories are populated with subdirs called leech, flicks, beats, pics – and call it "l33tz" Linux. For all intents and purposes it's no longer Arch. It does behave significantly different from the default Arch behavior, which would also be of concern to you, if you'd relied on any specific thing, and be it most minute.
Your employer doesn't seem to understand what Linux is or what distinguished distributions from each other.
Hint: It's not the binary compatibility. As a matter of fact, as long as you stay within the boring old realm of boring old glibc + libstdc++ Linux distributions are shockingly compatible with each other. There might be slight differences in where they put libraries other than libc.so, libdl.so and ld-linux[-${arch}].so, but those two usually always can be found under /lib. And once ld-linux[-${arch}].so and libdl.so take over (that means pulling in all libraries loaded at runtime) all the specifics of where shared objects and libraries are to be found are abstracted away by the dynamic linker.
1: like becoming multithreaded only after global constructors were executed and libstdc++ deciding it wants to be singlethreaded, because libpthread wasn't linked into a program that didn't create a single thread on its own. That was a really weird bug I unearthed, but yshui finally understood https://gitlab.freedesktop.org/mesa/mesa/-/issues/3199
You can list the predefined preprocessor macros with
gcc -dM -E - /dev/null
clang -dM -E - /dev/null
None of those indicate what operating system the compiler is running under. So not only you can't tell whether the program is compiled under Arch Linux, you can't even tell whether the program is compiled under Linux. The macros __linux__ and friends indicate that the program is being compiler for Linux. They are defined when cross-compiling from another system to Linux, and not defined when cross-compiling from Linux to another system.
You can artificially make your program more difficult to compile by specifying absolute paths for system headers and relying on non-portable headers (e.g. /usr/include/bits/foo.h). That can make cross-compilation or compilation for anything other than Linux practically impossible without modifying the source code. However, most Linux distributions install headers in the same location, so you're unlikely to pinpoint a specific distribution.
You're very likely asking the wrong question. Instead of asking how to restrict compilation to Arch Linux, start from why you want to restrict compilation to Arch Linux. If the answer is “because the resulting program wouldn't be what I want under another distribution”, then start from there and make sure that the difference results in a compilation error rather than incorrect execution. If the answer to “why” is something else, then you're probably looking for a technical solution to a social problem, and that rarely ends well.
No, it doesn't. And even if it did, it wouldn't stop anyone from compiling the code on an Arch Linux distro and then running it on a different Linux.
If you need to prevent your software from "from running under anything but Arch Linux", you'll need to insert a run-time check. Although, to be honest, I have no idea what that check might consist of, since linux distros are not monolithic products. The actual check would probably have to do with your reasons for imposing the restriction.

Is it possible to build a C standard library that is agnostic to the both the OS and compiler being used?

First off, I know that any such library would need to have at least some interface shimming to interact with system calls or board support packages or whatever (For example; newlib has a limited and well defined interface to the OS/BSP that doesn't assume intimate access to the OS). I can als see where there will be need for something similar for interacting with the compiler (e.g. details of some things left as "implementation defined" by the standard).
However, most libraries I've looked into end up being way more wedded to the OS and compiler than that. The OS assumptions seems to assume access to whatever parts they want from a specific OS environment and the compiler interactions practically in collusion with the compiler implementation it self.
So, the basic questions end up being:
Would it be possible (and practical) to implement a full C standard library that only has a very limited and well defined set of prerequisites and/or interface for being built by any compiler and run on any OS?
Do any such implementation exist? (I'm not asking for a recommendation of one to use, thought I would be interested in examples that I can make my own evaluations of.)
If either of the above answers is "no"; then why? What fundamentally makes it impossible or impractical? Or why hasn't anyone bothered?
Background:
What I'm working on that has lead me to this rabbit hole is an attempt to make a fully versioned, fully hermetic build chain. The goal being that I should be able to build a project without any dependencies on the local environment (beyond access to the needs source control repo and say "a valid posix shell" or "language-agnostic-build-tool-of-choice is installed and runs"). Given those dependencies, the build should do the exact same thing, with the same compiler and libraries, regardless of which versions of which compilers and libraries are or are not installed. Universal, repeatable, byte identical output is the target I'm wanting to move in the direction of.
Causing the compiler-of-choice to be run from a repo isn't too hard, but I've yet to have found a C standard library that "works right out of the box". They all seem to assume some magic undocumented interface between them and the compiler (e.g. __gnuc_va_list), or at best want to do some sort of config-probing of the hosting environment which would be something between useless and counterproductive for my goals.
This looks to be a bottomless rabbit hole that I'm reluctant to start down without first trying to locate alternatives.

Why was C not made a platform independent language?

I recently read the dragon book of compiler design. It mentions that the compiler has intermediate code generation as one of its phases which produces a machine independent code. Then why was C not developed as a platform independent language like java?
What the Dragon Book is describing is the following process:
Compile the source code into an intermediate machine-independent byte code format
Perform optimizations and analyses on that IR
Translate the IR to the target platform's actual machine code
The upside of this is that if you want to support additional systems, you just need to add a new code generator for step 3 without having to touch steps 1 and 2.
All common C compilers work this way. So if your question is "Why don't C compilers do what the Dragon Book describes?", the answer is: "They do".
Now you mentioned Java. What a Java compiler does is the following:
Compile the Java code into Java byte code. As far as the Java compiler is concerned, this is not an intermediate format, but the actual target language.
The end
Now to run this byte code you need a JVM, which interprets the byte code and/or JIT-compiles it. The optimizations and analyses usually happen during JIT-compilation. This is not the process described in the Dragon Book.
From the language implementers' point of view, this doesn't change the effort of supporting a new target system very much. You no longer have to change the compiler, but instead you have to change the JVM: Instead of having to add a new backend to the javac compiler, you instead add a new backend to the JIT-compiler. The effort remains basically the same.
The major difference is for the Java programmers: Instead of compiling the program for every target platform and distributing packages for each platform, you can now compile the code once and give the resulting package to everyone. Now the people running your code need to install an JVM to be able to use the package, so you basically moved the effort from the programmer to the end user, but installing a JVM is something you need to do only once (not for every Java program you want to run).
So instead of "write once, compile everywhere", you now have "compile once, run everywhere".
So why didn't C do the same thing that Java does? Performance. Interpreting byte code is slow (compared to running compiled code) and JIT-compilation leads to increased start-up time.
C was initially designed for a particular use case, which involved a specific machine. Although it was loosely based on the language BCPL, which was implemented by way of a platform-independent virtual machine, the goal for C was to be able to write low-level code, such as an operating system, which meant that it needed to be able to take advantage of specific features of the target machine, particularly its ability to directly address individual bytes. By contrast, BCPL's underlying architecture is resolutely word-oriented.
The fact that Bell Labs was able to rapidly reimplement the Unix Operating System in their new language (C) certainly contributed ti its popularity. (At least, that's why I initially learned it.) To allow for a wider dissemination of the language, a version of the compiler was written more closely following the architecture outlined in the Dragon Book, with an initial generation of virtual machine code which is then used to produce code for a target machine. This Portable C Compiler was for many years a reference implementation, and continues to be available.
Other languages contemporary with C, notably Pascal, also used the tactic of targetting a platform independent vurtual machine, and it was once common to refer to virtual machine code as "P-Code" because that's what Niklaus Wirth's Pascal project called their target architecture.
Although GCC does not use a virtual machine as such, it does start by generating a liw-level machine-independent internal representation, simplifying the task of porting the compiler to new archutectures. And of course the Clang compiler produces LLVM (low-level virtual machine) code, which can be transpiled into various concrete machine codes, or interpreted directly.
C was originally designed and written as a "Write-Once, Compile-Anywhere" language, which was as close as they could get at the time to a Universal Language.
Processors and Architectures were so radically different, and resources were so small that the idea of a Universal Virtual Machine (like Java has) was just impossible.
The idea that a single code-base could be run through a compiler, and then you have the same software on any target platform was pretty incredible.
The short answer: Because it was not feasible at that time.
The long answer: the Java platform is a language + virtual machine, Java code compiles to a something called ByteCode, then the virtual machine can take this byte code (it is similar to assembly language) and translates it to the relevant command at runtime, meaning the machine instruction that will be understood by the local machine.
Every architecture has it's own instruction set, meaning that an ARM architecture will not be able to understand code compiled for x86 architecture for example.
in C, the c code is compiled directly to machine instructions, these instructions are then performed by the local machine.
to get a behaviour like Java, you will need to have some kind of interpretor that reads C and translates it to machine code at runtime, this is no cheap task and was way too much for the computers of the time (c was invented in 1972) of course another way this could be implemented is to have the user compile your program before using it, which could be nice but probably will involve making your source code visible to the client, which is unwanted.
hopefully that clarifies things a bit.
Aside from leaving a number of things implementation-defined (in practice this is largely platform/ABI-defined, but strictly speaking doesn't have to be), C is mostly a platform-independent language. Indeed there are implementations of C (such as emscripten) that produce output in a form that can run on any machine platform with the right runtime environment for it. If software written in C makes assumptions about the implementation-defined (or worse, undefined) aspects of the language, then it might fail to work on some implementations/machines, but quite often the cause is more a matter of API/environment/library assumptions (like assuming POSIX, or Windows, or glibcisms) than making nonportable assumptions about the language itself.

How to cross compile C code for an ia188em chip

I inherited an old project that uses an Innovasic ia188em processor (previously AM188 from AMD). I will likely need to modify the code, and so will need to recompile. Unfortunately, I'm not sure which compiler was used previously (it compiled into a .hex file), and searching through the source code (and in particular the header files) doesn't seem to indicate it either.
I did see one program that could work, but I was wondering if anyone knew of any free programs that might do this. I saw some forums where people said they thought either an old Borland compiler or Bruce's C Compiler may work with 80188 chips (which I assume my chip falls under?), but nothing concrete. I failed to compile with Borland C++ 5 when I tried, though I admit I probably didn't have it set up correctly.
This is for an embedded board (i.e. no OS). I don't program too often, so my compiler knowledge is limited. I mostly just write simple C programs and compile with gcc under linux. Any help is appreciated.
Updated 10/8: I apologize, I was looking at both this code, and the PC side code that talks to the embedded board, and got mixed up. The code for the ia188em (embedded board) is actually C (not C++). Updated title to reflect that. I'm not sure if it makes a huge difference or not.
You'll need a 16 bit "real mode" x86 compiler. If your compiler is a DOS targeted compiler, you will need some means of generating a raw binary rather than than MS-DOS load module (.exe), this may be possible through linker options or may require a non-DOS linker.
Any build scripts or makefiles included with the project code might help you identifier the toolchain used, but the likelihood is that it is no longer available, and you'll need to source "antique software".
When I used to do this sort of thing (1985 -> 1990) I used the intel toolchain, now long obsolete and no longer available from intel. The tools required were
iC-86 - The compiler
link-86 - the linker
loc-86 - the image locater.
There is some information on these tools at a very old site here.
Another method that was used at the time was to process the .exe file produced by a Microsoft standard real mode PC compiler (MS-Pascal was the language used on that project) into an absolutely located image that could be blown into EPROM. The tool used for the conversion was proprietary to the company so I have no idea whether there is an equivalent available

Minimum amount of information necessary to do (dynamic) linking?

This is a question I've come across repeatedly, usually concerning plug ins, but recently I came across it trying to hammer out some build system issues. My concern is primarily for *nix based systems, but I suppose it applies to windows as well.
The question is, what is the minimum amount of information necessary to do dynamic linking? I know linux distributions like Debian have simply an 'i686', which is enough. However, I suppose there is some implicit information here, and I probably won't be able to do dynamic linking of any shared object as long as they're compiled using -march=i686, will I?
So what must be matched correctly for me to be able to load a shared object successfully? I know that for c++ even the compiler (and sometimes version) must match due to name mangling, but I was kind of hoping this wasn't the case for c.
Any thoughts appreciated.
Edit:
Neil's answer made me realize I'm not really talking about dynamic linking, or rather, the question is two-fold,
what's needed for static linking, and
what's needed for dynamic linking
I have higher hopes for the first I guess.
Well at minimum, the code must have been compiled for the same processor family, and you need to know the names of the library and the function. On top of that, you need the same ABI. You should be aware that despite what people think, the C Standard does not specify an ABI and it is entirely possible for two C compilers (or versions of the same compiler) to adhere to the standard, run on the same platform, but have different ABIs.
As for exactly specifying architecture details - I must admit I've never done it. Are you planning on distributing binary libraries on different Linux variants?

Resources