How do compilers written in the language they compile deal with bugs? - c

Say I have a C compiler written in C. That C compiler would then be compiled using an earlier version of that compiler or by itself (by first compiling the source with an older version and then again with the new compiler assume).
What if the implementation of that C compiler would have a bug in it? That would mean the C compiler produces binaries that are possibly incorrect. If I were to fix the bug that code would still have to be compiled using the bugged version of the compiler, again resulting in a compiler that possibly does not behave correctly.
If the bug was caught right away I can see how to deal with this by just using an older version o the compiler. But what if the bug goes undetected for a number of iterations? It seems to me almost impossible to track down the source of the bug at that point as the incorrect behavior could possibly be a result from any of the previous version of that compiler that has propagated through the various iterations.

Theere are several possibilities:
If there are other implementations of the compiler, you can use one of them to recompile your compiler. Most languages have multiple implementations, so this is often an option.
If you know what triggers the buggy behavior, look for the triggering code in your source code, and rewrite it. If this is suboptimal code, you only have to do it temporarily: compile this altered version, then compile the original version with the (now non-buggy) compiler.
Monkey-patch the compiler to "fix" the bug, then recompile. Or do this by hand in the debugger if the buggy situation doesn't arise too many times while compiling.
BTW, your concern is not dissimilar to a more insidious (theoretical) problem that was described by Ken Thompson in his Turing Award lecture, Reflections on Trusting Trust. He described a situation where the compiler has intentional code that detects when it's compiling the OS and inserts security vulnerabilities; it also detects when it's recompiling the compiler, and inserts the detection code.

"Language" is an abstract concept; a compiler is a specific, real, running program that implements the language. A compiler might well be written in the same language it compiles, but it is compiled by a specific version of a specific implementation of that language, so dealing with bugs is specific to that implementation.
It might be compiled with an implementation of the language from some other source, or from an earlier version of itself, or a limited version of itself, for example. It's quite possible for the code to contain workarounds for bugs in that other compiler. But again, such bugs exist in specific implementations, not in the abstract language, and they are dealt with the same way you would deal with compiler bugs in any other program.

Related

Is using an outdated C compiler a security risk?

We have some build systems in production which no one cares about and these machines run ancient versions of GCC like GCC 3 or GCC 2.
And I can't persuade the management to upgrade it to a more recent: they say, "if ain't broke, don't fix it".
Since we maintain a very old code base (written in the 80s), this C89 code compiles just fine on these compilers.
But I'm not sure it is good idea to use these old stuff.
My question is:
Can using an old C compiler compromise the security of the compiled program?
UPDATE:
The same code is built by Visual Studio 2008 for Windows targets, and MSVC doesn't support C99 or C11 yet (I don't know if newer MSVC does), and I can build it on my Linux box using the latest GCC. So if we would just drop in a newer GCC it would probably build just as fine as before.
Actually I would argue the opposite.
There are a number of cases where behaviour is undefined by the C standard but where it is obvious what would happen with a "dumb compiler" on a given platform. Cases like allowing a signed integer to overflow or accessing the same memory though variables of two different types.
Recent versions of gcc (and clang) have started treating these cases as optimisation opportunities not caring if they change how the binary behaves in the "undefined behaviour" condition. This is very bad if your codebase was written by people who treated C like a "portable assembler". As time went on the optimisers have started looking at larger and larger chunks of code when doing these optimisations increasing the chance the binary will end up doing something other than "what a binary built by a dumb compiler" would do.
There are compiler switches to restore "traditional" behaviour (-fwrapv and -fno-strict-aliasing for the two I mentioned above) , but first you have to know about them.
While in principle a compiler bug could turn compliant code into a security hole I would consider the risk of this to be negligable in the grand scheme of things.
There are risks in both courses of action.
Older compilers have the advantage of maturity, and whatever was broken in them has probably (but there's no guarantee) been worked around successfully.
In this case, a new compiler is a potential source of new bugs.
On the other hand, newer compilers come with additional tooling:
GCC and Clang both now feature sanitizers which can instrument the runtime to detect undefined behaviors of various sorts (Chandler Carruth, of the Google Compiler team, claimed last year that he expects them to have reached full coverage)
Clang, at least, features hardening, for example Control Flow Integrity is about detecting hi-jacks of control flow, there are also hardening implements to protect against stack smashing attacks (by separating the control-flow part of the stack from the data part); hardening features are generally low overhead (< 1% CPU overhead)
Clang/LLVM is also working on libFuzzer, a tool to create instrumented fuzzing unit-tests that explore the input space of the function under test smartly (by tweaking the input to take not-as-yet explored execution paths)
Instrumenting your binary with the sanitizers (Address Sanitizer, Memory Sanitizer or Undefined Behavior Sanitizer) and then fuzzing it (using American Fuzzy Lop for example) has uncovered vulnerabilities in a number of high-profile softwares, see for example this LWN.net article.
Those new tools, and all future tools, are inaccessible to you unless you upgrade your compiler.
By staying on an underpowered compiler, you are putting your head in the sand and crossing fingers that no vulnerability is found. If your product is a high-value target, I urge you to reconsider.
Note: even if you do NOT upgrade the production compiler, you might want to use a new compiler to check for vulnerability anyway; do be aware that since those are different compilers, the guarantees are lessened though.
Your compiled code contains bugs that could be exploited. The bugs come from three sources: Bugs in your source code, bugs in the compiler and libraries, and undefined behaviour in your source code that the compiler turns into a bug. (Undefined behaviour is a bug, but not a bug in the compiled code yet. As an example, i = i++; in C or C++ is a bug, but in your compiled code it may increase i by 1 and be Ok, or set i to some junk and be a bug).
The rate of bugs in your compiled code is presumably low due to testing and to fixing bugs due to customer bug reports. So there may have been a large number of bugs initially, but that has gone down.
If you upgrade to a newer compiler, you may lose bugs that were introduced by compiler bugs. But these bugs would all be bugs that to your knowledge nobody found and nobody exploited. But the new compiler may have bugs on its own, and importantly newer compilers have a stronger tendency to turn undefined behaviour into bugs in the compiled code.
So you will have a whole lot of new bugs in your compiled code; all bugs that hackers could find and exploit. And unless you do a whole lot of testing, and leave your code with customers to find bugs for a long time, it will be less secure.
If it aint broke, don't fix it
Your boss sounds right in saying this, however, the more important factor, is safeguarding of inputs, outputs, buffer overflows. Lack of those is invariably the weakest link in the chain from that standpoint regardless of the compiler used.
However, if the code base is ancient, and work was put in place to mitigate the weaknesses of the K&R C used, such as lacking of type safety, insecure fgets, etc, weigh up the question "Would upgrading the compiler to more modern C99/C11 standards break everything?"
Provided, that there's a clear path to migrate to the newer C standards, which could induce side effects, might be best to attempt a fork of the old codebase, assess it and put in extra type checks, sanity checks, and determine if upgrading to the newer compiler has any effect on input/output datasets.
Then you can show it to your boss, "Here's the updated code base, refactored, more in line with industry accepted C99/C11 standards...".
That's the gamble that would have to be weighed up on, very carefully, resistence to change might show there in that environment and may refuse to touch the newer stuff.
EDIT
Just sat back for a few minutes, realized this much, K&R generated code could be running on a 16bit platform, chances are, upgrading to more modern compiler could actually break the code base, am thinking in terms of architecture, 32bit code would be generated, this could have funny side effects on the structures used for input/output datasets, that is another huge factor to weigh up carefully.
Also, since OP has mentioned using Visual Studio 2008 to build the codebase, using gcc could induce bringing into the environment either MinGW or Cygwin, that could have an impact change on the environment, unless, the target is for Linux, then it would be worth a shot, may have to include additional switches to the compiler to minimize noise on old K&R code base, the other important thing is to carry out a lot of testing to ensure no functionality is broken, may turn out to be a painful exercise.
There is a security risk where a malicious developer can sneak a back-door through a compiler bug. Depending on the quantity of known bugs in the compiler in use, the backdoor may look more or less inconspicuous (in any case, the point is that the code is correct, even if convoluted, at the source level. Source code reviews and tests using a non-buggy compiler will not find the backdoor, because the backdoor does not exist in these conditions). For extra deniability points, the malicious developer may also look for previously-unknown compiler bugs on their own. Again, the quality of the camouflage will depend on the choice of compiler bugs found.
This attack is illustrated on the program sudo in this article. bcrypt wrote a great follow-up for Javascript minifiers.
Apart from this concern, the evolution of C compilers has been to exploit undefined behavior more and more and more aggressively, so old C code that was written in good faith would actually be more secure compiled with a C compiler from the time, or compiled at -O0 (but some new program-breaking UB-exploiting optimizations are introduced in new versions of compilers even at -O0).
Can using an old C compiler compromise the security of the compiled program?
Of course it can, if the old compiler contains known bugs that you know would affect your program.
The question is, does it? To know for sure, you would have to read the whole change log from your version to present date and check every single bug fixed over the years.
If you find no evidence of compiler bugs that would affect your program, updating GCC just for the sake of it seems a bit paranoid. You would have to keep in mind that newer versions might contain new bugs, that are not yet discovered. Lots of changes were made recently with GCC 5 and C11 support.
That being said, code written in the 80s is most likely already filled to the brim with security holes and reliance on poorly-defined behavior, no matter the compiler. We're talking of pre-standard C here.
Older compilers may not have protection against known hacking attacks. Stack smashing protection, for example, was not introduced until GCC 4.1. So yeah, code compiled with older compilers may be vulnerable in ways that newer compilers protect against.
Another aspect to worry about is the development of new code.
Older compilers may have different behavior for some language features than what is standardized and expected by the programmer. This mismatch can slow development and introduce subtle bugs that can be exploited.
Older compilers offer fewer features (including language features!) and don't optimize as well. Programmers will hack their way around these deficiencies — e.g. by reimplementing missing features, or writing clever code that is obscure but runs faster — creating new opportunities for the creation of subtle bugs.
Nope
The reason is simple, old compiler may have old bugs and exploits, but the new compiler will have new bugs and exploits.
Your not "fixing" any bugs by upgrading to a new compiler. Your switching old bugs and exploits for new bugs and exploits.
Well there is a higher probability that any bugs in the old compiler are well known and documented as opposed to using a new compiler so actions can be taken to avoid those bugs by coding around them. So in a way that is not enough as argument for upgrading. We have the same discussions where I work, we use GCC 4.6.1 on a code base for embedded software and there is a great reluctance (among management) to upgrade to the latest compiler because of fear for new, undocumented bugs.
Your question falls into two parts:
Explicit: “Is there a greater risk in using the older compiler” (more or less as in your title)
Implicit: “How can I persuade management to upgrade”
Perhaps you can answer both by finding an exploitable flaw in your existing code base and showing that a newer compiler would have detected it. Of course your management may say “you found that with the old compiler”, but you can point out that it cost considerable effort. Or you run it through the new compiler to find the vulnerability, then exploit it, if your are able/allowed to compile the code with the new compiler. You may want help from a friendly hacker, but that depends on trusting them and being able/allowed to show them the code (and use the new compiler).
But if your system is not exposed to hackers, you should perhaps be more interested in whether a compiler upgrade would increase your effectiveness: MSVS 2013 Code Analysis quite often finds potential bugs much sooner than MSVS 2010, and it more or less supports C99/C11 – not sure if it does officially, but declarations can follow statements and you can declare variables in for-loops.

How to cross compile C code for an ia188em chip

I inherited an old project that uses an Innovasic ia188em processor (previously AM188 from AMD). I will likely need to modify the code, and so will need to recompile. Unfortunately, I'm not sure which compiler was used previously (it compiled into a .hex file), and searching through the source code (and in particular the header files) doesn't seem to indicate it either.
I did see one program that could work, but I was wondering if anyone knew of any free programs that might do this. I saw some forums where people said they thought either an old Borland compiler or Bruce's C Compiler may work with 80188 chips (which I assume my chip falls under?), but nothing concrete. I failed to compile with Borland C++ 5 when I tried, though I admit I probably didn't have it set up correctly.
This is for an embedded board (i.e. no OS). I don't program too often, so my compiler knowledge is limited. I mostly just write simple C programs and compile with gcc under linux. Any help is appreciated.
Updated 10/8: I apologize, I was looking at both this code, and the PC side code that talks to the embedded board, and got mixed up. The code for the ia188em (embedded board) is actually C (not C++). Updated title to reflect that. I'm not sure if it makes a huge difference or not.
You'll need a 16 bit "real mode" x86 compiler. If your compiler is a DOS targeted compiler, you will need some means of generating a raw binary rather than than MS-DOS load module (.exe), this may be possible through linker options or may require a non-DOS linker.
Any build scripts or makefiles included with the project code might help you identifier the toolchain used, but the likelihood is that it is no longer available, and you'll need to source "antique software".
When I used to do this sort of thing (1985 -> 1990) I used the intel toolchain, now long obsolete and no longer available from intel. The tools required were
iC-86 - The compiler
link-86 - the linker
loc-86 - the image locater.
There is some information on these tools at a very old site here.
Another method that was used at the time was to process the .exe file produced by a Microsoft standard real mode PC compiler (MS-Pascal was the language used on that project) into an absolutely located image that could be blown into EPROM. The tool used for the conversion was proprietary to the company so I have no idea whether there is an equivalent available

Why compilers are written in C/C++ instead of using CoffeeScript (JavaScript, Node JS)?

I am exposed to C because of embedded system programming, and I think it's one wonderful language in this field. However, why is it used to write compilers? If the reason why gcc is implemented in C/C++ is that there aren't many good languages at that time, there's no excuse for why clang is taking the same path (using C/C++).
Is it for performance reasons? Mostly interpreted languages are a bit slower compared with compiled languages, but I guess the difference is almost negligible in CoffeeScript (JavaScript), because of Node.js.
From the perspective of developers, I suppose it's much easier to write one compiler using high level languages. Unfortunately, most of compilers out there are written in C/C++. Is it just because of legacy code?
Response to comments:
Bootstrapping is just one way to illustrate that this language is powerful enough to write one compiler. It shouldn't the dominant reason why we choose the language to implement the compiler.
I agree with the guess given below, that "most compiler developers would answer because most of compiler related tools (bison, yacc) emit C code". However, neither GCC nor Clang use generated parser, they implemented one themselves. This front-end process is independent of targeting architecture, and should not be C/C++'s strength.
There's more or less consensus that performance is one key factor. Indeed, even for GCC and Clang, building a reasonable size of C project (Linux kernel) takes a lot of time. Is it because of the front-end or the back-end. I have to admit that I didn't have much experience on backe-end of compilers, as we finished the course on compiler with generated LLVM code.
I am exposed to C because of embedded system programming, and I think
it's one wonderful language in this field.
Yes. It's better than Java.
However, why is it used to write compilers?
This question can't be answered without asking the developers. I suspect that the majority of them will tell you that common compiler-writing software (yacc, flex, bison, etc) produce C code.
If the reason for gcc is that there aren't many good languages,
there's no excuse for clang.
GCC isn't a programming language, and neither is Clang. They're both implementations of the C programming language.
Is it for performance reasons?
Don't confuse implementation with specification. Speed is an attribute introduced by your compiler and your computer, not by the programming language. GCC happens to produce fairly efficient machine code, which might influence developers to use C as their primary programming language... but in ten years time, it could* be that node.js produces more efficient machine code than GCC. Don't forget, StackOverflow is forever.
* could, but most likely won't. See Ira Baxters comment below for more info.
Mostly interpreted languages are a bit slower compared with compiled
languages, but I guess the difference is almost negligible in
CoffeeScript (JavaScript), because of Node.js.
Similarly, interpretation or compilation isn't the choice of the language, but of the implementation of the language. For example, GCC and Clang choose to compile C to machine code. Ch and CINT are two interpreters that translate C code directly to behaviour, rather than machine code. Java was once predominantly translated using interpretation, too, but is now predominantly compiled into JVM bytecode. Javascript seems to be phasing towards predominant compilation, too. Who knows? Maybe you'll see compilers written predominantly in Javascript in ten years time...
From the perspective of developers, I suppose it's much easier to
write one compiler using high level languages.
All of these programming languages are technically high level. They're mostly defined in terms of an abstract machine; They're certainly not low level.
Unfortunately, most of compilers out there are written in C/C++.
I don't consider it unfortunate that C++ is used to write software; It's not a bad programming language.
Is it just because of legacy code?
I suppose legacy code might influence the decision of a programmer. In the end though, as I said, you'd have to ask the developers. They might just decide to use C or C++ because C or C++ is their favourite programming language... Why do you speak English?
Compilers are very complex software in general. The front end part is pretty simple (parsing), but the backend part (scheduling, code generation, optimizations, register allocations) involve NP-complete problems (of course compilers try to approximate solutions to these problems). Thus, implementing in C would help compile times. C is also very good at bitwise operators and other low level stuff, which is useful for writing a compiler.
Note that not all compilers are written in C though. For example, Haskell GHC compiler is written in Haskell using bootstrapping technique.
Javascript is async, which doesn't suit compiler writing.
I see many reasons:
There is no elegant way of handling bit-precise code in Javascript
You can't write binary files easily in Javascript, so the assembler part of the compiler would have to be in a more low-level language
Huge JS codebase are very heavy to load in memory (that's plain text, remember?)
Writing optimizing routines for compilers are heavily CPU-intensive, which is not yet very compatible with Javascript
You wouldn't be able to compile your compiler with it (bootstrap), because you need a Javascript interpreter behing your compiler. The bootstrap phase wouldn't be "pure":
JS Compiler compiles NodeJS -> NodeJS runs your new Compiler -> new JS Compiler
gcc is implemented primarily in C, but that is not true of all compilers, including some that are quite standard. It is a common pattern for a compiler to be implemented in the language that it compiles. ghc is written largely in Haskell. Recent versions of guile feature a compiler implemented mostly in Scheme.
nope, coffeescript et al are still much slower than natively-compiled (and optimised) C code. Even if you take the subset of javscript that is able to be optimised (asm.js) its still twice as slow as native C.
What you hear about when people say node.js is just as fast as C code means that its just as fast as part of an overall system that does other things like read from disk, wait for data off the network, etc. In these systems the CPU is underused (especially with today's superfast CPUs) so the performance problem is not the raw processing capability of the language. Hence, a node.js server is exactly as fast as a C server if they're both stuck waiting for a network call to return data. The type of system written in node.js does a lot of waiting for network which is why people use node.js. The type of system written in C does not suit being written in node.js

Bootstrapping a cross-platform compiler

Suppose you are designing, and writing a compiler for, a new language called Foo, among whose virtues is intended to be that it's particularly good for implementing compilers. A classic approach is to write the first version of the compiler in C, and use that to write the second version in Foo, after which it becomes self-compiling.
This does mean you have to be careful to keep backup copies of the binary (as opposed to most programs where you only have to keep backup copies of the source); once the language has evolved away from the first version, if you lost all copies of the binary, you would have nothing capable of compiling the current version. So be it.
But suppose it is intended to support both Linux and Windows. As long as it is in fact running on both platforms, it can compile itself on each platform, no problem. Supposing however you lost the binary on one platform (or had reason to suspect it had been compromised by an attacker); now there is a problem. And having to safeguard the binary for every supported platform is at least one more failure point than I'm comfortable with.
One solution would be to make it a cross-compiler, such that the binary on either platform can target both platforms.
This is not quite as easy as it sounds - while there is no problem selecting the binary output format, each platform provides the system API in the form of C header files, which normally only exist on their native platform, e.g. there is no guarantee code compiled against the Windows stdio.h will work on Linux even if compiled into Linux binary format.
Perhaps that problem could be solved by downloading the Linux header files onto a Windows box and using the Windows binary to cross-compile a Linux binary.
Are there any caveats with that solution I'm missing?
Another solution might be to maintain a separate minimum bootstrap compiler in Python, that compiles Foo into portable C, accepting only that subset of the language needed by the main Foo compiler and performing minimum error checking and no optimization, the intent being that the bootstrap compiler will thus remain simple enough that maintaining it across subsequent language versions wouldn't cost very much.
Again, are there any caveats with that solution I'm missing?
What methods have people used to solve this problem in the past?
This is a problem for C compilers themselves. It's typically solved by the use of a cross-compiler, exactly as you suggest.
The process of cross-compiling a compiler is no more difficult than cross-compiling any other project: that is to say, it's trickier than you'd like, but by no means impossible.
Of course, you first need the cross-compiler itself. This probably means some major surgery to your build-configuration system, and you'll need some kind of "sysroot" taken from the target (header, libraries, anything else you'll need to reference in a build).
So, in the end it depends on how your compiler is structured. Either it's easier to re-bootstrap using historical sources, repeating each phase of language compatibility you went through in the first place (you did use source revision control, right?), or it's easier to implement a cross-compiler configuration. I can't tell you which from here.
For many years, the GCC compiler was always written only in standard-compliant C code for exactly this reason: they wanted to be able to bring it up on any OS, given only the native C compiler for that system. Only in 2012 was it decided that C++ is now sufficiently widespread that the compiler itself can be written in it. Even then, they're only permitting themselves a subset of the language. In future, if anybody wants to port GCC to a platform that does not already have C++, they will need to either use a cross-compiler, or first port GCC 4.7 (that last major C-only version) and then move to the latest.
Additionally, the GCC build process does not "trust" the compiler it was built with. When you type "make", it first builds a reduced version of itself, it then uses that the build a full version. Finally, it uses the full version to rebuild another full version, and compares the two binaries. If the two do not match it knows that the original compiler was buggy and introduced some bad code, and the build has failed.

How create a C compiler without a C native compiler

It's a simple question. If to compile the C compiler is needed a C compiler... Maybe directly with assembly code? Perhaps the kernel provides a basic tool for converting C to assembler and create an escalating infrastructure? It's a stupid question also but I'm really interested in how to design an operating system (not me) from 0 to interact with the CPU and memory.
Bootstrapping
This comes from the phrase, "pulling yourself up by your bootstraps", if you want to look it up. Bootstrapping is a straightforward process, but it is a lot of work.
Write a C compiler in another language.
Write a C compiler in C.
Use the compiler from step #1 to compile the compiler from step #2.
Some other compilers have to be bootstrapped, not just those for C. For example, GHC must be used to compile itself.
Note that bootstrapping is only necessary if both of the following are true:
You are inventing a new language, so there are no existing compilers.
You want to write the compiler in the language you are compiling.
This has nothing to do with operating systems. If you are designing an operating system, you have enough work ahead of you already. If your new operating system does not have a C compiler yet, you can cross-compile the compiler from a different operating system. This is much less work — maybe a few hours or days, instead of months or years.
It sounds like this is a task for tcc (http://bellard.org/tcc/), the Tiny C Compiler.
It's the smallest C compiler I know of, it can compile a bigger one once you've ported it, and there's numerous guides on how to port it to custom systems.

Resources