Related
I have read this, saying that
For example, it is common for programs that are written primarily in C to contain portions that are in an assembly language for optimization of processor efficiency.
I have never seen a program written primarily in C that contains assembly code too, at least not directly as source code. Only, their example with the Linux kernel.
Is this statement true and if so, how could it possibly optimize processor efficiency?
Aren't C code just translated into assembly code by the compiler?
No, it's not true. I'd estimate that less than 1% of C programmers even know how to program in assembly, and the need to use it is very rare. It's generally only needed for very special applications, such as some parts of an OS kernel or programming embedded systems, because they need to perform machine operations that don't have corresponding C code (such as directly manipulating CPU registers). In earlier days some programmers would use it for performance-critical sections of code, but compiler optimizations have improved significantly, and CPUs have gotten faster, so this is rarely needed now. It might still be used in the built-in libraries, so that functions like strcpy() will be as fast as possible. But application programmers almost never have to resort to assembly.
Aren't C code just translated into assembly code by the compiler?
Yes, but...
There are situations where you may want to access a specific register or other platform-specific location, and Standard C doesn't provide good ways to do that. If you want to look at a status word or load/read a data register directly, then you often need to drop down to the assembler level.
Also, even in this age of very smart optimizing compilers, it's still possible for a human assembly programmer to write assembly code that will out-perform code generated by the compiler. If you need to wring every possible cycle out of your code, you may need to "go manual" for a couple of routines.
I've heard of the idea of bootstrapping a language, that is, writing a compiler/interpreter for the language in itself. I was wondering how this could be accomplished and looked around a bit, and saw someone say that it could only be done by either
writing an initial compiler in a different language.
hand-coding an initial compiler in Assembly, which seems like a special case of the first
To me, neither of these seem to actually be bootstrapping a language in the sense that they both require outside support. Is there a way to actually write a compiler in its own language?
Is there a way to actually write a compiler in its own language?
You have to have some existing language to write your new compiler in. If you were writing a new, say, C++ compiler, you would just write it in C++ and compile it with an existing compiler first. On the other hand, if you were creating a compiler for a new language, let's call it Yazzleof, you would need to write the new compiler in another language first. Generally, this would be another programming language, but it doesn't have to be. It can be assembly, or if necessary, machine code.
If you were going to bootstrap a compiler for Yazzleof, you generally wouldn't write a compiler for the full language initially. Instead you would write a compiler for Yazzle-lite, the smallest possible subset of the Yazzleof (well, a pretty small subset at least). Then in Yazzle-lite, you would write a compiler for the full language. (Obviously this can occur iteratively instead of in one jump.) Because Yazzle-lite is a proper subset of Yazzleof, you now have a compiler which can compile itself.
There is a really good writeup about bootstrapping a compiler from the lowest possible level (which on a modern machine is basically a hex editor), titled Bootstrapping a simple compiler from nothing. It can be found at https://web.archive.org/web/20061108010907/http://www.rano.org/bcompiler.html.
The explanation you've read is correct. There's a discussion of this in Compilers: Principles, Techniques, and Tools (the Dragon Book):
Write a compiler C1 for language X in language Y
Use the compiler C1 to write compiler C2 for language X in language X
Now C2 is a fully self hosting environment.
The way I've heard of is to write an extremely limited compiler in another language, then use that to compile a more complicated version, written in the new language. This second version can then be used to compile itself, and the next version. Each time it is compiled the last version is used.
This is the definition of bootstrapping:
the process of a simple system activating a more complicated system that serves the same purpose.
EDIT: The Wikipedia article on compiler bootstrapping covers the concept better than me.
A super interesting discussion of this is in Unix co-creator Ken Thompson's Turing Award lecture.
He starts off with:
What I am about to describe is one of many "chicken and egg" problems that arise when compilers are written in their own language. In this ease, I will use a specific example from the C compiler.
and proceeds to show how he wrote a version of the Unix C compiler that would always allow him to log in without a password, because the C compiler would recognize the login program and add in special code.
The second pattern is aimed at the C compiler. The replacement code is a Stage I self-reproducing program that inserts both Trojan horses into the compiler. This requires a learning phase as in the Stage II example. First we compile the modified source with the normal C compiler to produce a bugged binary. We install this binary as the official C. We can now remove the bugs from the source of the compiler and the new binary will reinsert the bugs whenever it is compiled. Of course, the login command will remain bugged with no trace in source anywhere.
Check out podcast Software Engineering Radio episode 61 (2007-07-06) which discusses GCC compiler internals, as well as the GCC bootstrapping process.
Donald E. Knuth actually built WEB by writing the compiler in it, and then hand-compiled it to assembly or machine code.
As I understand it, the first Lisp interpreter was bootstrapped by hand-compiling the constructor functions and the token reader. The rest of the interpreter was then read in from source.
You can check for yourself by reading the original McCarthy paper, Recursive Functions of Symbolic Expressions and Their Computation by Machine, Part I.
Every example of bootstrapping a language I can think of (C, PyPy) was done after there was a working compiler. You have to start somewhere, and reimplementing a language in itself requires writing a compiler in another language first.
How else would it work? I don't think it's even conceptually possible to do otherwise.
Another alternative is to create a bytecode machine for your language (or use an existing one if it's features aren't very unusual) and write a compiler to bytecode, either in the bytecode, or in your desired language using another intermediate - such as a parser toolkit which outputs the AST as XML, then compile the XML to bytecode using XSLT (or another pattern matching language and tree-based representation). It doesn't remove the dependency on another language, but could mean that more of the bootstrapping work ends up in the final system.
It's the computer science version of the chicken-and-egg paradox. I can't think of a way not to write the initial compiler in assembler or some other language. If it could have been done, I should Lisp could have done it.
Actually, I think Lisp almost qualifies. Check out its Wikipedia entry. According to the article, the Lisp eval function could be implemented on an IBM 704 in machine code, with a complete compiler (written in Lisp itself) coming into being in 1962 at MIT.
Some bootstrapped compilers or systems keep both the source form and the object form in their repository:
ocaml is a language which has both a bytecode interpreter (i.e. a compiler to Ocaml bytecode) and a native compiler (to x86-64 or ARM, etc... assembler). Its svn repository contains both the source code (files */*.{ml,mli}) and the bytecode (file boot/ocamlc) form of the compiler. So when you build it is first using its bytecode (of a previous version of the compiler) to compile itself. Later the freshly compiled bytecode is able to compile the native compiler. So Ocaml svn repository contains both *.ml[i] source files and the boot/ocamlc bytecode file.
The rust compiler downloads (using wget, so you need a working Internet connection) a previous version of its binary to compile itself.
MELT is a Lisp-like language to customize and extend GCC. It is translated to C++ code by a bootstrapped translator. The generated C++ code of the translator is distributed, so the svn repository contains both *.melt source files and melt/generated/*.cc "object" files of the translator.
J.Pitrat's CAIA artificial intelligence system is entirely self-generating. It is available as a collection of thousands of [A-Z]*.c generated files (also with a generated dx.h header file) with a collection of thousands of _[0-9]* data files.
Several Scheme compilers are also bootstrapped. Scheme48, Chicken Scheme, ...
All over the web, I am getting the feeling that writing a C backend for a compiler is not such a good idea anymore. GHC's C backend is not being actively developed anymore (this is my unsupported feeling). Compilers are targeting C-- or LLVM.
Normally, I would think that GCC is a good old mature compiler that does performs well at optimizing code, therefore compiling to C will use the maturity of GCC to yield better and faster code. Is this not true?
I realize that the question greatly depends on the nature of the language being compiled and on other factors such that getting more maintainable code. I am looking for a rather more general answer (w.r.t. the compiled language) that focuses solely on performance (disregarding code quality, ..etc.). I would be also really glad if the answer includes an explanation on why GHC is drifting away from C and why LLVM performs better as a backend (see this) or any other examples of compilers doing the same that I am not aware of.
Let me list my two biggest problems with compiling to C. If this is a problem for your language depends on what kind of features you have.
Garbage collection When you have garbage collection you may have to interrupt regular execution at just about any point in the program, and at this point you need to access all pointers that point into the heap. If you compile to C you have no idea where those pointers might be. C is responsible for local variables, arguments, etc. The pointers are probably on the stack (or maybe in other register windows on a SPARC), but there is no real access to the stack. And even if you scan the stack, which values are pointers? LLVM actually addresses this problem (thought I don't know how well since I've never used LLVM with GC).
Tail calls Many languages assume that tail calls work (i.e., that they don't grow the stack); Scheme mandates it, Haskell assumes it. This is not the case with C. Under certain circumstances you can convince some C compilers to do tail calls. But you want tail calls to be reliable, e.g., when tail calling an unknown function. There are clumsy workarounds, like trampolining, but nothing quite satisfactory.
While I'm not a compiler expert, I believe that it boils down to the fact that you lose something in translation to C as opposed to translating to e.g. LLVM's intermediate language.
If you think about the process of compiling to C, you create a compiler that translates to C code, then the C compiler translates to an intermediate representation (the in-memory AST), then translates that to machine code. The creators of the C compiler have probably spent a lot of time optimizing certain human-made patterns in the language, but you're not likely to be able to create a fancy enough compiler from a source language to C to emulate the way humans write code. There is a loss of fidelity going to C - the C compiler doesn't have any knowledge about your original code's structure. To get those optimizations, you're essentially back-fitting your compiler to try to generate C code that the C compiler knows how to optimize when it's building its AST. Messy.
If, however, you translate directly to LLVM's intermediate language, that's like compiling your code to a machine-independent high-level bytecode, which is akin to the C compiler giving you access to specify exactly what its AST should contain. Essentially, you cut out the middleman that parses the C code and go directly to the high-level representation, which preserves more of the characteristics of your code by requiring less translation.
Also related to performance, LLVM can do some really tricky stuff for dynamic languages like generating binary code at runtime. This is the "cool" part of just-in-time compilation: it is writing binary code to be executed at runtime, instead of being stuck with what was created at compile time.
Part of the reason for GHC's moving away from the old C backend was that the code produced by GHC was not the code gcc could particularly well optimise. So with GHC's native code generator getting better, there was less return for a lot of work. As of 6.12, the NCG's code was only slower than the C compiled code in very few cases, so with the NCG getting even better in ghc-7, there was no sufficient incentive to keep the gcc backend alive. LLVM is a better target because it's more modular, and one can do many optimisations on its intermediate representation before passing the result to it.
On the other hand, last I looked, JHC still produced C and the final binary from that, typically (exclusively?) by gcc. And JHC's binaries tend to be quite fast.
So if you can produce code the C compiler handles well, that is still a good option, but it's probably not worth jumping through too many hoops to produce good C if you can easier produce good executables via another route.
One point that hasn't been brought up yet is, how close is your language to C? If you're compiling a fairly low-level imperative language, C's semantics may map very closely to the language you're implementing. If that's the case, it's probably a win, because the code written in your language is likely to resemble the kind of code someone would write in C by hand. That was definitely not the case with Haskell's C backend, which is one reason why the C backend optimized so poorly.
Another point against using a C backend is that C's semantics are actually not as simple as they look. If your language diverges significantly from C, using a C backend means you're going to have to keep track of all those infuriating complexities, and possibly differences between C compilers as well. It may be easier to use LLVM, with its simpler semantics, or devise your own backend, than keep track of all that.
Aside form all the codegenerator quality reasons, there are also other problems:
The free C compilers (gcc, clang) are a bit Unix centric
Support more than one compiler (e.g. gcc on Unix and MSVC on Windows) requires duplication of effort.
compilers might drag in runtime libraries (or even *nix emulations) on Windows that are painful. Two different C runtimes (e.g. linux libc and msvcrt) to base on complicate your own runtime and its maintenance
You get a big externally versioned blob in your project, which means a major version transition (e.g. a change of mangling could hurts your runtime lib, ABI changes like change of alignment) might require quite some work. Note that this goes for compiler AND externally versioned (parts of the) runtime library. And multiple compilers multiply this. This is not so bad for C as backend though as in the case where you directly connect to (read: bet on) a backend, like being a gcc/llvm frontend.
In many languages that follow this path, you see Cisms trickle through into the main language. Of course this won't happy to you, but you will be tempted :-)
Language functionality that doesn't directly map to standard C (like nested procedures,
and other things that need stack fiddling) are difficult.
If something is wrong, users will be confronted with C level compiler or linker errors that are outside their field of experience. Parsing them and making them your own is painful, specially with multiple compilers and -versions
Note that point 4 also means that you will have to invest time to just keep things working when the external projects evolve. That is time that generally doesn't really go into your project, and since the project is more dynamic, multiplatform releases will need a lot of extra release engineering to cater for change.
So in short, from what I've seen, while such a move allows a swift start (getting a reasonable codegenerator for free for many architectures), there are downsides. Most of them are related to loss of control and poor Windows support of *nix centric projects like gcc. (LLVM is too new to say much on long term, but their rhetoric sounds a lot like gcc did ten years ago). If a project you are hugely dependent on keeps a certain course (like GCC going to win64 awfully slow), then you are stuck with it.
First, decide if you want to have serious non *nix ( OS X being more unixy) support, or only a Linux compiler with a mingw stopgap for Windows? A lot of compilers need first rate Windows support.
Second, how finished must the product become? What's the primary audience ? Is it a tool for the open source developer that can handle a DIY toolchain, or do you want to target a beginner market (like many 3rd party products, e.g. RealBasic)?
Or do you really want to provide a well rounded product for professionals with deep integration and complete toolchains?
All three are valid directions for a compiler project. Ask yourself what your primary direction is, and don't assume that more options will be available in time. E.g. evaluate where projects are now that chose to be a GCC frontend in the early nineties.
Essentially the unix way is to go wide (maximize platforms)
The complete suites (like VS and Delphi, the latter which recently also started to support OS X and has supported linux in the past) go deep and try maximize productivity. (support specially the windows platform nearly completely with deep levels of integration)
The 3rd party projects are less clear cut. They go more after self-employed programmers, and niche shops. They have less developer resources, but manage and focus them better.
As you mentioned, whether C is a good target language depends very much on your source language. So here's a few reasons where C has disadvantages compared to LLVM or a custom target language:
Garbage Collection: A language that wants to support efficient garbage collection needs to know extra information that interferes with C. If an allocation fails, the GC needs to find which values on the stack and in registers are pointers and which aren't. Since the register allocator is not under our control we need to use more expensive techniques such as writing all pointers to a separate stack. This is just one of many issues when trying to support modern GC on top of C. (Note that LLVM also still has some issues in that area, but I hear it's being worked on.)
Feature mapping & Language-specific optimisations: Some languages rely on certain optimisations, e.g., Scheme relies on tail-call optimisation. Modern C compilers can do this but are not guaranteed to do this which could cause problems if a program relies on this for correctness. Another feature that could be difficult to support on top of C is co-routines.
Most dynamically typed languages also cannot be optimised well by C-compilers. For example, Cython compiles Python to C, but the generated C uses calls to many generic functions which are unlikely to be optimised well even by latest GCC versions. Just-in-time compilation ala PyPy/LuaJIT/TraceMonkey/V8 are much more suited to give good performance for dynamic languages (at the cost of much higher implementation effort).
Development Experience: Having an interpreter or JIT can also give you a much more convenient experience for developers -- generating C code, then compiling it and linking it, will certainly be slower and less convenient.
That said, I still think it's a reasonable choice to use C as a compilation target for prototyping new languages. Given that LLVM was explicitly designed as a compiler backend, I would only consider C if there are good reasons not to use LLVM. If the source-level language is very high-level, though, you most likely need an earlier higher-level optimisation pass as LLVM is indeed very low-level (e.g., GHC performs most of its interesting optimisations before generating calling into LLVM). Oh, and if you're prototyping a language, using an interpreter is probably easiest -- just try to avoid features that rely too much on being implemented by an interpreter.
Personally I would compile to C. That way you have a universal intermediary language and don't need to be concerned about whether your compiler supports every platform out there. Using LLVM might get some performance gains (although I would argue the same could probably be achieved by tweaking your C code generation to be more optimizer-friendly), but it will lock you in to only supporting targets LLVM supports, and having to wait for LLVM to add a target when you want to support something new, old, different, or obscure.
As far as I know, C can't query or manipulate processor flags.
This answer is a rebuttal to some of the points made against C as a target language.
Tail call optimizations
Any function that can be tail call optimized is actually equivalent to an iteration (it's an iterative process, in SICP terminology). Additionally, many recursive functions can and should be made tail recursive, for performance reasons, by using accumulators etc.
Thus, in order for your language to guarantee tail call optimization, you would have to detect it and simply not map those functions to regular C functions - but instead create iterations from them.
Garbage collection
It can be actually implemented in C. You can create a run-time system for your language which consists of some basic abstractions over the C memory model - using for example your own memory allocators, constructors, special pointers for objects in the source language, etc.
For example instead of employing regular C pointers for the objects in the source language, a special structure could be created, over which a garbage collection algorithm could be implemented. The objects in your language (more accurately, references) - could behave just like in Java, but in C they could be represented along with meta-information (which you wouldn't have in case you were working just with pointers).
Of course, such a system could have problems integrating with existing C tooling - depends on your implementation and trade-offs that you're willing to make.
Lacking operations
hippietrail noted that C lacks rotate operators (by which I assume he meant circular shift) that are supported by processors. If such operations are available in the instruction set, then they can be added using inline assembly.
The frontend would in this case have to detect the architecture which it's running for and provide the proper snippets. Some kind of a fallback in the form of a regular function should also be provided.
This answer seems to be addressing some core issues seriously. I'd like to see some more substantiation on which problems exactly are caused by C's semantics.
There's a particular case where if you're writing a programming language with strong security* or reliability requirements.
For one, it would take you years to know a big enough subset of C well enough that you know all the C operations you will choose to employ in your compilation are safe and don't evoke undefined behaviour. Secondly, you'd then have to find an implementation of C that you can trust (which would mean a tiny trusted code base, and probably wont be very efficient). Not to mention you'll need to find a trusted linker, OS capable of executing compiled C code, and some basic libraries, all of which would need to be well-defined and trusted.
So in this case you might as well either use assembly language, if you care about about machine independence, some intermediate representation.
*please note that "strong security" here is not related at all to what banks and IT businesses claim to have
Is it a good idea to compile a language to C?
No.
...which begs one obvious question: why do some still think compiling via C is a good idea?
Two big arguments in favour of misusing C in this fashion is that it's stable and standardised:
For GCC, it's C or bust (but there is work underway which may allow other options).
For LLVM, there's the routine breaking of backwards-compatibility in its IR and APIs - what would you prefer: spending time on improving your project or chasing after LLVM?
Providing little more than a promise of stability is somewhat ironic, considering the intended purpose of LLVM.
For these and other reasons, there are various half-done, toy-like, lab-experiment, single-site/use, and otherwise-ignominious via-C backends scattered throughout cyberspace - being abandoned, most have succumbed to bit-rot. But there are some projects which do manage to progress to the mainstream, and their success is then used by via-C supporters to further perpetuate this fantasy.
But if you're one of those supporters, feel free to make fantasy into reality - there's that work happening in GCC, or the resurrected LLVM backend for C. Just imagine it: two well-built, well-maintained via-C backends into which the sum of all prior knowledge can be directed.
They just need you.
It seems that most new programming languages that have appeared in the last 20 years have been written in C. This makes complete sense as C can be seen as a sort of portable assembly language. But what I'm curious about is whether this has constrained the design of the languages in any way. What prompted my question was thinking about how the C stack is used directly in Python for calling functions. Obviously the programming language designer can do whatever they want in whatever language they want, but it seems to me that the language you choose to write your new language in puts you in a certain mindset and gives you certain shortcuts that are difficult to ignore. Are there other characteristics of these languages that come from being written in that language (good or bad)?
I tend to disagree.
I don't think it's so much that a language's compiler or interpreter is implemented in C — after all, you can implement a virtual machine with C that is completely unlike its host environment, meaning that you can get away from a C / near-assembly language mindset.
However, it's more difficult to claim that the C language itself didn't have any influence on the design of later languages. Take for example the usage of curly braces { } to group statements into blocks, the notion that whitespace and indentation is mostly unimportant, native type's names (int, char, etc.) and other keywords, or the way how variables are defined (ie. type declaration first, followed by the variable's name, optional initialization). Many of today's popular and wide-spread languages (C++, Java, C#, and I'm sure there are even more) share these concepts with C. (These probably weren't completely new with C, but AFAIK C came up with that particular mix of language syntax.)
Even with a C implementation, you're surprisingly free in terms of implementation. For example, chicken scheme uses C as an intermediate, but still manages to use the stack as a nursery generation in its garbage collector.
That said, there are some cases where there are constraints. Case in point: The GHC haskell compiler has a perl script called the Evil Mangler to alter the GCC-outputted assembly code to implement some important optimizations. They've been moving to internally-generated assembly and LLVM partially for that reason. That said, this hasn't constrained the language design - only the compiler's choice of available optimizations.
No, in short. The reality is, look around at the languages that are written in C. Lua, for example, is about as far from C as you can get without becoming Perl. It has first-class functions, fully automated memory management, etc.
It's unusual for new languages to be affected by their implementation language, unless said language contains serious limitations. While I definitely disapprove of C, it's not a limited language, just very error-prone and slow to program in compared to more modern languages. Oh, except in the CRT. For example, Lua doesn't contain directory functionality, because it's not part of the CRT so they can't portably implement it in standard C. That is one way in which C is limited. But in terms of language features, it's not limited.
If you wanted to construct an argument saying that languages implemented in C have XYZ limitations or characteristics, you would have to show that doing things another way is impossible in C.
The C stack is just the system stack, and this concept predates C by quite a bit. If you study theory of computing you will see that using a stack is very powerful.
Using C to implement languages has probably had very little effect on those languages, though the familiarity with C (and other C like languages) of people who design and implement languages has probably influenced their design a great deal. It is very difficult to not be influenced by things you've seen before even when you aren't actively copying the best bits of another language.
Many languages do use C as the glue between them and other things, though. Part of this is that many OSes provide a C API, so to access that it's easy to use C. Additionally, C is just so common and simple that many other languages have some sort of way to interface with it. If you want to glue two modules together which are written in different languages then using C as the middle man is probably the easiest solution.
Where implementing a language in C has probably influenced other languages the most is probably things like how escapes are done in strings, which probably isn't that limiting.
The only thing that has constrained language design is the imagination and technical skill of the language designers. As you said, C can be thought of as a "portable assembly language". If that is true, then asking if C has constrained a design is akin to asking if assembly has constrained language design. Since all code written in any language is eventually executed as assembly, every language would suffer the same constraints. Therefore, the C language itself imposes no constraints that would be overcome by using a different language.
That being said, there are some things that are easier to do in one language vs another. Many language designers take this into account. If the language is being designed to be, say, powerful at string processing but performance is not a concern, then using a language with better built-in string processing facilities (such as C++) might be more optimal.
Many developers choose C for several reasons. First, C is a very common language. Open source projects in particular like that it is relatively easier to find an experienced C-language developer than it is to find an equivalently-skilled developer in some other languages. Second, C typically lends itself to micro-optimization. When writing a parser for a scripted language, the efficiency of the parser has a big impact on the overall performance of scripts written in that language. For compiled languages, a more efficient compiler can reduce compile times. Many C compilers are very good at generating extremely optimized code (which is also part of the reason why many embedded systems are programmed in C), and performance-critical code can be written in inline assembly. Also, C is standardized and is generally a static target. Code can be written to the ANSI/C89 standard and not have to worry about it being incompatible with a future version of C. The revisions made in the C99 standard add functionality but don't break existing code. Finally, C is extremely portable. If at least one compiler exists for a given platform, it's most likely a C compiler. Using a highly-portable language like C makes it easier to maximize the number of platforms that can use the new language.
The one limitation that comes to mind is extensibility and compiler hosting. Consider the case of C#. The compiler is written in C/C++ and is entirely native code. This makes it very difficult to use in process with a C# application.
This has broad implications for the tooling chain of C#. Any code which wants to take advantage of the real C# parser or binding engine has to have at least one component which is written in native code. This eventually results in most of the tooling chain for the C# language being written in C++ which is a bit backwards for a language.
This doesn't limit the language per say but definitely has an effect on the experience around the language.
Garbage collection. Language implementations on top of Java or .NET use the VM's GC. Those on top of C tend to use reference counting.
One thing I can think of is that functions are not necessarily first class members in the language, and this is can't be blamed on C alone (I am not talking about passing a function pointer, though it can be argued that C provides you with that feature).
If one were to write a DSL in groovy (/scheme/lisp/haskell/lua/javascript/and some more that I am not sure of), functions can become first class members. Making functions first class members and allowing for anonymous functions allows to write concise and more human readable code (like demonstrated by LINQ).
Yes, eventually all of these are running under C (or assembly if you want to get to that level), but in terms of providing the user of the language the ability to express themselves better, these abstractions do a wonderful job.
Implementing a compiler/interpreter in C doesn't have any major limitations. On the other hand, implementing a language X to C compiler does. For example, according to the Wikipedia article on C--, when compiling a higher level language to C you can't do precise garbage collection, efficient exception handling, or tail recursion optimization. This is the kind of problem that C-- was intended to solve.
I wonder why other languages do not support this feature. What I can understand that C / C++ code is platform dependent so to make it work (compile and execute) across various platform, is achieved by using preprocessor directives. And there are many other uses of this apart from this. Like you can put all your debug printf's inside #if DEBUG ... #endif. So while making the release build these lines of code do not get compiled in the binary. But in other languages, achieving this thing (later part) is difficult (or may be impossible, I'm not sure). All code will get compiled in the binary increasing its size. So my question is "why do Java, or other modern compiled languages no support this kind of feature?" which allows you to include or exclude some piece of code from the binary in a much handy way.
The major languages that don't have a preprocessor usually have a different, often cleaner, way to achieve the same effects.
Having a text-preprocessor like cpp is a mixed blessing. Since cpp doesn't actually know C, all it does is transform text into other text. This causes many maintenance problems. Take C++ for example, where many uses of the preprocessor have been explicitly deprecated in favor of better features like:
For constants, const instead of #define
For small functions, inline instead of #define macros
The C++ FAQ calls macros evil and gives multiple reasons to avoid using them.
The portability benefits of the preprocessor are far outweighed by the possibilities for abuse. Here are some examples from real codes I have seen in industry:
A function body becomes so tangled with #ifdef that it is very hard to read the function and figure out what is going on. Remember that the preprocessor works with text not syntax, so you can do things that are wildly ungrammatical
Code can become duplicated in different branches of an #ifdef, making it hard to maintain a single point of truth about what's going on.
When an application is intended for multiple platforms, it becomes very hard to compile all the code as opposed to whatever code happens to be selected for the developer's platform. You may need to have multiple machines set up. (It is expensive, say, on a BSD system to set up a cross-compilation environment that accurately simulates GNU headers.) In the days when most varieties of Unix were proprietary and vendors had to support them all, this problem was very serious. Today when so many versions of Unix are free, it's less of a problem, although it's still quite challenging to duplicate native Windows headers in a Unix environment.
It Some code is protected by so many #ifdefs that you can't figure out what combination of -D options is needed to select the code. The problem is NP-hard, so the best known solutions require trying exponentially many different combinations of definitions. This is of course impractical, so the real consequence is that gradually your system fills with code that hasn't been compiled. This problem kills refactoring, and of course such code is completely immune to your unit tests and your regression tests—unless you set up a huge, multiplatform testing farm, and maybe not even then.
In the field, I have seen this problem lead to situations where a refactored application is carefully tested and shipped, only to receive immediate bug reports that the application won't even compile on other platforms. If code is hidden by #ifdef and we can't select it, we have no guarantee that it typechecks—or even that it is syntactically correct.
The flip side of the coin is that more advanced languages and programming techniques have reduced the need for conditional compilation in the preprocessor:
For some languages, like Java, all the platform-dependent code is in the implementation of the JVM and in the associated libraries. People have gone to huge lengths to make JVMs and libraries that are platform-independent.
In many languages, such as Haskell, Lua, Python, Ruby, and many more, the designers have gone to some trouble to reduce the amount of platform-dependent code compared to C.
In a modern language, you can put platform-dependent code in a separate compilation unit behind a compiled interface. Many modern compilers have good facilities for inlining functions across interface boundaries, so that you don't pay much (or any) penalty for this kind of abstraction. This wasn't the case for C because (a) there are no separately compiled interfaces; the separate-compilation model assumes #include and the preprocessor; and (b) C compilers came of age on machines with 64K of code space and 64K of data space; a compiler sophisticated enough to inline across module boundaries was almost unthinkable. Today such compilers are routine. Some advanced compilers inline and specialize methods dynamically.
Summary: by using linguistic mechanisms, rather than textual replacement, to isolate platform-dependent code, you expose all your code to the compiler, everything gets type-checked at least, and you have a chance of doing things like static analysis to ensure suitable test coverage. You also rule out a whole bunch of coding practices that lead to unreadable code.
Because modern compilers are smart enough to remove dead code in most any case, making manually feeding the compiler this way no longer necessary. I.e. instead of :
#include <iostream>
#define DEBUG
int main()
{
#ifdef DEBUG
std::cout << "Debugging...";
#else
std::cout << "Not debugging.";
#endif
}
you can do:
#include <iostream>
const bool debugging = true;
int main()
{
if (debugging)
{
std::cout << "Debugging...";
}
else
{
std::cout << "Not debugging.";
}
}
and you'll probably get the same, or at least similar, code output.
Edit/Note: In C and C++, I'd absolutely never do this -- I'd use the preprocessor, if nothing else that it makes it instantly clear to the reader of my code that a chunk of it isn't supposed to be complied under certain conditions. I am saying, however, that this is why many languages eschew the preprocessor.
A better question to ask is why did C resort to using a pre-processor to implement these sorts of meta-programming tasks? It isn't a feature as much as it is a compromise to the technology of the time.
The pre-processor directives in C were developed at a time when machine resources (CPU speed, RAM) were scarce (and expensive). The pre-processor provided a way to implement these features on slow machines with limited memory. For example, the first machine I ever owned had 56KB of RAM and a 2Mhz CPU. It still had a full K&R C compiler available, which pushed the system's resources to the limit, but was workable.
More modern languages take advantage of today's more powerful machines to provide better ways of handling the sorts of meta-programming tasks that the pre-processor used to deal with.
Other languages do support this feature, by using a generic preprocessor such as m4.
Do we really want every language to have its own text-substitution-before-execution implementation?
The C pre-processor can be run on any text file, it need not be C.
Of course, if run on another language, it might tokenize in weird ways, but for simple block structures like #ifdef DEBUG, you can put that in any language, run the C pre-processor on it, then run your language specific compiler on it, and it will work.
Note that macros/preprocessing/conditionals/etc are usually considered a compiler/interpreter feature, as opposed to a language feature, because they are usually completely independent of the formal language definition, and might vary from compiler to compiler implementation for the same language.
A situation in many languages where conditional compilation directives can be better than if-then-else runtime code is when compile-time statements (such as variable declarations) need to be conditional. For example
$if debug
array x
$endif
...
$if debug
dump x
$endif
only declares/allocates/compiles x when needing x, whereas
array x
boolean debug
...
if debug then dump x
probably has to declare x regardless of whether debug is true.
Many modern languages actually have syntactic metaprogramming capabilities that go way beyond CPP. Pretty much all modern Lisps (Arc, Clojure, Common Lisp, Scheme, newLISP, Qi, PLOT, MISC, ...) for example have extremely powerful (Turing-complete, actually) macro systems, so why should they limit themselves to the crappy CPP style macros which aren't even real macros, just text snippets?
Other languages with powerful syntactic metaprogramming include Io, Ioke, Perl 6, OMeta, Converge.
Because decreasing the size of the binary:
Can be done in other ways (compare the average size of a C++ executable to a C# executable, for example).
Is not that important, when it you weigh it against being able to write programs that actually work.
Other languages also have better dynamic binding. For example, we have some code that we cannot ship to some customers for export reasons. Our "C" libraries use #ifdef statements and elaborate Makefile tricks (which is pretty much the same).
The Java code uses plugins (ala Eclipse), so that we just don't ship that code.
You can do the same thing in C through the use of shared libraries... but the preprocessor is a lot simpler.
A other point nobody else mentioned is platform support.
Most modern languages can not run on the same platforms as C or C++ can and are not intended to run on this platforms. For example, Java, Python and also native compiled languages like C# need a heap, they are designed to run on a OS with memory management, libraries and large amount of space, they do not run in a freestanding environment. There you can use other ways to archive the same. C can be used to program controllers with 2KiB ROM, there you need a preprocessor for most applications.