Translation to LLVM IR directly or via C/Clang

Translation to LLVM IR directly or via C/Clang - c

Let's say someone wants to statically compile a given language using LLVM, what would be the biggest differences (advantages and disadvantages) to translate it first to C and then use CLang instead of dealing with a direct IR translation.
The obvious answer I guess would be that by using a front-end that knows the source language, it is easier to come up with an optimized IR represention with than expecting CLang to perform well with the generated C.
I am missing something here ?

Advantages of using a generic C backend:
You can use any C compiler (not just Clang)
Easier to debug an intermediate code if it's in such a high level language
Depending on your source language semantics, it might be easier to translate it via C (but not necessarily)
And disadvantages are:
If your language is compiled incrementally (e.g., no clearly separated modules, or complex macro system, or whatever else), compiling via LLVM IR in a single module with immediate JIT-compilation makes more sense than generating hundreds of tiny C modules. In other words, C is enforcing separate compilation.
If your source language semantics is too far from C, compiling it straight into a lower level can be easier.
Not all the LLVM functionality is directly accessible from C. E.g., intrinsics, alternative calling conventions, debug metadata for a higher level language.
Clang is big, excluding it will improve your memory footprint
Clang is not easy to maintain, it depends on presence and exact locations of the headers, depends on some parts of gcc, etc. Without it, bare LLVM can be used on its own and dependencies may be kept self-contained.
Optimisations in most cases are not an issue. Clang is generating an extremely non-optimal LLVM IR, deliberately. LLVM should care for all the optimisations, not the frontends. Unless, of course, you can do some high level optimisations, but then they won't depend on your backend choice.

Related

What would be a simple to use JIT library?

I'm trying to write a language runtime (and a language itself) that is similar to .NET or to the JVM. It's got a form of bytecode that is custom.
What I want is a way to translate said bytecode to actual, runnable machine code. So, because I'm not wanting to write such a translator myself (this is more of a toy project/personal side project) I want to find a good JIT library to use.
Here's what I want the library to do:
The library should be as easy to use as possible (toy project and I don't really have much experience here)
The library should support at least x86_64 (development machine), though preferably it should cover other architectures as well
The library should preferably do some low level optimizations (register tracking and allocation, reducing memory accesses etc); those optimizations shouldn't be very expensive to do though (I will myself do other optimizations to e.g. remove virtual calls and convert them to direct ones, for example). I can accept a library with no optimization if it's easiest to use though.
The library must have an interface that is usable from C (preferred) or C++ (acceptable).
I will use Boehm GC for garbage collection, if it matters (probably doesn't, but just in case). Maybe a compacting GC would be nice, but I guess I shouldn't combine the questions...

I would suggest llvm. There are some great tutorials on how to implement your own language with it and basic stuff is not too complicated. You also get the option to do a lot of more advanced stuff later on. As a bonus not only can you use JIT but you can also statically compile and optimize your binaries. LLVM also does have a C interface and can target all common CPU architectures and even a lot of more obscure ones.

How could one possibly bootstrap a C compiler(from source)?

I was looking into compiler bootstrapping, and I looked at how Golang implements bootstrapping from source, i.e., by building the last version of Golang implemented in C and using the generated executable to compile newer Go releases. This made me curious as to how the same could be done with C. Can you construct a C compiler on a computer with literally nothing present on it? If not, then how can I trust that the binary of the compiler I use doesn't automatically fill the binaries it compiles with spyware?
Related question, since the first C compiler was written in B and B was written in BCPL, what was BCPL written in?

Can you construct a C compiler on a computer with literally nothing present on it?
The main issue is how (in 2021) would you write a program for that computer! And how would you input it?
In the 1970s computers (like IBM 360 mainframes) had many mechanical switches to enter some initial program. In the 1960s, they had even more, e.g. IBM1620.
Today, how would you input that initial program? Did you consider using some Arduino ? Even oscilloscopes today contain microprocessors with programs....
Some hobbyists today have designed (and spent a lot of money) in making - a few years ago - computers with mechanical relays. These are probably thousands times slower than the cheapest laptop computer you could buy (or the micro-controller inside your computer mouse - and your mouse contains some software too).
You could also buy many discrete transistors (e.g. thousands of 2N2222) and make a computer by soldering them.
Even a cheap motherboard (like e.g. MSI A320M A-PRO) has today some firmware program called UEFI or BIOS. It is shipped with that program.... and rumored to be mostly written in C (several dozen of thousands of statements).
In some ways, computer chips are "software" coded in VHDL, SystemC, etc... etc...
However, you can in principle still bootstrap a C compiler in 2021.
Here is an hypothetical tale....
Imagine you have today a laptop running a small Linux distribution on some isolated island (à la Robinson Crusoe), without any Internet connection - but with books (including Modern C and some book about x86-64 assembly and instruction set architecture and many other books in paper form), pencils, papers, food and a lot of time to spend. Imagine that system does not have any C compiler (e.g. because you just removed by mistake the gcc package from some Debian distribution), but just GNU binutils (that is, the linker ld and the assembler gas), some editor in binary form (e.g. GNU emacs or vim), GNU bash and GNU make as binary packages. We assume you are motivated enough to spend months in writing a C compiler. We also assume you have access to man pages in some paper form (notably elf(5) and ld(1)...). We have to assume you can inspect a file in binary form with od(1) and less(1).
Then you could design on paper a subset µC of the C language in EBNF notation. With months of efforts, you can write a small assembler program, directly doing syscalls(2) (see Linux Assembly HowTo) and interpreting that µC language (since writing an interpreter is easier than writing a compiler; read for example the Dragon book, and Queinnec's Lisp In Small Pieces and Scott's programming language pragmatics book).
Once you have your tiny µC interpreter, you can write a naive µC compiler in µC (since Fabrice Bellard has been able to write his tinyC compiler).
Once you have debugged that µC compiler, you can extend it to accept all the syntax and semantics of C.
Once you have a full C compiler, you could improve it to optimize better, maybe extend it to accept a small subset of C++, and you might also write a static C code analyzer inspired by Frama-C.
PS. Bootstrapping can be generalized a lot - see Pitrat's blog on bootstrapping artificial intelligence (Jacques Pitrat, born in 1934, died in october 2019) and the RefPerSys project.

As Some programmer dude stated in a comment, since C is a portable programming language, you can use a compiler for a different platform to produce a cross-compiler that on that platform would produce executables for the target platform.
You then compile that same C compiler for the target platform on that host platform so that the result is an executable for the target platform.
Then you copy that compiler binary onto the target machine and from thereon it is self-hosting.
Naturally at some point in early history someone really had to write something in assembler or machine code somewhere. Today, it is no longer a necessity but a "life choice".
As for the "how can I trust that the binary of the compiler I use doesn't automatically fill the binaries it compiles with spyware?" problem has been solved - you can use two independent compilers to compile the cross-compiler from the same source base and the target and both of those cross-compilers should produce bitwise-identical results for the target executable. Then you would know that the result is either free of spyware, or that the two independent compilers you used in the beginning would infect the resulting executable with exact same spyware - which is exceedingly unlikely.

You can write a really feeble C compiler in assembly or machine code, then bootstrap from there.
Before programming languages existed you just wrote machine code. That was simply how it was done.
Later came assembler, which is like "easy mode" machine code, and from there evolved high-level languages like Fortran and BCPL. These were decoupled from the machine architecture by having a proper compiler to do the translation.
Today you'd probably write something in C and go from there, anything compiled is suitable, though "compiled" is a loose definition now that LLVM exists and you can just bang out LLVM IR code instead of actual machine code. Rust started in OCaml and is now "self-hosted" on top of LLVM, for example.

Why compilers are written in C/C++ instead of using CoffeeScript (JavaScript, Node JS)?

I am exposed to C because of embedded system programming, and I think it's one wonderful language in this field. However, why is it used to write compilers? If the reason why gcc is implemented in C/C++ is that there aren't many good languages at that time, there's no excuse for why clang is taking the same path (using C/C++).
Is it for performance reasons? Mostly interpreted languages are a bit slower compared with compiled languages, but I guess the difference is almost negligible in CoffeeScript (JavaScript), because of Node.js.
From the perspective of developers, I suppose it's much easier to write one compiler using high level languages. Unfortunately, most of compilers out there are written in C/C++. Is it just because of legacy code?
Response to comments:
Bootstrapping is just one way to illustrate that this language is powerful enough to write one compiler. It shouldn't the dominant reason why we choose the language to implement the compiler.
I agree with the guess given below, that "most compiler developers would answer because most of compiler related tools (bison, yacc) emit C code". However, neither GCC nor Clang use generated parser, they implemented one themselves. This front-end process is independent of targeting architecture, and should not be C/C++'s strength.
There's more or less consensus that performance is one key factor. Indeed, even for GCC and Clang, building a reasonable size of C project (Linux kernel) takes a lot of time. Is it because of the front-end or the back-end. I have to admit that I didn't have much experience on backe-end of compilers, as we finished the course on compiler with generated LLVM code.

I am exposed to C because of embedded system programming, and I think
it's one wonderful language in this field.
Yes. It's better than Java.
However, why is it used to write compilers?
This question can't be answered without asking the developers. I suspect that the majority of them will tell you that common compiler-writing software (yacc, flex, bison, etc) produce C code.
If the reason for gcc is that there aren't many good languages,
there's no excuse for clang.
GCC isn't a programming language, and neither is Clang. They're both implementations of the C programming language.
Is it for performance reasons?
Don't confuse implementation with specification. Speed is an attribute introduced by your compiler and your computer, not by the programming language. GCC happens to produce fairly efficient machine code, which might influence developers to use C as their primary programming language... but in ten years time, it could* be that node.js produces more efficient machine code than GCC. Don't forget, StackOverflow is forever.
* could, but most likely won't. See Ira Baxters comment below for more info.
Mostly interpreted languages are a bit slower compared with compiled
languages, but I guess the difference is almost negligible in
CoffeeScript (JavaScript), because of Node.js.
Similarly, interpretation or compilation isn't the choice of the language, but of the implementation of the language. For example, GCC and Clang choose to compile C to machine code. Ch and CINT are two interpreters that translate C code directly to behaviour, rather than machine code. Java was once predominantly translated using interpretation, too, but is now predominantly compiled into JVM bytecode. Javascript seems to be phasing towards predominant compilation, too. Who knows? Maybe you'll see compilers written predominantly in Javascript in ten years time...
From the perspective of developers, I suppose it's much easier to
write one compiler using high level languages.
All of these programming languages are technically high level. They're mostly defined in terms of an abstract machine; They're certainly not low level.
Unfortunately, most of compilers out there are written in C/C++.
I don't consider it unfortunate that C++ is used to write software; It's not a bad programming language.
Is it just because of legacy code?
I suppose legacy code might influence the decision of a programmer. In the end though, as I said, you'd have to ask the developers. They might just decide to use C or C++ because C or C++ is their favourite programming language... Why do you speak English?

Compilers are very complex software in general. The front end part is pretty simple (parsing), but the backend part (scheduling, code generation, optimizations, register allocations) involve NP-complete problems (of course compilers try to approximate solutions to these problems). Thus, implementing in C would help compile times. C is also very good at bitwise operators and other low level stuff, which is useful for writing a compiler.
Note that not all compilers are written in C though. For example, Haskell GHC compiler is written in Haskell using bootstrapping technique.
Javascript is async, which doesn't suit compiler writing.

I see many reasons:
There is no elegant way of handling bit-precise code in Javascript
You can't write binary files easily in Javascript, so the assembler part of the compiler would have to be in a more low-level language
Huge JS codebase are very heavy to load in memory (that's plain text, remember?)
Writing optimizing routines for compilers are heavily CPU-intensive, which is not yet very compatible with Javascript
You wouldn't be able to compile your compiler with it (bootstrap), because you need a Javascript interpreter behing your compiler. The bootstrap phase wouldn't be "pure":
JS Compiler compiles NodeJS -> NodeJS runs your new Compiler -> new JS Compiler

gcc is implemented primarily in C, but that is not true of all compilers, including some that are quite standard. It is a common pattern for a compiler to be implemented in the language that it compiles. ghc is written largely in Haskell. Recent versions of guile feature a compiler implemented mostly in Scheme.

nope, coffeescript et al are still much slower than natively-compiled (and optimised) C code. Even if you take the subset of javscript that is able to be optimised (asm.js) its still twice as slow as native C.
What you hear about when people say node.js is just as fast as C code means that its just as fast as part of an overall system that does other things like read from disk, wait for data off the network, etc. In these systems the CPU is underused (especially with today's superfast CPUs) so the performance problem is not the raw processing capability of the language. Hence, a node.js server is exactly as fast as a C server if they're both stuck waiting for a network call to return data. The type of system written in node.js does a lot of waiting for network which is why people use node.js. The type of system written in C does not suit being written in node.js

LLVM, CLang and LLC optimization pass

I'm implementing a new back-end to LLVM, starting with the CBackend target.
The end goal is to use "llc" to generate source transforms of input C code.
However, there are a number of optimizations I'd like to make, which don't seem to be very well supported within this context.
The LLVM object code is very low level, and I have to inspect it to re-discover what's actually going on. This would be a lot simpler to do at the AST level.
However, it appears that the AST level is a Clang-internal construct, and there's no easy way to plug into this.
Do I have to inspect the LLVM object code and reverse-engineer the higher-level flow myself? (Does each back-end have to do this? That seems wasteful!)

In general, you cannot reverse-engineer everything. So, you have only two possibilities:
Do everything on clang AST level.
Emit additional information (e.g. via metadata) which might help you to recover some aspects of the input source.
But really, you shouldn't do any source-to-source transform on LLVM IR level, it's a wrong tool for a given target. You can surely plug to AST level. E.g. clang sources contains a rewriter which turns ObjC code into plain C.

C compiler producing lightweight executeables

I'm currently using MSVC for C++ but as I'm switching to C to write a very performance-intensive program (interpreter) I have to search for a fitting C compiler.
I've looked at some binaries produced by Turbo-C and even if its old they seem pretty straigthforward and optimized.
Now I don't know what the best compiler for building an interpreter is, but maybe you can help me.
I've considered GCC but as I don't know much about it, I can't be really sure.

99.9% a programs performance depends on the code you write and the language you choose.
you can safely ignore the performance of the the compiler.
Stick to MSVC...and dont waste time :)

If I were you, I would take the approach of worrying less about the compiler and worrying more about your own code. Write the code for the interpreter in a reasonable way. Then, profile it, and optimize spots based on how much time they take. That is more likely to produce a performance benefit than using a particular compiler.

If you want a lightweight program, it is not the compiler you need to worry about so much as the code you write and the libraries you use. Most compilers will produce similar results from the same source code.
For example, using C++ with MFC, a basic windows application used to start off at about 900kB and grow rapidly. Linking with the dynamic MFC dlls would get you down to a few hundred kB. But by dropping MFC totally - using Win32 APIs directly - and using a minimal C runtime it was relatively easy to implement the same thing in an .exe of about 25kB or less (IIRC - it's been a long time since I did this).
So ditch the libraries and get back to proper low level C (or even C++ if you don't use too many "clever" features), and you can easily write very compact applications.
edit
I've just realised I was confused by the question title into talking about lightweight applications as opposed to concentrating on performance, which appears to be the real thrust of the question. If you want performance, then there is no specific need to use C, or move to a painful development environment - just write good, high performance code. Fundamentally this is about using the correct designs and algorithms and then profiling and optimising the resulting code to eliminate bottlenecks and inefficiencies. Note that these days you may achieve a much bugger bang for your buck by switching to a multithreaded approach than just concentrating on raw code optimisation - make sure you utilise the hardware well.

You can use GCC, through MingW, Eclipse CDT, or one of the other Windows ports. You can optimize for executable size, speed of resulting executable, or speed of compilation.

C++ was designed to be backward compatible with C. So any C++ compiler should be able to compile pure C. You might want to tell it that it's C and not C++, so compiler doesn't do name mangling, etc. If the compiler is very good with C++, it should be equally good, or better with C, because C is much simpler.
I would suggest sticking with MSVC. It's a very decent system. Though if you are not convinced - compare your options. Build the same program with multiple compilers, look at the assembly they produce, measure the resulting executable performance, etc.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight