LLVM, CLang and LLC optimization pass - c

I'm implementing a new back-end to LLVM, starting with the CBackend target.
The end goal is to use "llc" to generate source transforms of input C code.
However, there are a number of optimizations I'd like to make, which don't seem to be very well supported within this context.
The LLVM object code is very low level, and I have to inspect it to re-discover what's actually going on. This would be a lot simpler to do at the AST level.
However, it appears that the AST level is a Clang-internal construct, and there's no easy way to plug into this.
Do I have to inspect the LLVM object code and reverse-engineer the higher-level flow myself? (Does each back-end have to do this? That seems wasteful!)

In general, you cannot reverse-engineer everything. So, you have only two possibilities:
Do everything on clang AST level.
Emit additional information (e.g. via metadata) which might help you to recover some aspects of the input source.
But really, you shouldn't do any source-to-source transform on LLVM IR level, it's a wrong tool for a given target. You can surely plug to AST level. E.g. clang sources contains a rewriter which turns ObjC code into plain C.

Related

What would be a simple to use JIT library?

I'm trying to write a language runtime (and a language itself) that is similar to .NET or to the JVM. It's got a form of bytecode that is custom.
What I want is a way to translate said bytecode to actual, runnable machine code. So, because I'm not wanting to write such a translator myself (this is more of a toy project/personal side project) I want to find a good JIT library to use.
Here's what I want the library to do:
The library should be as easy to use as possible (toy project and I don't really have much experience here)
The library should support at least x86_64 (development machine), though preferably it should cover other architectures as well
The library should preferably do some low level optimizations (register tracking and allocation, reducing memory accesses etc); those optimizations shouldn't be very expensive to do though (I will myself do other optimizations to e.g. remove virtual calls and convert them to direct ones, for example). I can accept a library with no optimization if it's easiest to use though.
The library must have an interface that is usable from C (preferred) or C++ (acceptable).
I will use Boehm GC for garbage collection, if it matters (probably doesn't, but just in case). Maybe a compacting GC would be nice, but I guess I shouldn't combine the questions...
I would suggest llvm. There are some great tutorials on how to implement your own language with it and basic stuff is not too complicated. You also get the option to do a lot of more advanced stuff later on. As a bonus not only can you use JIT but you can also statically compile and optimize your binaries. LLVM also does have a C interface and can target all common CPU architectures and even a lot of more obscure ones.

Where in the GCC source code does it compile to the different assembly languages?

Where is the code in the GCC source code that actually constructs the assembly for the different architectures?
Wondering how many different assembly languages it compiles to, and how it actually does this (by taking a look at the source code).
Is it in the gcc repo somewhere, or in another repo? I have started to dig around but haven't found anything.
https://github.com/gcc-mirror/gcc
For example, here is some of the assembly generating code in V8:
https://github.com/v8/v8-git-mirror/tree/master/src/x64
Is there anything equivalent for GCC?
I am wondering because it's a mystery how GCC does this, and it would be a great way to learn how compilers are actually implemented down to the assembly level.
The .md (machine description) files of GCC source contain stuff to generate assembly. GCC contains several specialized C/C++ code generators (and some of them translates the .md files into code emitting assembly).
GCC is a very complex program. The documentation of GCC MELT (an obsolete project) contains several interesting links and slides, notably refering to the Indian GCC Resource Center
Most of the optimizations in GCC happens in the middle-end (which is mostly independent of source language or target system), notably with many passes working on the Gimple representations.
The GCC repo is an SVN repository.
See also this answer, notably the pictures inside it.
The actual source code for GCC is most accessible from here:
https://gcc.gnu.org/svn.html
The software is accessible via SVN (subversion), a source code control system. This would be installed on many versions of Linux/UNIX, but if not on your platform, you can install the svn kit and then fetch the source using the following command:
svn checkout svn://gcc.gnu.org/svn/gcc/trunk SomeLocalDir
GCC is complex and would take significant experience to understand the nature of how the application actually compiles to different architectures.
In a nutshell, GCC has three major components - front-end, middle and back-end processing. The front-end processor has the component of the language parsing to understand the syntax of languages (like C, C++, Objective-C, etc). The front-end deconstructs the code to a portable construct which is then passed to the back-end for compilation to the target environment.
The middle part performs code analysis and optimisation, attempting to prioritise the code to generate the best possible output at the end of the full process. Technically, optimisation can occur at any part of the process as patterns are discovered during analysis.
The back-end processor compiles the code to a tree-style output format (not actually final executable code). Based on what the expected output is designed to be, the "pseudo-code" is optimised for using registers, bit-sizes, endian-ness, and so on. The final code is then generated during the assembly phase, which converts the back-end code into machine executable instructions.
It's important to note that the compiler has many options to deal with output formats so you can create output to many classes of architecture, usually out of the box. For cross-compiling and target compiler options, try checking out this link:
https://gcc.gnu.org/install/configure.html

Translation to LLVM IR directly or via C/Clang

Let's say someone wants to statically compile a given language using LLVM, what would be the biggest differences (advantages and disadvantages) to translate it first to C and then use CLang instead of dealing with a direct IR translation.
The obvious answer I guess would be that by using a front-end that knows the source language, it is easier to come up with an optimized IR represention with than expecting CLang to perform well with the generated C.
I am missing something here ?
Advantages of using a generic C backend:
You can use any C compiler (not just Clang)
Easier to debug an intermediate code if it's in such a high level language
Depending on your source language semantics, it might be easier to translate it via C (but not necessarily)
And disadvantages are:
If your language is compiled incrementally (e.g., no clearly separated modules, or complex macro system, or whatever else), compiling via LLVM IR in a single module with immediate JIT-compilation makes more sense than generating hundreds of tiny C modules. In other words, C is enforcing separate compilation.
If your source language semantics is too far from C, compiling it straight into a lower level can be easier.
Not all the LLVM functionality is directly accessible from C. E.g., intrinsics, alternative calling conventions, debug metadata for a higher level language.
Clang is big, excluding it will improve your memory footprint
Clang is not easy to maintain, it depends on presence and exact locations of the headers, depends on some parts of gcc, etc. Without it, bare LLVM can be used on its own and dependencies may be kept self-contained.
Optimisations in most cases are not an issue. Clang is generating an extremely non-optimal LLVM IR, deliberately. LLVM should care for all the optimisations, not the frontends. Unless, of course, you can do some high level optimisations, but then they won't depend on your backend choice.

Hints to the compiler using llvm

I am working on a tool that takes the LLVM IR and modifies it. I'm interested in allowing the programmer to give hints to the compiler. For example, he can give the hint that a particular loop is compute intensive. For this purpose, one thing that comes to my mind is to use a pragma. So my question is, how can we make the pragmas work? Can I have the pragma information there in the LLVM IR? What are the options for such kind of task?
This question can refer to several different things:
If you're looking to understand how to implement pragma, take a look at how Clang does it. I.e. what various pragma directives are translated to.
If you want to understand the existing hints (for instance inlinehint, byval etc.), look at attributes - for example Function Attributes.
If you want something more flexible and proprietary, you can use metadata. LLVM itself uses it for various purposes, but in your own compiler you're very free in what you can do with it. Hints to the compiler are one possible application.

How to make use of Clang's AST?

I am looking at making use of the Clang's AST for my C code and do some analysis over the AST. Some pointers on where to start, how to obtain the Clang's AST, tutorials or anything in this regard will be of great help!!!
I have been trying to find some and I got this link which was created 2 years back. But for some reason, it is not working for me. The sample code, in the tutorial, gives me too many errors. So I am not sure, if I build the code properly or some thing is wrong with the tutorial. But I would be happy to start from some other page as well.
Start with the tutorial linked by sharth. Then go through Clang's Doxygen. Start with SemaConsumer.
Read a lot of source code. Clang is a moving target. If you are writing tools based on clang, then you need to recognize that clang is adding and fixing features daily, so you should be prepared to read a lot of code!
You probably want the stable C API provided in the libclang library, as opposed to the unstable C++ internal APIs that others have mentioned.
The best documentation to start with currently is the video/slides of the talk, "libclang: Thinking Beyond the Compiler" available on the LLVM Developers Meeting website.
However, do note that the stability of the API comes at a cost of comprehensiveness. You won't be able to do everything with this API, but it is much easier to use.
To obtain the AST as well as get to know stages of the frontend, there is a frontend chapter in the book "LLVM core libraries". Basically it has such a flow (in the case of llvm-4.0.1 and should similar for later versions):
cc1_main.cpp:cc1_main (ExecuteCompilerInvocation)
CompilerInstance.cpp:CompilerInstance::ExecuteAction
ParseAST.cpp:clang::ParseAST (Consumer>HandleTranslationUnit(S.getASTContext())
CodeGenAction.cpp:HandleTranslationUnit
The last function handles the whole translation unit(top level decls are already handled at this point), and calls EmitBackendOutput to do backend stuff. So this function is a good spot where you can do something with the complete AST and before emitting backend output.
In terms of how to manipulate the AST, clang has some basic tutorial on this: http://clang.llvm.org/docs/RAVFrontendAction.html.
Also look at ASTDumper.cpp. It's the best example of visiting the AST.
Another good tutorial: https://jonasdevlieghere.com/understanding-the-clang-ast/ teaches you how to find a specific call expr in the AST via three different approaches.
I find this ASTUnit::LoadFromCompilerInvocation() fn as the most easiest way to construct the AST.
This link may give you some ideas http://comments.gmane.org/gmane.comp.compilers.clang.devel/12471

Resources