Compiling object file from an intermediate file of gcc - c

By using the -fdump-tree-* flag , one can dump some intermediate format file during compilation of a source code file. My question is if one can use that intermediate file as an input to gcc to get the final object file.
I'm asking this because I want to add some code to the intermediate file of the gimple (obtained by using the flag -fdump-tree-gimple) format. Sure I can use hooks and add my own pass, but I don't want to get to that level of complexity yet. I just want to give gcc my modified intermediate file, so it can start its compilation from there and give me the final object file. Any ideas how to achieve this?

GIMPLE was a binary internal format which is hard to dump fully and reload back correctly. Comparing with LLVM, LLVM IR was designed to be dumpable and reloadable into usual file (text and binary format of such files are fully-convertible from each to other). You can run Clang fronted to emit LLVMIR, then start opt program with some optimizations, then with other, and there will be LLVM IR bitcode files between phases. And then you can start codegeneration from IR bitcode into native code (even, in theory, into not the same platform, see PNaCl project).
There are some projects of dumping/reloading internal representation of GCC. I know such project was created to integrate gcc with commercial compiler tool. The author can't just link commercial code with gcc, because gcc is VIRAL (it will infect any linked code with anti-commercial GPL). So, author wrote a GPL dumper/loader of GIMPLE to some external (xml) format; the proprietary tool was able to read and translate this XML into other XML of the same format and then it was reloaded back with GPL tool.
In newer gcc you have an option of writing a plugin, which is VIRAL (23.2.1) in terms of GPL. Plugin will operate on in-memory representation of program and there will be no problem of dumping/reloading GIMPLE via external file.
There are some plugins which may be configured/may use user-supplied program, e.g MELT (Lisp) and GCC Python (Python). Some list of gcc plugins is there

There's no built-in facility to translate the text GIMPLE representation back to original GIMPLE internal representation.
You'll need to use custom front-end (such as suggested GIMPLE FE) to make sense of dumped GIMPLE.

Related

Extract function source code from existing open source C library

I need to extract source code for a function from the existing C library (the library is open source). The problem is that functions are created using macros in header files, and when I write a test project and link the library to it the debugger points me to that header file on 'go to definition' action. I have the source code of the library and I guess i need to build it together with my test code (maybe this is not correct, I am not sure). Any advice on how to proceed, what to use? Thank you.
I need to extract source code for a function from the existing C library (the library is open source).
Several C compilers are themselves open source. Both GCC and Clang are (and so is tinycc). So you legally could improve them (but that could take months of work).
In addition, recent GCC versions (e.g. in july 2020, GCC 10) accept plugins. Your GCC plugin could work on some internal GCC representations (e.g. GIMPLE, GENERIC) so will know about functions (even obtained by preprocessor expansion).
You could also consider using some open source static program analyzers, such as Frama-C or Clang static analyzer.
PS. Take into account open source license issues (legal ones). I am not a lawyer (and you might need to ask one, if you mix various software of different open source licenses).

How to check if a object code is 16/32 bit?

Is there any way by which we can identify that a .obj file and .exe file is 16/32 bit?
Basically I want to create a smart linker, that will automatically identify which linker do the given file names need to be passed to.
Preferred Language: C (it can be different, if needed)
I am looking for some solution that can read the bytes of an .exe/the code of an .obj file and then determine if it's 16/32 bit. Even an algorithm would too do.
Note: I know both object code and a executable are two different entities.
All of this information is encoded in the binary object according to the relevant Application Binary Interface (ABI).
The current Linux ABI is the Executable and Linkable Format (ELF), and you can query a specific binary file using a tool such as readelf or objdump.
The current Windows ABI is the Portable Executable (PE) format. I'm not familiar with the toolset here but a quick google search suggests there are programs that function the same as readelf:
http://www.pe-explorer.com/peexplorer-tour.htm
Here's the Microsoft specification of the PE format:
https://learn.microsoft.com/en-us/windows/win32/debug/pe-format
However, neither of those formats support 16-bit binaries anymore. The older ABI format is called "a.out" for Linux, which can be read and queried with objdump (I'm not sure about readelf). The older Windows/DOS formats are called MZ and NE. Again, I'm not familiar with the tool support for these older Windows formats.
Wikipedia has a pretty comprehensive list of all the popular executable file formats that have been used, with links to more info:
https://en.wikipedia.org/wiki/Comparison_of_executable_file_formats

Linking object files of differing types

I am trying to link object files which had originally been created by two different assemblers. We have some legacy assembly code that was compiled into object files using an old MRI assembler for the 68332 processor. We are developing a new application with the GNU Binutils m68k v2.24. We would like to use the original object files as built by the old assembler without change. I have configured our build environment to do this. For historic reasons, our build environment links into three output formats: Srecord, ieee, and ELF. When I run this is succeeding without error for the Srecord and ieee formats. However, for the ELF output format, I receive the following errors:
m68k-elf-ld: failed to merge target specific data of file
As a result the Elf file is not created.
I am first trying to understand what this error message might mean but I was not able to. If anyone knows the GNU Binutils ld documentation enough to point me to where the error definition is defined I would appreciate this.
I have actually loaded our target and run the Srecord output. It seems to pass many tests the same as before so it appears that it is running to some degree.
It looks like our legacy object files may be in coff format format. I would guess that this is the problem. Is there any way to convert a coff file to ELF format?
Thanks in advance for any support.
It looks like our legacy object files may be in coff format format. I would guess that this is the problem. Is there any way to convert a coff file to ELF format?
objcopy can be used to convert between formats. However, to do this it has to have been configured to understand both formats. You can check what formats it accepts with objcopy --info (a shortened list appears at the end of objcopy --help).
If you objcopy doesn't support the required formats, then you'll have to build binutils yourself.

Where in the GCC source code does it compile to the different assembly languages?

Where is the code in the GCC source code that actually constructs the assembly for the different architectures?
Wondering how many different assembly languages it compiles to, and how it actually does this (by taking a look at the source code).
Is it in the gcc repo somewhere, or in another repo? I have started to dig around but haven't found anything.
https://github.com/gcc-mirror/gcc
For example, here is some of the assembly generating code in V8:
https://github.com/v8/v8-git-mirror/tree/master/src/x64
Is there anything equivalent for GCC?
I am wondering because it's a mystery how GCC does this, and it would be a great way to learn how compilers are actually implemented down to the assembly level.
The .md (machine description) files of GCC source contain stuff to generate assembly. GCC contains several specialized C/C++ code generators (and some of them translates the .md files into code emitting assembly).
GCC is a very complex program. The documentation of GCC MELT (an obsolete project) contains several interesting links and slides, notably refering to the Indian GCC Resource Center
Most of the optimizations in GCC happens in the middle-end (which is mostly independent of source language or target system), notably with many passes working on the Gimple representations.
The GCC repo is an SVN repository.
See also this answer, notably the pictures inside it.
The actual source code for GCC is most accessible from here:
https://gcc.gnu.org/svn.html
The software is accessible via SVN (subversion), a source code control system. This would be installed on many versions of Linux/UNIX, but if not on your platform, you can install the svn kit and then fetch the source using the following command:
svn checkout svn://gcc.gnu.org/svn/gcc/trunk SomeLocalDir
GCC is complex and would take significant experience to understand the nature of how the application actually compiles to different architectures.
In a nutshell, GCC has three major components - front-end, middle and back-end processing. The front-end processor has the component of the language parsing to understand the syntax of languages (like C, C++, Objective-C, etc). The front-end deconstructs the code to a portable construct which is then passed to the back-end for compilation to the target environment.
The middle part performs code analysis and optimisation, attempting to prioritise the code to generate the best possible output at the end of the full process. Technically, optimisation can occur at any part of the process as patterns are discovered during analysis.
The back-end processor compiles the code to a tree-style output format (not actually final executable code). Based on what the expected output is designed to be, the "pseudo-code" is optimised for using registers, bit-sizes, endian-ness, and so on. The final code is then generated during the assembly phase, which converts the back-end code into machine executable instructions.
It's important to note that the compiler has many options to deal with output formats so you can create output to many classes of architecture, usually out of the box. For cross-compiling and target compiler options, try checking out this link:
https://gcc.gnu.org/install/configure.html

How do I compile C code to a raw os-less binary?

Considering that C is a systems programming language, how can I compile C code into raw x86 machine code that could be invoked without the presence of an operating system? (IE: You can assume I have a boot sector that loads the raw machine code from disk into memory then jumps directly to the first instruction).
And now, for bonus points: Ideally, I'd like to compile using Visual Studio 2010's compiler because I've already got it. Failing that, what's the best way to accomplish the task, without having to install a bunch of dependencies or having to make large sweeping configuration changes across my entire system? I'd be compiling on Windows 7.
Usually, you don't. Instead, you compile your code normally, and then (either with the linker or some other tool) extract a raw binary from the object file.
For example, on Linux, you can use the objcopy tool to copy an object file to a raw binary file.
$ objcopy -O binary object.elf object.binary
First off you dont use any libraries that require a system call (printf, fopen, read, etc). then you compile the C files normally. the major difference is the linker step, if you are used to letting the c compiler call the linker (or letting some gui do it) you will likely need to take over that manually in some form. The specific solution depends on your tools, you will need to have some bootstrap code (the small amount of assembly that is needed to cover the assumptions of C compilers and programmers and launch the entry point in your C program), and a linker script or the right command line options for the linker to control the address space for the binary as well as to link the objects together. Then depending on the output format of the linker you might have to convert it to some other binary format (intel hex, srec, exe, com, coff, elf, raw binary, etc) to be compatible with wherever it is going to be loaded or run.

Resources