AVR-GCC file output structure - file

I already searched the web, especially the avr-gcc website.
I want to know the STRUCTURE of the output file, of sourcecode, compiled with avr-gcc.
Example of a standard Microsoft .EXE file:
00h DW Signature word.
"N" is low-order byte.
"E" is high-order byte.
02h DB Version number of the linker.
03h DB Revision number of the linker.
Can someone please tell me the avr-gcc output file structure?
Thank you. -MW
edit:
As Rev1.0 said, it's the Intel-HEX format.

Since you talk about AVR, you probably mean the format of the HEX-file? That is encoded as Intel HEX format.
EDIT Regarding your question from the comment:
I see the Header has the field "DATA". What exactly is in that field? The pure assembly?
Each line of the HEX file is called a "record". There are several types of records. Depending on the record type, the data contents have a different meaning. A "data record" holds the actual firmware/program data. That is lowest level machine code, not assembly. It represents exactly the data that resides in the flash memory after the device has been programmed.

Can someone please tell me the avr-gcc output file structure?
avr-gcc is just a driver program that calls sub-processes on different files and file formats. For example, the compiler proper reads pre-processed input (text) and writes assembly (text).
The GNU assembler reads assembly (*.s, text) as generated by the compiler and writes object files (*.o, ELF32)
The GNU linker / locator reads this object files and resolves references to libraries like libgcc (*.a, ELF32) and produces a final executable (ELF32).
Depending on your loader, you can use ELF directly (for example with avrdude).
If you want something else like Intex HEX or plain binary as an AVR sees it, you'll convert ELF32 to the desired output by means of avr-objcopy from GNU Binutils.
In general, one wants to keep ELF as long as possible because formats like IHEX are "dumb". They don't have additional information like symbol info or debug info, these formats are only used to upload a program to an AVR, but most modern tools understand ELF as well.

Related

radare2 load register map

I'm reversing some stm32f030 code I downloaded from the chip. I do understand the stm32s and arm assembly but I'm completely new to radare2.
There are many special registers e.g. 0x40021000 is RCC_CR, 0x40021004 is RCC_CFGR, 0x48000000 is GPIOA_MODER an so on. s. https://www.st.com/resource/en/reference_manual/dm00091010-stm32f030x4x6x8xc-and-stm32f070x6xb-advanced-armbased-32bit-mcus-stmicroelectronics.pdf
Is there a way to import register definitions in some format so that the code analysis can automatically flag them? Or another way so that all referenzes to them are named?
From the Radare manual on symbols, you can read a dwarf file. So, you can make a dwarf with external definitions for the register at the absolute addresses and it may be able to use them in the disassembler of a second binary.
With bin-utils, you can merge a binary with the symbols. Radare looks to have lots of utilities which can probably do the same thing; if it is not able to merge symbols from a separate DWARF file.. and this step is entirely un-needed.

Get start and end of a section

Foreword
There already exist questions like this one, but the ones I have found so far were either specific to a given toolchain (this solution works with GCC, but not with Clang), or specific to a given format (this one is specific to Mach-O). I tagged ELF in this question merely out of familiarity, but I'm trying to figure out as "portable" a solution as reasonably possible, for other platforms. Anything that can be done with GCC and GCC-compatible toolchains (like MinGW and Clang) would suffice for me.
Problem
I have an ELF section containing a collection of relocatable "gadgets" that are to be copied/injected into something that can execute them. They are completely relocatable in that the raw bytes of the section can be copied verbatim (e.g. by memcpy) to any [correctly aligned] memory location, with little to no relocation processing (done by the injector). The problem is that I don't have a portable way of determining the size of such a section. I could "cheat" a little by using --section-start, and outright determine a start address, but that feels a little hacky, and I still wouldn't have a way to get the section's end/size.
Overview
Most of the gadgets in the section are written in assembly, so the section itself is declared there. No languages were tagged because I'm trying to be portable with this and get it working for various architectures/platforms. I have separate assembly sources for each (e.g. ARM, AArch64, x86_64, etc).
; The injector won't be running this code, so no need to be executable.
; Relocations (if any) will be done by the injector.
.section gadgets, "a", #progbits
...
The more "heavy duty" code is written in C and compiled in via a section attribute.
__attribute__((section("gadgets")))
void do_totally_innocent_things();
Alternatives
I technically don't need to make use of sections like this at all. I could instead figure out the ends of each function in the gadget, and then copy those however I like. I figured using a section would be a more straightforward way to go about it, to keep everything in one modular relocatable bundle.
I'm not sure if you considered this or this is out of picture but you could read the elf headers.
This would be sort of 'universal' as you can do the same thing with Mach-O binaries
So for example:
Creating 3 integer variables inside the 'custom_sect' section
These would add up to 12 or 0xC bytes which if we read the headers we can confirm:
Size with readelf
and here is how a section is represented in the ELF executables: ELF section representation
So each section will have its own size property which you can just read out

What's the difference between binary and executable files mentioned in ndisasm's manual?

I want to compile my C file with clang and then decompile it with with ndisasm (for educational purposes). However, ndisasm says in it's manual that it only works with binary and not executable files:
ndisasm only disassembles binary files: it has
no understanding of the header information
present in object or executable files. If you
want to disassemble an object file, you should
probably be using objdump(1).
What's the difference, exactly? And what does clang output when I run it with a simple C file, an executable or a binary?
An object file contains machine language code, and all sorts of other information. It sounds like ndisasm wants just the machine code, not the other stuff. So the message is telling you to use the objdump utility to extract just the machine code segment(s) from the object file. Then you can presumably run ndisasm on that.
And what does clang output when I run it with a simple C file, an executable or a binary?
A C compiler is usually able to create a 'raw' binary, which is Just The Code, hold the tomato, because for some (rare!) purposes that can be useful. Think, for instance, of boot sectors (which cannot 'load' an executable the regular way because the OS to load them is not yet started) and of programmable RAM chips. An Operating system in itself usually does not like to execute 'raw binary code' - pretty much for the same reasons. An exception is MS Windows, which still can run old format .com binaries.
By default, clang will create an executable. The intermediate files, called object files, are usually deleted after the executable is linked (glued together with library functions and an appropriate executable header). To get just a .o object file, use the -c switch.
Note that Object files also contain a header. After all, the linker needs to know what the file contains before it can link it to other parts.
For educational purposes, you may want to examine the object file format. Armed with that knowledge it should be possible to write a program that can tell you at what offset in the file the actual code starts. Then you can feed that information into ndisasm.
In addition to the header, files may contain even more data after the instructions. Again, ndisasm does not know and nor does it care. If your test program contains a string Hello world! somewhere at the end, it will happily try to disassemble that as well. It's up to you to recognize this garbage as such, and ignore what ndisasm does to it.

Usage differences between. a.out, .ELF, .EXE, and .COFF

Don't get me wrong by looking at the question title - I know what they are (format for portable executable files). But my interest scope is slightly different
MY CONFUSION
I am involved in re-hosting/retargeting applications that are originally from third parties. The problem is that sometimes the formats for object codes are also in .elf, .COFF formats and still says, "Executable and linkable".
I am primarily a Windows user and know that when you compile and assemble your C/C++ code, you get something similar to .o or .obj. that are not executable (well, I never tried to execute them). But when you complete linking static and dynamic libraries and finish building, the executable appears. My understanding is that you can then go about and link that executable or "bash" test it with some form of script if necessary.
However, in Linux (or UNIX-like systems) there are .o files after you compile and assemble the C/C++ code. And once the linking is done, the executable is in a.out format (at least in Ubuntu distribution of Linux). It may very well be .elf in some other distrib. In my quick web search none of the sources mentioned anything about .o files as executables.
QUESTIONS
Therefore my question turns into the followings:
What is the true definitions for portable executables and object code?
How is it that Windows and UNIX platform covers both executables annd object code under the same file format (.COFF, .elf).
Am I misinterpreting "Linkable"? My interpretation of "Linkable" is something that is compiled object code and can then be "linked" to other static/dynamic link libraries. Is this a stupid thought?
Based on question 1. (and perhaps 2) do I need to use symbol tables (e.g. .LUM or .MAP files) with object code then? Symbols as in debug symbols and using them when re-hosting the executables/object files on a different machine.
Thanks in advance for the right nudges. Meanwhile, I will keep digging and update the question if necessary.
UPDATE
I have managed to dig this out from somewhere :( Seems like a lot to swallow to me.
I am primarily a Windows user and know that when you compile your C/C++ code, you get something similar to .o or .obj. that are not executable
Well, last time I compiled stuff on Windows, the result of the compilation was an .obj file, which is exactly what its name suggests: it's an object file. You're right in that it's not an executable in itself. It contains machine code which doesn't (yet) contain enough information to be directly run on the CPU.
However, in Linux (or UNIX-like systems) there are .o files after you compile the C/C++ code. And once the linking is done, the executable is in a.out format (at least in Ubuntu distribution of Linux). It may very well be .elf in some other distrib.
Living in the 90's, that is :P No modern compilers I am aware of target the a.out format as their default output format for object code. Maybe it's a misleading default of GCC to put the object code into a file called a.out when no explicit output file name is specified, but if you run the file command on a.out, you'll find out that it's an ELF file. The a.out format is ancient and it's kind of "de facto obsolete".
What is the true definitions for portable executables and object code?
You've already got the Wikipedia link to object files, here's the one to "Portable Executable".
How is it that Windows and UNIX platform covers both executables annd object code under the same file format (.COFF, .elf).
Because the ELF format (and apparently COFF too) has been designed like so. And why not? It's just the very same machine code after all, it seems quite logical to use one file format during all the compilation steps. Just like we don't like when dynamic libraries and stand-alone executables have a different format. (That's why ELF is called ELF - it's an "Executable and Linkable Format".)
Am I misinterpreting "Linkable"?
I don't know. From your question it's not clear to me what you think "linkable" is. In general, it means that it's a file that can be linked against, i. e. a library.
Based on question 1. (and perhaps 2) do I need to use symbol tables (e.g. .LUM or .MAP files) with object code then? Symbols as in debug symbols and using them when re-hosting the object files on a different machine.
I think this one is not related to the executable format used. If you want to debug, you have to generate debugging information no matter what. But if you don't need to debug, then you're free to omit them of course.

What is the difference between ELF files and bin files?

The final images produced by compliers contain both bin file and extended loader format ELf file ,what is the difference between the two , especially the utility of ELF file.
A Bin file is a pure binary file with no memory fix-ups or relocations, more than likely it has explicit instructions to be loaded at a specific memory address. Whereas....
ELF files are Executable Linkable Format which consists of a symbol look-ups and relocatable table, that is, it can be loaded at any memory address by the kernel and automatically, all symbols used, are adjusted to the offset from that memory address where it was loaded into. Usually ELF files have a number of sections, such as 'data', 'text', 'bss', to name but a few...it is within those sections where the run-time can calculate where to adjust the symbol's memory references dynamically at run-time.
A bin file is just the bits and bytes that go into the rom or a particular address from which you will run the program. You can take this data and load it directly as is, you need to know what the base address is though as that is normally not in there.
An elf file contains the bin information but it is surrounded by lots of other information, possible debug info, symbols, can distinguish code from data within the binary. Allows for more than one chunk of binary data (when you dump one of these to a bin you get one big bin file with fill data to pad it to the next block). Tells you how much binary you have and how much bss data is there that wants to be initialised to zeros (gnu tools have problems creating bin files correctly).
The elf file format is a standard, arm publishes its enhancements/variations on the standard. I recommend everyone writes an elf parsing program to understand what is in there, dont bother with a library, it is quite simple to just use the information and structures in the spec. Helps to overcome gnu problems in general creating .bin files as well as debugging linker scripts and other things that can help to mess up your bin or elf output.
some resources:
ELF for the ARM architecture
http://infocenter.arm.com/help/topic/com.arm.doc.ihi0044d/IHI0044D_aaelf.pdf
ELF from wiki
http://en.wikipedia.org/wiki/Executable_and_Linkable_Format
ELF format is generally the default output of compiling.
if you use GNU tool chains, you can translate it to binary format by using objcopy, such as:
arm-elf-objcopy -O binary [elf-input-file] [binary-output-file]
or using fromELF utility(built in most IDEs such as ADS though):
fromelf -bin -o [binary-output-file] [elf-input-file]
bin is the final way that the memory looks before the CPU starts executing it.
ELF is a cut-up/compressed version of that, which the CPU/MCU thus can't run directly.
The (dynamic) linker first has to sufficiently reverse that (and thus modify offsets back to the correct positions).
But there is no linker/OS on the MCU, hence you have to flash the bin instead.
Moreover, Ahmed Gamal is correct.
Compiling and linking are separate stages; the whole process is called "building", hence the GNU Compiler Collection has separate executables:
One for the compiler (which technically outputs assembly), another one for the assembler (which outputs object code in the ELF format),
then one for the linker (which combines several object files into a single ELF file), and finally, at runtime, there is the dynamic linker,
which effectively turns an elf into a bin, but purely in memory, for the CPU to run.
Note that it is common to refer to the whole process as "compiling" (as in GCC's name itself), but that then causes confusion when the specifics are discussed,
such as in this case, and Ahmed was clarifying.
It's a common problem due to the inexact nature of human language itself.
To avoid confusion, GCC outputs object code (after internally using the assembler) using the ELF format.
The linker simply takes several of them (with an .o extension), and produces a single combined result, probably even compressing them (into "a.out").
But all of them, even ".so" are ELF.
It is like several Word documents, each ending in ".chapter", all being combined into a final ".book",
where all files technically use the same standard/format and hence could have had ".docx" as the extension.
The bin is then kind of like converting the book into a ".txt" file while adding as many whitespace as necessary to be equivalent to the size of the final book (printed on a single spool),
with places for all the pictures to be overlaid.
I just want to correct a point here. ELF file is produced by the Linker, not the compiler.
The Compiler mission ends after producing the object files (*.o) out of the source code files. Linker links all .o files together and produces the ELF.

Resources