Difference between different binary formats - c

What are the difference between binary formats like COFF, ELF, a.out, etc, why do so many different formats exist?
All they have to be is a sequence of instructions and their arguments (specified by the ISA). So as long as the processor is the same, you can use the same binary between computers (currently needs to be of compatible ABI).

Short answer: History and development. See the feature comparison at Wikipedia.
Better answer: Your assumptions are wrong. It is definitely not just code and initialized variables; much wider range of features are needed. There is no universal executable format, or a best executable format, as the features needed vary in different cases. Usually, we also want to keep backwards support; and why switch from a known working solution to a new one, if you don't have to? Because the new one is "universal" is just silly (actually stupid, if you consider how badly monocultures fare in the real, changing world).
Currently, ELF format is the closest we have to an "universal" format, and is used by many current operating systems -- although both Windows and Mac OS use their own formats for basically historical reasons. (Windows retains backwards portability, later having switching to a COFF-based "portable executable" format, COFF itself not being that portable; Mac OS X uses Mach-O format, which is directly related to its kernel.)

Related

Does the length of a data type depend on the architecture on the computer or on the compiler you are using?

I have read several times before that the length of the data type in C depends on the architecture on the PC.
But recently I have read documentation of a compiler and this documentation specifies the data types that you can access and the length of this data type.
So my question is, the data type length depends on the compiler your using, the architecture of the computer or both?
The final determination of the size of each type is up to the C implementation, subject to the constraints of the C standard. (The standard specifies things such as that char is one “byte,” and a byte is eight bits or more, and that certain types are the same size as each other.)
A C implementation typically consists of a compiler, header files, libraries, a linker, and a target system to run on, and possibly other components. The compiler has a large influence on the characteristics of types, especially the core types of C, but sometimes header files complete the definitions of types, so that one compiler may be used with different sets of header files that provide different types.
Most C implementations are influenced by the characteristics of the execution architecture they are designed for. An int will often be the same size as a general processor register, and a pointer (of any type) will often be the smallest size that is a power-of-two number of bytes and has enough bits to span the address space of the target architecture.
However, C implementations may be designed for other purposes. I have seen a C implementation that used a 32-bit pointer size for a 64-bit architecture, for the purpose of reducing memory usage in programs that had a great many pointers. A C implementation could also be designed to run older C code on a new processor, so the C implementation used the sizes of types that the older code was written for rather than sizes that would be more “natural” on the new processor.
It can depend on either. Of course the executable produced by the compiler will need to run efficiently on the CPU architecture that it is compiled for, so if (for example) you are cross-compiling software to run it on a 16-bit microcontroller, than it is very likely that sizeof(int)==2 in the compiled code, even if you are using a modern 64-bit desktop PC to do the compilation.
Simultaneously to that, it's also up to the compiler writers (within limits) to decide what sizes various non-width-specific data types should have in their compiler. So for example, some compilers set sizeof(int)==4 while others set sizeof(int)==8, for reasons particular to their own history and user's needs.

Transforming executables, objects or binaries between different architectures

Beforehand: This is just a nasty idea I had this night :-)
Think about the following scenario:
You have some arm-elf executable and for some reasons you want to run it on your amd64 box without emulating.
To simplify the scenario, let's say we just want to deal with simple console applications which are just linked against libc and there are no additional architecture specific requirements.
If you want to transform binaries between different architectures you have to consider the following points:
Endianess of the architectures
bit-width of registers
functionality of different registers
Endianess should be one of the lesser problems.
If the bit-width of the destination registers is smaller then those of the source architecture one could insert additional instructions to represent the same behaviour. The same applies to the functionality of registers.
Finally, (and before bashing down this idea), have a look at the following simple code snippet and its corresponding disassembly of the objects.
C Code
Corresponding ARM Disassembly
Corresponding AMD64 Disassembly
In my opinion it should be possible to convert those objects between different architectures. Even function calls (like printf) could be mapped or wrapped to the destination architecture's libc.
And now my questions:
Did anyone already think about realising this?
Is it actually possible?
Are there already some projects dealing with this issue?
Thanks in advance!

Adding 64 bit support to existing 32 bit code, is it difficult?

There is a library which I build against different 32-bit platforms. Now, 64-bit architectures must be supported. What are the most general strategies to extend existing 32-bit code to support 64-bit architectures? Should I use #ifdef's or anything else?
The amount of effort involved will depend entirely on how well written the original code is. In the best possible case there will be no effort involved other than re-compiling. In the worst case you will have to spend a lot of time making your code "64 bit clean".
Typical problems are:
assumptions about sizes of int/long/pointer/etc
assigning pointers <=> ints
relying on default argument or function result conversions (i.e. no function prototypes)
inappropriate printf/scanf format specifiers
assumptions about size/alignment/padding of structs (particularly in regard to file or network I/O, or interfacing with other APIs, etc)
inappropriate casts when doing pointer arithmetic with byte offsets
Simply don't rely on assumption of the machine word size? always use sizeof, stdint.h, etc. Unless you rely on different library calls for different architectures, there should be no need for #ifdefs.
The easiest strategy is to build what you have with 64-bit settings and test the heck out of it. Some code doesn't need to change at all. Other code, usually with wrong assumptions about the size of ints/pointers will be much more brittle and will need to be modified to be non-dependant on the architecture.
Very often binary files containing binary records cause the most problems. This is especially true in environments where ints grow from 32-bit to 64-bit in the transition to a 64-bit build. Primarily this is due to the fact that integers get written natively to files in their current (32-bit) length and read in using an incorrect length in a 64-bit build where ints are 64-bit.

Writing a portable C program - which things to consider?

For a project at university I need to extend an existing C application, which shall in the end run on a wide variety of commercial and non-commercial unix systems (FreeBSD, Solaris, AIX, etc.).
Which things do I have to consider when I want to write a C program which is most portable?
The best advice I can give, is to move to a different platform every day, testing as you go.
This will make the platform differences stick out like a sore thumb, and teach you the portability issues at the same time.
Saving the cross platform testing for the end, will lead to failure.
That aside
Integer sizes can vary.
floating point numbers might be represented differently.
integers can have different endianism.
Compilation options can vary.
include file names can vary.
bit field implementations will vary.
It is generally a good idea to set your compiler warning level up as high as possible,
to see the sorts of things the compiler can complain about.
I used to write C utilities that I would then support on 16 bit to 64 bit architectures, including some 60 bit machines. They included at least three varieties of "endianness," different floating point formats, different character encodings, and different operating systems (though Unix predominated).
Stay as close to standard C as you can. For functions/libraries not part of the standard, use as widely supported a code base as you can find. For example, for networking, use the BSD socket interface, with zero or minimal use of low level socket options, out-of-band signalling, etc. To support a wide number of disparate platforms with minimal staff, you'll have to stay with plain vanilla functions.
Be very aware of what's guaranteed by the standard, vice what's typical implementation behavior. For instance, pointers are not necessarily the same size as integers, and pointers to different data types may have different lengths. If you must make implementation dependent assumptions, document them thoroghly. Lint, or --strict, or whatever your development toolset has as an equivalent, is vitally important here.
Header files are your friend. Use implementaton defined macros and constants. Use header definitions and #ifdef to help isolate those instances where you need to cover a small number of alternatives.
Don't assume the current platform uses EBCDIC characters and packed decimal integers. There are a fair number of ASCII - two's complement machines out there as well. :-)
With all that, if you avoid the tempation to write things multiple times and #ifdef major portions of code, you'll find that coding and testing across disparate platforms helps find bugs sooner. You'll end up producing more disciplined, understandable, maintainable programs.
Use atleast two compilers.
Have a continuous build system in place, which preferably builds on the various target platforms.
If you do not need to work very low-level, try to use some library that provides abstraction. It is unlikely that you won't find third-party libraries that provide good abstraction for the things you need. For example, for network and communication, there is ACE. Boost (e.g. filesystem) is also ported to several platforms. These are C++ libraries, but there may be other C libraries too (like curl).
If you have to work at the low level, be aware that the platforms occasionally have different behavior even on things like posix where they are supposed to have the same behavior. You can have a look at the source code of the libraries above.
One particular issue that you may need to stay abreast of (for instance, if your data files are expected to work across platforms) is endianness.
Numbers are represented differently at the binary level on different architectures. Big-endian systems order the most significant byte first and little-endian systems order the least-significant byte first.
If you write some raw data to a file in one endianness and then read that file back on a system with a different endianness you will obviously have issues.
You should be able to get the endianness at compile-time on most systems from sys/param.h. If you need to detect it at runtime, one method is to use a union of an int and a char, then set the char to 1 and see what value the int has.
It is a very long list. The best thing to do is to read examples. The source of perl, for example. If you look at the source of perl, you will see a gigantic process of constructing a header file that deals with about 50 platform issues.
Read it and weep, or borrow.
The list may be long, but it's not nearly as long as also supporting Windows and MSDOS. Which is common with many utilities.
The usual technique is to separate core algorithm modules from those which deal with the operating system—basically a strategy of layered abstraction.
Differentiating amongst several flavors of unix is rather simple in comparison. Either stick to features all use the same RTL names for, or look at the majority convention for the platforms supported and #ifdef in the exceptions.
Continually refer to the POSIX standards for any library functions you use. Parts of the standard are ambiguous and some systems return different styles of error codes. This will help you to proactively find those really hard to find slightly different implementation bugs.

size of a datatype in c

Is size of a datatype hardware architecture dependent or compiler dependent?
I want to know what factors really influence in determining the size of a datatype?
The compiler (more properly the "implementation") is free to choose the sizes, subject to the limits in the C standard (for instance int must be at least 16 bits). The compiler optionally can subject itself to other standards, like POSIX, which can add more constraints. For example I think POSIX says all data pointers are the same size, whereas the C standard is perfectly happy for sizeof(int*) != sizeof(char*).
In practice, the compiler-writer's decisions are strongly influenced by the architecture, because unless there's a strong reason otherwise they want the implementation to be efficient and interoperable. Processor manufacturers or OS vendors often publish a thing called a "C ABI", which tells you (among other things), how big the types are and how they're stored in memory. Compilers are never obliged to follow the standard ABI for their architecture, and CPUs often have more than one common ABI anyway, but to call directly from code out of one compiler to code out of another, both compilers have to be using the same ABI. So if your C compiler doesn't use the Windows ABI on Windows, then you'd need extra wrappers to call into Windows dlls. If your compiler supports multiple platforms, then it quite likely uses different ABIs on different platforms.
You often see abbreviations used to indicate which of several ABIs is in use. So for instance when a compiler on a 64 bit platform says it's LP64, that means long and pointers are 64bit, and by omission int is 32bit. If it says ILP64, that means int is 64bit too.
In the end, it's more a case of the compiler-writer choosing from a menu of sensible options, than picking numbers out of the air arbitrarily. But the implementation is always free to do whatever it likes. If you want to write a compiler for x86 which emulates a machine with 9-bit bytes and 3-byte words, then the C standard allows it. But as far as the OS is concerned you're on your own.
The size is ultimately determined by the compiler. e.g. Java has a fixed set of sizes (8,16,32,64) while the set of sizes offered by C for its various types depends in part on the hardware it runs on; i.e. the compiler makes the choice but (except in cases like Java where datatypes are explicitly independent of underlying hardware) is strongly influenced by what the hardware offers.
The size of different data types is compiler, and its configuration, dependent (different compilers, or different switches to the same compiler, on some machine can have different sizes).
Usually the compiler is matched to the hardware it is installed on ... so you could say that the sizes of types are also hardware dependent. Making a compiler that emits 16-bit pointers on a machine where they are 48-bits is counter productive.
But it is possible to use a compiler on a computer to create a program meant to run on a different computer with different sizes.
It depends on the target hardware architecture, operation system and possibly the compiler.
The intel compiler sizes a long integer as follows:
OS arch size
Windows IA-32 4 bytes
Windows Intel 64 4 bytes
Windows IA-64 4 bytes
Linux IA-32 4 bytes
Linux Intel 64 8 bytes
Linux IA-64 8 bytes
Mac OS X IA-32 4 bytes
Mac OS X Intel 64 8 bytes
Here as link to show the sizes on microsoft visual c++ compiler.
The size of the "native" datatypes is up to the compiler. While this in turn is influenced by the hardware, I wouldn't start guessing.
Have a look at <stdint.h> - that header has platform-independent typedef's that should cater for whatever needs you might have.
The answer to your question is is yes, I'll explain.
Consider common storage types, i.e. size_t, int64_t, etc. These are decided (defined in most cases) at compile time depending on the architecture that you are building on. Don't have them? Fear not, the compiler figured out the underlying meaning of int and how unsigned effects it.
Once the standard C headers figure out the desired word length, everything just adjusts to your system.
Unless, of course, you happen to be cross compiling, then they are decided (or defined) by whatever architecture you specified.
In short, your compiler (well, mostly preprocessor) is going to adjust types to fit the target architecture. Be it the one you are using, or the one you are cross compiling for.
That is if I understand your question correctly. This is one of the few 'magic' abstractions that the language provides and part of the reason why its often called 'portable assembly'.
\\ It is exclusively compiler dependent.
Or to say more correctly C language standard dependent (http://en.wikipedia.org/wiki/C99).
As in the standard are clearly assigned sizes of bult-in types.
But. They are not fixed "one size for all" there.
There is just a minimal size (e.g. char is at least 8bit) need to be preserved by any compiler on any architecture.
But also it can be 16 or even 32 bit char depend on arch.
And also there is reletive sizes between different types are preserved.
It mean, that for example short is can be 8, 16 or 32 bit, but it cannot be more in length then more wide type int on the same architecture.
Only smaller or same length.
That is a borders inside whith compiler developers must work if they want to make C standard compatible compiler.

Resources