Which of these items can safely be assumed to be defined in any practically-usable platform ABI?
Value of CHAR_BIT
Size, alignment requirements and object representation of:
void*, size_t, ptrdiff_t
unsigned char and signed char
intptr_t and uintptr_t
float, double and long double
short and long long
int and long (but here I expect a "no")
Pointer to an object type for which the platform ABI specifies these properties
Pointer to function whose type only involves types for which the platform ABI specifies these properties
Object representation of a null object pointer
Object representation of a null function pointer
For example, if I have a library (compiled by an unknown, but ABI-conforming compiler) which publishes this function:
void* foo(void *bar, size_t baz, void* (*qux)());
can I assume to be able to safely call it in my program regardless of the compiler I use?
Or, taken the other way round, if I am writing a library, is there a set of types such that if I limit the library's public interface to this set, it will be guaranteed to be usable on all platforms where it builds?
I don't see how you can expect any library to be universally compatible. If that were possible, there would not be so many compiled variations of libraries.
For example, you could call a 64-bit library from a 16-bit program as long as you set up the call correctly. But you would have to know you're calling a 64-bit based library.
Portability is a much-talked about goal, but few truly achieve it. After 30+ years of system-level, firmware and application programming, I think of it as more of a fantasy versus a goal. Unfortunately, hardware forces us to optimize for the hardware. Therefore, when I write a library, I use the following:
Compile for ABI
Use a pointer to a structure for input and output for all function calls:
int lib_func(struct *input, struct *output);
Where the returning int indicates errors only. I make all error codes unique. I require the user to call an init function prior to any use of the library. The user calls it as:
lib_init(sizeof(int), sizeof(char *), sizeof(long), sizeof(long long));
So that I can decide if there will be any trouble or modify any assumptions if needed. I also add a function allowing the user to learn my data sizes and alignment in addition to version numbers.
This is not to say the user or I am expected to "on-the-fly" modify code or spend lots of CPU power reworking structures. But this allows the application to make absolutely sure it's compatible with me and vice-versa.
The other option which I have employed in the past, is to simply include several entry-point functions with my library. For example:
int lib_func32();
int lib_func16();
int lib_func64();
It makes a bit of a mess for you, but you can then fix it up using the preprocessor:
#ifdef LIB_USE32
#define lib_function lib_func32
#endif
You can do the same with data structures but I'd recommend using the same size data structure regardless of CPU size -- unless performance is a top-priority. Again, back to the hardware!
The final option I explore is whether to have entry functions of all sizes and styles which convert the input to my library's expectations, as well as my library's output.
For example, your lib_func32(&input, &output) can be compiled to expect a 32-bit aligned, 32-bit pointer but it converts the 32-bit struct into your internal 64-bit struct then calls your 64 bit function. When that returns, it reformats the 64-bit struct to its 32-bit equivalent as pointed to by the caller.
int lib_func32(struct *input32, struct *output32)
{
struct input64;
struct output64;
int retval;
lib_convert32_to_64(input32, &input64);
retval = lib_func64(&input64, &output64);
lib_convert64_to_32(&output64, output32);
return(retval);
}
In summary, a totally portable solution is not viable. Even if you begin with total portability, eventually you will have to deviate. This is when things truly get messy. You break your style for deviations which then breaks your documentation and confuses users. I think it's better to just plan it from the start.
Hardware will always cause you to have deviations. Just consider how much trouble 'endianness' causes -- not to mention the number of CPU cycles which are used each day swapping byte orders.
The C standard contains an entire section in the appendix summarizing just that:
J.3 Implementation-defined behavior
A completely random subset:
The number of bits in a byte
Which of signed char and unsigned char is the same as char
The text encodings for multibyte and wide strings
Signed integer representation
The result of converting a pointer to an integer and vice versa (6.3.2.3). Note that this means any pointer, not just object pointers.
Update: To address your question about ABIs: An ABI (application binary interface) is not a standardized concept, and it isn't said anywhere that an implementation must even specify an ABI. The ingredients of an ABI are partly the implementation-defined behaviour of the language (though not all of it; e.g. signed-to-unsigned conversion is implementation defined, but not part of an ABI), and most of the implementation-defined aspects of the language are dictated by the hardware (e.g. signed integer representation, floating point representation, size of pointers).
However, more important aspects of an ABI are things like how function calls work, i.e. where the arguments are stored, who's responsible for cleaning up the memory, etc. It is crucial for two compilers to agree on those conventions in order for their code to be binarily compatible.
In practice, an ABI is usually the result of an implementation. Once the compiler is complete, it determines -- by virtue of its implementation -- an ABI. It may document this ABI, and other compilers, and future versions of the same compiler, may like to stick to those conventions. For C implementations on x86, this has worked rather well and there are only a few, usually well documented, free parameters that need to be communicated for code to be interoperable. But for other languages, most notably C++, you have a completely different picture: There is nothing coming near a standard ABI for C++ at all. Microsoft's compiler breaks the C++ ABI with every release. GCC tries hard to maintain ABI compatibility across versions and uses the published Itanium ABI (ironically for a now dead architecture). Other compilers may do their own, completely different thing. (And then you have of course issues with C++ standard library implementations, e.g. does your string contain one, two, or three pointers, and in which order?)
To summarize: many aspects of a compiler's ABI, especially pertaining to C, are dictated by the hardware architecture. Different C compilers for the same hardware ought to produce compatible binary code as long as certain aspects like function calling conventions are communicated properly. However, for higher-level languages all bets are off, and whether two different compilers can produce interoperable code has to be decided on a case-by-case basis.
If I understand your needs correctly, uint style ones are the only ones that will give you binary compatibility guarantee and of cause int, char will but others tend to differ. i.e long on Windows and Linux, Windows considers it 4byte and Linux as 8byte. If you are really dependent on ABI, you have to plan for the platforms you are going to deliver and may be use typedefs to make things standardized and readable.
Related
I have difficulty to understand the portability of bit fields in C. Imagine I have a shared library composed of only two files, libfoobar.h (the public header) and libfoobar.c, with the following simple content:
libfoobar.h:
typedef struct some_bitfield_T {
unsigned char foo:3;
unsigned char bar:2;
unsigned char tree:2;
unsigned char window:1;
} some_bitfield;
extern unsigned int some_function (some_bitfield input);
libfoobar.c:
#include "libfoobar.h"
unsigned int some_function (some_bitfield input) {
return input.foo * 3 + input.bar + input.tree + 1 - input.window;
}
After having compiled and installed the library, I test it with a program named test.
test.c:
#include <stdio.h>
#include <libfoobar.h>
int main () {
some_bitfield my_attempt = {
.foo = 6,
.bar = 3,
.tree = 1,
.window = 1
};
unsigned int some_number = some_function(my_attempt);
printf("Here is the result: %u\n", some_number);
return 0;
}
Is there any possibility that the test program above will produce anything different than the following output?
Here is the result: 22
If yes, when? What if the library is compiled by someone else other than me? What if I use different compilers for the library and the test program?
Here is the relevant section of the C11 standard:
An implementation may allocate any addressable storage unit large enough to hold a bit- field. If enough space remains, a bit-field that immediately follows another bit-field in a structure shall be packed into adjacent bits of the same unit. If insufficient space remains, whether a bit-field that does not fit is put into the next unit or overlaps adjacent units is implementation-defined. The order of allocation of bit-fields within a unit (high-order to low-order or low-order to high-order) is implementation-defined. The alignment of the addressable storage unit is unspecified.
§6.7.2.1 point 11.
This means that the compiler can use any suitable type it likes to hold the bitfields and that they will be adjacent to each and in the order they were defined.
However, the compiler can choose for itself whether to order them from high to low or from low to high. It can also choose to overlap bitfields if it run out of space or to allocate a new storage unit (not a problem in your case, you only have 8 bits).
With the above in mind, we can answer your question. You can only guarantee that your test program will give the right answer if the program and the library were compiled with the same compiler implementation. If you use two different compilers, even if they both use (say) unsigned char to store the bit fields in, one could start from the top of the byte and the other could start from the bottom.
In practice, as ensc says above, a platform ABI might define a bit field packing and ordering standard, that makes it OK between compilers on the same platform, but this is not guaranteed in principle.
Bitfields are implementation defined and not not portable. But for most relevant platforms, their extraction/packing is well specified by the ABI so that they can be safely used in shared libraries.
E.g.:
ARM EABI specifies them in 7.1.7 "Bit-fields".
i386 ABI specifies them on page 3-6+
x86-64 specifies them on page 15+
MIPS specifies them on page 3-7+
I have difficulty to understand the portability of bit fields in C
Well, there's nothing much understand - bitfields are not portable.
Is there any possibility that the test program above will produce anything different than the following output?
The most common case is communication between user space and kernel space. Sometimes the communication uses pointers to structures. Headers written by library implementators, like glibc, that wrap kernel syscalls sometimes duplicate the same structures that are defined inside kernel source tree. For proper communication, the padding in those structures must be the same on both sides - on the kernel side when kernel is compiled and on user space side when we happy compile our own project even years after kernel was compiled.
On most architectures + operating systems there exists a "ABI" - a set of rules that also determine how structures should be padded and how bit-fields should be packed. A compiler may adhere to that ABI or not. When gcc is used to cross-compile for windows from linux, ex. __attribute__ ((ms_struct)) needs be used to ensure that proper structure packing is used that is compatible with the shenanigans microsoft compilers do.
So to answer: Is there any possibility - sure there is, different compiler flags or settings may cause different packing or padding between structure members So I can compile the program with gcc -fpack-struct=100 and you compiled your shared library with gcc -fpack-struct=20 and oops. But this isn't limited to structure padding - someone else can compile your program with unsigned int having 64-bits instead of 32-bit, so the return value of the function may be unexpected.
If yes, when?
When incompatible code generation methods are used to create machine code that depend on specified ABI to communicate. From the practical side, does this ever happen? Each linux system has ton of shared libraries in /usr/lib.
What if the library is compiled by someone else other than me?
Then he can do whatever he wants. But you can communicate that your shared library follows some common ABI standard, if you really want to.
What if I use different compilers for the library and the test program?
Then be sure to read that compiler documentation to make sure it follows the common ABI that you need.
Read: How to Write Shared Libraries. Ulrich Drepper - section 3 seems to be related.
Given a CPU architecture, is the exact binary form of a struct determined exactly?
For example, struct stat64 is used by glibc and the Linux kernel. I see glibc define it in sysdeps/unix/sysv/linux/x86/bits/stat.h as:
struct stat64 {
__dev_t st_dev; /* Device. */
# ifdef __x86_64__
__ino64_t st_ino; /* File serial number. */
__nlink_t st_nlink; /* Link count. */
/* ... et cetera ... */
}
My kernel was compiled already. Now when I compile new code using this definition, they have binary compatibility. Where is this guaranteed? The only guarantees I know of are:
The first element has offset 0
Elements declared later have higher offsets
So if the kernel code declares struct stat64 in the exact same way (in the C code), then I know that the binary form has:
st_dev # offset 0
st_ino # offset at least sizeof(__dev_t)
But I'm not currently aware of any way to determine the offset of st_ino. Kernighan & Ritchie give the simple example
struct X {
char c;
int i;
}
where on my x86-64 machine, offsetof(struct X, i) == 4. Perhaps there are some general alignment rules that determine the exact binary form of a struct for each CPU architecture?
Given a CPU architecture, is the exact binary form of a struct determined exactly?
No, the representation or layout (“binary form”) of a structure is ultimately determined by the C implementation, not by the CPU architecture. Most C implementations intended for normal purposes follow recommendations provided by the manufacturer and/or the operating system. However, there may be circumstances where, for example, a certain alignment for a particular type might give slightly better performance but is not required, and so one C implementation might choose to require that alignment while another does not, and this can result in different structure layout.
In addition, a C implementation might be designed for special purposes, such as providing compatibility with legacy code, in which case it might choose to replicate the alignment of some old compiler for another architecture rather than to use the alignment required by the target processor.
However, let’s consider structures in separate compilations using one C implementation. Then C 2018 6.2.7 1 says:
… Moreover, two structure, union, or enumerated types declared in separate translation units are compatible if their tags and members satisfy the following requirements: If one is declared with a tag, the other shall be declared with the same tag. If both are completed anywhere within their respective translation units, then the following additional requirements apply: there shall be a one-to-one correspondence between their members such that each pair of corresponding members are declared with compatible types; if one member of the pair is declared with an alignment specifier, the other is declared with an equivalent alignment specifier; and if one member of the pair is declared with a name, the other is declared with the same name. For two structures, corresponding members shall be declared in the same order. For two structures or unions, corresponding bit-fields shall have the same widths…
Therefore, if two structures are declared identically in separate translation units, or with the minor variations permitted in that passage, then they are compatible, which effectively means they have the same layout or representation.
Technically, that passage applies only to separate translation units of the same program. The C standard defines behaviors for one program; it does not explicitly define interactions between programs (or fragments of programs, such as kernel extensions) and the operating system, although to some extent you might consider the operating system and everything running in it as one program. However, for practical purposes, it applies to everything compiled with that C implementation.
This means that as long as you use the same C implementation as the kernel is compiled with, identically declared structures will have the same representation.
Another consideration is that we might use different compilers for compiling the kernel and compiling programs. The kernel might be compiled with Clang while a user prefers to use GCC. In this case, it is a matter for the compilers to document their behaviors. The C standard does not guarantee compatibility, but the compilers can, if they choose to, perhaps by both documenting that they adhere to a particular Application Binary Interface (ABI).
Also note that a “C implementation” as discussed above is not just a particular compiler but a particular compiler with particular switches. Various switches may change how a compiler behaves in ways that cause to be effectively a different C implementation, such as switches to conform to one version of the C standard or another, switches affecting whether structures are packed, switches affecting sizes of integer types, and so on.
Does the following code have defined beavior:
#include <stdio.h>
int main() {
int x;
if (scanf("%x", &x) == 1) {
printf("decimal: %d\n", x);
}
return 0;
}
clang compiles it without any warnings even with all warnings enabled, including -pedantic. The C Standard seems unambiguous about this:
C17 7.21.6.2 The fscanf function
...
... the result of the conversion is placed in the object pointed to by the first argument following the format argument that has not already received a conversion result. If this object does not have an appropriate type, or if the result of the conversion cannot be represented in the object, the behavior is undefined.
...
The conversion specifiers and their meanings are:
...
x Matches an optionally signed hexadecimal integer, whose format is the same as expected for the subject sequence of the strtoul function with the value 16 for the base argument. The corresponding argument shall be a pointer to unsigned integer.
On two's complement architectures, converting -1 with %x seems to work, but it would not on ancient sign/magnitude or ones complement systems.
Is there any provision to make this behavior defined or at least implementation defined?
This falls in the category of behaviors which quality implementations should support unless they document a good reason for doing otherwise, but which the Standard does not mandate. The authors of the Standard seem to have refrained from trying to list all such behaviors, and there are at least three good reasons for that:
Doing so would have made the Standard longer, and spending ink describing obvious behaviors that readers would expect anyway would distract from the places where the Standard needed to call readers' attention to things that they might not otherwise expect.
The authors of the Standard may not have wanted to preclude the possibility that an implementation might have a good reason for doing something unusual. I don't know whether that was a consideration in your particular case, but it could have been.
Consider, for example, a (likely theoretical) environment whose calling convention that requires passing information about argument types fed to variadic functions, and that supplies a scanf function that validates those argument types and squawks if int* is passed to a %X argument. The authors of the Standard were almost certainly not aware of any such environment [I doubt any ever existed], and thus would be in no position to weigh the benefits of using the environment's scanf routine versus the benefits of supporting the common behavior. Thus, it would make sense to leave such judgment up to people who would be in a better position to assess the costs and benefits of each approach.
It would be extremely difficult for the authors of the Standard to ensure that they exhaustively enumerated all such cases without missing any, and the more exhaustively they were to attempt to enumerate such cases, the more likely it would be that accidental omissions would be misconstrued as deliberate.
In practice, some compiler writers seem to regard most situations where the Standard fails to mandate the behavior of some action as an invitation to assume code will never attempt it, even if all implementations prior to the Standard had behaved consistently and it's unlikely there would ever be any good reason for an implementation to do otherwise. Consequently, using %X to read an int falls in the category of behaviors that will be reliable on implementations that make any effort to be compatible with common idioms, but could fail on implementations whose designers place a higher value on being able to process useless programs more efficiently, or on implementations that are designed to squawk when given programs that could be undermined by such implementations.
Some background:
the header stdint.h is part of the C standard since C99. It includes typedefs that are ensured to be 8, 16, 32, and 64-bit long integers, both signed and unsigned. This header is not part of the C89 standard, though, and I haven't yet found any straightforward way to ensure that my datatypes have a known length.
Getting to the actual topic
The following code is how SQLite (written in C89) defines 64-bit integers, but I don't find it convincing. That is, I don't think it's going to work everywhere. Worst of all, it could fail silently:
/*
** CAPI3REF: 64-Bit Integer Types
** KEYWORDS: sqlite_int64 sqlite_uint64
**
** Because there is no cross-platform way to specify 64-bit integer types
** SQLite includes typedefs for 64-bit signed and unsigned integers.
*/
#ifdef SQLITE_INT64_TYPE
typedef SQLITE_INT64_TYPE sqlite_int64;
typedef unsigned SQLITE_INT64_TYPE sqlite_uint64;
#elif defined(_MSC_VER) || defined(__BORLANDC__)
typedef __int64 sqlite_int64;
typedef unsigned __int64 sqlite_uint64;
#else
typedef long long int sqlite_int64;
typedef unsigned long long int sqlite_uint64;
#endif
typedef sqlite_int64 sqlite3_int64;
typedef sqlite_uint64 sqlite3_uint64;
So, this is what I've been doing so far:
Checking that the "char" data type is 8 bits long, since it's not guaranteed to be. If the preprocessor variable "CHAR_BIT" is not equal to 8, compilation fails
Now that "char" is guaranteed to be 8 bits long, I create a struct containing an array of several unsigned chars, which correspond to several bytes in the integer.
I write "operator" functions for my datatypes. Addition, multiplication, division, modulo, conversion from/to string, etc.
I have abstracted this process in a header file, which is the best I can do with what I know, but I wonder if there is a more straightforward way to achieve this.
I'm asking because I want to write a portable C library.
First, you should ask yourself whether you really need to support implementations that don't provide <stdint.h>. It was standardized in 1999, and even many pre-C99 implementations are likely to provide it as an extension.
Assuming you really need this, Doug Gwyn, a member of the ISO C standard committee, created an implementation of several of the new headers for C9x (as C99 was then known), compatible with C89/C90. The headers are in the public domain and should be reasonably portable.
http://www.lysator.liu.se/(nobg)/c/q8/index.html
(As I understand it, the name "q8" has no particular meaning; he just chose it as a reasonably short and unique search term.)
One rather nasty quirk of integer types in C stems from the fact that many "modern" implementations will have, for at least one size of integer, two incompatible signed types of that size with the same bit representation and likewise two incompatible unsigned types. Most typically the types will be 32-bit "int" and "long", or 64-bit "long" and "long long". The "fixed-sized" types will typically alias to one of the standard types, though implementations are not consistent about which one.
Although compilers used to assume that accesses to one type of a given size might affect objects of the other, the authors of the Standard didn't mandate that they do so (probably because there would have been no point ordering people to do things they would do anyway and they couldn't imagine any sane compiler writer doing otherwise; once compilers started doing so, it was politically difficult to revoke that "permission"). Consequently, if one has a library which stores data in a 32-bit "int" and another which reads data from a 32-bit "long", the only way to be assured of correct behavior is to either disable aliasing analysis altogether (probably the sanest choice while using gcc) or else add gratuitous copy operations (being careful that gcc doesn't optimize them out and then use their absence as an excuse to break code--something it sometimes does as of 6.2).
I often times write to memory mapped I/O pins like this
P3OUT |= BIT1;
I assumed that P3OUT was being replaced with something like this by my preprocessor:
*((unsigned short *) 0x0222u)
But I dug into an H file today and saw something along these lines:
volatile unsigned short P3OUT # 0x0222u;
There's some more expansion going on before that, but it is generally that. A symbol '#' is being used. Above that there are some #pragma's about using an extended set of the C language. I am assuming this is some sort of directive to the linker and effectively a symbol is being defined as being at that location in the memory map.
Was my assumption right for what happens most of the time on most compilers? Does it matter one way or the other? Where did that # notation come from, is it some sort of standard?
I am using IAR Embedded workbench.
This question is similar to this one: How to place a variable at a given absolute address in memory (with GCC).
It matches what I assumed my compiler was doing anyway.
Although an expression like (unsigned char *)0x1234 will, on many compilers, yield a pointer to hardware address 0x1234, nothing in the standard requires any particular relationship between an integer which is cast to a pointer and the resulting address. The only thing which the standard specifies is that if a particular integer type is at least as large as intptr_t, and casting a pointer to that particular type yields some value, then casting that particular value back to the original pointer type will yield a pointer equivalent to the original.
The IAR compiler offers a non-standard extension which allows the compiler to request that variables be placed at specified hard-coded addresses. This offers some advantages compared to using macros to create pointer expressions. For one thing, it ensures that such variables will be regarded syntactically as variables; while pointer-kludge expressions will generally be interpreted correctly when used in legitimate code, it's possible for illegitimate code which should fail with a compile-time error to compile but produce something other than the desired effect. Further, the IAR syntax defines symbols which are available to the linker and may thus be used within assembly-language modules. By contrast, a .H file which defines pointer-kludge macros will not be usable within an assembly-language module; any hardware which will be used in both C and assembly code will need to have its address specified in two separate places.
The short answer to the question in your title is "differently". What's worse is that compilers from different vendors for the same target processor will use different approaches. This one
volatile unsigned short P3OUT # 0x0222u;
Is a common way to place a variable at a fixed address. But you will also see it used to identify individual bits within a memory mapped location = especially for microcontrollers which have bit-wide instructions like the PIC families.
These are things that the C Standard does not address, and should IMHO, as small embedded microcontrollers will eventually end up being the main market for C (yes, I know the kernel is written in C, but a lot of user-space stuff is moving to C++).
I actually joined the C committee to try and drive for changes in this area, but my sponsorship went away and it's a very expensive hobby.
A similar area is declaring a function to be an ISR.
This document shows one of the approaches we considered