Storage layout of C objects - c

The storage layout of C objects is mostly not defined. As far as I know only for struct members and array elements, the layout is defined.
Interestingly for function parameters the C11 standard explicitly mentions that the layout is not defined:
The layout of the storage for parameters is unspecified. (C11 § 6.9.1 P 9)
I was wondering if the standard also explicitly defines that for other objects, e.g., objects with automatic storage duration, the layout is undefined. Is someone aware of this? I couldn't find anything about this in the standard.
What about objects with internal or external linkage?

No, I don't think that there is an explicit mention of the fact that we can't know anything about relative object layout. In fact, the C standard is even more radical than that, you are not even allowed to do comparison with the < operator on two variables that are not elements of the same array, nor may you do arithmetic between pointers to objects that are not part of the same array.
So the whole question of "layout" cannot even be formulate with the terminology that the C standard provides.

Related

How are constant struct members handled in C?

Is it OK do do something like this?
struct MyStruct {
int x;
const char y; // notice the const
unsigned short z;
};
struct MyStruct AStruct;
fread(&MyStruct, sizeof (MyStruct), 1,
SomeFileThatWasDefinedEarlierButIsntIncludedInThisCodeSnippet);
I am changing the constant struct member by writing to the entire struct from a file. How is that supposed to be handled? Is this undefined behavior, to write to a non-constant struct, if one or more of the struct members is constant? If so, what is the accepted practice to handle constant struct members?
It's undefined behavior.
The C11 draft n1570 says:
6.7.3 Type qualifiers
...
...
If an attempt is made to modify an object defined with a const-qualified type through use of an lvalue with non-const-qualified type, the behavior is undefined.
My interpretation of this is: To be compliant with the standard, you are only allowed to set the value of the const member during object creation (aka initialization) like:
struct MyStruct AStruct = {1, 'a', 2}; // Fine
Doing
AStruct.y = 'b'; // Error
should give a compiler error.
You can trick the compiler with code like:
memcpy(&AStruct, &AnotherStruct, sizeof AStruct);
It will probably work fine on most systems but it's still undefined behavior according to the C11 standard.
Also see memcpy with destination pointer to const data
How are constant struct members handled in C?
Read the C11 standard n1570 and its §6.7.3 related to the const qualifier.
If so, what is the accepted practice to handle constant struct members?
It depends if you care more about strict conformance to the C standard, or about practical implementations. See this draft report (work in progress in June 2020) discussing these concerns. Such considerations depend on the development efforts allocated on your project, and on portability of your software (to other platforms).
It is likely that you won't spend the same efforts on the embedded software of a Covid respirator (or inside some ICBM) and on the web server (like lighttpd or a library such as libonion or some FastCGI application) inside a cheap consumer appliance or running on some cheap rented Linux VPS.
Consider also using static analysis tools such as Frama-C or the Clang static analyzer on your code.
Regarding undefined behavior, be sure to read this blog.
See also this answer to a related question.
I am changing the constant struct member by writing to the entire struct from a file.
Then endianness issues and file system issues are important. Consider perhaps using libraries related to JSON, to YAML, perhaps mixed to sqlite or PostGreSQL or TokyoCabinet (and the source code of all these open source libraries, or from the Linux kernel, could be inspirational).
The Standard is a bit sloppy in its definition and use of the term "object". For a statement like "All X must be Y" or "No X may be Z" to be meaningful, the definition of X must have criteria that are not only satisfied by all X, but that would unambiguously exclude all objects that aren't required to be Y or are allowed to be Z.
The definition of "object", however, is simply "region of data storage in the execution environment, the contents of which can represent values". Such a definition, however, fails to make clear whether every possible range of consecutive addresses is always an "object", or when various possible ranges of addresses are subject to the constraints that apply to "objects" and when they are not.
In order for the Standard to unambiguously classify a corner case as defined or undefined, the Committee would have to reach a consensus as to whether it should be defined or undefined. If the Committee members fundamentally disagree about whether some cases should be defined or undefined, the only way to pass a rule by consensus will be if the rule is written ambiguously in a way that allows people with contradictory views about what should be defined to each think the rule supports their viewpoint. While I don't think the Committee members explicitly wanted to make their rules ambiguous, I don't think the Committee could have been consensus for rules that weren't.
Given that situation, many actions, including updating structures that have constant members, most likely falls in the realm of actions which the Standard doesn't require implementations to process meaningfully, but which the authors of the Standard would have expected that implementations would process meaningfully anyhow.

Valid programs in C89, but not in C99

Are there features / semantics introduced, or removed, in C99 which would make a well defined program written in C89 either
invalid (i.e not compiling anymore, according to the C99 standard)
compiling, but having different semantics.
My findings so far, concerning plainly invalid programs:
implicit int (C89 §3.5.2)
implicit function declaration (C89 §3.3.2.2)
not returning from a function expecting a return value (C89 §3.6.6.4)
using new keywords as identifier (for example restrict, inline, etc)
hacks involving //, which are now treated as comments. However, nearly never encountered in production code.
Subtle changes, making the same code having different semantics:
Integer division has been made well defined, for example -3 / 2 now has to truncate towards zero (C99 §6.5.5/6), instead of being implementation defined (C89 §3.3.5/6)
strtod gained the ability to parse hexadecimal numbers in C99, by parsing 0x or 0X
What have I missed?
There are a lot of programs which would have been considered valid under C89, prior to the publication of C99, which some people insist were never valid. C89 includes a rule that requires that an object of any type may only be accessed using a pointer of that type, a related type, or a character type. Prior to the publication of C99, this rule was generally interpreted as applying only to "named" objects (variables of static or automatic duration which are accessed directly by name), and only in situations where the object in question didn't have its address taken immediately before it was used as a different pointer type. Such interpretation was motivated by a number of factors:
One of the stated goals of the Standard was to fit with what existing compilers and programs were doing, and while it would have been rare for existing programs to access discrete named variables using pointers of different types other than in cases where the variable's address was taken immediately before such use, many other usages of pointer type punning were quite common.
The rationale for the Standard includes as its sole example a function which receives a pointer of one primitive type to write a global variable of another primitive type in such a way that a compiler would have no particular reason to expect aliasing. Being able to keep global variables in registers is clearly a useful optimization, and the stated purpose of the rule is to allow such optimizations in cases where a compiler would have no reason to expect aliasing to occur. Outlawing constructs like like (int*)&foo=23; does nothing to aid such optimizations, since the fact that code is taking foo's address and dereferencing it should make it abundantly clear to any compiler that isn't being deliberately obtuse that the code is going to modify foo.
There are many kinds of code which require semantically the ability to use memory bits as various types, and nothing in the Standard indicate that the rules were intended to make programmers jump through hoops (e.g. by using memcpy) to achieve semantics that could have been easily obtained in the absence of the rules, especially considering that using memcpy would prevent the compiler from keeping global variables in registers across the pointer accesses (thus defeating the purpose for which the rules were written in the first place).
If structure types V and W have a common initial sequence, U is any union type containing both, and p is a V* which identifies the V within a U, then (W*)(U*)p may be used to access those common members, and will be equivalent to (W*)p. Unless a compiler could show that p couldn't possibly be a pointer to a member of some union containing W, it would be required to allow (W*)p to access the common members; it was more helpful to simply treat such common member access as being legitimate regardless of whether or where U might exist than to search for excuses to deny it.
Nothing in the C89 rules makes clear how the "type" of a region of allocated storage is defined, or how storage which holds things of one type that are no longer needed might be re-purposed to hold things of another.
Keeping track of registers allocated to named variables was easier than keeping track of registers allocated to other pointer exceptions, and code which was interested in minimizing the number of loads and stores via pointers would often copy things to named variables and work on them there.
C99 added "effective type" rules which are explicitly applicable to allocated storage. Some people insist those were merely "clarifications" of rules which already existed in C89, but for the above reasons I find that viewpoint untenable. It's fashionable to claim that the only reasons compilers didn't apply aliasing rules to unnamed objects are #5 and #6, but objections #1-#4 are equally significant (and continue to apply to C99 just as much as C89). Still, since C99 added the effective type rules, many constructs which would have been treated as legitimate by most common interpretations of the C89 rules are clearly forbidden.
As an element of contrast and comparison, the git/git codebase remains strictly conform to C89 and does not use C99 initializers, or features from newer C standard.
This is detailed in Git 2.23 (Q3 2019) in Git Coding Guidelines.
This answer illustrates post-C89 feature that might be compatible with C89.
See commit cc0c429 (16 Jul 2019) by Junio C Hamano (gitster).
(Merged by Junio C Hamano -- gitster -- in commit fe9dc6b, 25 Jul 2019)
CodingGuidelines: spell out post-C89 rules
Even though we have been sticking to C89, there are a few handy features we borrow from more recent C language in our codebase after trying them in weather balloons and saw that nobody screamed.
Spell them out.
While at it, extend the existing variable declaration rule a bit to
read better with the newly spelled out rule for the for loop.
The coding guidelines now include:
You should not use features from newer C standard, even if your compiler groks them.
There are a few exceptions to this guideline:
since early 2012 with e1327023ea (Git v1.7.9.2), we have been using an enum definition whose last element is followed by a comma.
This, like an array initializer that ends with a trailing comma, can be used to reduce the patch noise when adding a new identifer at the end.
since mid 2017 with cbc0f81d (Git v2.15.0-rc0), we have been using designated
initializers for struct (e.g. "struct t v = { .val = 'a' };")
There are certain C99 features that might be nice to use in our code base, but we've hesitated to do so in order to avoid breaking compatibility with older compilers.
But we don't actually know if people are even using pre-C99 compilers these days.
If this patch can survive a few releases without complaint, then we can feel more confident that designated initializers are widely supported by our user base.
It also is an indication that other C99 features may be supported, but not a guarantee (e.g., gcc had designated initializers before C99 existed).
since mid 2017 with 512f41cf (Git v2.15.0-rc0), we have been using designated initializers for array (e.g. "int array[10] = { [5] = 2 }").
This is another test balloon to see if we get complaints from people
whose compilers do not support designated initializer for arrays.
These used to be forbidden, but we have not heard any breakage report, and they are assumed to be safe.
Variables have to be declared at the beginning of the block, before the first statement (i.e. -Wdeclaration-after-statement).
Declaring a variable in the for loop "for (int i = 0; i < 10; i++)" is still not allowed in this codebase.

Is it UB to declare same array as extern with different sizes in different compilation units

This is mainly a followup to Should definition and declaration match?
Question
Is it legal in C to have (for example) int a[10]; in one compilation unit and extern int a[4]; in another one ?
(You can find a working example in my answer to ref'd question)
Disclaimers :
I know it is dangerous and would not do it in production code
I know that if you have both in same compilation unit (typically through inclusion of a .h in the file containing the definition) compilers detects an error
I have already read Jonathan Leffler' excellent answer to How do I use extern to share variables between source files? but could not find the answer to this specific point there - even if Jonathan showed even worse usages ...
Even if different comments in referenced post spotted that as UB, I could not find any authoritative reference for it. So I would say that there is no UB here and that second compilation unit will have access to the beginning of the array, but I would really like a confirmation - or instead a reference about why it is UB
It is undefined behavior.
Section 6.2.7.2 of C99 states:
All declarations that refer to the same object or function shall have
compatible type; otherwise, the behavior is undefined.
NOTE: As mentioned in the comments below, the important part here is [...] that refer to the same object [...], which is further defined in 6.2.2:
In the set of translation units and libraries that constitutes an
entire program, each declaration of a particular identifier with
external linkage denotes the same object or function.
About the type compatibility rules for array types, section 6.7.5.2.4 of C99 clarifies what it means for two array types to be compatible:
For two array types to be compatible, both shall have compatible
element types, and if both size specifiers are present, and are
integer constant expressions, then both size specifiers shall have the
same constant value. If the two array types are used in a context
which requires them to be compatible, it is undefined behavior if the
two size specifiers evaluate to unequal values.
(Emphasis mine)
In the real world, as long as you stick to 1D arrays, it is probably harmless, because there is no bounds checking and the address of the first element remains the same regardless of the size specifier, but note that the sizeof operator will return different values in each source file (opening a wonderful opportunity to write buggy code).
Things start to get really ugly if you decide to extrapolate on this example and declare multidimensional arrays with different dimension sizes, because the offset of each element in the array will not match with the real dimensions any more.
Yes, it is legal. The language allows it.
In your specific case there will be no undefined behavior as the extern declared array is smaller than the actually allocated array.
It can be used in a case where the declaring module uses the "unpublished" array elements for e.g. housekeeping of its algorithms (abstraction hiding).

What do you call an instance of a struct?

This is purely out of interest. I searched around a bit, knowing that instances of classes are called objects, but I couldn't find what the correct word is for an instance of a struct in C, C++, C#, etc. Do we even have a word for this?
It is perfectly valid to call it an object even in C.
C99 Standard §3.15:
Para 1:
object
region of data storage in the execution environment, the contents of which can represent
values
In C and C++ word 'object' is much more general. It refers also to structure objects or primitives. From C99 standard:
object
region of data storage in the execution environment, the
contents of which can represent values
In C++ classes and structures are equivalent and types defined with struct are also considered class types. Standard uses expressions like object of class type or object of class X.

Can't initialize static structure with function pointer from another translation unit?

The Python documentation claims that the following does not work on "some platforms or compilers":
int foo(int); // Defined in another translation unit.
struct X { int (*fptr)(int); } x = {&foo};
Specifically, the Python docs say:
We’d like to just assign this to the tp_new slot, but we can’t, for
portability sake, On some platforms or compilers, we can’t statically
initialize a structure member with a function defined in another C
module, so, instead, we’ll assign the tp_new slot in the module
initialization function just before calling PyType_Ready(). --http://docs.python.org/extending/newtypes.html
Is the above standard C89 and/or C99? What compilers specifically cannot handle the above?
That kind of initialization has been permitted since at least C90.
From C90 6.5.7 "Initialization"
All the expressions in an initializer for an object that has static storage duration or in an initializer list for an object that has aggregate or union type shall be constant expressions.
And 6.4 "Constant expressions":
An address constant is a pointer to an lvalue designating an object of static storage duration, or to a function designator; it shall be created explicitly, using the unary & operator...
But it's certainly possible that some implementations might have trouble with the construct - I'd guess that wouldn't be true for modern implementations.
According to n1570 6.6 paragraph 9, the address of a function is an address constant, according to 6.7.9 this means that it can be used to initialize global variables. I am almost certain this is also valid C89.
However,
On sane platforms, the value of a function pointer (or any pointer, other than NULL) is only known at runtime. This means that the initialization of your structure can't take place until runtime. This doesn't always apply to executables but it almost always applies to shared objects such as Python extensions. I recommend reading Ulrich Drepper's essay on the subject (link).
I am not aware of which platforms this is broken on, but if the Python developers mention it, it's almost certainly because one of them got bitten by it. If you're really curious, try looking at an old Python extension and seeing if there's an appropriate message in the commit logs.
Edit: It looks like most Python modules just do the normal thing and initialize type structures statically, e.g., static type obj = { function_ptr ... };. For example, look at the mmap module, which is loaded dynamically.
The example is definitively conforming to C99, and AFAIR also C89.
If some particular (oldish) compiler has a problem with it, I don't think that the proposed solution is the way to go. Don't impose dynamic initialization to platforms that behave well. Instead, special case the weirdos that need special treatment. And try to phase them out as quickly as you may.

Resources