A few years ago, before standardization of C, it was allowed to use struct selectors on addresses. For example, the following code was allowed and frequently used.
#define PTR 0xAA000
struct { int integ; };
func() {
int i;
i = PTR->integ; /* here, c is set to the first int at PTR */
return c;
}
Maybe it wasn't very neat, but I like it. In my opinion, the power and the versatility of this language relies also on its lack of constraints. Nowadays, compilers just dump an error. I'd like to know if it is possible to remove this restraint in the GNU C compiler.
PS: similar code was used on the UNIX kernel by the inventors of C. (in V6, some dummy structures have been declared in param.h)
'A few years ago' is actually a very, very long time ago. AFAICR, the C in 7th Edition UNIX™ (1979, a decade before the C89 standard was defined) didn't support that notation any more (but see below).
The code shown in the question only worked when all structure members of all structures shared the same name space. That meant that structure.integ or pointer->integ always referred to an int at the start of a structure because there was only one possible structure member integ across the entire program.
Note that in 'modern' C (1978 onwards), you cannot reference the structure type; there's neither a structure tag nor a typedef for it — the type is useless. The original code also references an undefined variable c.
To make it work, you'd need something like:
#define PTR 0xAA000
struct integ { int integ; };
int func(void)
{
struct integ *ptr = (struct integ *)PTR;
return ptr->integ;
}
C for 7th Edition UNIX
I suggested that the C with 7th Edition UNIX supported separate namespaces for separate structure types. However, the C Reference Manual published with the UNIX Programmer's Manual Vol 2 mentions in §8.5 Structures:
The names of structure members and structure tags may be the same as ordinary variables, since a distinction can
be made by context. However, names of tags and members must be distinct. The same member name can appear in
different structures only if the two members are of the same type and if their origin with respect to their structure is
the same; thus separate structures can share a common initial segment.
However, that same manual also mentions the notations (see also What does =+ mean in C):
§7.14.2 lvalue =+ expression
§7.14.3 lvalue =- expression
§7.14.4 lvalue =* expression
§7.14.5 lvalue =/ expression
§7.14.6 lvalue =% expression
§7.14.7 lvalue =>> expression
§7.14.8 lvalue =<< expression
§7.14.9 lvalue =& expression
§7.14.10 lvalue =^ expression
§7.14.11 lvalue = | expression
The behavior of an expression of the form ‘‘E1 =op E2’’ may be inferred by taking it as equivalent to
‘‘E1 = E1 op E2’’; however, E1 is evaluated only once. Moreover, expressions like ‘‘i =+ p’’ in which a pointer is
added to an integer, are forbidden.
AFAICR, that was not supported in the first C compilers I used (1983 — I'm ancient, but not quite that ancient); only the modern += notations were allowed. In other words, I don't think the C described by that reference manual was fully current when the product was released. (I've not checked my 1st Edition of K&R — does anyone have one on hand to check?) You can find the UNIX 7th Edition manuals online at http://cm.bell-labs.com/7thEdMan/.
By giving the structure a type name and adjusting your macro slightly you can achieve the same effect in your code:
typedef struct { int integ; } PTR_t;
#define PTR ((PTR_t*)0xAA000)
I'd like to know if it is possible to remove this restraint in the GNU C compiler.
I'm reasonably sure the answer is no -- that is, unless you rewrite gcc to support the older version of the language.
The gcc manual documents the -traditional command-line option:
'-traditional' '-traditional-cpp'
Formerly, these options caused GCC to attempt to emulate a
pre-standard C compiler. They are now only supported with the
`-E' switch. The preprocessor continues to support a pre-standard
mode. See the GNU CPP manual for details.
This implies that modern gcc (the quote is from the 4.8.0 manual) no longer supports pre-ANSI C.
The particular feature you're referring to isn't just pre-ANSI, it's very pre-ANSI. The ANSI standard was published in 1989. The first edition of K&R was published in 1978, and as I recall the language it described didn't support the feature you're looking for. The initial release of gcc was in 1987, so it's very likely that no version of gcc has ever supported that feature.
Furthermore, enabling such a feature would break existing code which may depend on the ability to use the same member name in different structures. (Traces of the old rules survive in the standard C library, where for example the members of type struct tm all have names starting with tm_; in modern C that would not be necessary.)
You might be able to find sources for an ancient C compiler that works the way you want. The late Dennis Ritchie's home page would be a good starting point for that. It's not at all obvious that you'd be able to get such a compiler working on any modern system without a great deal of work. And the result would be a compiler that doesn't support a number of newer features of C that you might find useful, such as the long, signed, and unsigned keywords, the ability to pass structures by value, function prototypes, and diagnostics for attempts to mix pointers and integers.
C is better now than it was then. There are a few dangerous things that are slightly more difficult than they were, but I'm not aware that any actual expressive power has been lost.
Related
Is it OK do do something like this?
struct MyStruct {
int x;
const char y; // notice the const
unsigned short z;
};
struct MyStruct AStruct;
fread(&MyStruct, sizeof (MyStruct), 1,
SomeFileThatWasDefinedEarlierButIsntIncludedInThisCodeSnippet);
I am changing the constant struct member by writing to the entire struct from a file. How is that supposed to be handled? Is this undefined behavior, to write to a non-constant struct, if one or more of the struct members is constant? If so, what is the accepted practice to handle constant struct members?
It's undefined behavior.
The C11 draft n1570 says:
6.7.3 Type qualifiers
...
...
If an attempt is made to modify an object defined with a const-qualified type through use of an lvalue with non-const-qualified type, the behavior is undefined.
My interpretation of this is: To be compliant with the standard, you are only allowed to set the value of the const member during object creation (aka initialization) like:
struct MyStruct AStruct = {1, 'a', 2}; // Fine
Doing
AStruct.y = 'b'; // Error
should give a compiler error.
You can trick the compiler with code like:
memcpy(&AStruct, &AnotherStruct, sizeof AStruct);
It will probably work fine on most systems but it's still undefined behavior according to the C11 standard.
Also see memcpy with destination pointer to const data
How are constant struct members handled in C?
Read the C11 standard n1570 and its §6.7.3 related to the const qualifier.
If so, what is the accepted practice to handle constant struct members?
It depends if you care more about strict conformance to the C standard, or about practical implementations. See this draft report (work in progress in June 2020) discussing these concerns. Such considerations depend on the development efforts allocated on your project, and on portability of your software (to other platforms).
It is likely that you won't spend the same efforts on the embedded software of a Covid respirator (or inside some ICBM) and on the web server (like lighttpd or a library such as libonion or some FastCGI application) inside a cheap consumer appliance or running on some cheap rented Linux VPS.
Consider also using static analysis tools such as Frama-C or the Clang static analyzer on your code.
Regarding undefined behavior, be sure to read this blog.
See also this answer to a related question.
I am changing the constant struct member by writing to the entire struct from a file.
Then endianness issues and file system issues are important. Consider perhaps using libraries related to JSON, to YAML, perhaps mixed to sqlite or PostGreSQL or TokyoCabinet (and the source code of all these open source libraries, or from the Linux kernel, could be inspirational).
The Standard is a bit sloppy in its definition and use of the term "object". For a statement like "All X must be Y" or "No X may be Z" to be meaningful, the definition of X must have criteria that are not only satisfied by all X, but that would unambiguously exclude all objects that aren't required to be Y or are allowed to be Z.
The definition of "object", however, is simply "region of data storage in the execution environment, the contents of which can represent values". Such a definition, however, fails to make clear whether every possible range of consecutive addresses is always an "object", or when various possible ranges of addresses are subject to the constraints that apply to "objects" and when they are not.
In order for the Standard to unambiguously classify a corner case as defined or undefined, the Committee would have to reach a consensus as to whether it should be defined or undefined. If the Committee members fundamentally disagree about whether some cases should be defined or undefined, the only way to pass a rule by consensus will be if the rule is written ambiguously in a way that allows people with contradictory views about what should be defined to each think the rule supports their viewpoint. While I don't think the Committee members explicitly wanted to make their rules ambiguous, I don't think the Committee could have been consensus for rules that weren't.
Given that situation, many actions, including updating structures that have constant members, most likely falls in the realm of actions which the Standard doesn't require implementations to process meaningfully, but which the authors of the Standard would have expected that implementations would process meaningfully anyhow.
Are there features / semantics introduced, or removed, in C99 which would make a well defined program written in C89 either
invalid (i.e not compiling anymore, according to the C99 standard)
compiling, but having different semantics.
My findings so far, concerning plainly invalid programs:
implicit int (C89 §3.5.2)
implicit function declaration (C89 §3.3.2.2)
not returning from a function expecting a return value (C89 §3.6.6.4)
using new keywords as identifier (for example restrict, inline, etc)
hacks involving //, which are now treated as comments. However, nearly never encountered in production code.
Subtle changes, making the same code having different semantics:
Integer division has been made well defined, for example -3 / 2 now has to truncate towards zero (C99 §6.5.5/6), instead of being implementation defined (C89 §3.3.5/6)
strtod gained the ability to parse hexadecimal numbers in C99, by parsing 0x or 0X
What have I missed?
There are a lot of programs which would have been considered valid under C89, prior to the publication of C99, which some people insist were never valid. C89 includes a rule that requires that an object of any type may only be accessed using a pointer of that type, a related type, or a character type. Prior to the publication of C99, this rule was generally interpreted as applying only to "named" objects (variables of static or automatic duration which are accessed directly by name), and only in situations where the object in question didn't have its address taken immediately before it was used as a different pointer type. Such interpretation was motivated by a number of factors:
One of the stated goals of the Standard was to fit with what existing compilers and programs were doing, and while it would have been rare for existing programs to access discrete named variables using pointers of different types other than in cases where the variable's address was taken immediately before such use, many other usages of pointer type punning were quite common.
The rationale for the Standard includes as its sole example a function which receives a pointer of one primitive type to write a global variable of another primitive type in such a way that a compiler would have no particular reason to expect aliasing. Being able to keep global variables in registers is clearly a useful optimization, and the stated purpose of the rule is to allow such optimizations in cases where a compiler would have no reason to expect aliasing to occur. Outlawing constructs like like (int*)&foo=23; does nothing to aid such optimizations, since the fact that code is taking foo's address and dereferencing it should make it abundantly clear to any compiler that isn't being deliberately obtuse that the code is going to modify foo.
There are many kinds of code which require semantically the ability to use memory bits as various types, and nothing in the Standard indicate that the rules were intended to make programmers jump through hoops (e.g. by using memcpy) to achieve semantics that could have been easily obtained in the absence of the rules, especially considering that using memcpy would prevent the compiler from keeping global variables in registers across the pointer accesses (thus defeating the purpose for which the rules were written in the first place).
If structure types V and W have a common initial sequence, U is any union type containing both, and p is a V* which identifies the V within a U, then (W*)(U*)p may be used to access those common members, and will be equivalent to (W*)p. Unless a compiler could show that p couldn't possibly be a pointer to a member of some union containing W, it would be required to allow (W*)p to access the common members; it was more helpful to simply treat such common member access as being legitimate regardless of whether or where U might exist than to search for excuses to deny it.
Nothing in the C89 rules makes clear how the "type" of a region of allocated storage is defined, or how storage which holds things of one type that are no longer needed might be re-purposed to hold things of another.
Keeping track of registers allocated to named variables was easier than keeping track of registers allocated to other pointer exceptions, and code which was interested in minimizing the number of loads and stores via pointers would often copy things to named variables and work on them there.
C99 added "effective type" rules which are explicitly applicable to allocated storage. Some people insist those were merely "clarifications" of rules which already existed in C89, but for the above reasons I find that viewpoint untenable. It's fashionable to claim that the only reasons compilers didn't apply aliasing rules to unnamed objects are #5 and #6, but objections #1-#4 are equally significant (and continue to apply to C99 just as much as C89). Still, since C99 added the effective type rules, many constructs which would have been treated as legitimate by most common interpretations of the C89 rules are clearly forbidden.
As an element of contrast and comparison, the git/git codebase remains strictly conform to C89 and does not use C99 initializers, or features from newer C standard.
This is detailed in Git 2.23 (Q3 2019) in Git Coding Guidelines.
This answer illustrates post-C89 feature that might be compatible with C89.
See commit cc0c429 (16 Jul 2019) by Junio C Hamano (gitster).
(Merged by Junio C Hamano -- gitster -- in commit fe9dc6b, 25 Jul 2019)
CodingGuidelines: spell out post-C89 rules
Even though we have been sticking to C89, there are a few handy features we borrow from more recent C language in our codebase after trying them in weather balloons and saw that nobody screamed.
Spell them out.
While at it, extend the existing variable declaration rule a bit to
read better with the newly spelled out rule for the for loop.
The coding guidelines now include:
You should not use features from newer C standard, even if your compiler groks them.
There are a few exceptions to this guideline:
since early 2012 with e1327023ea (Git v1.7.9.2), we have been using an enum definition whose last element is followed by a comma.
This, like an array initializer that ends with a trailing comma, can be used to reduce the patch noise when adding a new identifer at the end.
since mid 2017 with cbc0f81d (Git v2.15.0-rc0), we have been using designated
initializers for struct (e.g. "struct t v = { .val = 'a' };")
There are certain C99 features that might be nice to use in our code base, but we've hesitated to do so in order to avoid breaking compatibility with older compilers.
But we don't actually know if people are even using pre-C99 compilers these days.
If this patch can survive a few releases without complaint, then we can feel more confident that designated initializers are widely supported by our user base.
It also is an indication that other C99 features may be supported, but not a guarantee (e.g., gcc had designated initializers before C99 existed).
since mid 2017 with 512f41cf (Git v2.15.0-rc0), we have been using designated initializers for array (e.g. "int array[10] = { [5] = 2 }").
This is another test balloon to see if we get complaints from people
whose compilers do not support designated initializer for arrays.
These used to be forbidden, but we have not heard any breakage report, and they are assumed to be safe.
Variables have to be declared at the beginning of the block, before the first statement (i.e. -Wdeclaration-after-statement).
Declaring a variable in the for loop "for (int i = 0; i < 10; i++)" is still not allowed in this codebase.
Why is it sensible for a language to allow implicit declarations of functions and typeless variables? I get that C is old, but allowing to omit declarations and default to int() (or int in case of variables) doesn't seem so sane to me, even back then.
So, why was it originally introduced? Was it ever really useful? Is it actually (still) used?
Note: I realise that modern compilers give you warnings (depending on which flags you pass them), and you can suppress this feature. That's not the question!
Example:
int main() {
static bar = 7; // defaults to "int bar"
return foo(bar); // defaults to a "int foo()"
}
int foo(int i) {
return i;
}
See Dennis Ritchie's "The Development of the C Language": http://web.archive.org/web/20080902003601/http://cm.bell-labs.com/who/dmr/chist.html
For instance,
In contrast to the pervasive syntax variation that occurred during the
creation of B, the core semantic content of BCPL—its type structure
and expression evaluation rules—remained intact. Both languages are
typeless, or rather have a single data type, the 'word', or 'cell', a
fixed-length bit pattern. Memory in these languages consists of a
linear array of such cells, and the meaning of the contents of a cell
depends on the operation applied. The + operator, for example, simply
adds its operands using the machine's integer add instruction, and the
other arithmetic operations are equally unconscious of the actual
meaning of their operands. Because memory is a linear array, it is
possible to interpret the value in a cell as an index in this array,
and BCPL supplies an operator for this purpose. In the original
language it was spelled rv, and later !, while B uses the unary *.
Thus, if p is a cell containing the index of (or address of, or
pointer to) another cell, *p refers to the contents of the pointed-to
cell, either as a value in an expression or as the target of an
assignment.
This typelessness persisted in C until the authors started porting it to machines with different word lengths:
The language changes during this period, especially around 1977, were largely focused on considerations of portability and type safety,
in an effort to cope with the problems we foresaw and observed in
moving a considerable body of code to the new Interdata platform. C at
that time still manifested strong signs of its typeless origins.
Pointers, for example, were barely distinguished from integral memory
indices in early language manuals or extant code; the similarity of
the arithmetic properties of character pointers and unsigned integers
made it hard to resist the temptation to identify them. The unsigned
types were added to make unsigned arithmetic available without
confusing it with pointer manipulation. Similarly, the early language
condoned assignments between integers and pointers, but this practice
began to be discouraged; a notation for type conversions (called
`casts' from the example of Algol 68) was invented to specify type
conversions more explicitly. Beguiled by the example of PL/I, early C
did not tie structure pointers firmly to the structures they pointed
to, and permitted programmers to write pointer->member almost without
regard to the type of pointer; such an expression was taken
uncritically as a reference to a region of memory designated by the
pointer, while the member name specified only an offset and a type.
Programming languages evolve as programming practices change. In modern C and the modern programming environment, where many programmers have never written assembly language, the notion that ints and pointers are interchangeable may seem nearly unfathomable and unjustifiable.
It's the usual story — hysterical raisins (aka 'historical reasons').
In the beginning, the big computers that C ran on (DEC PDP-11) had 64 KiB for data and code (later 64 KiB for each). There was a limit to how complex you could make the compiler and still have it run. Indeed, there was scepticism that you could write an O/S using a high-level language such as C, rather than needing to use assembler. So, there were size constraints. Also, we are talking a long time ago, in the early to mid 1970s. Computing in general was not as mature a discipline as it is now (and compilers specifically were much less well understood). Also, the languages from which C was derived (B and BCPL) were typeless. All these were factors.
The language has evolved since then (thank goodness). As has been extensively noted in comments and down-voted answers, in strict C99, implicit int for variables and implicit function declarations have both been made obsolete. However, most compilers still recognize the old syntax and permit its use, with more or less warnings, to retain backwards compatibility, so that old source code continues to compile and run as it always did. C89 largely standardized the language as it was, warts (gets()) and all. This was necessary to make the C89 standard acceptable.
There is still old code around using the old notations — I spend quite a lot of time working on an ancient code base (circa 1982 for the oldest parts) which still hasn't been fully converted to prototypes everywhere (and that annoys me intensely, but there's only so much one person can do on a code base with multiple millions of lines of code). Very little of it still has 'implicit int' for variables; there are too many places where functions are not declared before use, and a few places where the return type of a function is still implicitly int. If you don't have to work with such messes, be grateful to those who have gone before you.
Probably the best explanation for "why" comes from here:
Two ideas are most characteristic of C among languages of its class: the relationship between arrays and pointers, and the way in which declaration syntax mimics expression syntax. They are also among its most frequently criticized features, and often serve as stumbling blocks to the beginner. In both cases, historical accidents or mistakes have exacerbated their difficulty. The most important of these has been the tolerance of C compilers to errors in type. As should be clear from the history above, C evolved from typeless languages. It did not suddenly appear to its earliest users and developers as an entirely new language with its own rules; instead we continually had to adapt existing programs as the language developed, and make allowance for an existing body of code. (Later, the ANSI X3J11 committee standardizing C would face the same problem.)
Systems programming languages don't necessarily need types; you're mucking around with bytes and words, not floats and ints and structs and strings. The type system was grafted onto it in bits and pieces, rather than being part of the language from the very beginning. As C has moved from being primarily a systems programming language to a general-purpose programming language, it has become more rigorous in how it handles types. But, even though paradigms come and go, legacy code is forever. There's still a lot of code out there that relies on that implicit int, and the standards committee is reluctant to break anything that's working. That's why it took almost 30 years to get rid of it.
A long, long time ago, back in the K&R, pre-ANSI days, functions looked quite different than they do today.
add_numbers(x, y)
{
return x + y;
}
int ansi_add_numbers(int x, int y); // modern, ANSI C
When you call a function like add_numbers, there is an important difference in the calling conventions: all types are "promoted" when the function is called. So if you do this:
// no prototype for add_numbers
short x = 3;
short y = 5;
short z = add_numbers(x, y);
What happens is x is promoted to int, y is promoted to int, and the return type is assumed to be int by default. Likewise, if you pass a float it is promoted to double. These rules ensured that prototypes weren't necessary, as long as you got the right return type, and as long as you passed the right number and type of arguments.
Note that the syntax for prototypes is different:
// K&R style function
// number of parameters is UNKNOWN, but fixed
// return type is known (int is default)
add_numbers();
// ANSI style function
// number of parameters is known, types are fixed
// return type is known
int ansi_add_numbers(int x, int y);
A common practice back in the old days was to avoid header files for the most part, and just stick the prototypes directly in your code:
void *malloc();
char *buf = malloc(1024);
if (!buf) abort();
Header files are accepted as a necessary evil in C these days, but just as modern C derivatives (Java, C#, etc.) have gotten rid of header files, old-timers didn't really like using header files either.
Type safety
From what I understand about the old old days of pre-C, there wasn't always much of a static typing system. Everything was an int, including pointers. In this old language, the only point of function prototypes would be to catch arity errors.
So if we hypothesize that functions were added to the language first, and then a static type system was added later, this theory explains why prototypes are optional. This theory also explains why arrays decay to pointers when used as function arguments -- since in this proto-C, arrays were nothing more than pointers which get automatically initialized to point to some space on the stack. For example, something like the following may have been possible:
function()
{
auto x[7];
x += 1;
}
Citations
The Development of the C Language, Dennis M. Ritchie
On typelessness:
Both languages [B and BCPL] are typeless, or rather have a single data type, the 'word,' or 'cell,' a fixed-length bit pattern.
On the equivalence of integers and pointers:
Thus, if p is a cell containing the index of (or address of, or pointer to) another cell, *p refers to the contents of the pointed-to cell, either as a value in an expression or as the target of an assignment.
Evidence for the theory that prototypes were omitted due to size constraints:
During development, he continually struggled against memory limitations: each language addition inflated the compiler so it could barely fit, but each rewrite taking advantage of the feature reduced its size.
Some food for thought. (It's not an answer; we actually know the answer — it's permitted for backward compatibility.)
And people should look at COBOL code base or f66 libraries before saying why it's not cleaned up in 30 years or so!
gcc with its default does not spit out any warnings.
With -Wall and gcc -std=c99 do spit out the correct thing
main.c:2: warning: type defaults to ‘int’ in declaration of ‘bar’
main.c:3: warning: implicit declaration of function ‘foo’
The lint functionality built into modern gcc is showing its color.
Interestingly the modern clone of lint, the secure lint — I mean splint — gives only one warning by default.
main.c:3:10: Unrecognized identifier: foo
Identifier used in code has not been declared. (Use -unrecog to inhibit
warning)
The llvm C compiler clang which also has a static analyser built into it like gcc, spits out the two warnings by default.
main.c:2:10: warning: type specifier missing, defaults to 'int' [-Wimplicit-int]
static bar = 7; // defaults to "int bar"
~~~~~~ ^
main.c:3:10: warning: implicit declaration of function 'foo' is invalid in C99
[-Wimplicit-function-declaration]
return foo(bar); // defaults to a "int foo()"
^
People used to think we don't need backward compatibility for 80's stuff. All the code must be cleaned up or replaced. But it turns out it's not the case. A lot of production code stays in prehistoric non-standard times.
EDIT:
I didn't look through other answers before posting mine. I may have misunderstood the intention of the poster. But the thing is there was a time when you hand compiled your code, and use toggle to put the binary pattern in memory. They didn't need a "type system". Nor does the PDP machine in front of which Richie and Thompson posed like this :
Don't look at the beard, look at the "toggles", which I heard were used to bootstrap the machine.
And also look how they used to boot UNIX in this paper. It's from the Unix 7th edition manual.
http://wolfram.schneider.org/bsd/7thEdManVol2/setup/setup.html
The point of the matter is they didn't need so much software layer managing a machine with KB sized memory. Knuth's MIX has 4000 words. You don't need all these types to program a MIX computer. You can happily compare a integer with pointer in a machine like this.
I thought why they did this is quite self-evident. So I focused on how much is left to be cleaned up.
This is a nitpicky-details question with three parts. The context is that I wish to persuade some folks that it is safe to use <stddef.h>'s definition of offsetof unconditionally rather than (under some circumstances) rolling their own. The program in question is written entirely in plain old C, so please ignore C++ entirely when answering.
Part 1: When used in the same manner as the standard offsetof, does the expansion of this macro provoke undefined behavior per C89, why or why not, and is it different in C99?
#define offset_of(tp, member) (((char*) &((tp*)0)->member) - (char*)0)
Note: All implementations of interest to the people whose program this is supersede the standard's rule that pointers may only be subtracted from each other when they point into the same array, by defining all pointers, regardless of type or value, to point into a single global address space. Therefore, please do not rely on that rule when arguing that this macro's expansion provokes undefined behavior.
Part 2: To the best of your knowledge, has there ever been a released, production C implementation that, when fed the expansion of the above macro, would (under some circumstances) behave differently than it would have if its offsetof macro had been used instead?
Part 3: To the best of your knowledge, what is the most recently released production C implementation that either did not provide stddef.h or did not provide a working definition of offsetof in that header? Did that implementation claim conformance with any version of the C standard?
For parts 2 and 3, please answer only if you can name a specific implementation and give the date it was released. Answers that state general characteristics of implementations that may qualify are not useful to me.
There is no way to write a portable offsetof macro. You must use the one provided by stddef.h.
Regarding your specific questions:
The macro invokes undefined behavior. You cannot subtract pointers except when they point into the same array.
The big difference in practical behavior is that the macro is not an integer constant expression, so it can't safely be used for static initializers, bitfield widths, etc. Also strict bounds-checking-type C implementations might completely break it.
There has never been any C standard that lacked stddef.h and offsetof. Pre-ANSI compilers might lack it, but they have much more fundamental problems that make them unusable for modern code (e.g. lack of void * and const).
Moreover, even if some theoretical compiler did lack stddef.h, you could just provide a drop-in replacement, just like the way people drop in stdint.h for use with MSVC...
To answer #2: yes, gcc-4* (I'm currently looking at v4.3.4, released 4 Aug 2009, but it should hold true for all gcc-4 releases to date). The following definition is used in their stddef.h:
#define offsetof(TYPE, MEMBER) __builtin_offsetof (TYPE, MEMBER)
where __builtin_offsetof is a compiler builtin like sizeof (that is, it's not implemented as a macro or run-time function). Compiling the code:
#include <stddef.h>
struct testcase {
char array[256];
};
int main (void) {
char buffer[offsetof(struct testcase, array[0])];
return 0;
}
would result in an error using the expansion of the macro that you provided ("size of array ‘buffer’ is not an integral constant-expression") but would work when using the macro provided in stddef.h. Builds using gcc-3 used a macro similar to yours. I suppose that the gcc developers had many of the same concerns regarding undefined behavior, etc that have been expressed here, and created the compiler builtin as a safer alternative to attempting to generate the equivalent operation in C code.
Additional information:
A mailing list thread from the Linux kernel developer's list
GCC's documentation on offsetof
A sort-of-related question on this site
Regarding your other questions: I think R's answer and his subsequent comments do a good job of outlining the relevant sections of the standard as far as question #1 is concerned. As for your third question, I have not heard of a modern C compiler that does not have stddef.h. I certainly wouldn't consider any compiler lacking such a basic standard header as "production". Likewise, if their offsetof implementation didn't work, then the compiler still has work to do before it could be considered "production", just like if other things in stddef.h (like NULL) didn't work. A C compiler released prior to C's standardization might not have these things, but the ANSI C standard is over 20 years old so it's extremely unlikely that you'll encounter one of these.
The whole premise to this problems begs a question: If these people are convinced that they can't trust the version of offsetof that the compiler provides, then what can they trust? Do they trust that NULL is defined correctly? Do they trust that long int is no smaller than a regular int? Do they trust that memcpy works like it's supposed to? Do they roll their own versions of the rest of the C standard library functionality? One of the big reasons for having language standards is so that you can trust the compiler to do these things correctly. It seems silly to trust the compiler for everything else except offsetof.
Update: (in response to your comments)
I think my co-workers behave like yours do :-) Some of our older code still has custom macros defining NULL, VOID, and other things like that since "different compilers may implement them differently" (sigh). Some of this code was written back before C was standardized, and many older developers are still in that mindset even though the C standard clearly says otherwise.
Here's one thing you can do to both prove them wrong and make everyone happy at the same time:
#include <stddef.h>
#ifndef offsetof
#define offsetof(tp, member) (((char*) &((tp*)0)->member) - (char*)0)
#endif
In reality, they'll be using the version provided in stddef.h. The custom version will always be there, however, in case you run into a hypothetical compiler that doesn't define it.
Based on similar conversations that I've had over the years, I think the belief that offsetof isn't part of standard C comes from two places. First, it's a rarely used feature. Developers don't see it very often, so they forget that it even exists. Second, offsetof is not mentioned at all in Kernighan and Ritchie's seminal book "The C Programming Language" (even the most recent edition). The first edition of the book was the unofficial standard before C was standardized, and I often hear people mistakenly referring to that book as THE standard for the language. It's much easier to read than the official standard, so I don't know if I blame them for making it their first point of reference. Regardless of what they believe, however, the standard is clear that offsetof is part of ANSI C (see R's answer for a link).
Here's another way of looking at question #1. The ANSI C standard gives the following definition in section 4.1.5:
offsetof( type, member-designator)
which expands to an integral constant expression that has type size_t,
the value of which is the offset in bytes, to the structure member
(designated by member-designator ), from the beginning of its
structure (designated by type ).
Using the offsetof macro does not invoke undefined behavior. In fact, the behavior is all that the standard actually defines. It's up to the compiler writer to define the offsetof macro such that its behavior follows the standard. Whether it's implemented using a macro, a compiler builtin, or something else, ensuring that it behaves as expected requires the implementor to deeply understand the inner workings of the compiler and how it will interpret the code. The compiler may implement it using a macro like the idiomatic version you provided, but only because they know how the compiler will handle the non-standard code.
On the other hand, the macro expansion you provided indeed invokes undefined behavior. Since you don't know enough about the compiler to predict how it will process the code, you can't guarantee that particular implementation of offsetof will always work. Many people define their own version like that and don't run into problems, but that doesn't mean that the code is correct. Even if that's the way that a particular compiler happens to define offsetof, writing that code yourself invokes UB while using the provided offsetof macro does not.
Rolling your own macro for offsetof can't be done without invoking undefined behavior (ANSI C section A.6.2 "Undefined behavior", 27th bullet point). Using stddef.h's version of offsetof will always produce the behavior defined in the standard (assuming a standards-compliant compiler). I would advise against defining a custom version since it can cause portability problems, but if others can't be persuaded then the #ifndef offsetof snippet provided above may be an acceptable compromise.
(1) The undefined behavior is already there before you do the substraction.
First of all, (tp*)0 is not what you think it is. It is a null
pointer, such a beast is not necessarily represented with all-zero
bit pattern.
Then the member operator -> is not simply an offset addition. On a CPU with segmented memory this might be a more complicated operation.
Taking the address with a & operation is UB if the expression is
not a valid object.
(2) For the point 2., there are certainly still archictures out in the wild (embedded stuff) that use segmented memory. For 3., the point that R makes about integer constant expressions has another drawback: if the code is badly optimized the & operation might be done at runtime and signal an error.
(3) Never heard of such a thing, but this is probably not enough to convice your colleagues.
I believe that nearly every optimizing compiler has broken that macro at multiple points in time. Your coworkers have apparently been lucky enough not to have been hit by it.
What happens is that some junior compiler engineer decides that because the zero page is never mapped on their platform of choice, any time anyone does anything with a pointer to that page, that's undefined behavior and they can safely optimize away the whole expression. At that point, everyone's homebrew offsetof macros break until enough people scream about it, and those of us who were smart enough not to roll our own go happily about our business.
I don't know of any compiler where this is the behavior in the current released version, but I think I've seen it happen at some point with every compiler I've ever worked with.
The Wikipedia article on ANSI C says:
One of the aims of the ANSI C standardization process was to produce a superset of K&R C (the first published standard), incorporating many of the unofficial features subsequently introduced. However, the standards committee also included several new features, such as function prototypes (borrowed from the C++ programming language), and a more capable preprocessor. The syntax for parameter declarations was also changed to reflect the C++ style.
That makes me think that there are differences. However, I didn't see a comparison between K&R C and ANSI C. Is there such a document? If not, what are the major differences?
EDIT: I believe the K&R book says "ANSI C" on the cover. At least I believe the version that I have at home does. So perhaps there isn't a difference anymore?
There may be some confusion here about what "K&R C" is. The term refers to the language as documented in the first edition of "The C Programming Language." Roughly speaking: the input language of the Bell Labs C compiler circa 1978.
Kernighan and Ritchie were involved in the ANSI standardization process. The "ANSI C" dialect superceded "K&R C" and subsequent editions of "The C Programming Language" adopt the ANSI conventions. "K&R C" is a "dead language," except to the extent that some compilers still accept legacy code.
Function prototypes were the most obvious change between K&R C and C89, but there were plenty of others. A lot of important work went into standardizing the C library, too. Even though the standard C library was a codification of existing practice, it codified multiple existing practices, which made it more difficult. P.J. Plauger's book, The Standard C Library, is a great reference, and also tells some of the behind-the-scenes details of why the library ended up the way it did.
The ANSI/ISO standard C is very similar to K&R C in most ways. It was intended that most existing C code should build on ANSI compilers without many changes. Crucially, though, in the pre-standard era, the semantics of the language were open to interpretation by each compiler vendor. ANSI C brought in a common description of language semantics which put all the compilers on an equal footing. It's easy to take this for granted now, some 20 years later, but this was a significant achievement.
For the most part, if you don't have a pre-standard C codebase to maintain, you should be glad you don't have to worry about it. If you do--or worse yet, if you're trying to bring an old program up to more modern standards--then you have my sympathies.
There are some minor differences, but I think later editions of K&R are for ANSI C, so there's no real difference anymore.
"C Classic" for lack of a better terms had a slightly different way of defining functions, i.e.
int f( p, q, r )
int p, float q, double r;
{
// Code goes here
}
I believe the other difference was function prototypes. Prototypes didn't have to - in fact they couldn't - take a list of arguments or types. In ANSI C they do.
function prototype.
constant & volatile qualifiers.
wide character support and internationalization.
permit function pointer to be used without dereferencing.
Another difference is that function return types and parameter types did not need to be defined. They would be assumed to be ints.
f(x)
{
return x + 1;
}
and
int f(x)
int x;
{
return x + 1;
}
are identical.
The major differences between ANSI C and K&R C are as follows:
function prototyping
support of the const and volatile data type qualifiers
support wide characters and internationalization
permit function pointers to be used without dereferencing
ANSI C adopts c++ function prototype technique where function definition and declaration include function names,arguments' data types, and return value data types. Function prototype enable ANSI C compiler to check for function calls in user programs that pass invalid numbers of arguments or incompatible arguments data types. These fix major weakness of the K&R C compiler.
Example: to declares a function foo and requires that foo take two arguments
unsigned long foo (char* fmt, double data)
{
/*body of foo */
}
FUNCTION PROTOTYPING:ANSI C adopts c++ function prototype technique where function definaton and declaration include function names,arguments t,data types and return value data types.function prototype enable ANSI ccompilers to check for function call in user program that passes invalid number number of argument or incompatiblle argument data types.these fix a major weakness of the K&R C compilers:invalid call in user program often passes compilation but cause program to crash when they are executed
The difference is:
Prototype
wide character support and internationalisation
Support for const and volatile keywords
permit function pointers to be used as dereferencing
A major difference nobody has yet mentioned is that before ANSI, C was defined largely by precedent rather than specification; in cases where certain operations would have predictable consequences on some platforms but not others (e.g. using relational operators on two unrelated pointers), precedent strongly favored making platform guarantees available to the programmer. For example:
On platforms which define a natural ranking among all pointers to all objects, application of the relational operators to arbitrary pointers could be relied upon to yield that ranking.
On platforms where the natural means of testing whether one pointer is "greater than" another never has any side-effect other than yielding a true or false value, application of the relational operators to arbitrary pointers could likewise be relied upon never to have any side-effects other than yielding a true or false value.
On platforms where two or more integer types shared the same size and representation, a pointer to any such integer type could be relied upon to read or write information of any other type with the same representation.
On two's-complement platforms where integer overflows naturally wrap silently, an operation involving an unsigned values smaller than "int" could be relied upon to behave as though the value was unsigned in cases where the result would be between INT_MAX+1u and UINT_MAX and it was not promoted to a larger type, nor used as the left operand of >>, nor either operand of /, %, or any comparison operator. Incidentally, the rationale for the Standard gives this as one of the reasons small unsigned types promote to signed.
Prior to C89, it was unclear to what lengths compilers for platforms where the above assumptions wouldn't naturally hold might be expected to go to uphold those assumptions anyway, but there was little doubt that compilers for platforms which could easily and cheaply uphold such assumptions should do so. The authors of the C89 Standard didn't bother to expressly say that because:
Compilers whose writers weren't being deliberately obtuse would continue doing such things when practical without having to be told (the rationale given for promoting small unsigned values to signed strongly reinforces this view).
The Standard only required implementations to be capable of running one possibly-contrived program without a stack overflow, and recognized that while an obtuse implementation could treat any other program as invoking Undefined Behavior but didn't think it was worth worrying about obtuse compiler writers writing implementations that were "conforming" but useless.
Although "C89" was interpreted contemporaneously as meaning "the language defined by C89, plus whatever additional features and guarantees the platform provides", the authors of gcc have been pushing an interpretation which excludes any features and guarantees beyond those mandated by C89.
The biggest single difference, I think, is function prototyping and the syntax for describing the types of function arguments.
Despite all the claims to the contary K&R was and is quite capable of providing any sort of stuff from low down close to the hardware on up.
The problem now is to find a compiler (preferably free) that can give a clean compile on a couple of millions of lines of K&R C without out having to mess with it.And running on something like a AMD multi core processor.
As far as I can see, having looked at the source of the GCC 4.x.x series there is no simple hack to reactivate the -traditional and -cpp-traditional lag functionality to their previous working state without without more effor than I am prepered to put in. And simpler to build a K&R pre-ansi compiler from scratch.