The title nearly says it all, but I will restate the question...
Is the following program a "strictly conforming program" under the C99 standard?
#include <stdlib.h>
/* Removing any pre-existing macro definition, in case one should exist in the implementation.
* Seems to be allowed under 7.1.3 para 3, as malloc does not begin with _X where X is any capital letter.
* And 7.1.4 para 1 explicitly permits #undef of such macros.
*/
#ifdef malloc
#undef malloc
#endif
/* Macro substitution has no impact on the external name malloc
* which remains accessible, e.g., via "(malloc)(s)". Such use of
* macro substitution seems enabled by 7.1.4 para 1, but not specifically
* mentioned otherwise.
*/
void * journalling_malloc(size_t size);
#define malloc(s) ((journalling_malloc)(s))
int main(void)
{
return malloc(10) == NULL ? 1 : 0;
/* Just for the sake of expanding the
* macro one time, return as exit code
* whether the allocation was performed.
*/
}
Let's look at what the C99 standard has to say about it:
See 7.1.3, §1, clause 5:
Each identifier with file scope listed in any of the following subclauses [...] is reserved for use as a macro name and as an identifier with file scope in the same name space if any of its associated headers is included.
As you include stdlib.h, the name malloc is reserved for use as a macro name.
But 7.1.4, §1 allows using #undef on reserved names:
The use of #undef to remove any macro definition will also ensure that an
actual function is referred to.
This makes it possible to re-#define malloc, which results in undefined behaviour according to 7.1.3, §2:
If the program [...] defines a reserved identifier as a macro name, the behavior is undefined.
Why does the standard make this restriction? Because other functions of the standard library may be implemented as function-like macros in terms of the original function, so hiding the declaration might break these other functions.
In practice, you should be fine as long as your definition of malloc satisfies all provisions the standard provides for the library function, which can be achieved by wrapping an actual call to malloc().
You will want to change journalling_malloc(...) from void to void *, change the comments to // (because they are commenting out your undef) and add a #endif near the top, but otherwise it looks fine.
Will it work: Yes.
Is it conformant: No.
According to the C Standard:
http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1124.pdf
All names in the standard library are reserved (including malloc).
7.1.3 Reserved identifiers
Specifically:
<quote>Each macro name in any of the following subclauses</quote>
<quote>All identifiers with external linkage in any of the
following subclauses</quote>
Also a strictly conforming program will can not define names that are reserved for the implementation (i.e. This includes reserved names and idnetifiers, those for current libraries and those reserved for future use).
7.1.3 (note 2)
Specifically:
<quote>If the program declares or defines an identifier in a context in which
it is reserved or defines a reserved identifier as a macro name,
the behavior is undefined.</quote>
Thus by definition: defining malloc() is non conformant because it is undefined behavior (Illegal).
Macro identifiers are proper names, and all library identifiers of any kind are forbidden from aliasing to a macro regardless of the linguistic status of macros.
§6.10.3/7
The identifier immediately following
the define is called the macro name.
There is one name space for macro
names.
§7.1.3/1
Each identifier with file scope listed
in any of the following subclauses
(including the future library
directions) is reserved for use as a
macro name and as an identifier with
file scope in the same name space if
any of its associated headers is
included.
§7.1.3/2
If the program declares or defines an
identifier in a context in which it is
reserved (other than as allowed by
7.1.4), or defines a reserved identifier as a macro name, the
behavior is undefined.
Related
Consider this code:
/*
* stdio.h
*
* note: it is an example of a particular implementation of stdio.h
* containing _x; it is not "my code added to stdio.h"
*/
void _x(void);
/* t627.c */
#define _x 0
#include <stdio.h>
Invocation:
$ gcc t627.c
t627.c:1:12: error: expected identifier or ‘(’ before numeric constant
1 | #define _x 0
| ^
stdio.h:1:6: note: in expansion of macro ‘_x’
1 | void _x(void);
At translation phase 4 the identifier _x is non-reserved. At translation phase 7 the identifier _x is reserved (for use as identifier with file scope in both the ordinary and tag name spaces). Since translation phase 4 precedes translation phase 7, then at translation phase 7 the identifier _x (currently defined as a macro name) is already replaced by its replacement list 0, invalidating the program.
Does it mean that in cases when the user-defined macro (that begins with an underscore, followed by a lowercase letter) can collide/overlap with the file scope identifier with the same name, such file scope identifier cannot be reserved?
#define macros are always a textual substitution.
Headers, of course, are not compiled entities in their own right, so are only evaluated at the point they are #included.
Let's say you have a header containing a certain non-macro identifier*.
In a C module, you #define that same identifier to expand to something arbitrary and pathological, and then #include the header
Since the compiler encounters the #define before it encounters the #include, all mentions in the header of the colliding identifier will be substituted with the macro's expansion. The consequences can be (and often are) disastrous, or at the very least hard to debug.
It doesn't really matter whether or not the identifier starts with an underscore. If you wrote #define printf scanf, just for instance, that would cause chaos!
(* I stipulate "non-macro" just to avoid the complications of what would happen if the header redefined - or tried to - the macro you defined first.)
You're not allowed to define macros with any of the reserved names. This is stated explicitly in section 7.1.3p2 of the C standard:
If the program declares or defines an identifier in a context in which it is reserved (other than as allowed by 7.1.4), or defines a reserved identifier as a macro name, the behavior is undefined.
(Boldface: my emphasis.)
To put it another way, every identifier that is reserved in some phase-7 context is also reserved for use as a macro name.
Found a relevant quote from P.J. Plauger (emphasis added):
Remember that the user can write macros that begin with an underscore, followed by a lowercase letter. These names overlap those reserved to the implementor for naming external functions and data objects.
So, the answer seems to be "yes".
For macros, are there any name limitations other than it needs to be an identifier? For example, would something like the following be valid?
#define assert getchar
#include <stdio.h>
int main(void)
{
assert();
}
Code link: https://godbolt.org/z/ra63na.
main:
push rbp
mov rbp, rsp
mov eax, 0
call getchar
mov eax, 0
pop rbp
ret
And does the preprocessor have any knowledge of the C language? Or is it more like a find-and-replace program?
For macros, are there any name limitations other than it needs to be an identifier?
Yes, they are subject to the provisions of section 7.1.3 of the language specification ("Reserved Identifiers"), in particular:
All identifiers that begin with an underscore and either an uppercase letter or another underscore are always reserved for any use
[including as macro names].
[...]
Each macro name in any of the [standard library specification] subclauses (including the future library directions) is reserved for
use as specified if any of its associated headers is included; unless
explicitly stated otherwise
[...]
Each identifier with file scope listed in any of the [standard library specification] subclauses (including the future library
directions) is reserved for use as a macro name and as an identifier
with file scope in the same name space if any of its associated
headers is included.
[...] If the program declares or defines an identifier in a context in
which it is reserved (other than as allowed by 7.1.4), or defines a
reserved identifier as a macro name, the behavior is undefined.
The second bullet point in particular would be relevant to your example code if it also included the assert.h header. The identifier assert would then reserved for use as a macro name. That you use it as one would trigger undefined behavior. That does not place any particular requirements on the implementation -- in fact that's exactly the meaning of "undefined behavior". It does not require the implementation to accept the code, nor to reject it, nor to emit any kind of diagnostic in either case. If it did accept it, the preprocessor would not be required to perform macro substitution on assert, nor would it be forbidden to do so, nor, in fact, would it be required to behave in a way that seems in any way rational or predictable.
Similar would apply based on the third bullet point if you defined getchar as a macro name in code that includes stdio.h, as the example does. The code actually presented is ok, however.
You also ask,
And does the preprocessor have any knowledge of the C language? Or is
it more like a find-and-replace program?
A little. The C preprocessor is not a general-purpose macro language, and attempts to use it as one often go poorly. The preprocessor's input is a series of tokens, determined according to rules consistent with C syntax, and it uses the same syntax for identifiers that C does. Conditional inclusion directives recognize a subset of the arithmetic expressions of C, and they work in terms of one of the host implementation's integer data types. The preprocessor (or at least the tokenization stage preceding it) understands C string literals and character constants, so macro replacement does not affect the contents of these.
This is covered in section 7.1.2 and 7.1.3 of the standard (C11). Here is a selection of rules pertaining to macros:
If used, a header shall be included outside of any external declaration or definition, and it shall first be included before the first reference to any of the functions or objects it declares, or to any of the types or macros it defines.
The program shall not have any macros with names lexically identical to keywords currently defined prior to the inclusion of the header or when any macro defined in the header is expanded.
Each macro name in any of the following subclauses (including the future library
directions) is reserved for use as specified if any of its associated headers is included;
unless explicitly stated otherwise.
Each identifier with file scope listed in any of the following subclauses (including the
future library directions) is reserved for use as a macro name and as an identifier with
file scope in the same name space if any of its associated headers is included.
So the exact program you posted is correct, since <assert.h> has not been included. But it would be undefined behaviour if you did include that header.
It's really dumb. It understands enough to do token replacement, but not much more.
For example: #define test fail will replace test in test(...) but not tested or "test".
Since C has a very basic syntax writing a parser that can work through and identify tokens like that is actually not that hard. Making it understand the totality of C syntax is beyond the scope of that tool.
In other words, for an input program like:
#define test fail
int main() {
test(9, "test", tested());
return 0;
}
The C pre-processor breaks this up into tokens that end up something like:
[ "#", "define", "test", "fail" ]
[ "int", "main", "(", ")", "{" ]
[ "test", "(", "9", "\"test\"", "tested", "(", ")", ")", ";" ]
...
Where each of those is processed using the simple pre-processor grammar.
This is slightly more complicated because macros can include arguments, but you get the idea. The grammar used is a simple subset of the whole C grammar.
Yes it is valid. No the pre-processor is not language aware. The pre-processor does exactly what it is told - included content, replaces macros - if that results in invalid syntax, the compiler must detect that.
Other then C symbol naming rules, there are no C language dependencies or reserved words. All pre-processor directives start # which is not a valid C symbol name so there is no need for reserved words.
The pre-processor can be run on its own - either by command line option to the compiler driver or in the of the GUN tool chain it is a standalone executable cpp - making it useful for purposes other than just C and C++ source pre-processing.
I can read in many books and an other SO questions that the standard may expand the set of identifiers such as size_t or int32_t, so it reserves any use of the _t suffix for identifiers.
Is that true?
I could not find anything that discourage the use of this suffix in the ISO9899:1999 standard, but that standard is hard to read :(
No, it's not true.
The standard reserves the right to add identifiers starting with int or uint and ending _t to the stdint.h header (§7.31.10). Those identifiers are technically only reserved if that header is included but since it almost always is, they should be treated as reserved.
In general, the standard reserves identifiers defined in standard headers, or mentioned in the future directions for standard headers (§7.31). Identifiers having external linkage (library functions) are reserved for that use (which doesn't stop you from using them as local or static variables, for example). If the library header is included, then its identifiers are reserved for use at file scope. Read §7.1.3 for details.
As that section indicates, the only identifiers unconditionally reserved are those which start with an underscore followed by a capital letter or a second underscore.
While reading the standard, it's important to understand the difference between contexts in which a name is reserved:
Reserved for any use (identifiers starting with an underscore followed by another underscore or a capital letter): these identifiers may be used by the implementation as macros or special symbols which are handled in some idiosyncratic way by the compiler. Do not ever define one of these in your code, and use the ones which are documented only as indicated by the documentation. Do not use such a symbol if it is not documented, even if you see it used in some standard library header. Or someone else's code.
Reserved at file scope (other identifiers starting with an underscore, not part of any standard header): These identifiers will not be used as macros, and you must not define them as macros either. You may use them as local variables, labels, parameters, and struct or union members. Personally, I wouldn't do this, but it's permitted. I prefer to put an underscore at the end of an identifier which is used in some internal context.
Reserved at file scope and as a macro name (any identifier mentioned in an included standard header, including in the future directions clause): Again, since these identifiers may be macros, you should treat them as off limits if you #include the associated header. The standard does allow you to #undef an identifier used as a function name in the standard library, although you might find that performance suffers because the macro wraps a construct with equivalent semantics but optimised performance.
Reserved for use as an identifier with external linkage (any identifier defined in any standard library header as having external linkage, whether or not the header is included, including the identifier errno): The weakest reservation. If you don't include the associated header, you're free to use such an identifier, even at file scope, as long as it is not externally visible. So it could be a file-scope static or an enumeration member or the tag of a struct or union. The point of this clause is not to allow you to deliberately shadow the name of a standard library function. Rather, it is to protect you from future additions to the standard library which might export an external symbol you're currently using. Of course, if your current use is as an externally visible identifier, you're still going to have a future problem. But on the whole, externally visible symbols should be prefixed with a package name to avoid name collisions with other libraries.
Having said all that, it's unwise to use an identifier that looks like it might be a standard identifier. Posix includes a list of over a hundred patterns for identifier names that it might use in the future, including all identifiers ending _t, so if you expect your code to be used in a Posix environment, you'll want to avoid those names. And while future C standard revisions might avoid adding new type names to existing headers (aside from the integer typenames mentioned above), you don't really want to preclude using any such new types, since they may well be useful. (And, according to a comment by #JensGustedt, who knows a lot more about the workings of the C working group than I do, there will be a couple of new type names in existing headers in C2x.)
The _t suffix is not reserved by ISO 9899 as such. The future library directions for C11 revision does only say that (C11 7.31.10):
Typedef names beginning with int or uint and ending with _t may be
added to the types defined in the <stdint.h> header. [...]
That said, there are great many types with _t suffix defined in C11:
char16_t, char32_t, clock_t, cnd_t, constraint_handler_t, div_t, double_t, errno_t, fenv_t, fexcept_t, float_t, fpos_t, imaxdiv_t,
int_fastN_t, int_leastN_t, intmax_t, intN_t, intptr_t, ldiv_t, lldiv_t, max_align_t, mbstate_t, mtx_t, ptrdiff_t, rsize_t,
sig_atomic_t, size_t, thrd_start_t, thrd_t, time_t, tss_dtor_t, tss_t, uint_fastN_t, uint_leastN_t,
uintmax_t, uintN_t, uintptr_t, wchar_t, wctrans_t, wctype_t, wint_t
POSIX, on the other hand, reserves the _t suffix for system use. The POSIX 1003.1 rationale has this excerpt:
To allow implementors to provide their own types, all conforming applications are required to avoid symbols ending in _t, which permits the implementor to provide additional types.
All in all, considering that the chances are that you want to use your C code in a POSIX system now or later, to steer away from using _t for your own types.
Standard C allows you to use the _t suffix so long as you don't end up with a token that starts with a double underscore. (Note that C++ restricts this further in that a double underscore is not allowed anywhere in the token; worth adhering to should you anticipate your code reaching C++.)
It's POSIX that reserves _t.
What is the conventional way to set up your include guards? I usually write them as (for example.h):
#ifndef _EXAMPLE_H_
#define _EXAMPLE_H_
#include "example.h"
#endif
Does underscore convention matter? I've seen conflicting information when I googled this. Does the _EXAMPLE_H_ even have to match the name of the header?
Does underscore convention matter?
Yes. It matters.
Identifiers with a leading underscore followed upper case letter is reserved for implementation. So what you have would cause undefined behaviour.
The following is the C standard's specification for naming the identifiers (C11 draft):
7.1.3 Reserved identifiers
Each header declares or defines all identifiers listed in its
associated subclause, and optionally declares or defines identifiers
listed in its associated future library directions subclause and
identifiers which are always reserved either for any use or for use as
file scope identifiers.
— All identifiers that begin with an underscore and either an
uppercase letter or another underscore are always reserved for any
use.
— All identifiers that begin with an underscore are always reserved
for use as identifiers with file scope in both the ordinary and tag
name spaces.
— Each macro name in any of the following subclauses (including the
future library directions) is reserved for use as specified if any of
its associated headers is included; unless explicitly stated otherwise
(see 7.1.4). — All identifiers with external linkage in any of the
following subclauses (including the future library directions) and
errno are always reserved for use as identifiers with external
linkage.184) — Each identifier with file scope listed in any of the
following subclauses (including the future library directions) is
reserved for use as a macro name and as an identifier with file scope
in the same name space if any of its associated headers is included.
No other identifiers are reserved. If the program declares or defines
an identifier in a context in which it is reserved (other than as
allowed by 7.1.4), or defines a reserved identifier as a macro name,
the behavior is undefined.
If the program removes (with #undef) any macro definition of an
identifier in the first group listed above, the behavior is undefined.
Without violating any of the above, the include guard name can be anything and doesn't have to be the name of the header file. But usually the convention I have seen/used is to use same name as that of the header file name so that it doesn't cause any unnecessary confusion.
There is no absolute requirement as to how include guards are named. It does not have to match the header name. I have seen (and used myself) some that use a UUID, basically composed of a randomly generated hexadecimal string.
Technically as KingsIndian said, identifiers beginning with underscores are reserved:
The rules, paraphrased from ANSI Sec. 4.1.2.1, are:
1. All identifiers beginning with an underscore followed
by an upper-case letter or another underscore are always
reserved (all scopes, all namespaces).
2. All identifiers beginning with an underscore are reserved
for ordinary identifiers (functions, variables, typedefs, enumeration
constants) with file scope.
...
comp.lang.c FAQ list · Question 1.29
Maybe the new ISO C11 (?) standard relaxes these rules, but that's been the bottom line for a while.
extern int ether_hostton (__const char *__hostname, struct ether_addr *__addr)
__THROW;
I found the above function definition in /usr/include/netinet/ether.h on a Linux box.
Can someone explain what the double underscores mean in front of const (keyword), addr (identifier) and at last __THROW.
In C, symbols starting with an underscore followed by either an upper-case letter or another underscore are reserved for the implementation. You as a user of C should not create any symbols that start with the reserved sequences. In C++, the restriction is more stringent; you the user may not create a symbol containing a double-underscore.
Given:
extern int ether_hostton (__const char *__hostname, struct ether_addr *__addr)
__THROW;
The __const notation is there to allow for the possibility (somewhat unlikely) that a compiler that this code is used with supports prototype notations but does not have a correct understanding of the C89 standard keyword const. The autoconf macros can still check whether the compiler has working support for const; this code could be used with a broken compiler that does not have that support.
The use of __hostname and __addr is a protection measure for you, the user of the header. If you compile with GCC and the -Wshadow option, the compiler will warn you when any local variables shadow a global variable. If the function used just hostname instead of __hostname, and if you had a function called hostname(), there'd be a shadowing. By using names reserved to the implementation, there is no conflict with your legitimate code.
The use of __THROW means that the code can, under some circumstances, be declared with some sort of 'throw specification'. This is not standard C; it is more like C++. But the code can be used with a C compiler as long as one of the headers (or the compiler itself) defines __THROW to empty, or to some compiler-specific extension of the standard C syntax.
Section 7.1.3 of the C standard (ISO 9899:1999) says:
7.1.3 Reserved identifiers
Each header declares or defines all identifiers listed in its associated subclause, and
optionally declares or defines identifiers listed in its associated future library directions
subclause and identifiers which are always reserved either for any use or for use as file
scope identifiers.
— All identifiers that begin with an underscore and either an uppercase letter or another
underscore are always reserved for any use.
— All identifiers that begin with an underscore are always reserved for use as identifiers
with file scope in both the ordinary and tag name spaces.
— Each macro name in any of the following subclauses (including the future library
directions) is reserved for use as specified if any of its associated headers is included;
unless explicitly stated otherwise (see 7.1.4).
— All identifiers with external linkage in any of the following subclauses (including the
future library directions) are always reserved for use as identifiers with external
linkage.154)
— Each identifier with file scope listed in any of the following subclauses (including the
future library directions) is reserved for use as a macro name and as an identifier with
file scope in the same name space if any of its associated headers is included.
No other identifiers are reserved. If the program declares or defines an identifier in a
context in which it is reserved (other than as allowed by 7.1.4), or defines a reserved
identifier as a macro name, the behavior is undefined.
If the program removes (with #undef) any macro definition of an identifier in the first
group listed above, the behavior is undefined.
Footnote 154) The list of reserved identifiers with external linkage includes errno, math_errhandling,
setjmp, and va_end.
See also What are the rules about using an underscore in a C++ identifier; a lot of the same rules apply to both C and C++, though the embedded double-underscore rule is in C++ only, as mentioned at the top of this answer.
C99 Rationale
The C99 Rationale says:
7.1.3 Reserved identifiers
To give implementors maximum latitude in packing library functions into files, all external
identifiers defined by the library are reserved in a hosted environment. This means, in effect, that no user-supplied external names may match library names, not even if the user function has
the same specification. Thus, for instance, strtod may be defined in the same object module as printf, with no fear that link-time conflicts will occur. Equally, strtod may call printf, or printf may call strtod, for whatever reason, with no fear that the wrong function will be called.
Also reserved for the implementor are all external identifiers beginning with an underscore, and all other identifiers beginning with an underscore followed by a capital letter or an underscore. This gives a name space for writing the numerous behind-the-scenes non-external macros and functions a library needs to do its job properly.
With these exceptions, the Standard assures the programmer that all other identifiers are available, with no fear of unexpected collisions when moving programs from one
implementation to another5. Note, in particular, that part of the name space of internal identifiers beginning with underscore is available to the user: translator implementors have not been the only ones to find use for “hidden” names. C is such a portable language in many respects that the issue of “name space pollution” has been and is one of the principal barriers to writing completely portable code. Therefore the Standard assures that macro and typedef names are reserved only if the associated header is explicitly included.
5 See §6.2.1 for a discussion of some of the precautions an implementor should take to keep this promise. Note also that any implementation-defined member names in structures defined in <time.h> and <locale.h> must begin with an underscore, rather than following the pattern of other names in those structures.
And the relevant part of the rationale for §6.2.1 Scopes of identifiers is:
Although the scope of an identifier in a function prototype begins at its declaration and ends at the end of that function’s declarator, this scope is ignored by the preprocessor. Thus an identifier
in a prototype having the same name as that of an existing macro is treated as an invocation of that macro. For example:
#define status 23
void exit(int status);
generates an error, since the prototype after preprocessing becomes
void exit(int 23);
Perhaps more surprising is what happens if status is defined
#define status []
Then the resulting prototype is
void exit(int []);
which is syntactically correct but semantically quite different from the intent.
To protect an implementation’s header prototypes from such misinterpretation, the implementor must write them to avoid these surprises. Possible solutions include not using identifiers in prototypes, or using names in the reserved name space (such as __status or _Status).
See also P J Plauger The Standard C Library (1992) for an extensive discussion of name space rules and library implementations. The book refers to C90 rather than any later version of the standard, but most of the implementation advice in it remains valid to this day.
Names with double leading underscores are reserved for use by the implementation. This does not necessarily mean they are internal per se, although they often are.
The idea is, you're not allowed to to use any names starting with __, so the implementation is free to use them in places like macro expansions, or in the names of syntax extensions (e.g. __gcnew is not part of C++, but Microsoft can add it to C++/CLI confident that no existing code should have something like int __gcnew; in it that would stop compiling).
To find out what these specific extensions mean, i.e. __const you'll need to consult the documentation for your specific compiler/platform. In this particular case, you should probably consider the prototype in the documentation (e.g. http://www.kernel.org/doc/man-pages/online/pages/man3/ether_aton.3.html) to be the function's interface and ignore the __const and __THROW decorations that appear in the actual header.
By convention in some libraries, this indicates that a particular symbol is for internal use and not intended to be part of the public API of the library.
The underscore in __const means that this keyword is a compiler extension and using it is not portable (The const keyword was added to C in a later revision, 89 I think).
The __THROW is also some kind of extension, I assume that it gets defined to some __attribute__(something) if gcc is used, But I'm not sure on that and too lazy to check.
The __addr can mean anything the programmer wanted it to mean, It's just a name.