Is a C compiler allowed to add functions to standard headers and still conform to the C standard?
I read this somewhere, but I can't find any reference in the standard, except in annex J.5:
The inclusion of any extension that may cause a strictly conforming
program to become invalid renders an implementation nonconforming.
Examples of such extensions are new keywords, extra library functions
declared in standard headers, or predefined macros with names that do
not begin with an underscore.
However, Annex J is informative and not normative... so it isn't helping.
So I wonder if it is okay or not for a conforming compiler to add additional functions in standard headers?
For example, lets say it adds non-standard itoa to stdlib.h.
In 4. "Conformance" §6, there is:
A conforming implementation may have extensions (including additional
library functions), provided they do not alter the behavior of any strictly conforming
program.
with the immediate conclusion in a footnote:
This implies that a conforming implementation reserves no identifiers other than those explicitly
reserved in this International Standard.
The reserved identifiers are described in 7.1.3. Basically, it is everything starting with an underscore and everything explicitly listed as used for the standard libraries.
So, yes the compiler is allowed to add extensions. But they have to have a name starting with an underscore or one of the prefixes reserved for libraries.
itoa is not a reserved identifier and a compiler defining it in a standard header is not conforming.
In "7.26 Future library directions" you have a list of the identifiers that may be added to the standard headers, this includes identifiers starting with str or mem, macros starting with E and stuff like that.
Other than that, implementations are restricted to the generic names as reserved in "7.1.3 Reserved identifiers".
Compilers for embedded systems regularly add functions and macros to standard headers, usually to make a special processor feature available for use.
If I read the standard correctly, they can do so without sacrificing conformity if they do use names specified as reserved by the standard. Since a conforming program may use any non-reserved name as a variable or a function name, using such a non-reserved name as an addition to a standard header would break a conforming program.
In practice, however, the compiler writers usually do not care too much. They will at most provide a list of elements defined for the system you may not use if you want your program to work with their implementation.
Related
#define and#undefshall not be used on a reserved identifier or reserved macro name.
I have a violation for this rule for this code
"#define _POSIX_C_SOURCE 200809L".
_POSIX_C_SOURCE is a reserved identifier for macro.
Which is a formal deviation for this code?
This isn't just a MISRA violation but a standard C violation as well. See for example C11 7.1.3:
All identifiers that begin with an underscore and either an uppercase letter or another underscore are always reserved for any use.
Where "reserved for any use" means reserved for the compiler/library implementation.
The problem lies in the Glibc naming of the identifier. If the implementation vouches for this identifier, then you should be able to use it.
But here's the catch: getting Glibc MISRA compliant is mission impossible. And professional MISRA-C implementations don't allow non-compliant libraries to be used.
If you still insist on using these libraries, you have to create some massive, project-wide deviation for the whole of the standard libraries. The problem here is that a vast amount of the Glibc code relies on gcc non-standard extensions, such as writing code that would otherwise have poorly-defined behavior outside the library implementation - for the sake of writing a standard lib where normal C rules don't apply. You cannot possibly make an argument that such code is to be trusted from a MISRA-C perspective.
I'd ask the person who made the call to combine POSIX, Glibc and MISRA-C in the same project how to carry on from here...
Yes, that's right, there is a violation of rule 21.1 here.
One of the points in the text of rule 21.1 states that identifiers or macro names beginning with and underscore should not be used.
Rationale of this rule is that macro definitions started from underscore are used in standard library headers.
I came across this paragraph from this answer by #zwol recently:
The __libc_ prefix on read is because there are actually three different names for read in the C library: read, __read, and __libc_read. This is a hack to achieve "namespace cleanliness", which you only need to worry about if you ever set out to implement a full-fledged and fully standards compliant C library. The short version is that there are many functions in the C library that need to call read, but some of them cannot use the name read to call it, because a C program is technically allowed to define a function named read itself.
As some of you may know, I am setting out to implement my own full-fledged and fully standards-compliant C library, so I'd like more details on this.
What is "namespace cleanliness", and how does glibc achieve it?
First, note that the identifier read is not reserved by ISO C at all. A strictly conforming ISO C program can have an external variable or function called read. Yet, POSIX has a function called read. So how can we have a POSIX platform with read that at the same time allows the C program? After all fread and fgets probably use read; won't they break?
One way would be to split all the POSIX stuff into separate libraries: the user has to link -lio or whatever to get read and write and other functions (and then have fread and getc use some alternative read function, so they work even without -lio).
The approach in glibc is not to use symbols like read, but instead stay out of the way by using alternative names like __libc_read in a reserved namespace. The availability of read to POSIX programs is achieved by making read a weak alias for __libc_read. Programs which make an external reference to read, but do not define it, will reach the weak symbol read which aliases to __libc_read. Programs which define read will override the weak symbol, and their references to read will all go to that override.
The important part is that this has no effect on __libc_read. Moreover, the library itself, where it needs to use the read function, calls its internal __libc_read name that is unaffected by the program.
So all of this adds up to a kind of cleanliness. It's not a general form of namespace cleanliness feasible in a situation with many components, but it works in a two-party situation where our only requirement is to separate "the system library" and "the user application".
OK, first some basics about the C language as specified by the standard. In order that you can write C applications without concern that some of the identifiers you use might clash with external identifiers used in the implementation of the standard library or with macros, declarations, etc. used internally in the standard headers, the language standard splits up possible identifiers into namespaces reserved for the implementation and namespaces reserved for the application. The relevant text is:
7.1.3 Reserved identifiers
Each header declares or defines all identifiers listed in its associated subclause, and optionally declares or defines identifiers listed in its associated future library directions subclause and identifiers which are always reserved either for any use or for use as file scope identifiers.
All identifiers that begin with an underscore and either an uppercase letter or another underscore are always reserved for any use.
All identifiers that begin with an underscore are always reserved for use as identifiers with file scope in both the ordinary and tag name spaces.
Each macro name in any of the following subclauses (including the future library directions) is reserved for use as specified if any of its associated headers is included; unless explicitly stated otherwise (see 7.1.4).
All identifiers with external linkage in any of the following subclauses (including the future library directions) and errno are always reserved for use as identifiers with external linkage.184)
Each identifier with file scope listed in any of the following subclauses (including the future library directions) is reserved for use as a macro name and as an identifier with file scope in the same name space if any of its associated headers is included.
No other identifiers are reserved. If the program declares or defines an identifier in a context in which it is reserved (other than as allowed by 7.1.4), or defines a reserved identifier as a macro name, the behavior is undefined.
Emphasis here is mine. As examples, the identifier read is reserved for the application in all contexts ("no other..."), but the identifier __read is reserved for the implementation in all contexts (bullet point 1).
Now, POSIX defines a lot of interfaces that are not part of the standard C language, and libc implementations might have a good deal more not covered by any standards. That's okay so far, assuming the tooling (linker) handles it correctly. If the application doesn't include <unistd.h> (outside the scope of the language standard), it can safely use the identifier read for any purpose it wants, and nothing breaks even though libc contains an identifier named read.
The problem is that a libc for a unix-like system is also going to want to use the function read to implement parts of the base C language's standard library, like fgetc (and all the other stdio functions built on top of it). This is a problem, because now you can have a strictly conforming C program such as:
#include <stdio.h>
#include <stdlib.h>
void read()
{
abort();
}
int main()
{
getchar();
return 0;
}
and, if libc's stdio implementation is calling read as its backend, it will end up calling the application's function (not to mention, with the wrong signature, which could break/crash for other reasons), producing the wrong behavior for a simple, strictly conforming program.
The solution here is for libc to have an internal function named __read (or whatever other name in the reserved namespace you like) that can be called to implement stdio, and have the public read function call that (or, be a weak alias for it, which is a more efficient and more flexible mechanism to achieve the same thing with traditional unix linker semantics; note that there are some namespace issues more complex than read that can't be solved without weak aliases).
Kaz and R.. have explained why a C library will, in general, need to have two names for functions such as read, that are called by both applications and other functions within the C library. One of those names will be the official, documented name (e.g. read) and one of them will have a prefix that makes it a name reserved for the implementation (e.g. __read).
The GNU C Library has three names for some of its functions: the official name (read) plus two different reserved names (e.g. both __read and __libc_read). This is not because of any requirements made by the C standard; it's a hack to squeeze a little extra performance out of some heavily-used internal code paths.
The compiled code of GNU libc, on disk, is split into several shared objects: libc.so.6, ld.so.1, libpthread.so.0, libm.so.6, libdl.so.2, etc. (exact names may vary depending on the underlying CPU and OS). The functions in each shared object often need to call other functions defined within the same shared object; less often, they need to call functions defined within a different shared object.
Function calls within a single shared object are more efficient if the callee's name is hidden—only usable by callers within that same shared object. This is because globally visible names can be interposed. Suppose that both the main executable and a shared object define the name __read. Which one will be used? The ELF specification says that the definition in the main executable wins, and all calls to that name from anywhere must resolve to that definition. (The ELF specification is language-agnostic and does not make any use of the C standard's distinction between reserved and non-reserved identifiers.)
Interposition is implemented by sending all calls to globally visible symbols through the procedure linkage table, which involves an extra layer of indirection and a runtime-variable final destination. Calls to hidden symbols, on the other hand, can be made directly.
read is defined in libc.so.6. It is called by other functions within libc.so.6; it's also called by functions within other shared objects that are also part of GNU libc; and finally it's called by applications. So, it is given three names:
__libc_read, a hidden name used by callers from within libc.so.6. (nm --dynamic /lib/libc.so.6 | grep read will not show this name.)
__read, a visible reserved name, used by callers from within libpthread.so.0 and other components of glibc.
read, a visible normal name, used by callers from applications.
Sometimes the hidden name has a __libc prefix and the visible implementation name has just two underscores; sometimes it's the other way around. This doesn't mean anything. It's because GNU libc has been under continuous development since the 1990s and its developers have changed their minds about internal conventions several times, but haven't always bothered to fix up all the old-style code to match the new convention (sometimes compatibility requirements mean we can't fix up the old code, even).
This question is about the same subject as strdup or _strdup? but it is not the same. That question asks how to work around MS's renamings, this question asks why they did it in the first place.
For some reason Microsoft has deprecated a whole slew of POSIX C functions and replaced them with _-prefixed variants. One example among many is isatty:
https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/posix-isatty
This POSIX function is deprecated. Use the ISO C++ conformant _isatty instead.
What exactly is ISO C++ conformant about _isatty? It appears to me that the MSDN help is totally wrong.
The other questions answer explained how to deal with this problem. You add the _CRT_NONSTDC_NO_DEPRECATE define. Fine. But I want to know what Microsoft's thinking is. What was their point in renaming and deprecating functions? Was it just to make C programmers lives even harder?
The fact that _isatty() is ISO C++ conformant makes sense if you think of it like a language-lawyer.
Under ISO C++, the compiler is only supposed to provide the functions in the standard (at least for the standard headers) -- they're not allowed to freely add extra functions, because it could conflict with functions declared in the code being compiled. Since isatty() is not listed in the standard, providing an isatty() function in a standard header would not be ISO C++ compliant.
However, the standard does allow the compiler to provide any function it wants as long as the function starts with a single underscore. So -- language lawyer time -- _isatty() is compliant with ISO C++.
I believe that's the logic that leads to the error message being phrased the way it is.
(Now, in this specific case, isatty() was provided in io.h, which is not actually a C++ standard header, so technically Microsoft could provide it and still claim to be standards-conformant. But, they had other non-compliant functions like strcmpi() in string.h, which is a standard header. So, for consistency, they deprecated all of the POSIX functions the same way and they all report the same error message.)
Names starting with an underscore, like _isatty are reserved for the implementation. They do not have a meaning defined by ISO C++, nor by ISO C, and you can't use them for your own purposes. So Microsoft is entirely right in using this prefix, and POSIX is actually wrong.
C++ has namespaces, so a hypthetical "Posix C++" could define namespace posix, but POSIX has essentially become fossilized - no new innovation in that area.
isatty & co., although POSIX, are not standard C, and are provided as "extensions" by the VC++ runtime1.
As such, they are prefixed with an underscore supposedly to avoid name clashes - as names starting with an underscore followed by a lowercase letter are reserved for implementation-defined stuff at global scope. So if, for example, you wanted to use an actual POSIX compatibility layer providing its own versions of these functions, they wouldn't have to fight with the VC++-provided "fake" ones for the non-underscored names.
Extensions which have no presumption to be actually POSIX-compliant, by the way.
Does any standard mandate name decoration?
As far as I know most (all?) conforming implementations add underscore prefix to the name of each exported symbol. Is this guaranteed by a C, POSIX or some other standard?
I'm not sure about a "standard" but the prepended underscore seems to be a very common convention dating from the 1970s. From What is the reason function names are prefixed with an underscore by the compiler?:
At the time that UNIX was rewritten in C in about 1974, its authors
already had extensive assember language libraries, and it was easier
to mangle the names of new C and C-compatible code than to go back and
fix all the existing code.
It is required for C names if you want to interoperate with the Microsoft compilers on Windows (win32 only; win64 does not use decoration since it has only a single standard calling convention).
https://learn.microsoft.com/en-us/cpp/build/reference/decorated-names
See also: Why do C compilers prepend underscores to external names?
The C standard makes clear that a compiler/library combination is allowed to do whatever it likes with the following code:
int doubleFree(char *p)
{
int temp = *p;
free(p);
free(p);
return temp;
}
In the event that a compiler does not require use of a particular bundled library, however, is there anything in the C standard which would forbid a library from defining a meaningful behavior? As a simple example, suppose code were written for a platform which had reference-counted pointers, such that following p = malloc(1234); __addref(p); __addref(p); the first two calls to free(p) would decrement the counter but not free the memory. Any code written for use with such a library would naturally work only with such a library (and the __addref() calls would likely fail on most others), but such a feature could be helpful in many cases when e.g. it is necessary to pass the a string repeatedly to a method which expects to be given a string produced with strdup and consequently calls free on it.
In the event that a library would define a useful behavior for some action like double-freeing a pointer, is there anything in the C standard which would authorize a compiler to unilaterally break it?
There is really two question here, your formally stated one and your broader one outlined in your comments to questions raised by others.
Your formal question is answers by the definition of undefined behavior and section 4 on conformance. The definition says (emphasis mine):
behavior, upon use of a nonportable or erroneous program construct or of erroneous data,
for which this International Standard imposes no requirements
With emphasis on nonportable and imposes no requirements. This really says it all, the compiler is free to optimize in unpleasant manners or can also chose to make the behavior documented and well defined, this of course mean the program is no longer strictly conforming, which brings us to section 4:
A strictly conforming program shall use only those features of the language and library
specified in this International Standard.2) It shall not produce output dependent on any
unspecified, undefined, or implementation-defined behavior, and shall not exceed any
minimum implementation limit.
but a conforming implementation is allowed extensions as long as they don't break a conforming program:
A conforming implementation may have extensions (including additional
library functions), provided they do not alter the behavior of any strictly conforming
program.3)
As the C FAQ says:
There are very few realistic, useful, strictly conforming programs. On the other hand, a merely conforming program can make use of any compiler-specific extension it wants to.
Your informal question deals with compilers taking more aggressive optimization opportunies with undefined behavior and in the long run the fear this will make real world systems programming impossible. While I do understand how this relatively new aggressive stance seems very programmer unfriendly to many in the end a compiler won't last very long if people can not build useful programs with it. A related blog post by John Regehr: Proposal for a Friendly Dialect of C.
One could argue the opposite, that compilers have made a lot of effort to build extensions to support varying needs not supported by the standard. I think the article GCC hacks in the Linux kernel demonstrates this well. It goes into the many gcc extensions that the Linux kernel relies on and clang has in general attempted to support as many gcc extensions as possible.
Whether compilers have removed useful handling of undefined behavior which hampers effective systems programming is not clear to me. I think specific questions on alternatives for individual cases of undefined behavior that has been exploited in systems programming and no longer work would be useful and interesting to the community.
Does C standard mandate that platforms must not define behaviors beyond those given in standard
Quite simply, no, it does not. The standard says:
An implementation shall be accompanied by a document that defines all implementation-
defined and locale-specific characteristics and all extensions.
There is no restriction anywhere in the standard that prohibits implementations from providing any other documentation they like. If you like, you can read N1570, the latest freely available draft of the ISO C standard, and confirm the lack of any such prohibition.
In the event that a library would define a useful behavior for some action like double-freeing a pointer, is there anything in the C standard which would authorize a compiler to unilaterally break it?
A C implementation includes both the compiler and the standard library. free() is part of the standard library. The standard does not define the behavior of passing the same pointer value to free() twice, but an implementation is free to define the behavior. Any such documentation is not required, and is outside the scope of the C standard.
If a C implementation documented, for example, that calling free() a second time on the same pointer value has no effect, but then doing so actually causes the program to crash, that would violate the implementation's own documentation, but it would not violate the C standard. There is no specific requirement in the C standard that says an implementation must conform to its own documentation, beyond the documentation that's required by the standard. An implementation's conformance to its own documentation is enforce by the market and by common sense, not by the C standard.
In the event that a library would define a useful behavior for some action like double-freeing a pointer, is there anything in the C standard which would authorize a compiler to unilaterally break it?
The compiler and the standard library (i.e. the one in which free is defined) are both part of the implementation - it isn't really coherent to talk about one of them doing something "unilaterally".
If a compiler "does not require use of a particular bundled library", then (other than perhaps as a freestanding implementation) it alone is not an implementation, so the standard doesn't apply to it at all. The behavior of a combination of a library and a compiler are the responsibility of whoever chooses to combine them (which may be the author of either component, or someone else entirely) and label this combination as an implementation. It would, of course, be wise not to document extensions implemented by the library as features of this implementation without confirming that the compiler does not break them. For that matter, you would also need to make sure that the compiler doesn't break anything used internally by the library.
In answer to your main question: no, it does not. If the end result of combining a library and a compiler (and kernel, dynamic loader, etc) is a conforming hosted environment, it is a conforming implementation even if some extensions that the library's author would like to have provided are not supported by the final result of combining them, but it does not require them to work, either. Conversely, if the result does not conform - for example if the compiler breaks the internals of the library and thereby causes some library function not to conform - then it is not a conforming implementation. Any program which calls free twice on the same pointer, or uses any reserved identifier starting with two underscores, causes undefined behavior and therefore is not a strictly conforming program.