What is "namespace cleanliness", and how does glibc achieve it? - c

I came across this paragraph from this answer by #zwol recently:
The __libc_ prefix on read is because there are actually three different names for read in the C library: read, __read, and __libc_read. This is a hack to achieve "namespace cleanliness", which you only need to worry about if you ever set out to implement a full-fledged and fully standards compliant C library. The short version is that there are many functions in the C library that need to call read, but some of them cannot use the name read to call it, because a C program is technically allowed to define a function named read itself.
As some of you may know, I am setting out to implement my own full-fledged and fully standards-compliant C library, so I'd like more details on this.
What is "namespace cleanliness", and how does glibc achieve it?

First, note that the identifier read is not reserved by ISO C at all. A strictly conforming ISO C program can have an external variable or function called read. Yet, POSIX has a function called read. So how can we have a POSIX platform with read that at the same time allows the C program? After all fread and fgets probably use read; won't they break?
One way would be to split all the POSIX stuff into separate libraries: the user has to link -lio or whatever to get read and write and other functions (and then have fread and getc use some alternative read function, so they work even without -lio).
The approach in glibc is not to use symbols like read, but instead stay out of the way by using alternative names like __libc_read in a reserved namespace. The availability of read to POSIX programs is achieved by making read a weak alias for __libc_read. Programs which make an external reference to read, but do not define it, will reach the weak symbol read which aliases to __libc_read. Programs which define read will override the weak symbol, and their references to read will all go to that override.
The important part is that this has no effect on __libc_read. Moreover, the library itself, where it needs to use the read function, calls its internal __libc_read name that is unaffected by the program.
So all of this adds up to a kind of cleanliness. It's not a general form of namespace cleanliness feasible in a situation with many components, but it works in a two-party situation where our only requirement is to separate "the system library" and "the user application".

OK, first some basics about the C language as specified by the standard. In order that you can write C applications without concern that some of the identifiers you use might clash with external identifiers used in the implementation of the standard library or with macros, declarations, etc. used internally in the standard headers, the language standard splits up possible identifiers into namespaces reserved for the implementation and namespaces reserved for the application. The relevant text is:
7.1.3 Reserved identifiers
Each header declares or defines all identifiers listed in its associated subclause, and optionally declares or defines identifiers listed in its associated future library directions subclause and identifiers which are always reserved either for any use or for use as file scope identifiers.
All identifiers that begin with an underscore and either an uppercase letter or another underscore are always reserved for any use.
All identifiers that begin with an underscore are always reserved for use as identifiers with file scope in both the ordinary and tag name spaces.
Each macro name in any of the following subclauses (including the future library directions) is reserved for use as specified if any of its associated headers is included; unless explicitly stated otherwise (see 7.1.4).
All identifiers with external linkage in any of the following subclauses (including the future library directions) and errno are always reserved for use as identifiers with external linkage.184)
Each identifier with file scope listed in any of the following subclauses (including the future library directions) is reserved for use as a macro name and as an identifier with file scope in the same name space if any of its associated headers is included.
No other identifiers are reserved. If the program declares or defines an identifier in a context in which it is reserved (other than as allowed by 7.1.4), or defines a reserved identifier as a macro name, the behavior is undefined.
Emphasis here is mine. As examples, the identifier read is reserved for the application in all contexts ("no other..."), but the identifier __read is reserved for the implementation in all contexts (bullet point 1).
Now, POSIX defines a lot of interfaces that are not part of the standard C language, and libc implementations might have a good deal more not covered by any standards. That's okay so far, assuming the tooling (linker) handles it correctly. If the application doesn't include <unistd.h> (outside the scope of the language standard), it can safely use the identifier read for any purpose it wants, and nothing breaks even though libc contains an identifier named read.
The problem is that a libc for a unix-like system is also going to want to use the function read to implement parts of the base C language's standard library, like fgetc (and all the other stdio functions built on top of it). This is a problem, because now you can have a strictly conforming C program such as:
#include <stdio.h>
#include <stdlib.h>
void read()
{
abort();
}
int main()
{
getchar();
return 0;
}
and, if libc's stdio implementation is calling read as its backend, it will end up calling the application's function (not to mention, with the wrong signature, which could break/crash for other reasons), producing the wrong behavior for a simple, strictly conforming program.
The solution here is for libc to have an internal function named __read (or whatever other name in the reserved namespace you like) that can be called to implement stdio, and have the public read function call that (or, be a weak alias for it, which is a more efficient and more flexible mechanism to achieve the same thing with traditional unix linker semantics; note that there are some namespace issues more complex than read that can't be solved without weak aliases).

Kaz and R.. have explained why a C library will, in general, need to have two names for functions such as read, that are called by both applications and other functions within the C library. One of those names will be the official, documented name (e.g. read) and one of them will have a prefix that makes it a name reserved for the implementation (e.g. __read).
The GNU C Library has three names for some of its functions: the official name (read) plus two different reserved names (e.g. both __read and __libc_read). This is not because of any requirements made by the C standard; it's a hack to squeeze a little extra performance out of some heavily-used internal code paths.
The compiled code of GNU libc, on disk, is split into several shared objects: libc.so.6, ld.so.1, libpthread.so.0, libm.so.6, libdl.so.2, etc. (exact names may vary depending on the underlying CPU and OS). The functions in each shared object often need to call other functions defined within the same shared object; less often, they need to call functions defined within a different shared object.
Function calls within a single shared object are more efficient if the callee's name is hidden—only usable by callers within that same shared object. This is because globally visible names can be interposed. Suppose that both the main executable and a shared object define the name __read. Which one will be used? The ELF specification says that the definition in the main executable wins, and all calls to that name from anywhere must resolve to that definition. (The ELF specification is language-agnostic and does not make any use of the C standard's distinction between reserved and non-reserved identifiers.)
Interposition is implemented by sending all calls to globally visible symbols through the procedure linkage table, which involves an extra layer of indirection and a runtime-variable final destination. Calls to hidden symbols, on the other hand, can be made directly.
read is defined in libc.so.6. It is called by other functions within libc.so.6; it's also called by functions within other shared objects that are also part of GNU libc; and finally it's called by applications. So, it is given three names:
__libc_read, a hidden name used by callers from within libc.so.6. (nm --dynamic /lib/libc.so.6 | grep read will not show this name.)
__read, a visible reserved name, used by callers from within libpthread.so.0 and other components of glibc.
read, a visible normal name, used by callers from applications.
Sometimes the hidden name has a __libc prefix and the visible implementation name has just two underscores; sometimes it's the other way around. This doesn't mean anything. It's because GNU libc has been under continuous development since the 1990s and its developers have changed their minds about internal conventions several times, but haven't always bothered to fix up all the old-style code to match the new convention (sometimes compatibility requirements mean we can't fix up the old code, even).

Related

Could anyone help me to solve "clang: error: Undefined symbol _revstring" [duplicate]

I've been working in C for so long that the fact that compilers typically add an underscore to the start of an extern is just understood... However, another SO question today got me wondering about the real reason why the underscore is added. A wikipedia article claims that a reason is:
It was common practice for C compilers to prepend a leading underscore to all external scope program identifiers to avert clashes with contributions from runtime language support
I think there's at least a kernel of truth to this, but also it seems to no really answer the question, since if the underscore is added to all externs it won't help much with preventing clashes.
Does anyone have good information on the rationale for the leading underscore?
Is the added underscore part of the reason that the Unix creat() system call doesn't end with an 'e'? I've heard that early linkers on some platforms had a limit of 6 characters for names. If that's the case, then prepending an underscore to external names would seem to be a downright crazy idea (now I only have 5 characters to play with...).
It was common practice for C compilers to prepend a leading underscore to all external scope program identifiers to avert clashes with contributions from runtime language support
If the runtime support is provided by the compiler, you would think it would make more sense to prepend an underscore to the few external identifiers in the runtime support instead!
When C compilers first appeared, the basic alternative to programming in C on those platforms was programming in assembly language, and it was (and occasionally still is) useful to link together object files written in assembler and C. So really (IMHO) the leading underscore added to external C identifiers was to avoid clashes with the identifiers in your own assembly code.
(See also GCC's asm label extension; and note that this prepended underscore can be considered a simple form of name mangling. More complicated languages like C++ use more complicated name mangling, but this is where it started.)
if the c compiler always prepended an underscore before every symbol,
then the startup/c-runtime code, (which is usually written in assembly) can safely use labels and symbols that do not start with an underscore, (such as the symbol 'start').
even if you write a start() function in the c code, it gets generated as _start in the object/asm output. (note that in this case, there is no possibility for the c code to generate a symbol that does not start with an underscore) so the startup coder doesnt have to worry about inventing obscure improbable symbols (like $_dontuse42%$) for each of his/her global variables/labels.
so the linker wont complain about a name clash, and the programmer is happy. :)
the following is different from the practise of the compiler prepending an underscore in its output formats.
This practice was later codified as part of the C and C++ language standards, in which the use of leading underscores was reserved for the implementation.
that is a convention followed, for the c sytem libraries and other system components. (and for things such as __FILE__ etc).
(note that such a symbol (ex: _time) may result in 2 leading underscores (__time) in the generated output)
From what I always hear it is to avoid naming conflicts. Not for other extern variables but more so that when you use a library it will hopefully not conflict with the user code variable names.
The main function is not the real entry point of an executable. Some statically linked files have the real entry point that eventually calls main, and those statically linked files own the namespace that does not start with an underscore. On my system, in /usr/lib, there are gcrt1.o, crt1.o and dylib1.o among others. Each of those has a "start" function without an underscore that will eventually call the "_main" entry point. Everything else besides those files has external scope. The history has to do with mixing assembler and C in a project, where all C was considered external.
From Wikipedia:
It was common practice for C compilers to prepend a leading underscore to all external scope program identifiers to avert clashes with contributions from runtime language support. Furthermore, when the C/C++ compiler needed to introduce names into external linkage as part of the translation process, these names were often distinguished with some combination of multiple leading or trailing underscores.
This practice was later codified as part of the C and C++ language standards, in which the use of leading underscores was reserved for the implementation.

Is there a way to limit the use of a function to its library in C?

So I'm working on a static C library (like a library.a file) for a school project. There are multiple functions, some of which are placed in different files, so I can't use the static keyword. Is there a way that those functions could be limited to the library itself, an equivalent to static for libraries?
So I'm working on a static C library (like a library.a file) for a school project. There are multiple functions, some of which are placed in different files, so I can't use the static keyword. Is there a way that those functions could be limited to the library itself, an equivalent to static for libraries?
The C language does not have any formal sense of a unit of program organization larger than a single translation unit but smaller than a whole program. Libraries are a foreign concept to the language specification, provided by substantially all toolchains, but not part of the language itself. Thus, no, the C language does not define a mechanism other than static to declare that a function identifier can be referenced only by a proper subset of all translation units contributing to a program.
Such limitations are supported by some shared library formats, such as ELF, and it is common for C implementations targeting such shared libraries to provide extensions that enable those facilities to be engaged, but the same generally is not true for static libraries.
Note also that in all these cases, we're talking about the linkage of function identifiers, not about genuinely controlling access to the functions. In principle, any function in the program can be called from anywhere in the program via a function pointer pointing to it.
Frame challenge: why do you care?
The usual accommodation for having functions with external linkage that you don't want library clients to call directly would be to omit those functions from the library's public header files. What does it matter if some intrepid person can analyze the library to discover and possibly call those functions? Your public headers and documentation tell people how the library should be used. If people use it in other ways then that's on them.
It may not be possible to completely "hide" the existence of a set of restricted functions from other source files if they are defined with external linkage, since their identifiers will be visible when linking (as noted in the other answer).
However, if you are only looking to prevent someone from inadvertently calling restricted functions, one of these approaches may be useful:
In some of my projects, I have used #define and #ifdef statements to block restricted functions from being used throughout the program. For example, in my Hardware Abstraction Layer (HAL) library C source files, I typically place #define HAL__ prior to any #include statements. Then I place an #ifdef HAL__ ... #endif block around any restricted function definitions in my header files. Someone with intention could easily easily bypass this by adding #define HAL__ to their source code or by modifying the header file, but it provides some protection against unintentional use of restricted functions and other definitions.
Place the restricted function definitions in a separate header file used to build the library itself (for example library.a) and provide only header files containing non-restricted function declarations with the library. The identifiers for any functions defined will still be visible to the linker, but without the prototypes, it will be difficult for anyone to call them.
Again, if having the identifiers for any restricted functions visible throughout the program would be a problem (for example, duplicating other identifiers), then these options will not work. Also, if the goal is to prevent developers from intentionally calling restricted functions, then these options will not work, although option 2 would make this more difficult. If the intention is only to prevent unintentional calls to the restricted functions and there is no concern with having duplicate identifiers in the program, then these options may help.

Cannot use standard library function name for a global variable, even when no headers are included

Declaring a global variable with the same name as a standard function produces an error in clang (but not gcc). It is not due to a previous declaration in a header file. I can get the error by compiling the following one-line file:
extern void *memcpy[];
Clang says
foo.c:1:14: error: redefinition of 'memcpy' as different kind of symbol
foo.c:1:14: note: previous definition is here
Apparently this only happens for a few standard functions. printf produces an error, fprintf produces a warning, fseek just works.
Why is this an error? Is there a way to work around it?
Motivation. I am using the C compiler as a compiler backend. C code is programmatically generated. The generated code relies on byte-level address arithmetic and pointer type casting. All external symbols are declared as extern void *variablename[];.
According to the C standard (ISO 9899:1999 section 7.1.3), "all external identifiers defined by the library are reserved in a hosted environment. This means, in effect, that no user-supplied external names may match library names."
Your problem can be easily solved by adding a unique prefix to all your identifiers, e.g. "mylang_".
As an alternative, you can avoid the problem by using the LLVM or GCC -ffreestanding flag, which will compile your code for a non-hosted environment. (The C standard specifies that the restriction only applies to a hosted environment.) In this case you can use all the names you want (apart from main, which is still your program's entry point), but you must make your own arrangements for your library. This is how operating system kernels can legally define their own versions of the C library functions.
The reason is explained here and a relevant extract is given below. http://www.gnu.org/software/libc/manual/html_node/Reserved-Names.html
I get an error in gcc as well.
The names of all library types, macros, variables and functions that come from the ISO C standard are reserved unconditionally; your program may not redefine these names. All other library names are reserved if your program explicitly includes the header file that defines or declares them. There are several reasons for these restrictions:
Other people reading your code could get very confused if you were using a function named exit to do something completely different from what the standard exit function does, for example. Preventing this situation helps to make your programs easier to understand and contributes to modularity and maintainability.
It avoids the possibility of a user accidentally redefining a library function that is called by other library functions. If redefinition were allowed, those other functions would not work properly.
It allows the compiler to do whatever special optimizations it pleases on calls to these functions, without the possibility that they may have been redefined by the user. Some library facilities, such as those for dealing with variadic arguments (see Variadic Functions) and non-local exits (see Non-Local Exits), actually require a considerable amount of cooperation on the part of the C compiler, and with respect to the implementation, it might be easier for the compiler to treat these as built-in parts of the language.
The page also describes other restricted names.

What is the meaning of leading and trailing underscores in Linux kernel identifiers?

I keep running across little conventions like __KERNEL__.
Are the __ in this case a naming convention used by kernel developers or is it a syntax specific reason for naming a macro this way?
There are many examples of this throughout the code.
For example some functions and variables begin with an _ or even __.
Is there a specific reason for this?
It seems pretty widely used and I just need some clarification as to whether these things have a syntactical purpose or is it simply a naming convention.
Furthermore I see lots of user declared types such as uid_t. Again I assume this is a naming convention telling the reader that it is a user-defined type?
There are several cases:
In public facing headers, i.e. anything that libc will be taking over and putting under /usr/include/linux, the standards specify which symbols should be defined and any other symbols specific to the system shall start with underscore and capital letter or two underscores. That's the reason for __KERNEL__ in particular, because it is used in headers that are included both in kernel and in libc and some declarations are different.
In internal code, the convention usually is that symbol __something is workhorse for something excluding some management, often locking. That is a reason for things like __d_lookup . Similar convention for system calls is that sys_something is the system call entry point that handles context switch to and from kernel and calls do_something to do the actual work.
The _t suffix is standard library convention for typedefs. E.g. size_t, ptrdiff_t, foff_t and such. Kernel code follows this convention for it's internal types too.
There are several __functions, as i.e. __alloc_pages_nodemask() that seems also be exported. Also, __functions are static, static inline, or globals too. Btw, observing __alloc_pages_nodemask() is called only from mm code, and in no other place, such functions may be meant "internal" of some kernel framekwork.

Is a compiler allowed to add functions to standard headers?

Is a C compiler allowed to add functions to standard headers and still conform to the C standard?
I read this somewhere, but I can't find any reference in the standard, except in annex J.5:
The inclusion of any extension that may cause a strictly conforming
program to become invalid renders an implementation nonconforming.
Examples of such extensions are new keywords, extra library functions
declared in standard headers, or predefined macros with names that do
not begin with an underscore.
However, Annex J is informative and not normative... so it isn't helping.
So I wonder if it is okay or not for a conforming compiler to add additional functions in standard headers?
For example, lets say it adds non-standard itoa to stdlib.h.
In 4. "Conformance" §6, there is:
A conforming implementation may have extensions (including additional
library functions), provided they do not alter the behavior of any strictly conforming
program.
with the immediate conclusion in a footnote:
This implies that a conforming implementation reserves no identifiers other than those explicitly
reserved in this International Standard.
The reserved identifiers are described in 7.1.3. Basically, it is everything starting with an underscore and everything explicitly listed as used for the standard libraries.
So, yes the compiler is allowed to add extensions. But they have to have a name starting with an underscore or one of the prefixes reserved for libraries.
itoa is not a reserved identifier and a compiler defining it in a standard header is not conforming.
In "7.26 Future library directions" you have a list of the identifiers that may be added to the standard headers, this includes identifiers starting with str or mem, macros starting with E and stuff like that.
Other than that, implementations are restricted to the generic names as reserved in "7.1.3 Reserved identifiers".
Compilers for embedded systems regularly add functions and macros to standard headers, usually to make a special processor feature available for use.
If I read the standard correctly, they can do so without sacrificing conformity if they do use names specified as reserved by the standard. Since a conforming program may use any non-reserved name as a variable or a function name, using such a non-reserved name as an addition to a standard header would break a conforming program.
In practice, however, the compiler writers usually do not care too much. They will at most provide a list of elements defined for the system you may not use if you want your program to work with their implementation.

Resources