Why do C compilers prepend underscores to external names? - c

I've been working in C for so long that the fact that compilers typically add an underscore to the start of an extern is just understood... However, another SO question today got me wondering about the real reason why the underscore is added. A wikipedia article claims that a reason is:
It was common practice for C compilers to prepend a leading underscore to all external scope program identifiers to avert clashes with contributions from runtime language support
I think there's at least a kernel of truth to this, but also it seems to no really answer the question, since if the underscore is added to all externs it won't help much with preventing clashes.
Does anyone have good information on the rationale for the leading underscore?
Is the added underscore part of the reason that the Unix creat() system call doesn't end with an 'e'? I've heard that early linkers on some platforms had a limit of 6 characters for names. If that's the case, then prepending an underscore to external names would seem to be a downright crazy idea (now I only have 5 characters to play with...).

It was common practice for C compilers to prepend a leading underscore to all external scope program identifiers to avert clashes with contributions from runtime language support
If the runtime support is provided by the compiler, you would think it would make more sense to prepend an underscore to the few external identifiers in the runtime support instead!
When C compilers first appeared, the basic alternative to programming in C on those platforms was programming in assembly language, and it was (and occasionally still is) useful to link together object files written in assembler and C. So really (IMHO) the leading underscore added to external C identifiers was to avoid clashes with the identifiers in your own assembly code.
(See also GCC's asm label extension; and note that this prepended underscore can be considered a simple form of name mangling. More complicated languages like C++ use more complicated name mangling, but this is where it started.)

if the c compiler always prepended an underscore before every symbol,
then the startup/c-runtime code, (which is usually written in assembly) can safely use labels and symbols that do not start with an underscore, (such as the symbol 'start').
even if you write a start() function in the c code, it gets generated as _start in the object/asm output. (note that in this case, there is no possibility for the c code to generate a symbol that does not start with an underscore) so the startup coder doesnt have to worry about inventing obscure improbable symbols (like $_dontuse42%$) for each of his/her global variables/labels.
so the linker wont complain about a name clash, and the programmer is happy. :)
the following is different from the practise of the compiler prepending an underscore in its output formats.
This practice was later codified as part of the C and C++ language standards, in which the use of leading underscores was reserved for the implementation.
that is a convention followed, for the c sytem libraries and other system components. (and for things such as __FILE__ etc).
(note that such a symbol (ex: _time) may result in 2 leading underscores (__time) in the generated output)

From what I always hear it is to avoid naming conflicts. Not for other extern variables but more so that when you use a library it will hopefully not conflict with the user code variable names.

The main function is not the real entry point of an executable. Some statically linked files have the real entry point that eventually calls main, and those statically linked files own the namespace that does not start with an underscore. On my system, in /usr/lib, there are gcrt1.o, crt1.o and dylib1.o among others. Each of those has a "start" function without an underscore that will eventually call the "_main" entry point. Everything else besides those files has external scope. The history has to do with mixing assembler and C in a project, where all C was considered external.

From Wikipedia:
It was common practice for C compilers to prepend a leading underscore to all external scope program identifiers to avert clashes with contributions from runtime language support. Furthermore, when the C/C++ compiler needed to introduce names into external linkage as part of the translation process, these names were often distinguished with some combination of multiple leading or trailing underscores.
This practice was later codified as part of the C and C++ language standards, in which the use of leading underscores was reserved for the implementation.

Related

Could anyone help me to solve "clang: error: Undefined symbol _revstring" [duplicate]

I've been working in C for so long that the fact that compilers typically add an underscore to the start of an extern is just understood... However, another SO question today got me wondering about the real reason why the underscore is added. A wikipedia article claims that a reason is:
It was common practice for C compilers to prepend a leading underscore to all external scope program identifiers to avert clashes with contributions from runtime language support
I think there's at least a kernel of truth to this, but also it seems to no really answer the question, since if the underscore is added to all externs it won't help much with preventing clashes.
Does anyone have good information on the rationale for the leading underscore?
Is the added underscore part of the reason that the Unix creat() system call doesn't end with an 'e'? I've heard that early linkers on some platforms had a limit of 6 characters for names. If that's the case, then prepending an underscore to external names would seem to be a downright crazy idea (now I only have 5 characters to play with...).
It was common practice for C compilers to prepend a leading underscore to all external scope program identifiers to avert clashes with contributions from runtime language support
If the runtime support is provided by the compiler, you would think it would make more sense to prepend an underscore to the few external identifiers in the runtime support instead!
When C compilers first appeared, the basic alternative to programming in C on those platforms was programming in assembly language, and it was (and occasionally still is) useful to link together object files written in assembler and C. So really (IMHO) the leading underscore added to external C identifiers was to avoid clashes with the identifiers in your own assembly code.
(See also GCC's asm label extension; and note that this prepended underscore can be considered a simple form of name mangling. More complicated languages like C++ use more complicated name mangling, but this is where it started.)
if the c compiler always prepended an underscore before every symbol,
then the startup/c-runtime code, (which is usually written in assembly) can safely use labels and symbols that do not start with an underscore, (such as the symbol 'start').
even if you write a start() function in the c code, it gets generated as _start in the object/asm output. (note that in this case, there is no possibility for the c code to generate a symbol that does not start with an underscore) so the startup coder doesnt have to worry about inventing obscure improbable symbols (like $_dontuse42%$) for each of his/her global variables/labels.
so the linker wont complain about a name clash, and the programmer is happy. :)
the following is different from the practise of the compiler prepending an underscore in its output formats.
This practice was later codified as part of the C and C++ language standards, in which the use of leading underscores was reserved for the implementation.
that is a convention followed, for the c sytem libraries and other system components. (and for things such as __FILE__ etc).
(note that such a symbol (ex: _time) may result in 2 leading underscores (__time) in the generated output)
From what I always hear it is to avoid naming conflicts. Not for other extern variables but more so that when you use a library it will hopefully not conflict with the user code variable names.
The main function is not the real entry point of an executable. Some statically linked files have the real entry point that eventually calls main, and those statically linked files own the namespace that does not start with an underscore. On my system, in /usr/lib, there are gcrt1.o, crt1.o and dylib1.o among others. Each of those has a "start" function without an underscore that will eventually call the "_main" entry point. Everything else besides those files has external scope. The history has to do with mixing assembler and C in a project, where all C was considered external.
From Wikipedia:
It was common practice for C compilers to prepend a leading underscore to all external scope program identifiers to avert clashes with contributions from runtime language support. Furthermore, when the C/C++ compiler needed to introduce names into external linkage as part of the translation process, these names were often distinguished with some combination of multiple leading or trailing underscores.
This practice was later codified as part of the C and C++ language standards, in which the use of leading underscores was reserved for the implementation.

How to access assembly language symbols without a leading _ from C on 6502 (cc65)

I'm writing some code in 6502 assembly language using cc65.
Because I'm living in 2022 and not 1979 and have access to a development machine that is a million times more powerful than the target platform, I'm writing unit tests for the assembly language code in C.
Obviously the calling conventions for C and assembly language are different, so I have a bunch of wrapper functions that accept C-style arguments and then call the assembly language functions.
But after calling an assembly language function, I want to check the state of various globals that are defined in assembly language, but I can't because C expects all identifiers to start with an underscore '_' and the identifiers in my assembly language modules don't.
I could just export every symbol twice, once with a '_' prefix and once without, but it seems so clunky and I just wonder if there's an easier way? Is there a #pragma or something that I can use to tell C to use the symbol name exactly as-is, without adding an underscore?
I've looked in the cc65 docs and found nothing, but it seems like a pretty common need, and I'm wondering what other people do.
It is likely that the cc65 compiler only supports access to symbols with the ABI-specificed decoration, i.e. those beginning in an underscore _.
To access other symbols, they therefore must either be renamed to follow the decoration or a decorated alias must be created.
_foo EQU foo
For functions, it is also worth considering to write wrapper functions. This may improve the ability to debug the code as debuggers tend to get confused when two symbols refer to the same address.

Why symbols are prefixed with an underscore in compiled files? [duplicate]

I've been working in C for so long that the fact that compilers typically add an underscore to the start of an extern is just understood... However, another SO question today got me wondering about the real reason why the underscore is added. A wikipedia article claims that a reason is:
It was common practice for C compilers to prepend a leading underscore to all external scope program identifiers to avert clashes with contributions from runtime language support
I think there's at least a kernel of truth to this, but also it seems to no really answer the question, since if the underscore is added to all externs it won't help much with preventing clashes.
Does anyone have good information on the rationale for the leading underscore?
Is the added underscore part of the reason that the Unix creat() system call doesn't end with an 'e'? I've heard that early linkers on some platforms had a limit of 6 characters for names. If that's the case, then prepending an underscore to external names would seem to be a downright crazy idea (now I only have 5 characters to play with...).
It was common practice for C compilers to prepend a leading underscore to all external scope program identifiers to avert clashes with contributions from runtime language support
If the runtime support is provided by the compiler, you would think it would make more sense to prepend an underscore to the few external identifiers in the runtime support instead!
When C compilers first appeared, the basic alternative to programming in C on those platforms was programming in assembly language, and it was (and occasionally still is) useful to link together object files written in assembler and C. So really (IMHO) the leading underscore added to external C identifiers was to avoid clashes with the identifiers in your own assembly code.
(See also GCC's asm label extension; and note that this prepended underscore can be considered a simple form of name mangling. More complicated languages like C++ use more complicated name mangling, but this is where it started.)
if the c compiler always prepended an underscore before every symbol,
then the startup/c-runtime code, (which is usually written in assembly) can safely use labels and symbols that do not start with an underscore, (such as the symbol 'start').
even if you write a start() function in the c code, it gets generated as _start in the object/asm output. (note that in this case, there is no possibility for the c code to generate a symbol that does not start with an underscore) so the startup coder doesnt have to worry about inventing obscure improbable symbols (like $_dontuse42%$) for each of his/her global variables/labels.
so the linker wont complain about a name clash, and the programmer is happy. :)
the following is different from the practise of the compiler prepending an underscore in its output formats.
This practice was later codified as part of the C and C++ language standards, in which the use of leading underscores was reserved for the implementation.
that is a convention followed, for the c sytem libraries and other system components. (and for things such as __FILE__ etc).
(note that such a symbol (ex: _time) may result in 2 leading underscores (__time) in the generated output)
From what I always hear it is to avoid naming conflicts. Not for other extern variables but more so that when you use a library it will hopefully not conflict with the user code variable names.
The main function is not the real entry point of an executable. Some statically linked files have the real entry point that eventually calls main, and those statically linked files own the namespace that does not start with an underscore. On my system, in /usr/lib, there are gcrt1.o, crt1.o and dylib1.o among others. Each of those has a "start" function without an underscore that will eventually call the "_main" entry point. Everything else besides those files has external scope. The history has to do with mixing assembler and C in a project, where all C was considered external.
From Wikipedia:
It was common practice for C compilers to prepend a leading underscore to all external scope program identifiers to avert clashes with contributions from runtime language support. Furthermore, when the C/C++ compiler needed to introduce names into external linkage as part of the translation process, these names were often distinguished with some combination of multiple leading or trailing underscores.
This practice was later codified as part of the C and C++ language standards, in which the use of leading underscores was reserved for the implementation.

Name decoration in C

Does any standard mandate name decoration?
As far as I know most (all?) conforming implementations add underscore prefix to the name of each exported symbol. Is this guaranteed by a C, POSIX or some other standard?
I'm not sure about a "standard" but the prepended underscore seems to be a very common convention dating from the 1970s. From What is the reason function names are prefixed with an underscore by the compiler?:
At the time that UNIX was rewritten in C in about 1974, its authors
already had extensive assember language libraries, and it was easier
to mangle the names of new C and C-compatible code than to go back and
fix all the existing code.
It is required for C names if you want to interoperate with the Microsoft compilers on Windows (win32 only; win64 does not use decoration since it has only a single standard calling convention).
https://learn.microsoft.com/en-us/cpp/build/reference/decorated-names
See also: Why do C compilers prepend underscores to external names?

What is the meaning of leading and trailing underscores in Linux kernel identifiers?

I keep running across little conventions like __KERNEL__.
Are the __ in this case a naming convention used by kernel developers or is it a syntax specific reason for naming a macro this way?
There are many examples of this throughout the code.
For example some functions and variables begin with an _ or even __.
Is there a specific reason for this?
It seems pretty widely used and I just need some clarification as to whether these things have a syntactical purpose or is it simply a naming convention.
Furthermore I see lots of user declared types such as uid_t. Again I assume this is a naming convention telling the reader that it is a user-defined type?
There are several cases:
In public facing headers, i.e. anything that libc will be taking over and putting under /usr/include/linux, the standards specify which symbols should be defined and any other symbols specific to the system shall start with underscore and capital letter or two underscores. That's the reason for __KERNEL__ in particular, because it is used in headers that are included both in kernel and in libc and some declarations are different.
In internal code, the convention usually is that symbol __something is workhorse for something excluding some management, often locking. That is a reason for things like __d_lookup . Similar convention for system calls is that sys_something is the system call entry point that handles context switch to and from kernel and calls do_something to do the actual work.
The _t suffix is standard library convention for typedefs. E.g. size_t, ptrdiff_t, foff_t and such. Kernel code follows this convention for it's internal types too.
There are several __functions, as i.e. __alloc_pages_nodemask() that seems also be exported. Also, __functions are static, static inline, or globals too. Btw, observing __alloc_pages_nodemask() is called only from mm code, and in no other place, such functions may be meant "internal" of some kernel framekwork.

Resources