variable names in C by Dennis Ritchie [duplicate] - c

When taken literally, it makes sense, but what exactly does it mean to be a significant character of a variable name?
I'm a beginning learner of C using K&R. Here's a direct quote from the book:
"At least the first 31 characters of an internal name are significant. For function names and external variables, the number may be less than 31, because external names may be used by assemblers and loaders over which the language has no control. For external names, the standard guarantees only for 6 characters and a single case."
By the way, what does it mean by "single case"?

Single Case usually means "lower case". Except in some OS's where it means "upper case". The point is that mixed case is not guaranteed to work.
abcdef
ABCDEF
differ only in case. This is not guaranteed to work.
The "Significance" issue is one of how many letters can be the same.
Let's say we only have 6 significant characters.
a_very_long_name
a_very_long_name_thats_too_similar
Look different, but the first 16 characters are the same. Since only 6 are significant, those are the same variable.

It means what you fear it means. For external names, the C standard at the time K&R 2nd ed. was written really does give only six case-insensitive characters! So you can't have afoobar and aFooBaz as independent entities.
This absurd limitation (which was to accommodate legacy linkers now long-gone) is no longer relevant to any environment much. The C99 standard offers 31 case-sensitive characters for external names and 63 internally, and commonly-used linkers in practice support much longer names.

It just means that if you have two variables named
abcdefghijklmnopqrstuvwxyz78901A,
and
abcdefghijklmnopqrstuvwxyz78901B,
that there is no guarantee that will be treated as different, separate variables...

It means that :
foobar1
foobar2
might be the same external name, because only the first 6 characters need be considered. The single case means that upper and lower case names need not be distinguished.
Please note that almost all modern linkers will consider much longer names, thogh there will still be a limit, dependent on the linker.

G'day,
One of the problems with this limited symbol resolution occurs at link time.
Multiple symbols with the same name can exist across several libraries and the link editor usually only takes the first one it finds that matches what it is looking for.
So, using S.Lott's example from above, if your link editor is searching for the symbol "a_very_long_name" and it finds a library on its search path that contains the symbol "a_very_long_name_thats_too_similar" it will take this one. This will happen even if the library that contains the symbol that you want, i.e. "a_very_long_name" has been specified in your command. For example specifying the libraries as:
-L/my/library/path -lmy_wrong_lib -lmy_correct_lib
There are now compiler options, or more correctly compile time options which are passed through to the link editor, which enforce a search for multiple symbols in your link path. These are then usually raised as errors at link time.
In addition, many compilers, e.g. gcc, will default to such behaviour. You have to explicitly enable multiple definitions to allow the link editor to proceed without raising a fatal error if it finds multiple definitions for a symbol.
BTW I'd highly recommend working through the exercises in conjunction with Clovis Tondo's book "The C Answer Book 2nd ed.".
Doing this really helps make C stick in your mind.
HTH
cheers,

Related

Could anyone help me to solve "clang: error: Undefined symbol _revstring" [duplicate]

I've been working in C for so long that the fact that compilers typically add an underscore to the start of an extern is just understood... However, another SO question today got me wondering about the real reason why the underscore is added. A wikipedia article claims that a reason is:
It was common practice for C compilers to prepend a leading underscore to all external scope program identifiers to avert clashes with contributions from runtime language support
I think there's at least a kernel of truth to this, but also it seems to no really answer the question, since if the underscore is added to all externs it won't help much with preventing clashes.
Does anyone have good information on the rationale for the leading underscore?
Is the added underscore part of the reason that the Unix creat() system call doesn't end with an 'e'? I've heard that early linkers on some platforms had a limit of 6 characters for names. If that's the case, then prepending an underscore to external names would seem to be a downright crazy idea (now I only have 5 characters to play with...).
It was common practice for C compilers to prepend a leading underscore to all external scope program identifiers to avert clashes with contributions from runtime language support
If the runtime support is provided by the compiler, you would think it would make more sense to prepend an underscore to the few external identifiers in the runtime support instead!
When C compilers first appeared, the basic alternative to programming in C on those platforms was programming in assembly language, and it was (and occasionally still is) useful to link together object files written in assembler and C. So really (IMHO) the leading underscore added to external C identifiers was to avoid clashes with the identifiers in your own assembly code.
(See also GCC's asm label extension; and note that this prepended underscore can be considered a simple form of name mangling. More complicated languages like C++ use more complicated name mangling, but this is where it started.)
if the c compiler always prepended an underscore before every symbol,
then the startup/c-runtime code, (which is usually written in assembly) can safely use labels and symbols that do not start with an underscore, (such as the symbol 'start').
even if you write a start() function in the c code, it gets generated as _start in the object/asm output. (note that in this case, there is no possibility for the c code to generate a symbol that does not start with an underscore) so the startup coder doesnt have to worry about inventing obscure improbable symbols (like $_dontuse42%$) for each of his/her global variables/labels.
so the linker wont complain about a name clash, and the programmer is happy. :)
the following is different from the practise of the compiler prepending an underscore in its output formats.
This practice was later codified as part of the C and C++ language standards, in which the use of leading underscores was reserved for the implementation.
that is a convention followed, for the c sytem libraries and other system components. (and for things such as __FILE__ etc).
(note that such a symbol (ex: _time) may result in 2 leading underscores (__time) in the generated output)
From what I always hear it is to avoid naming conflicts. Not for other extern variables but more so that when you use a library it will hopefully not conflict with the user code variable names.
The main function is not the real entry point of an executable. Some statically linked files have the real entry point that eventually calls main, and those statically linked files own the namespace that does not start with an underscore. On my system, in /usr/lib, there are gcrt1.o, crt1.o and dylib1.o among others. Each of those has a "start" function without an underscore that will eventually call the "_main" entry point. Everything else besides those files has external scope. The history has to do with mixing assembler and C in a project, where all C was considered external.
From Wikipedia:
It was common practice for C compilers to prepend a leading underscore to all external scope program identifiers to avert clashes with contributions from runtime language support. Furthermore, when the C/C++ compiler needed to introduce names into external linkage as part of the translation process, these names were often distinguished with some combination of multiple leading or trailing underscores.
This practice was later codified as part of the C and C++ language standards, in which the use of leading underscores was reserved for the implementation.

K & R C Variable Names

I have some confusion for contents about variable names in K & R C. Original text as below:
At least the first 31 characters of an internal name are significant. For function names and external variables, the number may be less than 31, because external names may be used by assemblers and loaders over which the language has no control. For external names, the standard guarantees uniqueness only for 6 characters and a single case. Keywords like if, else, int, float, etc., are reserved: you can't use them as variable names. They must be in lower case.
It's wise to choose variable names that are related to the purpose of the variable, and that are unlikely to get mixed up typographically. We tend to use short names for local variables, especially loop indices, and longer names for external variables.
What confused me was the external names, the standard guarantees uniqueness only for 6 characters and a single case. Does it means that for external names, only the 6 leading chars are valid and remaining chars are all ignored? For example, we defined two external variable myexvar1 and myexvar2, the compiler will treat these two variables as one? If this is true, why they advise us to use longer names for external variables?
Does it means that for external names, only the 6 leading chars are valid and remaining chars are all ignored? For example, we defined two external variable myexvar1 and myexvar2, the compiler will treat these two variables as one?
Yes this was true in 1990. Or rather, 6 unique leading characters of external identifiers was what the C90 standard set as minimum limit for a compiler. This was of course madness - which is why this limit was increased to 31 in C99.
In practice, most C90 compilers had at least 31 unique characters for internal and external identifiers both.
If this is true, why they advise us to use longer names for external variables?
Not sure if they advise it. But the coding style used in K&R is often plain horrible, so it is definitely not a book you should consult for coding style advise.
In modern C, it is required (C17 5.2.4.1) that we have:
63 significant initial characters in an internal identifier or a macro name
31 significant initial characters in an external identifier
So don't worry too much about which limitations the dinosaurs faced, but follow modern standard C.
As pointed out in another answer, even the restriction of 31 significant initial characters for external identifiers is listed as obsolete, meaning this might get increased even further, to 255, in future standards.
Truth be told K&R is pretty old, so I assume things have changes since then.
I really don't know the reason why the give exactly 6 characters here:
For external names, the standard guarantees uniqueness only for 6 characters and a single case.
But you have to understand that all compiler does is translating a translation unit (usually a *.c file) into an object file (*.o). That's it. Compiler does not produce a ready to run program.
Those object files might contain references to unresolved symbols to be found in other object files as well as a table of their own external symbols, the ones they provide to be referenced from the outside. The symbols do have textual names, which are the names you've given to your external variables.
Linkers and dynamic loaders still have to do their jobs to build the program and get it running. Along the way the have to resolve all unresolved symbols, so they perform textual lookup for those symbols in object files. Linkers and loaders are not compiler. The might have their own rules about treating those names (back in the days of K&R, I guess). That's what this ...
because external names may be used by assemblers and loaders over which the language has no control.
... is about.
These days though all your K&R concerns sound outdated and irrelevant. Pick a newer standard to follow.
This is due to the historical background concerning the length of exported symbols to the linker of the system.
I quote from The New C Standard -- An Economic and Cultural Commentary.
The values of 6 and 10 were chosen so that the encodings \u1234 and
\U12345678 could be used.
The Fortran significant character limit of six was followed by many
suppliers of linkers for a long time. The need for longer identifiers
to support name mangling in C++ ensured that most modern linkers
support many more significant characters in an external identifier.
Common Implementations
Historically, the number of significant
characters in an external identifier was driven by the behavior of the
host vendor-supplied linker. Only since the success of MS-DOS have
developers become used to translator vendors supplying their own
linker. Previously, most linkers tended to be supplied by the hardware
vendor. The mainframe world tended to be driven by the requirements of
Fortran, which had six significant characters in an internal or
external identifier. In this environment it was not always possible to
replace the system linker by one supporting more significant
characters. The importance of the mainframe environment waned in the
1990s. In modern environments it is very often possible to obtain
alternative linkers.
So the main issue was to be able to link together libraries compiled in C with libraries compiled in Fortran, and Fortran imposed the limit of 6.
You can read more at the given reference.
That's a legacy of the past that is not anymore important. No today compiler has those limits, and that was something that dates from the times the old unix was made. The reasons were (then and today) the limits imposed by the compiler to the names in the symbol table (31) and the limit the linker used (6) in that time.
But that's not applicable anymore. At least you can be sure that today's linkers will allow different identifiers to state different with at least a common prefix of length 100.

Does Comments/Identifiers can impact on code performance/operability?

Today i was presented with a wiered fact (or not)
it was said:
"At it is disallowed to write long, descriptive identifier names, and forbidden to write Comments for Linux Drivers written in ANSI C."
When i asked "WTF? Why?" i was told it caused performence issues and errors of such...
not many details there.
I am supprised, but have to ask...
Can this be real?
knowing that Comments are stripped by the compilation pre-processor,
and that Identifiers are either way converted to adresses.
so... Can it cause Problems ?
Well, ANSI C is a standard, and a standard is something itself that everyone must follow (I mean compiler designers and programmers, if they decide to support it).
ANSI C standard states that exported identifiers (yeah, exported identifiers are stored as symbols in symbols table as is, not just addresses) must not be longer than 6 characters, and non-exported identifiers are ok to be not longer than 31 character.
On commenting. Except some obvious pitfalls like accidental code swallowing by multi-line commenting, I recommend you to read Coding Style article for Kernel developers which explains what kind of comments are not encouraged.
Absolutely not. Whatever identifier you used in your code, they will be translated to symbols by compiler.
Also, all comments will be ignored by the compilation pre-processor.
The only effect of comments are help you understand code more quickly .
The only performance impact comments can have is during compile time, though I would say it is neglectable, unless you write whole books as comments.
The identifer names are translated to symbols, so there is also, at best, a performance impact at compiletime, which again is neglectable. Identifer names might hit a maximum limit, but to be honest, I never encountered a problem because of to long identifier names.
No, the first step in the compilation is pre-process your source code to remove comments and do other tricks like expanding macros.
Identifiers are often translated into pointers (to symbol table entries).

same name but with different case variable and function names in c

I have a variable named setlocal and a function named void SetLocal(void)
I am using C51 keil compiler to build the code and the linker gives following error:
"EXTERNAL ATTRIBUT DO NOT MATCH PUBLIC"
Is it not possible to use same name for function and a variable? with different case?
That particular compiler is for embedded systems (using the 8051 chips) and is really targeted for those environments. I've seen compilers in that arena that don't even support floating point, and Keil make it clear that, while it's based on C90, there are deviations from that standard.
As per the compiler limitations listed on the Keil website:
Names may be up to 255 characters long. The C language provides for case sensitivity in regard to function and variable names. However, for compatibility reasons, all names in the object file appear in capital letters. It is therefore irrelevant if an external object name within the source program is written in capital or small letters.
So it's a safe bet that, as far as the linker is concerned, you have a conflict between the setlocal variable and the SetLocal function, both of which would be seen as SETLOCAL.
That also explains (as stated in one on your comments) why changing the variable name to setlocal1 fixes your problem. While the symbols are not case sensitive, they are unique to 255 characters.

(K&R) At least the first 31 characters of an internal name are significant?

When taken literally, it makes sense, but what exactly does it mean to be a significant character of a variable name?
I'm a beginning learner of C using K&R. Here's a direct quote from the book:
"At least the first 31 characters of an internal name are significant. For function names and external variables, the number may be less than 31, because external names may be used by assemblers and loaders over which the language has no control. For external names, the standard guarantees only for 6 characters and a single case."
By the way, what does it mean by "single case"?
Single Case usually means "lower case". Except in some OS's where it means "upper case". The point is that mixed case is not guaranteed to work.
abcdef
ABCDEF
differ only in case. This is not guaranteed to work.
The "Significance" issue is one of how many letters can be the same.
Let's say we only have 6 significant characters.
a_very_long_name
a_very_long_name_thats_too_similar
Look different, but the first 16 characters are the same. Since only 6 are significant, those are the same variable.
It means what you fear it means. For external names, the C standard at the time K&R 2nd ed. was written really does give only six case-insensitive characters! So you can't have afoobar and aFooBaz as independent entities.
This absurd limitation (which was to accommodate legacy linkers now long-gone) is no longer relevant to any environment much. The C99 standard offers 31 case-sensitive characters for external names and 63 internally, and commonly-used linkers in practice support much longer names.
It just means that if you have two variables named
abcdefghijklmnopqrstuvwxyz78901A,
and
abcdefghijklmnopqrstuvwxyz78901B,
that there is no guarantee that will be treated as different, separate variables...
It means that :
foobar1
foobar2
might be the same external name, because only the first 6 characters need be considered. The single case means that upper and lower case names need not be distinguished.
Please note that almost all modern linkers will consider much longer names, thogh there will still be a limit, dependent on the linker.
G'day,
One of the problems with this limited symbol resolution occurs at link time.
Multiple symbols with the same name can exist across several libraries and the link editor usually only takes the first one it finds that matches what it is looking for.
So, using S.Lott's example from above, if your link editor is searching for the symbol "a_very_long_name" and it finds a library on its search path that contains the symbol "a_very_long_name_thats_too_similar" it will take this one. This will happen even if the library that contains the symbol that you want, i.e. "a_very_long_name" has been specified in your command. For example specifying the libraries as:
-L/my/library/path -lmy_wrong_lib -lmy_correct_lib
There are now compiler options, or more correctly compile time options which are passed through to the link editor, which enforce a search for multiple symbols in your link path. These are then usually raised as errors at link time.
In addition, many compilers, e.g. gcc, will default to such behaviour. You have to explicitly enable multiple definitions to allow the link editor to proceed without raising a fatal error if it finds multiple definitions for a symbol.
BTW I'd highly recommend working through the exercises in conjunction with Clovis Tondo's book "The C Answer Book 2nd ed.".
Doing this really helps make C stick in your mind.
HTH
cheers,

Resources