External and internal Identifier - c

I know to code in C well but I thought of learning C from the book C - The Complete Reference by Herbert Schildt. Here is a quote from Chapter 2:
In C89, at least the first 6 characters of an external identifier and at
least the first 31 characters of an internal identifier will be significant. C99 has increased these values. In C99, an external identifier has at least 31 significant characters, and an internal identifier has at least 63 significant characters.
Can somebody explain what does it mean to be significant?

That means that it is used within the compiler to differ between different names.
E.g. if only the first 6 characters are significant, when having two variables:
int abcdef_1;
int abcdef_2;
They will be treated as the same variable, and possibly the compiler will generate a warning or error.
About the minimal significance:
Maybe the compiler/assembler can handle more, but the linker cannot. Or maybe external tools which are out of control of the manufacturer of the assembler/linker can handle less, thus a minimum value (per type, internal/external) is defined in the C standard(s).

Related

Why can you start a variable name with $ in C?

I was under the impression that you could only start variable names with letters and _, however while testing around, I also found out that you can start variable names with $, like so:
Code
#include <stdio.h>
int main() {
int myvar=13;
int $var=42;
printf("%d\n", myvar);
printf("%d\n", $var);
}
Output
13
42
According to this resource, it says that you can't start variable names with $ in C, which is wrong (at least when compiled using my gcc version, Apple LLVM version 10.0.1 (clang-1001.0.46.4)). Other resources that I found online also seem to suggest that variables can't start with $, which is why I'm confused.
Do these articles just fail to mention this nuance, and if so, why is this a feature of C?
In the C 2018 standard, clause 6.4.2, paragraph 1 allows implementations to allow additional characters in identifiers.
It defines an identifier to be an identifier-nondigit character followed by any number of identifier-nondigit or digit characters. It defines digit to be “0“ to “9”, and it defines the identifier-nondigit characters to be:
a nondigit, which is one of underscore, “a” to “z”, or “A” to “Z”,
a universal-character-name, or
other implementation-defined characters.
Thus, implementations may define other characters that are allowed in identifiers.
The characters included as universal-character-name are those listed in ranges in Annex D of the C standard.
The resource you link to is wrong in several places:
Variable names in C are made up of letters (upper and lower case) and digits.
This is false; identifiers may include underscores and the above universal characters in every conforming implementation and other characters in implementations that permit them.
$ not allowed -- only letters, and _
This is incorrect. The C standard does not require an implementation to allow “$”, but it does not disallow an implementation from allowing it. “$” is allowed by some implementations and not others. It can be said not to be a part of strictly conforming C programs, but it may be a part of conforming C programs.
This answers your question:
In GNU C, you may normally use dollar signs in identifier names. This is because many traditional C implementations allow such identifiers. However, dollar signs in identifiers are not supported on a few target machines, typically because the target assembler does not allow them.
This is allowed in GCC and LLVM because many traditional C implementations allow identifiers like this.
One such reason is that VMS commonly uses these, where a lot of system library routines have names like SYS$SOMETHING.
Here's a link to the GCC docs describing this:
https://gcc.gnu.org/onlinedocs/gcc/Dollar-Signs.html
Depends on the dialect of C and the options selected. Historically some Cs supported $ to be compatible with existing libraries when C was new. You may need to use a command line option to enable $ or another to turn if of if strictly conforming C is valuable to you.
A spot of history: in my early years I got into enough mainframe rooms to know that $ is one of what IBM mainframes called "national characters" of $,#, and # that could show up in identifiers of programming languages like PL/1 and mainframe assembler. This worked down to some mainframe spin-offs, such as the IBM 1130. It looked to me like early impact printers using pieces of shaped slugs to print with, and CRT terminals, could swap out these characters to meet the national needs of foreign customers. The IBM 1403 printer had many "print chains" to choose from for different human languages and technical purposes.
Some non-IBM identifiers picked up on at least some of these characters. GNU C, VMS, and JavaScript kept "$". "$" is the only character of old that seems to have survived to this day, even as an option, in most languages. The odd thing is back on early IBM days the underscore was invalid for identifier names.
TL;DR: it's the assembler not the compiler
Ok, so I did some research into this. It's not really allowed, but what excludes it as the assembly pass. Trying to do the following fails:
#include <stdio.h>
extern int $func();
int main() {
int myvar=13;
int $var=42;
printf("%d\n", myvar);
printf("%d\n", $var);
$func();
}
joshua#nova:/tmp$ gcc -c test.c
/tmp/ccg7zLVB.s: Assembler messages:
/tmp/ccg7zLVB.s:31: Error: operand type mismatch for `call'
joshua#nova:/tmp$
I pulled K&R C version 2 (this covers ANSI C) off my shelf and it says "Identifiers are a sequence of letters and digits. The first character must be a letter; the underscore _ character counts as a letter. Upper and lower case letters are different. Identifiers may have any length ... [obsolete verbiage omitted]."
This reference as clearly aged; and almost everybody accepts high-unicode as letters. What's going on is the back-end assembler sees symbols bytewise and every byte with the high bit set counts as a letter. If you're crazy enough to use shift-jis outside of string literals, chaos can ensue; but otherwise this tends to work well enough.
I accessed a draft of C18 which says identifier-nondigit: nondigit ; nondigit ; universal-character-name other-implementation-defined-characters. Therefore, implementations are allowed to permit additional characters.
For universal-character-name, we have a restriction: "A universal character name shall not specify a character whose short identifier is less than 00A0
other than 0024 ( $ ), 0040 ( # ), or 0060 (‘), nor one in the range D800 through DFFF inclusive."
The following code still chokes at the assembly pass as expected:
#include <stdio.h>
extern int \U00000024func();
int main()
{
return \U00000024func();
}
The following code builds:
#include <stdio.h>
extern int func\U00000024();
int main()
{
return func\U00000024();
}

in C, what is the maximum amount of identifiers you can have?

what is the max amount of variables/identifiers you can have in C? Learning compiler theory and interpreter design, I've learned that identifiers and their values are stored via a symbol dictionary/hashmap.
Considering that hashmaps/dictionaries have a RAM limit, what would be the max amount of hashed identifiers possible in the C programming language?
In general the number of identifiers is a quality-of-implementation issue. All compilers I know are only limited by available resources (memory).
There is, however, a (nearly useless) specification of minimum limits in the C Standard, C11, emphasis for identifiers by me:
5.2.4.1 Translation limits
The implementation shall be able to translate and execute at least one
program that contains at least one instance of every one of the
following limits:
127 nesting levels of blocks
63 nesting levels of conditional inclusion
12 pointer, array, and function declarators (in any combinations) modifying an arithmetic, structure, union, or void type in a
declaration
63 nesting levels of parenthesized declarators within a full declarator
63 nesting levels of parenthesized expressions within a full expression
63 significant initial characters in an internal identifier or a macro name (each universal character name or extended source character
is considered a single character)
31 significant initial characters in an external identifier (each universal character name specifying a short identifier of 0000FFFF or
less is considered 6 characters, each universal character name
specifying a short identifier of 00010000 or more is considered 10
characters, and each extended source character is considered the same
number of characters as the corresponding universal character name, if
any)
4095 external identifiers in one translation unit
511 identifiers with block scope declared in one block
4095 macro identifiers simultaneously defined in one preprocessing translation unit
127 parameters in one function definition
127 arguments in one function call
127 parameters in one macro definition
127 arguments in one macro invocation
4095 characters in a logical source line
4095 characters in a string literal (after concatenation)
65535 bytes in an object (in a hosted environment only)
15 nesting levels for #included files
1023 case labels for a switch statement (excluding those for any nested switch statements)
1023 members in a single structure or union
1023 enumeration constants in a single enumeration
63 levels of nested structure or union definitions in a single struct-declaration-list
I consider it nearly useless due to the "at least one program" part. I think the intent is clear, but if your vendor sells you a compiler able to translate exactly one program testing these limits, then you won't get your money back :-)
The standard doesn't specify a limit so it's down to the compiler or interpreter to make the choice.
You should also note that identifiers can be compiled out in the final binary.
There does not seem to be any information in the C standard, but the C++ standard does mention some minimum recommendations which you probably could use as a guideline:
Annex B (informative)
Implementation quantities
[implimits]
(2.8) — Identifiers with block scope declared in one block [1 024].

C source inclusion name length

According to the C Standard, subclause 6.10.2, paragraph 5 [ISO/IEC 9899:2011],
The implementation shall provide unique mappings for sequences
consisting of one or more nondigits or digits (6.4.2.1) followed by a
period (.) and a single nondigit. The first character shall not be a
digit. The implementation may ignore distinctions of alphabetical case
and restrict the mapping to eight significant characters before the
period.
This would mean that if two include files have first 8 characters in common, the header it actually picks is undefined.
When I compile using clang or gcc, I haven't really faced this issue. However, is there a documented behavior for source file inclusion in GCC and Clang?
In the modern world, I would find it weird if any compiler really restricts to 8 characters.
Reference: C11 WG14 draft version N1570, Cert C Coding standard
This would mean that if two include files have first 8 characters in common, the header it actually picks is undefined.
No, I'd argue against that: Looking at the exact wording we see that standard uses:
[..] The implementation may ignore [..]
It's "may", not "shall". If the later was used it would indeed mean that the behavior was undefined (N1570 $4/2). Since "may" is used as-is, without exact declaration I think it's safe to assume the normal meaning of the word (source, emphasis mine):
used to express opportunity or permission
Thus, an implementation is allowed to only consider the first 8 characters, but it doesn't have to.
Funny thing: I cannot find an exact documentation for the "distinction limit" of the "sequence" in GCC's manual, meaning (N1570 $4/8, emphasis mine) ...
An implementation shall be accompanied by a document that defines all implementation defined and locale-specific characteristics and all extensions.
... that GCC could (under some very pedantic point of view) be considered a nonconforming implementation. The practical relevant part of their manual, as #PaulGriffiths pointed out, is probably (source, point 4 in the list):
Significant initial characters in an identifier or macro name.
The preprocessor treats all characters as significant. The C standard requires only that the first 63 be significant.
Regarding the comment:
[..] I am actually trying to evaluate if this will bite me as long as I am using one of these compilers on a Linux platform. [..]
I really doubt that this will ever (again?) be an issue.

Ambiguous behavior of variable declaration in c

i have the following code
#include<stdio.h>
int main()
{
int a12345678901234567890123456789012345;
int a123456789012345678901234567890123456;
int sum;
scanf("%d",&a12345678901234567890123456789012345);
scanf("%d",&a123456789012345678901234567890123456);
sum = a12345678901234567890123456789012345 + a123456789012345678901234567890123456;
printf("%d\n",sum);
return 0;
}
the problem is, we know that ANSI standard recognizes variables upto 31 characters...but, both variables are same upto 35 characters...but, still the program compiles without any error and warning and giving correct output...
but how?
shouldn't it give an error of redeclaration?
Many compilers are built to exceed ANSI specification (for instance, in recognizing longer than 31 character variable names) as a protection to programmers. While it works in the compiler you're using, you can't count on it working in just any C compiler...
[...] we know that ANSI standard recognizes variables upto 31 characters [...] shouldn't it give an error of redeclaration?
Well, not necessary. Since you mentioned ANSI C, this is the relevant part of C89 standard:
"Implementation limits"
The implementation shall treat at least the first 31 characters of an internal name (a macro name or an identifier that does not have external linkage) as significant. Corresponding lower-case and upper-case letters are different. The implementation may further restrict the significance of an external name (an identifier that has external linkage) to six characters and may ignore distinctions of alphabetical case for such names.10 These limitations on identifiers are all implementation-defined.
Any identifiers that differ in a significant character are different identifiers. If two identifiers differ in a non-significant character, the behavior is undefined.
http://port70.net/~nsz/c/c89/c89-draft.html#3.1.2 (emphasis mine)
It's also explicitly described as a common extension:
Lengths and cases of identifiers
All characters in identifiers (with or without external linkage) are significant and case distinctions are observed (3.1.2)
http://port70.net/~nsz/c/c89/c89-draft.html#A.6.5.3
So, you're just exploiting a C implementation choice of your compiler.
The C89 rationale elaborates on this:
3.1.2 Identifiers
While an implementation is not obliged to remember more than the first
31 characters of an identifier for the purpose of name matching, the
programmer is effectively prohibited from intentionally creating two
different identifiers that are the same in the first 31 characters.
Implementations may therefore store the full identifier; they are not
obliged to truncate to 31.
The decision to extend significance to 31 characters for internal
names was made with little opposition, but the decision to retain the
old six-character case-insensitive restriction on significance of
external names was most painful. While strong sentiment was expressed
for making C ``right'' by requiring longer names everywhere, the
Committee recognized that the language must, for years to come,
coexist with other languages and with older assemblers and linkers.
Rather than undermine support for the Standard, the severe
restrictions have been retained.
Compilers like GCC may store the full identifier.
The number of significant initial characters in an identifier (C90 6.1.2, C90, C99 and C11 5.2.4.1, C99 and C11 6.4.2).
For internal names, all characters are significant. For external
names, the number of significant characters are defined by the linker;
for almost all targets, all characters are significant.
A conforming implementation must support at least 31 characters for an external identifier (and your identifiers are internal, where the limit is 63 for C99 and C11).
In fact, having all characters significant is the intent of the standard, but the committe doesn't want to make implementations non-conforming by not providing it. The limits for external identifiers origin from some linkers unable to provide more (in C89, only 6 characters were required to be significant, which is why the old standard library functions have names not longer than 6 characters).
To be precise, the standard doesn't exactly mandate these limits, the language in the standard is quite permissive:
C11 (n1570) 5.2.4.1 Translation limits
The implementation shall be able to translate and execute at least one program that contains at least one instance of every one of the following limits:18)
[...]
63 significant initial characters in an internal identifier or a macro name (each universal character name or extended source character is considered a single character)
31 significant initial characters in an external identifier (each universal character name specifying a short identifier of 0000FFFF or less is considered 6 characters, each universal character name specifying a short identifier of 00010000 or more is considered 10 characters, and each extended source character is considered the same number of characters as the corresponding universal character name, if any)19)
[...]
Footnote 18) clearly expresses the intent:
Implementations should avoid imposing fixed translation limits whenever possible.
Footnote 19) refers to Future language directions 6.11.3:
Restriction of the significance of an external name to fewer than 255 characters (considering each universal character name or extended source character as a single character) is an obsolescent feature that is a concession to existing implementations.
And to explain the permissiveness in the first sentence of 5.2.4.1, cf. the C99 rationale (5.10)
5.2.4 Environmental limits
The C89 Committee agreed that the Standard must say something about certain capacities and limitations, but just how to enforce these treaty points was the topic of considerable debate.
5.2.4.1 Translation limits
The Standard requires that an implementation be able to translate and execute some program that meets each of the stated limits. This criterion was felt to give a useful latitude to the implementor in meeting these limits. While a deficient implementation could probably contrive a program that meets this requirement, yet still succeed in being useless, the C89 Committee felt that such ingenuity would probably require more work than making something useful. The sense of both the C89 and C99 Committees was that implementors should not construe the translation limits as the values of hard-wired parameters, but rather as a set of criteria by which an implementation will be judged.
There is no limit .
Actually there is a limit , it has to be small enough that it will fit in memory, but otherwise no . If there is a builtin limit (I don't believe there is) it is so huge you would be really hard-pressed to reach it. I
generated C++ code with 2 variables with a differing last character to ensure that the names that long are distinct . I got to 64KB file and thought that is enough.

Why must the first 31 characters of an identifier be unique?

MISRA 2004 rule 5.1 states that all identifiers must have the first 31 characters unique. What is the reason for this rule? Is it a technical limitation with some compilers?
The C standards only guarantee that a certain number of initial characters in identifiers are significant. For C99 this is 31 characters for external identifiers. Even this is a huge step up from ANSI/IS C, which guarantees only 6 significant characters for external identifiers… (So if you're wondering why so many old C functions have unpronounceable names, this is one reason.)
In practice compilers tend to support a higher number of significant characters in identifiers (and IIRC the C standard even has a footnote encouraging this), but MISRA probably wanted to pick a “safe” limit already guaranteed by the then-most-recent C standard, C99, without imposing the limit of 6 that would be guaranteed by C90 which MISRA 2004 otherwise follows.
edit: Since it has been questioned twice in the comments, let me clarify: MISRA 2004 does not follow C99, and there is no hard evidence that the C99 standard contributed to MISRA's chosen limit of specifically 31 characters. However, the limit does not come from C90 (ISO C), because C90 specifies a limit of 6 characters. So, one must either accept that MISRA picked the number 31 independently, or followed the example of C99 in this particular decision. Of course it might be that both picked the same number due to that being the lower bound in popular compilers of the day, but at the very least it can be argued that the example of the older C99 validates the choice.
MISRA-C:2004 follows the C90 standard, which only requires the 6 first characters of an identifier to be treated as distinct ones. You can read the rationale in the MISRA document.
MISRA-C:2004 Rule 14:
The ISO standard requires external identifiers to be distinct in the
first 6 characters. However compliance with this severe and unhelpful
restriction is considered an unnecessary limitation since most
compilers/linkers allow at least 31 character significance (as for
internal identifiers).
The ISO standard referred to is ISO 9899:1990 (C90). The purpose of the rule is ensure that you are using a sane, safe compiler with enough characters of significance.

Resources