C source inclusion name length - c

According to the C Standard, subclause 6.10.2, paragraph 5 [ISO/IEC 9899:2011],
The implementation shall provide unique mappings for sequences
consisting of one or more nondigits or digits (6.4.2.1) followed by a
period (.) and a single nondigit. The first character shall not be a
digit. The implementation may ignore distinctions of alphabetical case
and restrict the mapping to eight significant characters before the
period.
This would mean that if two include files have first 8 characters in common, the header it actually picks is undefined.
When I compile using clang or gcc, I haven't really faced this issue. However, is there a documented behavior for source file inclusion in GCC and Clang?
In the modern world, I would find it weird if any compiler really restricts to 8 characters.
Reference: C11 WG14 draft version N1570, Cert C Coding standard

This would mean that if two include files have first 8 characters in common, the header it actually picks is undefined.
No, I'd argue against that: Looking at the exact wording we see that standard uses:
[..] The implementation may ignore [..]
It's "may", not "shall". If the later was used it would indeed mean that the behavior was undefined (N1570 $4/2). Since "may" is used as-is, without exact declaration I think it's safe to assume the normal meaning of the word (source, emphasis mine):
used to express opportunity or permission
Thus, an implementation is allowed to only consider the first 8 characters, but it doesn't have to.
Funny thing: I cannot find an exact documentation for the "distinction limit" of the "sequence" in GCC's manual, meaning (N1570 $4/8, emphasis mine) ...
An implementation shall be accompanied by a document that defines all implementation defined and locale-specific characteristics and all extensions.
... that GCC could (under some very pedantic point of view) be considered a nonconforming implementation. The practical relevant part of their manual, as #PaulGriffiths pointed out, is probably (source, point 4 in the list):
Significant initial characters in an identifier or macro name.
The preprocessor treats all characters as significant. The C standard requires only that the first 63 be significant.
Regarding the comment:
[..] I am actually trying to evaluate if this will bite me as long as I am using one of these compilers on a Linux platform. [..]
I really doubt that this will ever (again?) be an issue.

Related

Format specifier guide for C

Is there a complete online guide for C format specifiers for every type of data and for all cases? I only found partial and contrasting references that doesn't explain all possible cases.
The definitive guide for this is the actual ISO standard itself. Any other source suffers from the potential flaw that it may be incorrect or incomplete. The standard is, by definition, both correct and complete(a).
And, while standards documents can sometimes be dry and difficult to read, the sections covering the format specifiers is reasonably clear, both in terms of what all the specifiers mean (including flags, width/precision specifiers, and length modifiers), and the data types you're allowed to use with those specifiers.
For example, C11(b) details all the format specifiers in 7.21.6.1 and 7.21.6.2 for the printf and scanf family of functions respectively. The last free draft of this iteration of the standard is the N1570 document.
That is, practically speaking, the C11 standard - officially, it is the latest draft of C11 and, to get the real standard, you need to buy it from the standards body of your country. However, the differences are minor and tend to be administrative in nature.
(a) I don't mean to imply the standard is totally coherent or bug-free, just that it is the standard. That means, pending authorised changes, implementations must follow said standard in order to be considered C. If an implementation does that, it's valid, regardless of what lunacy the standard may have in it :-)
(b) Although C11 (the iteration we use and are therefore most familiar with) may have been officially replaced by C18, the changes were only incorporations of TCs and defect fixes. There were no substantial changes to the "meat" of the standard, in particular for this question, the format specifiers.

Unicode characters in C

Does the C standard require that compilers be able to deal with files not encoded as ascii? Specifially, I am wondering whether utf-8 files are standards compliant. Does the answer to the previous question differ between C89, C99 and C11?
Assuming that it is legal to use characters from outside of ASCII in C source files, which usages are legal?
I can think of a few distinct use cases:
Within comments
Within strings
Within identifiers
Within macro names
Here is an example showing all four:
#ifdef PRINT_©
// Print out the © notice
cont char my©Notice[] = "This program is © 2016 ACME INC";
puts(my©Notice);
#endif
If C allows non-ASCII characters to appear in the above listed usages, are there any restrictions on the code points which may be used?
Keep in mind that this is a question about C standards. I already realize that putting unicode characters into identifiers and macros will make the code more difficult to use.
It's implementation defined, and thus not regulated by the standard.
I know of at least one compiler, namely clang, that requires the source to be UTF-8. But other compilers might use other requirements, or not allow it.
Since C99, identifiers are allowed to contain multi-byte characters, but before C99 it would be an extension to allow non-basic characters there. C11 expanded the set of allowed characters.
There's some additional restrictions on what characters are allowed in identifiers, and © is not in the list. It's listed in appendix D. These are Unicode points, but that doesn't strictly mean the encoding in the file has to be unicode-based.
Ranges of characters allowed
00A8, 00AA, 00AD, 00AF, 00B2−00B5, 00B7−00BA, 00BC−00BE, 00C0−00D6, 00D8−00F6, 00F8−00FF
0100−167F, 1681−180D, 180F−1FFF
200B−200D, 202A−202E, 203F−2040, 2054, 2060−206F
2070−218F, 2460−24FF, 2776−2793, 2C00−2DFF, 2E80−2FFF
3004−3007, 3021−302F, 3031−303F
3040−D7FF
F900−FD3D, FD40−FDCF, FDF0−FE44, FE47−FFFD
10000−1FFFD, 20000−2FFFD, 30000−3FFFD, 40000−4FFFD, 50000−5FFFD, 60000−6FFFD, 70000−7FFFD, 80000−8FFFD, 90000−9FFFD, A0000−AFFFD, B0000−BFFFD, C0000−CFFFD, D0000−DFFFD, E0000−EFFFD
Ranges of characters disallowed initially
0300−036F, 1DC0−1DFF, 20D0−20FF, FE20−FE2F

What constitutes a "valid" C Identifier?

At #Zaibis suggestion (and related to my own answer to What are the valid characters for macro names?, as well as 😃 (and other unicode characters) in identifiers not allowed by g++))...
clang allows a lot of "crazy" characters.. although I have struggled to find much rhyme or reason - as to why some are allowed (🔴 ϟ ツ ⌘ ☁ ½), and others are not (▶︎ ∀ ★ ©).
For example, the following all compile A-OK (clang-700.1.76)
#define 💩 ?: // OK (Pile of poo)
#define ■ #end // OK (HALFWIDTH BLACK SQUARE)
#define 🅺 #interface // OK (NEGATIVE SQUARED LATIN CAPITAL LETTER K)
#define P #protocol // OK (FULLWIDTH LATIN CAPITAL LETTER P)
yet the following all result in the same compiler error...
Macro name must be an identifier.
#define ☎ TEL
#define ❌ NO
#define ⇧ UP
#define 〓 ==
#define 🍎 APPLE
clang's docs refer to the issue, stating only...
... support for extended identifiers in C99 and C++. This feature allows identifiers to contain certain Unicode characters, as specified by the active language standard; these characters can be written directly in the source file using the UTF-8 encoding, or referred to using universal character names (\u00E0, \U000000E0).
So, I guess I'm asking.. what IS the "active language standard", and how can I find an authoritative source for what identifiers are legal.
I created the following code just to see what clang would do with it. Out of about 63488 possible identifiers tested, 23 issued warnings and 9506 generated errors. That leaves almost 54,000 valid characters to use in identifiers. Certainly enough, but who got cut? And why?
As others have mentioned, Annex D of ISO/IEC 9899:2011 lists the hexadecimal values of characters valid for universal character names in C11. (I won't bother repeating it here.) I have been searching for an answer as to "why" this list was chosen.
Character set standards
First, there are two relevant standards defining a set of characters: ISO/IEC 10646 (defining UCS) and Unicode. To further confuse (or simplify) things, they both define the same characters since the ISO and Unicode keep them synchronized. UCS is essentially just a character map associating values to a set of characters ("repertoire"), while Unicode also gives further definitions such how to compare strings in an alphabetical sorting order (collation), which code points represent "canonically equivalent" characters (normalization), and a bidirectional algorithm for how to process characters in languages written right to left, and more.
Universal character names in C
Universal character names (UCN) was a feature newly added in C99 (ISO/IEC 9899:1999). In the "Rationale for International Standard---Programming Languages---C" (Rev. 2, Oct. 1999), the purpose was "to enable the use of any 'native' character in identifiers, string literals and character constants, while retaining the portability objective of C" (sec. 5.2.1). This section continues on about issues of how to encode these characters in C (the \U and \u forms versus multibyte characters or native encodings) and policy models of how to deal with it (p.14, see PDF page 22).
Rationale
I was hoping that the same "rationale" document from 1999 would give a reason of why each extended character range was selected as acceptable for C99's UCNs. The entirety of the rationale's Annex I is:
Annex I Universal character names for identifiers (normative)
A new feature of C9X.
This is not much of a rationale. They didn't even know what year the C standard would be published, so it's just called "C9X". A later rationale document from 2003 is slightly more enlightening:
Annex D Universal character names for identifiers (normative)
New feature for C99.
The intention is to keep current with ISO/IEC TR 10176.
ISO/IEC TR 10176 is "Guidelines for the preparation of programming language standards." It a basically a guidebook for people who write programming language standards. It includes guidelines for the use of character sets in programming languages as well as a "recommended extended repertoire for user-defined identifiers" (Annex A). But this quote from the 2003 rationale document is only an "intention to keep current," not a pledge of strict adherence to TR 10176.
There is a publicly available ISO/IEC TR 10176:2003 table of characters. The character values refer to ISO 10646. The table classifies ranges of characters from numerous languages as being "uppercase" Lu; "lowercase" Ll; "number, decimal digit" Nd, "punctuation, connector" Pc; etc. It should be clear what use such classifications have to a programming language.
An important reminder is that TR 10176 is a Technical Report, and not a standard. I have found several passing references to it on forums and in documents related to other programming languages, such as Ada, COBOL, and D language. Much of the discussion was about how closely standards of those languages should follow TR 10176 (not being a standard) and complaints that TR 10176 was lagging behind updates to ISO 10646.
Perhaps most enlightening is document WG21/N3146: "Recommendations for extended identifier characters for C and C++." It starts with a comment in 2010 to the standards body recommending restrictions on the initial characters of identifiers. It mentions similar complaints about C referencing TR 10176, and makes suggestions about what characters should be allowed as initial characters of an identifier based on restrictions from Unicode's Identifier and Pattern Syntax and XML's Common Syntactic Constructs. WG21/N3146 gives the proposed wording that later appeared in the C11 standard ISO/IEC 9899:2011. There is a table at the end of the document that helps shed light on the character ranges selected.
Characters allowed and not allowed in C11
Below is a compiled list of ranges for extended identifier characters. The boldface ranges are those given in C11 (ISO/IEC 9899:2011 Annex D). Some comments are added about the italicized ranges not listed in C11 (i.e. not allowed). They are either marked in WG21/N3146 as disallowed by Unicode's UAX#31 or XML's Common Syntactic Constructs, or prohibited by some other comment.
00A8, 00AA, 00AD, 00AF, 00B2-00B5, 00C0-00D6, 00D8-00F6, 00F8-00FF: (Various characters, such as feminine ª and masculine º ordinal indicators, vowels with diacritics, numeric characters such as superscript numbers, fractions, etc.)
(previous gaps): All disallowed by UAX31 and/or XML. (Generally punctuation type marks like «», monetary symbols ¥£, mathematical operators ×÷, etc.)
0100-167F: (Latin, Greek, Cyrillic, Arabic, Thai, Ethiopic, etc.---many others)
1680: "The Ogham block contains a script-specific space:  "
1681-180D: (Ogham, Tagalog, Mongolian, etc.)
180E: "The Mongolian block contains a script-specific space"
180F-1FFF: (More languages... phonetics, extended Latin & Greek, etc.)
2000: starts the "General Punctuation" block, but some are allowed:
200B−200D, 202A−202E, 203F−2040, 2054, 2060−206F: (selections from "General Punctuation" block)
2070−218F: "Superscripts and Subscripts, Currency Symbols, Combining Diacritical Marks for Symbols, Letterlike Symbols, Number Forms"
2190-245F: "Arrows, Mathematical Operators, Miscellaneous Technical, Control Pictures, Optical Character Recognition"
2460-24FF: "Enclosed Alphanumerics"
2500: starts "Box Drawing, Block Elements, Geometric Shapes", etc.
2776-2793: (some dingbats and circled dingbats)
2794-2BFF: (a different dingbat set, mathematical symbols, arrows, Braille patterns, etc.)
2C00-2DFF, 2E80-2FFF: "Glagolitic, Latin Extended-C, Coptic, Georgian Supplement, Tifinagh, Ethiopic Extended, Cyrillic Extended-A" (also CJK radical supplement)
3000: (start of "CJK Symbols and Punctuation", some selections allowed)
3004-3007, 3021-302F, 3031-303F: (allowed "CJK Symbols and Punctuation")
3040-D7FF: "Hiragana, Katakana," more CJK ideograms, radicals, etc.
D800-F8FF: (This starts the High and Low Surrogate Areas (number space needed for encodings), and Private Use)
F900-FD3D, FD40-FDCF, FDF0-FE44, FE47-FFFD: selections from "CJK Compatibility Ideographs," "Arabic Presentation Forms," etc.
10000−1FFFD, 20000−2FFFD, 30000−3FFFD, 40000−4FFFD, 50000−5FFFD,
60000−6FFFD, 70000−7FFFD, 80000−8FFFD, 90000−9FFFD, A0000−AFFFD,
B0000−BFFFD, C0000−CFFFD, D0000−DFFFD, E0000−EFFFD: WG21/N3146 gives the rationale for these final ranges:
The Supplementary Private Use Area extends from F0000 through 10FFFF; both [AltId] and [XML2008] disallow characters in that range.
In addition, [AltId] disallows, as non-characters, the last two code positions of each plane, i.e. every position of the form PFFFE or PFFFF, for any value of P.
The "Ranges of characters disallowed initially" from C11 Annex D.2 are 0300−036F, 1DC0−1DFF, 20D0−20FF, FE20−FE2F.
With this WG21/N3146 placed next to the Annex D of the C11 standard, much can be inferred about how they line up. For example, mathematical operators and punctuation seem to be not allowed. I hope this sheds some light on "why" or "how" the allowed characters were chosen.
TLDR; version
Authoritative source for legal identifier characters is the C11 standard ISO/IEC 9899:2011 (See Annex D).
This list is based on a technical report, ISO/IEC TR 10176, but with modifications.
C 2011 standard
6.4.2 Identifiers
6.4.2.1 General
...
3 Each universal character name in an identifier shall designate a character whose encoding
in ISO/IEC 10646 falls into one of the ranges specified in D.1.71) The initial character
shall not be a universal character name designating a character whose encoding falls into
one of the ranges specified in D.2. An implementation may allow multibyte characters
that are not part of the basic source character set to appear in identifiers; which characters
and their correspondence to universal character names is implementation-defined.
...
71) On systems in which linkers cannot accept extended characters, an encoding of the universal character
name may be used in forming valid external identifiers. For example, some otherwise unused
character or sequence of characters may be used to encode the \u in a universal character name.
Extended characters may produce a long external identifier.
...
Annex D
(normative)
Universal character names for identifiers
1 This clause lists the hexadecimal code values that are valid in universal character names
in identifiers.
D.1 Ranges of characters allowed
1 00A8, 00AA, 00AD, 00AF, 00B2−00B5, 00B7−00BA, 00BC−00BE, 00C0−00D6,
00D8−00F6, 00F8−00FF
2 0100−167F, 1681−180D, 180F−1FFF
3 200B−200D, 202A−202E, 203F−2040, 2054, 2060−206F
4 2070−218F, 2460−24FF, 2776−2793, 2C00−2DFF, 2E80−2FFF
5 3004−3007, 3021−302F, 3031−303F
6 3040−D7FF
7 F900−FD3D, FD40−FDCF, FDF0−FE44, FE47−FFFD
8 10000−1FFFD, 20000−2FFFD, 30000−3FFFD, 40000−4FFFD, 50000−5FFFD,
60000−6FFFD, 70000−7FFFD, 80000−8FFFD, 90000−9FFFD, A0000−AFFFD,
B0000−BFFFD, C0000−CFFFD, D0000−DFFFD, E0000−EFFFD
D.2 Ranges of characters disallowed initially
1 0300−036F, 1DC0−1DFF, 20D0−20FF, FE20−FE2F
The syntax for identifiers, which include macro names, is presented in section 6.4.2 of the C2011 standard, as interpreted in light of appendix D.1. These provisions hold that every identifier may contain underscores, upper- and lower-case Latin letters, decimal digits, sequences of characters constituting "universal character names" (subject to limitations), and any other character defined by the implementation.
Universal character names (UCNs) are Unicode escape sequences similar to those provided by Java, Python, and some other languages: they start with a backslash (\), which is followed by a u or U, and either four or eight hexadecimal digits, respectively. There are some limitations on the specific hex digit sequences that may be used, some general, others specific to identifier context. Note, however, that syntactically, the only additional character that the provision for UCNs allows to appear in identifiers is the backslash; all the other characters that can appear in a UCN are allowed in identifiers outside of UCN context, too.
Thus, speaking syntactically and restricting the discussion to the characters that the standard requires to be allowed in identifiers, the underscore, (unaccented) Latin letters, decimal digits, and the backslash are the only characters that C requires must be supported in identifiers. Support for the backslash is required only in the context of UCNs, and not all valid UCNs are allowed in identifiers. Additionally, the standard does not require support for digits as the first characters of identifiers.
On the other hand, the standard is quite liberal in allowing "other implementation-defined characters" in identifiers, including as the first character. Even decimal digits, which otherwise cannot be the first character in an identifier, could, in principle, be allowed at that position under this provision, at the discretion of the implementation. If you want your code to be portable among implementations then you will avoid relying on this provision anywhere. If you want to know which characters your particular implementation allows then you must consult its documentation.
Every standard-conforming implementation must document its behavior with respect to every detail the standard declares to be implementation defined. For example, GCC's documentation specifies that the dollar sign ($) is allowed in identifiers on most target architectures. You yourself linked to and quoted Clang's documentation of the same implementation-defined detail, which is more liberal -- it allows all the characters that can be represented in identifiers via UCNs to also be representable by UTF-8 byte sequences. In many cases, if you display or print source code containing such byte sequences, they will be rendered as a single display character.
As already mentioned, the C11 Standard defines several allowed Ranges of Unicode characters.
00A8, 00AA, 00AD, 00AF, 00B2−00B5, 00B7−00BA, 00BC−00BE, 00C0−00D6, 00D8−00F6, 00F8−00FF
0100−167F, 1681−180D, 180F−1FFF
200B−200D, 202A−202E, 203F−2040, 2054, 2060−206F
2070−218F, 2460−24FF, 2776−2793, 2C00−2DFF, 2E80−2FFF
3004−3007, 3021−302F, 3031−303F
3040−D7FF
F900−FD3D, FD40−FDCF, FDF0−FE44, FE47−FFFD
10000−1FFFD, 20000−2FFFD, 30000−3FFFD, 40000−4FFFD, 50000−5FFFD, 60000−6FFFD, 70000−7FFFD, 80000−8FFFD, 90000−9FFFD, A0000−AFFFD, B0000−BFFFD, C0000−CFFFD, D0000−DFFFD, E0000−EFFFD
This also means there are several ranges of characters excluded from usage.
From your examples:
☎ is 260E and from the "Miscellaneous Symbols" block: 2600-26FF which means youre missing out on all of these
❌ is 274C and from the "Dingbats" block: 2700-27BF which is all of these but some of them are allowed (2776−2793)
⇧ is 21E7 and from the "Arrows " block: 2190-21FF which means youre missing out on all of these
〓 is 3013 and from the "CJK Symbols and Punctuation" block: 3000-303F which is all these but some of them are allowed.
🍎 is 1F34E and from the "Miscellaneous Symbols and Pictographs" block: 1F300-1F5FF which is all these and actually should work (maybe a clangproblem? btw this is not displayed on my home computer (Ubuntu) but on my work PC (Win7))

Ambiguous behavior of variable declaration in c

i have the following code
#include<stdio.h>
int main()
{
int a12345678901234567890123456789012345;
int a123456789012345678901234567890123456;
int sum;
scanf("%d",&a12345678901234567890123456789012345);
scanf("%d",&a123456789012345678901234567890123456);
sum = a12345678901234567890123456789012345 + a123456789012345678901234567890123456;
printf("%d\n",sum);
return 0;
}
the problem is, we know that ANSI standard recognizes variables upto 31 characters...but, both variables are same upto 35 characters...but, still the program compiles without any error and warning and giving correct output...
but how?
shouldn't it give an error of redeclaration?
Many compilers are built to exceed ANSI specification (for instance, in recognizing longer than 31 character variable names) as a protection to programmers. While it works in the compiler you're using, you can't count on it working in just any C compiler...
[...] we know that ANSI standard recognizes variables upto 31 characters [...] shouldn't it give an error of redeclaration?
Well, not necessary. Since you mentioned ANSI C, this is the relevant part of C89 standard:
"Implementation limits"
The implementation shall treat at least the first 31 characters of an internal name (a macro name or an identifier that does not have external linkage) as significant. Corresponding lower-case and upper-case letters are different. The implementation may further restrict the significance of an external name (an identifier that has external linkage) to six characters and may ignore distinctions of alphabetical case for such names.10 These limitations on identifiers are all implementation-defined.
Any identifiers that differ in a significant character are different identifiers. If two identifiers differ in a non-significant character, the behavior is undefined.
http://port70.net/~nsz/c/c89/c89-draft.html#3.1.2 (emphasis mine)
It's also explicitly described as a common extension:
Lengths and cases of identifiers
All characters in identifiers (with or without external linkage) are significant and case distinctions are observed (3.1.2)
http://port70.net/~nsz/c/c89/c89-draft.html#A.6.5.3
So, you're just exploiting a C implementation choice of your compiler.
The C89 rationale elaborates on this:
3.1.2 Identifiers
While an implementation is not obliged to remember more than the first
31 characters of an identifier for the purpose of name matching, the
programmer is effectively prohibited from intentionally creating two
different identifiers that are the same in the first 31 characters.
Implementations may therefore store the full identifier; they are not
obliged to truncate to 31.
The decision to extend significance to 31 characters for internal
names was made with little opposition, but the decision to retain the
old six-character case-insensitive restriction on significance of
external names was most painful. While strong sentiment was expressed
for making C ``right'' by requiring longer names everywhere, the
Committee recognized that the language must, for years to come,
coexist with other languages and with older assemblers and linkers.
Rather than undermine support for the Standard, the severe
restrictions have been retained.
Compilers like GCC may store the full identifier.
The number of significant initial characters in an identifier (C90 6.1.2, C90, C99 and C11 5.2.4.1, C99 and C11 6.4.2).
For internal names, all characters are significant. For external
names, the number of significant characters are defined by the linker;
for almost all targets, all characters are significant.
A conforming implementation must support at least 31 characters for an external identifier (and your identifiers are internal, where the limit is 63 for C99 and C11).
In fact, having all characters significant is the intent of the standard, but the committe doesn't want to make implementations non-conforming by not providing it. The limits for external identifiers origin from some linkers unable to provide more (in C89, only 6 characters were required to be significant, which is why the old standard library functions have names not longer than 6 characters).
To be precise, the standard doesn't exactly mandate these limits, the language in the standard is quite permissive:
C11 (n1570) 5.2.4.1 Translation limits
The implementation shall be able to translate and execute at least one program that contains at least one instance of every one of the following limits:18)
[...]
63 significant initial characters in an internal identifier or a macro name (each universal character name or extended source character is considered a single character)
31 significant initial characters in an external identifier (each universal character name specifying a short identifier of 0000FFFF or less is considered 6 characters, each universal character name specifying a short identifier of 00010000 or more is considered 10 characters, and each extended source character is considered the same number of characters as the corresponding universal character name, if any)19)
[...]
Footnote 18) clearly expresses the intent:
Implementations should avoid imposing fixed translation limits whenever possible.
Footnote 19) refers to Future language directions 6.11.3:
Restriction of the significance of an external name to fewer than 255 characters (considering each universal character name or extended source character as a single character) is an obsolescent feature that is a concession to existing implementations.
And to explain the permissiveness in the first sentence of 5.2.4.1, cf. the C99 rationale (5.10)
5.2.4 Environmental limits
The C89 Committee agreed that the Standard must say something about certain capacities and limitations, but just how to enforce these treaty points was the topic of considerable debate.
5.2.4.1 Translation limits
The Standard requires that an implementation be able to translate and execute some program that meets each of the stated limits. This criterion was felt to give a useful latitude to the implementor in meeting these limits. While a deficient implementation could probably contrive a program that meets this requirement, yet still succeed in being useless, the C89 Committee felt that such ingenuity would probably require more work than making something useful. The sense of both the C89 and C99 Committees was that implementors should not construe the translation limits as the values of hard-wired parameters, but rather as a set of criteria by which an implementation will be judged.
There is no limit .
Actually there is a limit , it has to be small enough that it will fit in memory, but otherwise no . If there is a builtin limit (I don't believe there is) it is so huge you would be really hard-pressed to reach it. I
generated C++ code with 2 variables with a differing last character to ensure that the names that long are distinct . I got to 64KB file and thought that is enough.

Why must the first 31 characters of an identifier be unique?

MISRA 2004 rule 5.1 states that all identifiers must have the first 31 characters unique. What is the reason for this rule? Is it a technical limitation with some compilers?
The C standards only guarantee that a certain number of initial characters in identifiers are significant. For C99 this is 31 characters for external identifiers. Even this is a huge step up from ANSI/IS C, which guarantees only 6 significant characters for external identifiers… (So if you're wondering why so many old C functions have unpronounceable names, this is one reason.)
In practice compilers tend to support a higher number of significant characters in identifiers (and IIRC the C standard even has a footnote encouraging this), but MISRA probably wanted to pick a “safe” limit already guaranteed by the then-most-recent C standard, C99, without imposing the limit of 6 that would be guaranteed by C90 which MISRA 2004 otherwise follows.
edit: Since it has been questioned twice in the comments, let me clarify: MISRA 2004 does not follow C99, and there is no hard evidence that the C99 standard contributed to MISRA's chosen limit of specifically 31 characters. However, the limit does not come from C90 (ISO C), because C90 specifies a limit of 6 characters. So, one must either accept that MISRA picked the number 31 independently, or followed the example of C99 in this particular decision. Of course it might be that both picked the same number due to that being the lower bound in popular compilers of the day, but at the very least it can be argued that the example of the older C99 validates the choice.
MISRA-C:2004 follows the C90 standard, which only requires the 6 first characters of an identifier to be treated as distinct ones. You can read the rationale in the MISRA document.
MISRA-C:2004 Rule 14:
The ISO standard requires external identifiers to be distinct in the
first 6 characters. However compliance with this severe and unhelpful
restriction is considered an unnecessary limitation since most
compilers/linkers allow at least 31 character significance (as for
internal identifiers).
The ISO standard referred to is ISO 9899:1990 (C90). The purpose of the rule is ensure that you are using a sane, safe compiler with enough characters of significance.

Resources