I was under the impression that you could only start variable names with letters and _, however while testing around, I also found out that you can start variable names with $, like so:
Code
#include <stdio.h>
int main() {
int myvar=13;
int $var=42;
printf("%d\n", myvar);
printf("%d\n", $var);
}
Output
13
42
According to this resource, it says that you can't start variable names with $ in C, which is wrong (at least when compiled using my gcc version, Apple LLVM version 10.0.1 (clang-1001.0.46.4)). Other resources that I found online also seem to suggest that variables can't start with $, which is why I'm confused.
Do these articles just fail to mention this nuance, and if so, why is this a feature of C?
In the C 2018 standard, clause 6.4.2, paragraph 1 allows implementations to allow additional characters in identifiers.
It defines an identifier to be an identifier-nondigit character followed by any number of identifier-nondigit or digit characters. It defines digit to be “0“ to “9”, and it defines the identifier-nondigit characters to be:
a nondigit, which is one of underscore, “a” to “z”, or “A” to “Z”,
a universal-character-name, or
other implementation-defined characters.
Thus, implementations may define other characters that are allowed in identifiers.
The characters included as universal-character-name are those listed in ranges in Annex D of the C standard.
The resource you link to is wrong in several places:
Variable names in C are made up of letters (upper and lower case) and digits.
This is false; identifiers may include underscores and the above universal characters in every conforming implementation and other characters in implementations that permit them.
$ not allowed -- only letters, and _
This is incorrect. The C standard does not require an implementation to allow “$”, but it does not disallow an implementation from allowing it. “$” is allowed by some implementations and not others. It can be said not to be a part of strictly conforming C programs, but it may be a part of conforming C programs.
This answers your question:
In GNU C, you may normally use dollar signs in identifier names. This is because many traditional C implementations allow such identifiers. However, dollar signs in identifiers are not supported on a few target machines, typically because the target assembler does not allow them.
This is allowed in GCC and LLVM because many traditional C implementations allow identifiers like this.
One such reason is that VMS commonly uses these, where a lot of system library routines have names like SYS$SOMETHING.
Here's a link to the GCC docs describing this:
https://gcc.gnu.org/onlinedocs/gcc/Dollar-Signs.html
Depends on the dialect of C and the options selected. Historically some Cs supported $ to be compatible with existing libraries when C was new. You may need to use a command line option to enable $ or another to turn if of if strictly conforming C is valuable to you.
A spot of history: in my early years I got into enough mainframe rooms to know that $ is one of what IBM mainframes called "national characters" of $,#, and # that could show up in identifiers of programming languages like PL/1 and mainframe assembler. This worked down to some mainframe spin-offs, such as the IBM 1130. It looked to me like early impact printers using pieces of shaped slugs to print with, and CRT terminals, could swap out these characters to meet the national needs of foreign customers. The IBM 1403 printer had many "print chains" to choose from for different human languages and technical purposes.
Some non-IBM identifiers picked up on at least some of these characters. GNU C, VMS, and JavaScript kept "$". "$" is the only character of old that seems to have survived to this day, even as an option, in most languages. The odd thing is back on early IBM days the underscore was invalid for identifier names.
TL;DR: it's the assembler not the compiler
Ok, so I did some research into this. It's not really allowed, but what excludes it as the assembly pass. Trying to do the following fails:
#include <stdio.h>
extern int $func();
int main() {
int myvar=13;
int $var=42;
printf("%d\n", myvar);
printf("%d\n", $var);
$func();
}
joshua#nova:/tmp$ gcc -c test.c
/tmp/ccg7zLVB.s: Assembler messages:
/tmp/ccg7zLVB.s:31: Error: operand type mismatch for `call'
joshua#nova:/tmp$
I pulled K&R C version 2 (this covers ANSI C) off my shelf and it says "Identifiers are a sequence of letters and digits. The first character must be a letter; the underscore _ character counts as a letter. Upper and lower case letters are different. Identifiers may have any length ... [obsolete verbiage omitted]."
This reference as clearly aged; and almost everybody accepts high-unicode as letters. What's going on is the back-end assembler sees symbols bytewise and every byte with the high bit set counts as a letter. If you're crazy enough to use shift-jis outside of string literals, chaos can ensue; but otherwise this tends to work well enough.
I accessed a draft of C18 which says identifier-nondigit: nondigit ; nondigit ; universal-character-name other-implementation-defined-characters. Therefore, implementations are allowed to permit additional characters.
For universal-character-name, we have a restriction: "A universal character name shall not specify a character whose short identifier is less than 00A0
other than 0024 ( $ ), 0040 ( # ), or 0060 (‘), nor one in the range D800 through DFFF inclusive."
The following code still chokes at the assembly pass as expected:
#include <stdio.h>
extern int \U00000024func();
int main()
{
return \U00000024func();
}
The following code builds:
#include <stdio.h>
extern int func\U00000024();
int main()
{
return func\U00000024();
}
Related
I stumbled on some C++ code like this:
int $T$S;
First I thought that it was some sort of PHP code or something wrongly pasted in there but it compiles and runs nicely (on MSVC 2008).
What kind of characters are valid for variables in C++ and are there any other weird characters you can use?
The only legal characters according to the standard are alphanumerics
and the underscore. The standard does require that just about anything
Unicode considers alphabetic is acceptable (but only as single
code-point characters). In practice, implementations offer extensions
(i.e. some do accept a $) and restrictions (most don't accept all of the
required Unicode characters). If you want your code to be portable,
restrict symbols to the 26 unaccented letters, upper or lower case, the
ten digits, and the '_'.
It's an extension of some compilers and not in the C standard
MSVC:
Microsoft Specific
Only the first 2048 characters of Microsoft C++ identifiers are significant. Names for user-defined types are "decorated" by the compiler to preserve type information. The resultant name, including the type information, cannot be longer than 2048 characters. (See Decorated Names for more information.) Factors that can influence the length of a decorated identifier are:
Whether the identifier denotes an object of user-defined type or a type derived from a user-defined type.
Whether the identifier denotes a function or a type derived from a function.
The number of arguments to a function.
The dollar sign is also a valid identifier in Visual C++.
// dollar_sign_identifier.cpp
struct $Y1$ {
void $Test$() {}
};
int main() {
$Y1$ $x$;
$x$.$Test$();
}
https://web.archive.org/web/20100216114436/http://msdn.microsoft.com/en-us/library/565w213d.aspx
Newest version: https://learn.microsoft.com/en-us/cpp/cpp/identifiers-cpp?redirectedfrom=MSDN&view=vs-2019
GCC:
6.42 Dollar Signs in Identifier Names
In GNU C, you may normally use dollar signs in identifier names. This is because many traditional C implementations allow such identifiers. However, dollar signs in identifiers are not supported on a few target machines, typically because the target assembler does not allow them.
http://gcc.gnu.org/onlinedocs/gcc/Dollar-Signs.html#Dollar-Signs
In my knowledge only letters (capital and small), numbers (0 to 9) and _ are valid for variable names according to standard (note: the variable name should not start with a number though).
All other characters should be compiler extensions.
This is not good practice. Generally, you should only use alphanumeric characters and underscores in identifiers ([a-z][A-Z][0-9]_).
Surface Level
Unlike in other languages (bash, perl), C does not use $ to denote the usage of a variable. As such, it is technically valid. In C it most likely falls under C11, 6.4.2. This means that it does seem to be supported by modern compilers.
As for your C++ question, lets test it!
int main(void) {
int $ = 0;
return $;
}
On GCC/G++/Clang/Clang++, this indeed compiles, and runs just fine.
Deeper Level
Compilers take source code, lex it into a token stream, put that into an abstract syntax tree (AST), and then use that to generate code (e.g. assembly/LLVM IR). Your question really only revolves around the first part (e.g. lexing).
The grammar (thus the lexer implementation) of C/C++ does not treat $ as special, unlike commas, periods, skinny arrows, etc... As such, you may get an output from the lexer like this from the below c code:
int i_love_$ = 0;
After the lexer, this becomes a token steam like such:
["int", "i_love_$", "=", "0"]
If you where to take this code:
int i_love_$,_and_.s = 0;
The lexer would output a token steam like:
["int", "i_love_$", ",", "_and_", ".", "s", "=", "0"]
As you can see, because C/C++ doesn't treat characters like $ as special, it is processed differently than other characters like periods.
Does the C standard require that compilers be able to deal with files not encoded as ascii? Specifially, I am wondering whether utf-8 files are standards compliant. Does the answer to the previous question differ between C89, C99 and C11?
Assuming that it is legal to use characters from outside of ASCII in C source files, which usages are legal?
I can think of a few distinct use cases:
Within comments
Within strings
Within identifiers
Within macro names
Here is an example showing all four:
#ifdef PRINT_©
// Print out the © notice
cont char my©Notice[] = "This program is © 2016 ACME INC";
puts(my©Notice);
#endif
If C allows non-ASCII characters to appear in the above listed usages, are there any restrictions on the code points which may be used?
Keep in mind that this is a question about C standards. I already realize that putting unicode characters into identifiers and macros will make the code more difficult to use.
It's implementation defined, and thus not regulated by the standard.
I know of at least one compiler, namely clang, that requires the source to be UTF-8. But other compilers might use other requirements, or not allow it.
Since C99, identifiers are allowed to contain multi-byte characters, but before C99 it would be an extension to allow non-basic characters there. C11 expanded the set of allowed characters.
There's some additional restrictions on what characters are allowed in identifiers, and © is not in the list. It's listed in appendix D. These are Unicode points, but that doesn't strictly mean the encoding in the file has to be unicode-based.
Ranges of characters allowed
00A8, 00AA, 00AD, 00AF, 00B2−00B5, 00B7−00BA, 00BC−00BE, 00C0−00D6, 00D8−00F6, 00F8−00FF
0100−167F, 1681−180D, 180F−1FFF
200B−200D, 202A−202E, 203F−2040, 2054, 2060−206F
2070−218F, 2460−24FF, 2776−2793, 2C00−2DFF, 2E80−2FFF
3004−3007, 3021−302F, 3031−303F
3040−D7FF
F900−FD3D, FD40−FDCF, FDF0−FE44, FE47−FFFD
10000−1FFFD, 20000−2FFFD, 30000−3FFFD, 40000−4FFFD, 50000−5FFFD, 60000−6FFFD, 70000−7FFFD, 80000−8FFFD, 90000−9FFFD, A0000−AFFFD, B0000−BFFFD, C0000−CFFFD, D0000−DFFFD, E0000−EFFFD
Ranges of characters disallowed initially
0300−036F, 1DC0−1DFF, 20D0−20FF, FE20−FE2F
I find in the new C++ Standard
2.11 Identifiers [lex.name]
identifier:
identifier-nondigit
identifier identifier-nondigit
identifier digit
identifier-nondigit:
nondigit
universal-character-name
other implementation-defined character
with the additional text
An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified
in E.1. [...]
I can not quite comprehend what this means. From the old std I am used to that a "universal character name" is written \u89ab for example. But using those in an identifier...? Really?
Is the new standard more open w.r.t to Unicode? And I do not refer to the new Literal Types "uHello \u89ab thing"u32, I think I understood those. But:
Can (portable) source code be in any unicode encoding, like UTF-8, UTF-16 or any (how-ever-defined) codepage?
Can I write an identifier with \u1234 in it myfu\u1234ntion (for whatever purpose)
Or can i use the "character names" that unicode defines like in the ICU, i.e.
const auto x = "German Braunb\U{LOWERCASE LETTER A WITH DIARESIS}r."u32;
or even in an identifier in the source itself? That would be a treat... cough...
I think the answer to all thise questions is no but I can not map this reliably to the wording in the standard... :-)
Edit: I found "2.2 Phases of translation [lex.phases]", Phase 1:
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set [...] if necessary. The set of physical source file characters accepted is implementation-defined. [...] Any source file character not in the basic
source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)
By reading this I now think, that a compiler may choose to accept UTF-8, UTF-16 or any codepage it wishes (by meta information or user configuration). In Phase 1 it translates this into an ASCII form ("basic source character set") in which then the Unicode-characters are replaced by its \uNNNN notation (or the compiler can choose to continue to work in its Unicode-representation, but than has to make sure it handles the other \uNNNN the same way.
What do you think?
Is the new standard more open w.r.t to Unicode?
With respect to allowing universal character names in identifiers the answer is no; UCNs were allowed in identifiers back in C99 and C++98. However compilers did not implement that particular requirement until recently. Clang 3.3 I think introduces support for this and GCC has had an experimental feature for this for some time. Herb Sutter also mentioned during his Build 2013 talk "The Future of C++" that this feature would also be coming to VC++ at some point. (Although IIRC Herb refers to it as a C++11 feature; it is in fact a C++98 feature.)
It's not expected that identifiers will be written using UCNs. Instead the expected behavior is to write the desired character using the source encoding. E.g., source will look like:
long pörk;
not:
long p\u00F6rk;
However UCNs are also useful for another purpose; Compilers are not all required to accept the same source encodings, but modern compilers all support some encoding scheme where at least the basic source characters have the same encoding (that is, modern compilers all support some ASCII compatible encoding).
UCNs allow you to write source code with only the basic characters and yet still name extended characters. This is useful in, for example, writing a string literal "°" in source code that will be compiled both as CP1252 and as UTF-8:
char const *degree_sign = "\u00b0";
This string literal is encoded into the appropriate execution encoding on multiple compilers, even when the source encodings differ, as long as the compilers at least share the same encoding for basic characters.
Can (portable) source code be in any unicode encoding, like UTF-8, UTF-16 or any (how-ever-defined) codepage?
It's not required by the standard, but most compilers will accept UTF-8 source. Clang supports only UTF-8 source (although it has some compatibility for non-UTF-8 data in character and string literals), gcc allows the source encoding to be specified and includes support for UTF-8, and VC++ will guess at the encoding and can be made to guess UTF-8.
(Update: VS2015 now provides an option to force the source and execution character sets to be UTF-8.)
Can I write an identifier with \u1234 in it myfu\u1234ntion (for whatever purpose)
Yes, the specification mandates this, although as I said not all compilers implement this requirement yet.
Or can i use the "character names" that unicode defines like in the ICU, i.e.
const auto x = "German Braunb\U{LOWERCASE LETTER A WITH DIARESIS}r."u32;
No, you cannot use Unicode long names.
or even in an identifier in the source itself? That would be a treat... cough...
If the compiler supports a source code encoding that contains the extended character you want then that character written literally in the source must be treated exactly the same as the equivalent UCN. So yes, if you use a compiler that supports this requirement of the C++ spec then you may write any character in its source character set directly in the source without bothering with writing UCNs.
I think the intent is to allow Unicode characters in identifiers, such as:
long pöjk;
ostream* å;
I suggest using clang++ instead of g++. Clang is designed to be highly compatible with GCC (wikipedia-source), so you can most likely just substitute that command.
I wanted to use Greek symbols in my source code.
If code readability is the goal, then it seems reasonable to use (for example) α over alpha. Especially when used in larger mathematical formulas, they can be read more easily in the source code.
To achieve this, this is a minimal working example:
> cat /tmp/test.cpp
#include <iostream>
int main()
{
int α = 10;
std::cout << "α = " << α << std::endl;
return 0;
}
> clang++ /tmp/test.cpp -o /tmp/test
> /tmp/test
α = 10
This article https://www.securecoding.cert.org/confluence/display/seccode/PRE30-C.+Do+not+create+a+universal+character+name+through+concatenation works with the idea that int \u0401; is compliant code, though it's based on C99, instead of C++0x.
Present versions of gcc (up to version 5.2 so far) only support ASCII and in some cases EBCDIC input files. Therefore, unicode characters in identifiers have to be represented using \uXXXX and \UXXXXXXXX escape sequences in ASCII encoded files. While it may be possible to represent unicode characters as ??/uXXXX and ??/UXXXXXXX in EBCDIC encoded input files, I have not tested this. At anyrate, a simple one-line patch to cpp allows direct reading of UTF-8 input provided a recent version of iconv is installed. Details are in
https://www.raspberrypi.org/forums/viewtopic.php?p=802657
and may be summarized by the patch
diff -cNr gcc-5.2.0/libcpp/charset.c gcc-5.2.0-ejo/libcpp/charset.c
*** gcc-5.2.0/libcpp/charset.c Mon Jan 5 04:33:28 2015
--- gcc-5.2.0-ejo/libcpp/charset.c Wed Aug 12 14:34:23 2015
***************
*** 1711,1717 ****
struct _cpp_strbuf to;
unsigned char *buffer;
! input_cset = init_iconv_desc (pfile, SOURCE_CHARSET, input_charset);
if (input_cset.func == convert_no_conversion)
{
to.text = input;
--- 1711,1717 ----
struct _cpp_strbuf to;
unsigned char *buffer;
! input_cset = init_iconv_desc (pfile, "C99", input_charset);
if (input_cset.func == convert_no_conversion)
{
to.text = input;
i have the following code
#include<stdio.h>
int main()
{
int a12345678901234567890123456789012345;
int a123456789012345678901234567890123456;
int sum;
scanf("%d",&a12345678901234567890123456789012345);
scanf("%d",&a123456789012345678901234567890123456);
sum = a12345678901234567890123456789012345 + a123456789012345678901234567890123456;
printf("%d\n",sum);
return 0;
}
the problem is, we know that ANSI standard recognizes variables upto 31 characters...but, both variables are same upto 35 characters...but, still the program compiles without any error and warning and giving correct output...
but how?
shouldn't it give an error of redeclaration?
Many compilers are built to exceed ANSI specification (for instance, in recognizing longer than 31 character variable names) as a protection to programmers. While it works in the compiler you're using, you can't count on it working in just any C compiler...
[...] we know that ANSI standard recognizes variables upto 31 characters [...] shouldn't it give an error of redeclaration?
Well, not necessary. Since you mentioned ANSI C, this is the relevant part of C89 standard:
"Implementation limits"
The implementation shall treat at least the first 31 characters of an internal name (a macro name or an identifier that does not have external linkage) as significant. Corresponding lower-case and upper-case letters are different. The implementation may further restrict the significance of an external name (an identifier that has external linkage) to six characters and may ignore distinctions of alphabetical case for such names.10 These limitations on identifiers are all implementation-defined.
Any identifiers that differ in a significant character are different identifiers. If two identifiers differ in a non-significant character, the behavior is undefined.
http://port70.net/~nsz/c/c89/c89-draft.html#3.1.2 (emphasis mine)
It's also explicitly described as a common extension:
Lengths and cases of identifiers
All characters in identifiers (with or without external linkage) are significant and case distinctions are observed (3.1.2)
http://port70.net/~nsz/c/c89/c89-draft.html#A.6.5.3
So, you're just exploiting a C implementation choice of your compiler.
The C89 rationale elaborates on this:
3.1.2 Identifiers
While an implementation is not obliged to remember more than the first
31 characters of an identifier for the purpose of name matching, the
programmer is effectively prohibited from intentionally creating two
different identifiers that are the same in the first 31 characters.
Implementations may therefore store the full identifier; they are not
obliged to truncate to 31.
The decision to extend significance to 31 characters for internal
names was made with little opposition, but the decision to retain the
old six-character case-insensitive restriction on significance of
external names was most painful. While strong sentiment was expressed
for making C ``right'' by requiring longer names everywhere, the
Committee recognized that the language must, for years to come,
coexist with other languages and with older assemblers and linkers.
Rather than undermine support for the Standard, the severe
restrictions have been retained.
Compilers like GCC may store the full identifier.
The number of significant initial characters in an identifier (C90 6.1.2, C90, C99 and C11 5.2.4.1, C99 and C11 6.4.2).
For internal names, all characters are significant. For external
names, the number of significant characters are defined by the linker;
for almost all targets, all characters are significant.
A conforming implementation must support at least 31 characters for an external identifier (and your identifiers are internal, where the limit is 63 for C99 and C11).
In fact, having all characters significant is the intent of the standard, but the committe doesn't want to make implementations non-conforming by not providing it. The limits for external identifiers origin from some linkers unable to provide more (in C89, only 6 characters were required to be significant, which is why the old standard library functions have names not longer than 6 characters).
To be precise, the standard doesn't exactly mandate these limits, the language in the standard is quite permissive:
C11 (n1570) 5.2.4.1 Translation limits
The implementation shall be able to translate and execute at least one program that contains at least one instance of every one of the following limits:18)
[...]
63 significant initial characters in an internal identifier or a macro name (each universal character name or extended source character is considered a single character)
31 significant initial characters in an external identifier (each universal character name specifying a short identifier of 0000FFFF or less is considered 6 characters, each universal character name specifying a short identifier of 00010000 or more is considered 10 characters, and each extended source character is considered the same number of characters as the corresponding universal character name, if any)19)
[...]
Footnote 18) clearly expresses the intent:
Implementations should avoid imposing fixed translation limits whenever possible.
Footnote 19) refers to Future language directions 6.11.3:
Restriction of the significance of an external name to fewer than 255 characters (considering each universal character name or extended source character as a single character) is an obsolescent feature that is a concession to existing implementations.
And to explain the permissiveness in the first sentence of 5.2.4.1, cf. the C99 rationale (5.10)
5.2.4 Environmental limits
The C89 Committee agreed that the Standard must say something about certain capacities and limitations, but just how to enforce these treaty points was the topic of considerable debate.
5.2.4.1 Translation limits
The Standard requires that an implementation be able to translate and execute some program that meets each of the stated limits. This criterion was felt to give a useful latitude to the implementor in meeting these limits. While a deficient implementation could probably contrive a program that meets this requirement, yet still succeed in being useless, the C89 Committee felt that such ingenuity would probably require more work than making something useful. The sense of both the C89 and C99 Committees was that implementors should not construe the translation limits as the values of hard-wired parameters, but rather as a set of criteria by which an implementation will be judged.
There is no limit .
Actually there is a limit , it has to be small enough that it will fit in memory, but otherwise no . If there is a builtin limit (I don't believe there is) it is so huge you would be really hard-pressed to reach it. I
generated C++ code with 2 variables with a differing last character to ensure that the names that long are distinct . I got to 64KB file and thought that is enough.
I saw a line of C that looked like this:
!ErrorHasOccured() ??!??! HandleError();
It compiled correctly and seems to run ok. It seems like it's checking if an error has occurred, and if it has, it handles it. But I'm not really sure what it's actually doing or how it's doing it. It does look like the programmer is trying express their feelings about errors.
I have never seen the ??!??! before in any programming language, and I can't find documentation for it anywhere. (Google doesn't help with search terms like ??!??!). What does it do and how does the code sample work?
??! is a trigraph that translates to |. So it says:
!ErrorHasOccured() || HandleError();
which, due to short circuiting, is equivalent to:
if (ErrorHasOccured())
HandleError();
Guru of the Week (deals with C++ but relevant here), where I picked this up.
Possible origin of trigraphs or as #DwB points out in the comments it's more likely due to EBCDIC being difficult (again). This discussion on the IBM developerworks board seems to support that theory.
From ISO/IEC 9899:1999 §5.2.1.1, footnote 12 (h/t #Random832):
The trigraph sequences enable the input of characters that are not defined in the Invariant Code Set as
described in ISO/IEC 646, which is a subset of the seven-bit US ASCII code set.
Well, why this exists in general is probably different than why it exists in your example.
It all started half a century ago with repurposing hardcopy communication terminals as computer user interfaces. In the initial Unix and C era that was the ASR-33 Teletype.
This device was slow (10 cps) and noisy and ugly and its view of the ASCII character set ended at 0x5f, so it had (look closely at the pic) none of the keys:
{ | } ~
The trigraphs were defined to fix a specific problem. The idea was that C programs could use the ASCII subset found on the ASR-33 and in other environments missing the high ASCII values.
Your example is actually two of ??!, each meaning |, so the result is ||.
However, people writing C code almost by definition had modern equipment,1 so my guess is: someone showing off or amusing themself, leaving a kind of Easter egg in the code for you to find.
It sure worked, it led to a wildly popular SO question.
ASR-33 Teletype
1. For that matter, the trigraphs were invented by the ANSI committee, which first met after C become a runaway success, so none of the original C code or coders would have used them.
It's a C trigraph. ??! is |, so ??!??! is the operator ||
As already stated ??!??! is essentially two trigraphs (??! and ??! again) mushed together that get replaced-translated to ||, i.e the logical OR, by the preprocessor.
The following table containing every trigraph should help disambiguate alternate trigraph combinations:
Trigraph Replaces
??( [
??) ]
??< {
??> }
??/ \
??' ^
??= #
??! |
??- ~
Source: C: A Reference Manual 5th Edition
So a trigraph that looks like ??(??) will eventually map to [], ??(??)??(??) will get replaced by [][] and so on, you get the idea.
Since trigraphs are substituted during preprocessing you could use cpp to get a view of the output yourself, using a silly trigr.c program:
void main(){ const char *s = "??!??!"; }
and processing it with:
cpp -trigraphs trigr.c
You'll get a console output of
void main(){ const char *s = "||"; }
As you can notice, the option -trigraphs must be specified or else cpp will issue a warning; this indicates how trigraphs are a thing of the past and of no modern value other than confusing people who might bump into them.
As for the rationale behind the introduction of trigraphs, it is better understood when looking at the history section of ISO/IEC 646:
ISO/IEC 646 and its predecessor ASCII (ANSI X3.4) largely endorsed existing practice regarding character encodings in the telecommunications industry.
As ASCII did not provide a number of characters needed for languages other than English, a number of national variants were made that substituted some less-used characters with needed ones.
(emphasis mine)
So, in essence, some needed characters (those for which a trigraph exists) were replaced in certain national variants. This leads to the alternate representation using trigraphs comprised of characters that other variants still had around.