Does the C standard require that compilers be able to deal with files not encoded as ascii? Specifially, I am wondering whether utf-8 files are standards compliant. Does the answer to the previous question differ between C89, C99 and C11?
Assuming that it is legal to use characters from outside of ASCII in C source files, which usages are legal?
I can think of a few distinct use cases:
Within comments
Within strings
Within identifiers
Within macro names
Here is an example showing all four:
#ifdef PRINT_©
// Print out the © notice
cont char my©Notice[] = "This program is © 2016 ACME INC";
puts(my©Notice);
#endif
If C allows non-ASCII characters to appear in the above listed usages, are there any restrictions on the code points which may be used?
Keep in mind that this is a question about C standards. I already realize that putting unicode characters into identifiers and macros will make the code more difficult to use.
It's implementation defined, and thus not regulated by the standard.
I know of at least one compiler, namely clang, that requires the source to be UTF-8. But other compilers might use other requirements, or not allow it.
Since C99, identifiers are allowed to contain multi-byte characters, but before C99 it would be an extension to allow non-basic characters there. C11 expanded the set of allowed characters.
There's some additional restrictions on what characters are allowed in identifiers, and © is not in the list. It's listed in appendix D. These are Unicode points, but that doesn't strictly mean the encoding in the file has to be unicode-based.
Ranges of characters allowed
00A8, 00AA, 00AD, 00AF, 00B2−00B5, 00B7−00BA, 00BC−00BE, 00C0−00D6, 00D8−00F6, 00F8−00FF
0100−167F, 1681−180D, 180F−1FFF
200B−200D, 202A−202E, 203F−2040, 2054, 2060−206F
2070−218F, 2460−24FF, 2776−2793, 2C00−2DFF, 2E80−2FFF
3004−3007, 3021−302F, 3031−303F
3040−D7FF
F900−FD3D, FD40−FDCF, FDF0−FE44, FE47−FFFD
10000−1FFFD, 20000−2FFFD, 30000−3FFFD, 40000−4FFFD, 50000−5FFFD, 60000−6FFFD, 70000−7FFFD, 80000−8FFFD, 90000−9FFFD, A0000−AFFFD, B0000−BFFFD, C0000−CFFFD, D0000−DFFFD, E0000−EFFFD
Ranges of characters disallowed initially
0300−036F, 1DC0−1DFF, 20D0−20FF, FE20−FE2F
I find in the new C++ Standard
2.11 Identifiers [lex.name]
identifier:
identifier-nondigit
identifier identifier-nondigit
identifier digit
identifier-nondigit:
nondigit
universal-character-name
other implementation-defined character
with the additional text
An identifier is an arbitrarily long sequence of letters and digits. Each universal-character-name in an identifier shall designate a character whose encoding in ISO 10646 falls into one of the ranges specified
in E.1. [...]
I can not quite comprehend what this means. From the old std I am used to that a "universal character name" is written \u89ab for example. But using those in an identifier...? Really?
Is the new standard more open w.r.t to Unicode? And I do not refer to the new Literal Types "uHello \u89ab thing"u32, I think I understood those. But:
Can (portable) source code be in any unicode encoding, like UTF-8, UTF-16 or any (how-ever-defined) codepage?
Can I write an identifier with \u1234 in it myfu\u1234ntion (for whatever purpose)
Or can i use the "character names" that unicode defines like in the ICU, i.e.
const auto x = "German Braunb\U{LOWERCASE LETTER A WITH DIARESIS}r."u32;
or even in an identifier in the source itself? That would be a treat... cough...
I think the answer to all thise questions is no but I can not map this reliably to the wording in the standard... :-)
Edit: I found "2.2 Phases of translation [lex.phases]", Phase 1:
Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set [...] if necessary. The set of physical source file characters accepted is implementation-defined. [...] Any source file character not in the basic
source character set (2.3) is replaced by the universal-character-name that designates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e., using the \uXXXX notation), are handled equivalently except where this replacement is reverted in a raw string literal.)
By reading this I now think, that a compiler may choose to accept UTF-8, UTF-16 or any codepage it wishes (by meta information or user configuration). In Phase 1 it translates this into an ASCII form ("basic source character set") in which then the Unicode-characters are replaced by its \uNNNN notation (or the compiler can choose to continue to work in its Unicode-representation, but than has to make sure it handles the other \uNNNN the same way.
What do you think?
Is the new standard more open w.r.t to Unicode?
With respect to allowing universal character names in identifiers the answer is no; UCNs were allowed in identifiers back in C99 and C++98. However compilers did not implement that particular requirement until recently. Clang 3.3 I think introduces support for this and GCC has had an experimental feature for this for some time. Herb Sutter also mentioned during his Build 2013 talk "The Future of C++" that this feature would also be coming to VC++ at some point. (Although IIRC Herb refers to it as a C++11 feature; it is in fact a C++98 feature.)
It's not expected that identifiers will be written using UCNs. Instead the expected behavior is to write the desired character using the source encoding. E.g., source will look like:
long pörk;
not:
long p\u00F6rk;
However UCNs are also useful for another purpose; Compilers are not all required to accept the same source encodings, but modern compilers all support some encoding scheme where at least the basic source characters have the same encoding (that is, modern compilers all support some ASCII compatible encoding).
UCNs allow you to write source code with only the basic characters and yet still name extended characters. This is useful in, for example, writing a string literal "°" in source code that will be compiled both as CP1252 and as UTF-8:
char const *degree_sign = "\u00b0";
This string literal is encoded into the appropriate execution encoding on multiple compilers, even when the source encodings differ, as long as the compilers at least share the same encoding for basic characters.
Can (portable) source code be in any unicode encoding, like UTF-8, UTF-16 or any (how-ever-defined) codepage?
It's not required by the standard, but most compilers will accept UTF-8 source. Clang supports only UTF-8 source (although it has some compatibility for non-UTF-8 data in character and string literals), gcc allows the source encoding to be specified and includes support for UTF-8, and VC++ will guess at the encoding and can be made to guess UTF-8.
(Update: VS2015 now provides an option to force the source and execution character sets to be UTF-8.)
Can I write an identifier with \u1234 in it myfu\u1234ntion (for whatever purpose)
Yes, the specification mandates this, although as I said not all compilers implement this requirement yet.
Or can i use the "character names" that unicode defines like in the ICU, i.e.
const auto x = "German Braunb\U{LOWERCASE LETTER A WITH DIARESIS}r."u32;
No, you cannot use Unicode long names.
or even in an identifier in the source itself? That would be a treat... cough...
If the compiler supports a source code encoding that contains the extended character you want then that character written literally in the source must be treated exactly the same as the equivalent UCN. So yes, if you use a compiler that supports this requirement of the C++ spec then you may write any character in its source character set directly in the source without bothering with writing UCNs.
I think the intent is to allow Unicode characters in identifiers, such as:
long pöjk;
ostream* å;
I suggest using clang++ instead of g++. Clang is designed to be highly compatible with GCC (wikipedia-source), so you can most likely just substitute that command.
I wanted to use Greek symbols in my source code.
If code readability is the goal, then it seems reasonable to use (for example) α over alpha. Especially when used in larger mathematical formulas, they can be read more easily in the source code.
To achieve this, this is a minimal working example:
> cat /tmp/test.cpp
#include <iostream>
int main()
{
int α = 10;
std::cout << "α = " << α << std::endl;
return 0;
}
> clang++ /tmp/test.cpp -o /tmp/test
> /tmp/test
α = 10
This article https://www.securecoding.cert.org/confluence/display/seccode/PRE30-C.+Do+not+create+a+universal+character+name+through+concatenation works with the idea that int \u0401; is compliant code, though it's based on C99, instead of C++0x.
Present versions of gcc (up to version 5.2 so far) only support ASCII and in some cases EBCDIC input files. Therefore, unicode characters in identifiers have to be represented using \uXXXX and \UXXXXXXXX escape sequences in ASCII encoded files. While it may be possible to represent unicode characters as ??/uXXXX and ??/UXXXXXXX in EBCDIC encoded input files, I have not tested this. At anyrate, a simple one-line patch to cpp allows direct reading of UTF-8 input provided a recent version of iconv is installed. Details are in
https://www.raspberrypi.org/forums/viewtopic.php?p=802657
and may be summarized by the patch
diff -cNr gcc-5.2.0/libcpp/charset.c gcc-5.2.0-ejo/libcpp/charset.c
*** gcc-5.2.0/libcpp/charset.c Mon Jan 5 04:33:28 2015
--- gcc-5.2.0-ejo/libcpp/charset.c Wed Aug 12 14:34:23 2015
***************
*** 1711,1717 ****
struct _cpp_strbuf to;
unsigned char *buffer;
! input_cset = init_iconv_desc (pfile, SOURCE_CHARSET, input_charset);
if (input_cset.func == convert_no_conversion)
{
to.text = input;
--- 1711,1717 ----
struct _cpp_strbuf to;
unsigned char *buffer;
! input_cset = init_iconv_desc (pfile, "C99", input_charset);
if (input_cset.func == convert_no_conversion)
{
to.text = input;
At #Zaibis suggestion (and related to my own answer to What are the valid characters for macro names?, as well as 😃 (and other unicode characters) in identifiers not allowed by g++))...
clang allows a lot of "crazy" characters.. although I have struggled to find much rhyme or reason - as to why some are allowed (🔴 ϟ ツ ⌘ ☁ ½), and others are not (▶︎ ∀ ★ ©).
For example, the following all compile A-OK (clang-700.1.76)
#define 💩 ?: // OK (Pile of poo)
#define ■ #end // OK (HALFWIDTH BLACK SQUARE)
#define 🅺 #interface // OK (NEGATIVE SQUARED LATIN CAPITAL LETTER K)
#define P #protocol // OK (FULLWIDTH LATIN CAPITAL LETTER P)
yet the following all result in the same compiler error...
Macro name must be an identifier.
#define ☎ TEL
#define ❌ NO
#define ⇧ UP
#define 〓 ==
#define 🍎 APPLE
clang's docs refer to the issue, stating only...
... support for extended identifiers in C99 and C++. This feature allows identifiers to contain certain Unicode characters, as specified by the active language standard; these characters can be written directly in the source file using the UTF-8 encoding, or referred to using universal character names (\u00E0, \U000000E0).
So, I guess I'm asking.. what IS the "active language standard", and how can I find an authoritative source for what identifiers are legal.
I created the following code just to see what clang would do with it. Out of about 63488 possible identifiers tested, 23 issued warnings and 9506 generated errors. That leaves almost 54,000 valid characters to use in identifiers. Certainly enough, but who got cut? And why?
As others have mentioned, Annex D of ISO/IEC 9899:2011 lists the hexadecimal values of characters valid for universal character names in C11. (I won't bother repeating it here.) I have been searching for an answer as to "why" this list was chosen.
Character set standards
First, there are two relevant standards defining a set of characters: ISO/IEC 10646 (defining UCS) and Unicode. To further confuse (or simplify) things, they both define the same characters since the ISO and Unicode keep them synchronized. UCS is essentially just a character map associating values to a set of characters ("repertoire"), while Unicode also gives further definitions such how to compare strings in an alphabetical sorting order (collation), which code points represent "canonically equivalent" characters (normalization), and a bidirectional algorithm for how to process characters in languages written right to left, and more.
Universal character names in C
Universal character names (UCN) was a feature newly added in C99 (ISO/IEC 9899:1999). In the "Rationale for International Standard---Programming Languages---C" (Rev. 2, Oct. 1999), the purpose was "to enable the use of any 'native' character in identifiers, string literals and character constants, while retaining the portability objective of C" (sec. 5.2.1). This section continues on about issues of how to encode these characters in C (the \U and \u forms versus multibyte characters or native encodings) and policy models of how to deal with it (p.14, see PDF page 22).
Rationale
I was hoping that the same "rationale" document from 1999 would give a reason of why each extended character range was selected as acceptable for C99's UCNs. The entirety of the rationale's Annex I is:
Annex I Universal character names for identifiers (normative)
A new feature of C9X.
This is not much of a rationale. They didn't even know what year the C standard would be published, so it's just called "C9X". A later rationale document from 2003 is slightly more enlightening:
Annex D Universal character names for identifiers (normative)
New feature for C99.
The intention is to keep current with ISO/IEC TR 10176.
ISO/IEC TR 10176 is "Guidelines for the preparation of programming language standards." It a basically a guidebook for people who write programming language standards. It includes guidelines for the use of character sets in programming languages as well as a "recommended extended repertoire for user-defined identifiers" (Annex A). But this quote from the 2003 rationale document is only an "intention to keep current," not a pledge of strict adherence to TR 10176.
There is a publicly available ISO/IEC TR 10176:2003 table of characters. The character values refer to ISO 10646. The table classifies ranges of characters from numerous languages as being "uppercase" Lu; "lowercase" Ll; "number, decimal digit" Nd, "punctuation, connector" Pc; etc. It should be clear what use such classifications have to a programming language.
An important reminder is that TR 10176 is a Technical Report, and not a standard. I have found several passing references to it on forums and in documents related to other programming languages, such as Ada, COBOL, and D language. Much of the discussion was about how closely standards of those languages should follow TR 10176 (not being a standard) and complaints that TR 10176 was lagging behind updates to ISO 10646.
Perhaps most enlightening is document WG21/N3146: "Recommendations for extended identifier characters for C and C++." It starts with a comment in 2010 to the standards body recommending restrictions on the initial characters of identifiers. It mentions similar complaints about C referencing TR 10176, and makes suggestions about what characters should be allowed as initial characters of an identifier based on restrictions from Unicode's Identifier and Pattern Syntax and XML's Common Syntactic Constructs. WG21/N3146 gives the proposed wording that later appeared in the C11 standard ISO/IEC 9899:2011. There is a table at the end of the document that helps shed light on the character ranges selected.
Characters allowed and not allowed in C11
Below is a compiled list of ranges for extended identifier characters. The boldface ranges are those given in C11 (ISO/IEC 9899:2011 Annex D). Some comments are added about the italicized ranges not listed in C11 (i.e. not allowed). They are either marked in WG21/N3146 as disallowed by Unicode's UAX#31 or XML's Common Syntactic Constructs, or prohibited by some other comment.
00A8, 00AA, 00AD, 00AF, 00B2-00B5, 00C0-00D6, 00D8-00F6, 00F8-00FF: (Various characters, such as feminine ª and masculine º ordinal indicators, vowels with diacritics, numeric characters such as superscript numbers, fractions, etc.)
(previous gaps): All disallowed by UAX31 and/or XML. (Generally punctuation type marks like «», monetary symbols ¥£, mathematical operators ×÷, etc.)
0100-167F: (Latin, Greek, Cyrillic, Arabic, Thai, Ethiopic, etc.---many others)
1680: "The Ogham block contains a script-specific space: "
1681-180D: (Ogham, Tagalog, Mongolian, etc.)
180E: "The Mongolian block contains a script-specific space"
180F-1FFF: (More languages... phonetics, extended Latin & Greek, etc.)
2000: starts the "General Punctuation" block, but some are allowed:
200B−200D, 202A−202E, 203F−2040, 2054, 2060−206F: (selections from "General Punctuation" block)
2070−218F: "Superscripts and Subscripts, Currency Symbols, Combining Diacritical Marks for Symbols, Letterlike Symbols, Number Forms"
2190-245F: "Arrows, Mathematical Operators, Miscellaneous Technical, Control Pictures, Optical Character Recognition"
2460-24FF: "Enclosed Alphanumerics"
2500: starts "Box Drawing, Block Elements, Geometric Shapes", etc.
2776-2793: (some dingbats and circled dingbats)
2794-2BFF: (a different dingbat set, mathematical symbols, arrows, Braille patterns, etc.)
2C00-2DFF, 2E80-2FFF: "Glagolitic, Latin Extended-C, Coptic, Georgian Supplement, Tifinagh, Ethiopic Extended, Cyrillic Extended-A" (also CJK radical supplement)
3000: (start of "CJK Symbols and Punctuation", some selections allowed)
3004-3007, 3021-302F, 3031-303F: (allowed "CJK Symbols and Punctuation")
3040-D7FF: "Hiragana, Katakana," more CJK ideograms, radicals, etc.
D800-F8FF: (This starts the High and Low Surrogate Areas (number space needed for encodings), and Private Use)
F900-FD3D, FD40-FDCF, FDF0-FE44, FE47-FFFD: selections from "CJK Compatibility Ideographs," "Arabic Presentation Forms," etc.
10000−1FFFD, 20000−2FFFD, 30000−3FFFD, 40000−4FFFD, 50000−5FFFD,
60000−6FFFD, 70000−7FFFD, 80000−8FFFD, 90000−9FFFD, A0000−AFFFD,
B0000−BFFFD, C0000−CFFFD, D0000−DFFFD, E0000−EFFFD: WG21/N3146 gives the rationale for these final ranges:
The Supplementary Private Use Area extends from F0000 through 10FFFF; both [AltId] and [XML2008] disallow characters in that range.
In addition, [AltId] disallows, as non-characters, the last two code positions of each plane, i.e. every position of the form PFFFE or PFFFF, for any value of P.
The "Ranges of characters disallowed initially" from C11 Annex D.2 are 0300−036F, 1DC0−1DFF, 20D0−20FF, FE20−FE2F.
With this WG21/N3146 placed next to the Annex D of the C11 standard, much can be inferred about how they line up. For example, mathematical operators and punctuation seem to be not allowed. I hope this sheds some light on "why" or "how" the allowed characters were chosen.
TLDR; version
Authoritative source for legal identifier characters is the C11 standard ISO/IEC 9899:2011 (See Annex D).
This list is based on a technical report, ISO/IEC TR 10176, but with modifications.
C 2011 standard
6.4.2 Identifiers
6.4.2.1 General
...
3 Each universal character name in an identifier shall designate a character whose encoding
in ISO/IEC 10646 falls into one of the ranges specified in D.1.71) The initial character
shall not be a universal character name designating a character whose encoding falls into
one of the ranges specified in D.2. An implementation may allow multibyte characters
that are not part of the basic source character set to appear in identifiers; which characters
and their correspondence to universal character names is implementation-defined.
...
71) On systems in which linkers cannot accept extended characters, an encoding of the universal character
name may be used in forming valid external identifiers. For example, some otherwise unused
character or sequence of characters may be used to encode the \u in a universal character name.
Extended characters may produce a long external identifier.
...
Annex D
(normative)
Universal character names for identifiers
1 This clause lists the hexadecimal code values that are valid in universal character names
in identifiers.
D.1 Ranges of characters allowed
1 00A8, 00AA, 00AD, 00AF, 00B2−00B5, 00B7−00BA, 00BC−00BE, 00C0−00D6,
00D8−00F6, 00F8−00FF
2 0100−167F, 1681−180D, 180F−1FFF
3 200B−200D, 202A−202E, 203F−2040, 2054, 2060−206F
4 2070−218F, 2460−24FF, 2776−2793, 2C00−2DFF, 2E80−2FFF
5 3004−3007, 3021−302F, 3031−303F
6 3040−D7FF
7 F900−FD3D, FD40−FDCF, FDF0−FE44, FE47−FFFD
8 10000−1FFFD, 20000−2FFFD, 30000−3FFFD, 40000−4FFFD, 50000−5FFFD,
60000−6FFFD, 70000−7FFFD, 80000−8FFFD, 90000−9FFFD, A0000−AFFFD,
B0000−BFFFD, C0000−CFFFD, D0000−DFFFD, E0000−EFFFD
D.2 Ranges of characters disallowed initially
1 0300−036F, 1DC0−1DFF, 20D0−20FF, FE20−FE2F
The syntax for identifiers, which include macro names, is presented in section 6.4.2 of the C2011 standard, as interpreted in light of appendix D.1. These provisions hold that every identifier may contain underscores, upper- and lower-case Latin letters, decimal digits, sequences of characters constituting "universal character names" (subject to limitations), and any other character defined by the implementation.
Universal character names (UCNs) are Unicode escape sequences similar to those provided by Java, Python, and some other languages: they start with a backslash (\), which is followed by a u or U, and either four or eight hexadecimal digits, respectively. There are some limitations on the specific hex digit sequences that may be used, some general, others specific to identifier context. Note, however, that syntactically, the only additional character that the provision for UCNs allows to appear in identifiers is the backslash; all the other characters that can appear in a UCN are allowed in identifiers outside of UCN context, too.
Thus, speaking syntactically and restricting the discussion to the characters that the standard requires to be allowed in identifiers, the underscore, (unaccented) Latin letters, decimal digits, and the backslash are the only characters that C requires must be supported in identifiers. Support for the backslash is required only in the context of UCNs, and not all valid UCNs are allowed in identifiers. Additionally, the standard does not require support for digits as the first characters of identifiers.
On the other hand, the standard is quite liberal in allowing "other implementation-defined characters" in identifiers, including as the first character. Even decimal digits, which otherwise cannot be the first character in an identifier, could, in principle, be allowed at that position under this provision, at the discretion of the implementation. If you want your code to be portable among implementations then you will avoid relying on this provision anywhere. If you want to know which characters your particular implementation allows then you must consult its documentation.
Every standard-conforming implementation must document its behavior with respect to every detail the standard declares to be implementation defined. For example, GCC's documentation specifies that the dollar sign ($) is allowed in identifiers on most target architectures. You yourself linked to and quoted Clang's documentation of the same implementation-defined detail, which is more liberal -- it allows all the characters that can be represented in identifiers via UCNs to also be representable by UTF-8 byte sequences. In many cases, if you display or print source code containing such byte sequences, they will be rendered as a single display character.
As already mentioned, the C11 Standard defines several allowed Ranges of Unicode characters.
00A8, 00AA, 00AD, 00AF, 00B2−00B5, 00B7−00BA, 00BC−00BE, 00C0−00D6, 00D8−00F6, 00F8−00FF
0100−167F, 1681−180D, 180F−1FFF
200B−200D, 202A−202E, 203F−2040, 2054, 2060−206F
2070−218F, 2460−24FF, 2776−2793, 2C00−2DFF, 2E80−2FFF
3004−3007, 3021−302F, 3031−303F
3040−D7FF
F900−FD3D, FD40−FDCF, FDF0−FE44, FE47−FFFD
10000−1FFFD, 20000−2FFFD, 30000−3FFFD, 40000−4FFFD, 50000−5FFFD, 60000−6FFFD, 70000−7FFFD, 80000−8FFFD, 90000−9FFFD, A0000−AFFFD, B0000−BFFFD, C0000−CFFFD, D0000−DFFFD, E0000−EFFFD
This also means there are several ranges of characters excluded from usage.
From your examples:
☎ is 260E and from the "Miscellaneous Symbols" block: 2600-26FF which means youre missing out on all of these
❌ is 274C and from the "Dingbats" block: 2700-27BF which is all of these but some of them are allowed (2776−2793)
⇧ is 21E7 and from the "Arrows " block: 2190-21FF which means youre missing out on all of these
〓 is 3013 and from the "CJK Symbols and Punctuation" block: 3000-303F which is all these but some of them are allowed.
🍎 is 1F34E and from the "Miscellaneous Symbols and Pictographs" block: 1F300-1F5FF which is all these and actually should work (maybe a clangproblem? btw this is not displayed on my home computer (Ubuntu) but on my work PC (Win7))
Assume that you're writing (portable) C99 code in the invariant set of ISO 646. This means that the \ (backslash, reverse solidus, however you name it) can't be written directly. For instance, one could opt to write a Hello World program as such:
%:include <stdio.h>
%:include <stdlib.h>
int main()
<%
fputs("Hello World!??/n", stdout);
return EXIT_SUCCESS;
%>
However, besides digraphs, I used the ??/ trigraph to write the \ character.
Given my assumptions above, is it possible to either
include the '\n' character (which is translated to a newline in <stdio.h> functions) in a string without the use of trigraphs, or
write a newline to a FILE * without using the '\n' character?
For stdout you could just use puts("") to output a newline. Or indeed replace the fputs in your original program with puts and delete the \n.
If you want to get the newline character into a variable so you can do other things with it, I know another standard function that gives you one for free:
int gimme_a_newline(void)
{
time_t t = time(0);
return strchr(ctime(&t), 0)[-1];
}
You could then say
fprintf(stderr, "Hello, world!%c", gimme_a_newline());
(I hope all of the characters I used are ISO646 or digraph-accessible. I found it surprisingly difficult to get a simple list of which ASCII characters are not in ISO646. Wikipedia has a color-coded table with not nearly enough contrast between colors for me to tell what's what.)
Your premise:
Assume that you're writing (portable) C99 code in the invariant set of ISO 646. This means that the \ (backslash, reverse solidus, however you name it) can't be written directly.
is questionable. C99 defines "source" and "execution" character sets, and requires that both include representations of the backslash character (C99 5.2.1). The only reason I can imagine for an effort such as you describe would be to try to produce source code that does not require character set transcoding upon movement among machines. In that case, however, the choice of ISO 646 as a common baseline is odd. You're more likely to run into an EBCDIC machine than one that uses an ISO 646 variant that is not coincident with the ISO-8859 family of character sets. (And if you can assume ISO 8859, then backslash does not present a problem.)
Nevertheless, if you insist on writing C source code without using a literal backslash character, then the trigraph for that character is the way to do so. That's what trigraphs were invented for. In character constants and string literals, you cannot portably substitute anything else for \n or its trigraph equivalent, ??/n, because it is implementation-dependent how that code is mapped. In particular, it is not safe to assume that it maps to a line-feed character (which, however, is included among the invariant characters of ISO 646).
Update:
You ask specifically whether it is possible to
include the '\n' character (which is translated to a newline in functions) in a string without the use of trigraphs, or
No, it is not possible, because there is no one '\n' character. Moreover, there seems to be a bit of a misconception here: \n in a character or string literal represents one character in the execution character set. The compiler is therefore responsible for that transformation, not the stdio functions. The stdio functions' responsibility is to handle that character on output by writing a character or character sequence intended to produce the specified effect ("[m]oves the active position to the initial position of the next line").
You also ask whether it is possible to
write a newline to a FILE * without using the '\n' character?
This one depends on exactly what you mean. If you want to write a character whose code in the execution character set you know, then you can write a numeric constant having that numeric value. In particular, if you want to write the character with encoded value 0xa (in the execution character set) then you can do so. For example, you could
fputc(0xa, my_file);
but that does not necessarily produce a result equivalent to
fputc('\n', my_file);
Short answer is, yes, for what you want to do, you have to use this trigraph.
Even if there was a digraph for \, it would be useless inside a string literal because digraphs must be tokens, they are recognized by the tokenizer, while trigraphs are pre-processed and so still work inside string literals and the like.
Still wondering why somebody would encode source this way today ... :o
No. \n (or its trigraph equivalent) is the portable representation of a newline character.
No. You'd have to represent the literal newline somehow, and \n (or it's trigraph equivalent) is the only portable representation.
It's very unusual to find C source code that uses trigraphs or digraphs! Some compilers (e.g. GNU gcc) require command-line options to enable the use of trigraphs and assume they have been used unintentionally and issues a warning if it encounters them in the source code.
EDIT: I forgot about puts(""). That's a sneaky way to do it, but only works for stdout.
Yes of course it's possible
fputc(0x0A, file);
I saw a line of C that looked like this:
!ErrorHasOccured() ??!??! HandleError();
It compiled correctly and seems to run ok. It seems like it's checking if an error has occurred, and if it has, it handles it. But I'm not really sure what it's actually doing or how it's doing it. It does look like the programmer is trying express their feelings about errors.
I have never seen the ??!??! before in any programming language, and I can't find documentation for it anywhere. (Google doesn't help with search terms like ??!??!). What does it do and how does the code sample work?
??! is a trigraph that translates to |. So it says:
!ErrorHasOccured() || HandleError();
which, due to short circuiting, is equivalent to:
if (ErrorHasOccured())
HandleError();
Guru of the Week (deals with C++ but relevant here), where I picked this up.
Possible origin of trigraphs or as #DwB points out in the comments it's more likely due to EBCDIC being difficult (again). This discussion on the IBM developerworks board seems to support that theory.
From ISO/IEC 9899:1999 §5.2.1.1, footnote 12 (h/t #Random832):
The trigraph sequences enable the input of characters that are not defined in the Invariant Code Set as
described in ISO/IEC 646, which is a subset of the seven-bit US ASCII code set.
Well, why this exists in general is probably different than why it exists in your example.
It all started half a century ago with repurposing hardcopy communication terminals as computer user interfaces. In the initial Unix and C era that was the ASR-33 Teletype.
This device was slow (10 cps) and noisy and ugly and its view of the ASCII character set ended at 0x5f, so it had (look closely at the pic) none of the keys:
{ | } ~
The trigraphs were defined to fix a specific problem. The idea was that C programs could use the ASCII subset found on the ASR-33 and in other environments missing the high ASCII values.
Your example is actually two of ??!, each meaning |, so the result is ||.
However, people writing C code almost by definition had modern equipment,1 so my guess is: someone showing off or amusing themself, leaving a kind of Easter egg in the code for you to find.
It sure worked, it led to a wildly popular SO question.
ASR-33 Teletype
1. For that matter, the trigraphs were invented by the ANSI committee, which first met after C become a runaway success, so none of the original C code or coders would have used them.
It's a C trigraph. ??! is |, so ??!??! is the operator ||
As already stated ??!??! is essentially two trigraphs (??! and ??! again) mushed together that get replaced-translated to ||, i.e the logical OR, by the preprocessor.
The following table containing every trigraph should help disambiguate alternate trigraph combinations:
Trigraph Replaces
??( [
??) ]
??< {
??> }
??/ \
??' ^
??= #
??! |
??- ~
Source: C: A Reference Manual 5th Edition
So a trigraph that looks like ??(??) will eventually map to [], ??(??)??(??) will get replaced by [][] and so on, you get the idea.
Since trigraphs are substituted during preprocessing you could use cpp to get a view of the output yourself, using a silly trigr.c program:
void main(){ const char *s = "??!??!"; }
and processing it with:
cpp -trigraphs trigr.c
You'll get a console output of
void main(){ const char *s = "||"; }
As you can notice, the option -trigraphs must be specified or else cpp will issue a warning; this indicates how trigraphs are a thing of the past and of no modern value other than confusing people who might bump into them.
As for the rationale behind the introduction of trigraphs, it is better understood when looking at the history section of ISO/IEC 646:
ISO/IEC 646 and its predecessor ASCII (ANSI X3.4) largely endorsed existing practice regarding character encodings in the telecommunications industry.
As ASCII did not provide a number of characters needed for languages other than English, a number of national variants were made that substituted some less-used characters with needed ones.
(emphasis mine)
So, in essence, some needed characters (those for which a trigraph exists) were replaced in certain national variants. This leads to the alternate representation using trigraphs comprised of characters that other variants still had around.