Is program translation direction well-defined? - c

It may sound obvious, but just out of curiosity: is program translation direction well-defined (i.e. top-to-bottom, left-to-right)? Is it explicitly defined in the standard?

A source file, and the translation unit that results from including headers via the #include directive, is implicitly a sequence of characters with one dimension. I do not see this explicitly stated in the standard, but there are numerous references to this dimension in the standard, referring to characters “followed by” (C 2018 5.1.1.2 1 1) or “before” (6.10.2 5) other characters, scopes begin “after” the appearance of a tag or declarator (6.2.1 7), and so on.
A compiler is free to read and compute with the parts of translation units in any order it wants, but the meaning of the translation unit is defined in terms of this start-to-finish order.
There is no “up” or “down” between lines. Lines are meaningful in certain parts of C translation, such as the fact that a preprocessing directive ends with a new-line character. However, there is no relationship defined between the same columns on different lines, so the standard does not define anything meaningful for going up or down by lines beyond the fact this means going back or forth in the stream of characters by some amount.
The standard does allow that source files might be composed of lines of text, perhaps with Hollerith cards (punched cards) in which each individual card is a line or with fixed-length-record files in which each record is a fixed number of bytes (such as 80) and there are no new-line characters physically recorded in the file. (Implicitly, each line of text ends after 80 characters.) The standard treats these files as if a new-line character were inserted at the end of the line (5.2.1 3), thus effectively converting the file to a single stream of characters.

Related

Contradiction in C18 standard (regarding character sets)?

We read in the C18 standard:
5.1.1.2 Translation phases
The precedence among the syntax rules of translation is specified by the following phases.
Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary.
Meaning that the source file character set is decoded and mapped to the source character set.
But then you can read:
5.2.1 Character sets
Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set).
Meaning that the source file character set is the source character set.
So the question is: which one did I understand wrong, or which one is actually wrong?
EDIT: Actually I was wrong. See my answer below.
Meaning that the source file character set is decoded and mapped to the source character set.
No, it does not mean that. My take is that the source is already assumed to be written in the source character set - how exactly would it make sense to "map the source character set to the source character set"? Either they are part of the set or they aren't. If you pick the wrong encoding for your source code, it will simply be rejected before the preprocessing even starts.
Translation phase 1 does two things not quite related to this at all:
Resolves trigraphs, which are standardized multibyte sequences.
Map multibyte characters into the source character set (defined in 5.2.1).
The source character set consists of the basic character set which is essentially the Latin alphabet plus various common symbols (5.2.1/3), and an extended character set, which is locale- and implemention-specific.
The definition of multibyte characters is found at 5.2.1.2:
The source character set may contain multibyte characters, used to represent members of
the extended character set. The execution character set may also contain multibyte
characters, which need not have the same encoding as for the source character set.
Meaning various locale-specific oddball special cases, such as locale-specific trigraphs.
All of this multibyte madness goes back to the first standardization in 1990 - according to anecdotes from those who were part of that committee, this was because members from various European countries weren't able to use various symbols on their national keyboards.
(I'm not sure how widespread the AltGr key on such keyboards was at the time. It remains a key subject to some serious button mashing when writing C on non-English keyboards anyway, to get access to {}[] symbols etc.)
Well, after all it seems I was wrong. After contacting David Keaton, from the WG14 group (they are in charge of the C standard), I got this clarifying reply:
There is a subtle distinction. The source character set is the
character set in which source files are written. However, the source
character set is just the list of characters available, which does not
say anything about the encoding.
Phase 1 maps the multibyte encoding of the source character set onto
the abstract source characters themselves.
In other words, a character that looks like this:
<byte 1><byte 2>
is mapped to this:
<character 1>
The first is an encoding that represents a character in the source
character set in which the program was written. The second is the
abstract character in the source character set.
You have encountered cross compiling, where a program is compiled on one architecture and executed on another architecture and these architectures have different character sets.
5.1.1.2 is active early in read, where the input file is converted into the compiler's single character set, which clearly must contain all of the characters required by a C program.
However when cross compiling, the execution character set may be different. 5.2.1 is allowing for this possibility. When the compiler emits code, it must translate all character and string constants to the target platform's character set. On modern platforms, this is a no-op, but on some ancient platforms it wasn't.

C preprocessor: line continuation: why exactly comment is not allowed after backslash character ('\')?

Valid code:
#define M xxx\
yyy
Not valid code:
#define M xxx\/*comment*/
yyy
#define M xxx\//comment
yyy
Questions:
Why comment is not allowed after backslash character (\)?
What the standard says?
UPD.
Extra question:
What is the motivation / reason / argumentation behind the requirement that (in order to achieve splicing of physical source lines) backslash character (\) must immediately follow by a new-line character? What is the obstacle to allow comments (or spaces) after the backslash character (\)?
Lines are spliced together only if a backslash character is the last character on a line. C 2018 5.1.1.2 specifies phases of translating a C program. In phase 2:
Each instance of a backslash character () immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines…
If a comment follows a backslash character, the backslash character is not followed by a new-line character, so no splicing is performed. Comments are processed in phase 3:
The source file is decomposed into preprocessing tokens7) and sequences of white-space characters (including comments)… Each comment is replaced by one space character…
Regarding the added question:
What is the motivation / reason / argumentation behind the requirement that (in order to achieve splicing of physical source lines) backslash character (\) must immediately follow by a new-line character? What is the obstacle to allow comments (or spaces) after the backslash character (\)?
The earliest processing in compiling a C program is the simplest. Early C compilers may have been implemented as layers of simple filters: First local-environment characters or methods of file storage would be translated to a simple stream of characters, then lines would be spliced together (perhaps dealing with a problem of wanting a long source line while having to type your source code on 80-column punched cards), then comments would be removed, and so on.
Splicing together lines marked by a backslash at the end of a line is easy; it only requires looking at two characters. If instead we allow comments to follow the backslash that marks a splice, it becomes complicated:
A backslash followed by a comment followed by a new-line would be spliced, but a backslash followed by a comment followed by other source code would not. That requires looking possibly many characters ahead and parsing the comment delimiters, possibly for multiple comments.
One purpose of splicing lines was to allow continuing long strings across multiple lines. (This was before adjacent strings were concatenated in C.) So "abc\ on one line and def" on another would be spliced together, making "abcdef". While we might allow comments after backslashes intended to join lines, we do not want to splice after a line containing "abc\ /*" /*comment*/. That means the code doing the splicing has to be context-sensitive; if the backslash appears in a quoted string, it has to treat it differently.
There is actually a reason why backslash-newlines are processed before comments are removed. It's the same reason why backslash-newlines are entirely removed, instead of being replaced with (virtual) horizontal whitespace, as comments are. It's a ridiculous reason, but it's the official reason. It's so you can mechanically force-fit C code with long lines onto punched cards, by inserting backslash-newline at column 79 no matter what that happens to divide:
static int cp_old_stat(struct kstat *stat, struct __old_kernel_stat __user * st\
atbuf)
{
static int warncount = 5;
struct __old_kernel_stat tmp;
if (warncount > 0) {
warncount--;
printk(KERN_WARNING "VFS: Warning: %s using old stat() call. Re\
compile your binary.\n",
(this is the first chunk of C I found on my hard drive that actually had lines that wouldn't fit on punched cards)
For this to work as intended, backslash-newline has to be able to split a /* or a */, like
/* this comment just so happens to be exactly 80 characters wide at the close *\
/
And you can't have it both ways: if comments were to be removed before processing backslash-newline, then backslash-newline could not affect comment boundaries; conversely, if backslash-newline is to be processed first, then comments can't appear between the backslash and the newline.
(I Am Not Making This Up™: C99 Rationale section 5.1.1.2 paragraph 30 reads
A backslash immediately before a newline has long been used to continue string literals, as well as preprocessing command lines. In the interest of easing machine generation of C, and of transporting code to machines with restrictive physical line lengths, the C89 Committee generalized this mechanism to permit any token to be continued by interposing a backslash/newline sequence.
Emphasis in original. Sorry, I don't know of any non-PDF version of this document.)
Per 5.1.1.2 Translation phases of the C11 standard (note the bolded text added)
5.1.1.2 Translation phases
1 The precedence among the syntax rules of translation is specified by the following phases.6)
1 Physical source file multibyte characters are mapped, in an implementation- defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.
2 Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character before any such splicing takes place.
...
Only backslash characters immediately followed by a new-line will cause lines to be spliced. A comment is not a new-line character.

Restrictions to Unicode escape sequences in C11

Why is there a restriction for Unicode escape sequences (\unnnn and \Unnnnnnnn) in C11 such that only those characters outside of the basic character set may be represented? For example, the following code results in the compiler error: \u000A is not a valid universal character. (Some Unicode "dictionary" sites even give this invalid format as canon for the C/C++ languages, though admittedly these are likely auto-generated):
static inline int test_unicode_single() {
return strlen(u8"\u000A") > 1;
}
While I understand that it's not exactly necessary for these basic characters to supported, is there a technical reason why they're not? Something like not being able to represent the same character in more than one way?
It's precisely to avoid alternative spellings.
The primary motivations for adding Universal Character Names (UCNs) to C and C++ were to:
allow identifiers to include letters outside of the basic source character set (like ñ, for example).
allow portable mechanisms for writing string and character literals which include characters outside of the basic source character set.
Furthermore, there was a desire that the changes to existing compilers be as limited as possible, and in particular that compilers (and other tools) could continue to use their established (and often highly optimised) lexical analysis functions.
That was a challenge, because there are huge differences in the lexical analysis architectures of different compilers. Without going into all the details, it appeared that two broad implementation strategies were possible:
The compiler could internally use some single universal encoding, such as UTF-8. All input files in other encodings would be transcribed into this internal encoding very early in the input pipeline. Also, UCNs (wherever they appeared) would be converted to the corresponding internal encoding. This latter transformation could be conducted in parallel with continuation line processing, which also requires detecting backslashes, thus avoiding an extra test on every input character for a condition which very rarely turns out to be true.
The compiler could internally use strict (7-bit) ASCII. Input files in encodings allowing other characters would be transcribed into ASCII with non-ASCII characters converted to UCNs prior to any other lexical analysis.
In effect, both of these strategies would be implemented in Phase 1 (or equivalent), which is long before lexical analysis has taken place. But note the difference: strategy 1 converts UCNs to an internal character coding, while strategy 2 converts non-representable characters to UCNs.
What these two strategies have in common is that once the transcription is finished, there is no longer any difference between a character entered directly into the source stream (in whatever encoding the source file uses) and a character described with a UCN. So if the compiler allows UTF-8 source files, you could enter an ñ as either the two bytes 0xc3, 0xb1 or as the six-character sequence \u00D1, and they would both end up as the same byte sequence. That, in turn, means that every identifier has only one spelling, so no change is necessary (for example) to symbol table lookup.
Typically, compilers just pass variable names through the compilation pipeline, leaving them to be eventually handled by assemblers or linkers. If these downstream tools do not accept extended character encodings or UCNs (depending on implementation strategy) then names containing such characters need to be "mangled" (transcribed) in order to make them acceptable. But even if that's necessary, it's a minor change and can be done at a well-defined interface.
Rather than resolve arguments between compiler vendors whose products (or development teams) had clear preferences between the two strategies, the C and C++ standards committees chose mechanisms and restrictions which make both strategies compatible. In particular, both committees forbid the use of UCNs which represent characters which already have an encoding in the basic source character set. That avoids questions like:
What happens if I put \u0022 inside a string literal:
const char* quote = "\u0022";
If the compiler translates UCNs to the characters they represent, then by the time the lexical analyser sees that line, "\u0022" will have been converted to """, which is a lexical error. On the other hand, a compiler which retains UCNs until the end would happily accept that as a string literal. Banning the use of a UCN which represents a quotation mark avoids this possible non-portability.
Similarly, would '\u005cn' be a newline character? Again, if the UCN is converted to a backslash in Phase 1, then in Phase 3 the string literal would definitely be treated as a newline. But if the UCN is converted to a character value only after the character literal token has been identified as such, then the resulting character literal would contain two characters (an implementation-defined value).
And what about 2 \u002B 2? Is that going to look like an addition, even though UCNs aren't supposed to be used for punctuation characters? Or will it look like an identifier starting with a non-letter code?
And so on, for a large number of similar issues.
All of these details are avoided by the simple expedient of requiring that UCNs cannot be used to spell characters in the basic source character set. And that's what was embodied in the standards.
Note that the "basic source character set" does not contain every ASCII character. It does not contain the majority of the control characters, and nor does it contain the ASCII characters $, # and `. These characters (which have no meaning in a C or C++ program outside of string and character literals) can be written as the UCNs \u0024, \u0040 and \u0060 respectively.
Finally, in order to see what sort of knots you need to untie in order to correctly lexically analyse C (or C++), consider the following snippet:
const char* s = "\\
n";
Because continuation lines are dealt with in Phase 1, prior to lexical analysis, and Phase 1 only looks for the two-character sequence consisting of a backslash followed by a newline, that line is the same as
const char* s = "\n";
But that might not have been obvious looking at the original code.

C translation phases concrete examples

According to the C11 standard (5.1.1.2 Translation phases) there are 8 translation phases.
Can anyone give a concrete example for each of the phases.
For example at phase 1 there is:
Physical source file multibyte characters are mapped, in an
implementation- defined manner, to the source character set...
so can I have an example of what happens when that mapping is executed and so on
for other phases?
Well, one example of phase one would be storing your source code into a record-oriented format, such as in z/OS on the mainframe.
These data sets have fixed record sizes so, if your data set specification was FB80 (fixed, blocked, record length of 80), the "line":
int main (void)
would be stored as those fifteen characters followed by sixty-five spaces, and no newline.
Phase one translation would read in the record, possibly strip off the trailing spaces, and add a newline character, before passing the line on to the next phase.
As per the standard, this is also the phase that handles trigraphs, such as converting ??( into [ on a 3270 terminal that has no support for the [ character.
An example of phase five is if you're writing your code on z/OS (using EBCDIC) but cross-compiling it for Linux/x86 (using ASCII/Unicode).
In that case the source characters within string literals and character constants must have the ASCII representation rather than the EBCDIC one. Otherwise, you're likely to get some truly bizarre output on your Linux box.

Regarding to c99, What is the definition of a logical source line?

Imagine we write this code:
printf ("testtest"
"titiritest%s",
" test");
Would this be according to ISO/IEC:9899 §5.1.1.2 — 2
Be 3 different logical source lines or would it be a single one?
And is this
2. Each instance of a backslash character (\) immediately followed by a new-line
character is deleted, splicing physical source lines to form logical source lines.
Only the last backslash on any physical source line shall be eligible for being part
of such a splice. A source file that is not empty shall end in a new-line character,
which shall not be immediately preceded by a backslash character before any such
splicing takes place.
the only rule mentioned about forming logical source lines?
As regarding to the
5.2.4.1 Translation limits
[...]
— 4095 characters in a logical source line
Would mean each translation unit should not get bigger as 4095 characters, as long we dont use a \ right before our line breaks. And I'm pretty sure, thats not what they intend to say.
So where is the piece of the definition I'm missing to lookup?
It's three logical source lines.
Logical source lines are mostly important because macro definitions must fit into one logical source line; I cannot right now think of any other use for logical source lines of more than one line. To construct large string literals, you could either use logical source lines consisting of more than one physical source line (which I personally find very ugly), or relying on the fact that quoted strings will be concatenated, which is much more readable and maintainable .

Resources