C translation phases concrete examples - c

According to the C11 standard (5.1.1.2 Translation phases) there are 8 translation phases.
Can anyone give a concrete example for each of the phases.
For example at phase 1 there is:
Physical source file multibyte characters are mapped, in an
implementation- defined manner, to the source character set...
so can I have an example of what happens when that mapping is executed and so on
for other phases?

Well, one example of phase one would be storing your source code into a record-oriented format, such as in z/OS on the mainframe.
These data sets have fixed record sizes so, if your data set specification was FB80 (fixed, blocked, record length of 80), the "line":
int main (void)
would be stored as those fifteen characters followed by sixty-five spaces, and no newline.
Phase one translation would read in the record, possibly strip off the trailing spaces, and add a newline character, before passing the line on to the next phase.
As per the standard, this is also the phase that handles trigraphs, such as converting ??( into [ on a 3270 terminal that has no support for the [ character.
An example of phase five is if you're writing your code on z/OS (using EBCDIC) but cross-compiling it for Linux/x86 (using ASCII/Unicode).
In that case the source characters within string literals and character constants must have the ASCII representation rather than the EBCDIC one. Otherwise, you're likely to get some truly bizarre output on your Linux box.

Related

The compiling process is proprocess-compiling-assembling-linking. Where in the process that "A" follow the ASCII or Unicode to convert A into 65

#include <stdio.h>
int main(void) {
printf("A");
}
Does reserved keyword of the language int also follows the ASCII or Unicode character set. Like splitting it into individual character to convert it into binary?
The character set of the source files is implementation-defined. Many common compiler systems use ASCII, some even Unicode, but the C standard requests no specific character set.
This is the first translation phase specified by the C standard:
Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary.
The chapter 5.2.1 of the C standard differentiates between the source character set and the execution character set. None of them is defined as a specific character set.
Chapter 5.1.1.2 defines several translation phases, of which the fourth phase executes all preprocessor directives, and as such finishes the preprocessing.
This is the fifth phase:
Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementation-defined member other than the null (wide) character.
So your title question can be answered as "The compiler converts characters from the source character set to the execution character set."
There are actually compilers with different character sets. For example, z88dk uses a target specific execution character set that is not necessarily ASCII, but accepts ASCII source files.
However, this conversion takes place only for character constants and strings literals.
Keywords are not affected. They are processed by the preprocessor in their encoding in the source character set and are never converted to any other character set. This source character set can be ASCII or UTF-8, but the specific character set is implementation-defined.
The source file is decomposed into preprocessing tokens and sequences of white-space characters (including comments). [...]
Such a token could be implemented by some integer constant or enumeration member, or any other means seen fit by the developers.
Concerning your mentioning of "binary": Anything in a computer as we know them commonly is binary. It is always the interpretation of a binary value that leads to a numeric value, a character, a machine instruction, or any other meaning. Therefore, there are less conversions to binary than you might think.
Reserved words of programming languages are almost ever composed from plain ASCII characters.
int is stored in the source file as 0x69, 0x6E, 0x74 and compiler has to parse and identify it as the one of reserved words. This can be done by comparing those three bytes with strings in the table of reserved words, or using some more advanced technique, such as hash lookup. Many languages have case-insensitive reserved words, in this case the parsed word int would have to be converted to uniform character case INT first.
Argument of function printf() is a literal string "A". Compiler will store its value into object file, typically to .rodata section, as the sequence of bytes 0x41, 0x00 (zero-terminated string in ASCII or UTF-8 encoding). There is no sense to speak of converting the ASCII value of letter A to binary - it is the question of interpreting the byte contents as letter A or its binary value 0x41=65.

Is program translation direction well-defined?

It may sound obvious, but just out of curiosity: is program translation direction well-defined (i.e. top-to-bottom, left-to-right)? Is it explicitly defined in the standard?
A source file, and the translation unit that results from including headers via the #include directive, is implicitly a sequence of characters with one dimension. I do not see this explicitly stated in the standard, but there are numerous references to this dimension in the standard, referring to characters “followed by” (C 2018 5.1.1.2 1 1) or “before” (6.10.2 5) other characters, scopes begin “after” the appearance of a tag or declarator (6.2.1 7), and so on.
A compiler is free to read and compute with the parts of translation units in any order it wants, but the meaning of the translation unit is defined in terms of this start-to-finish order.
There is no “up” or “down” between lines. Lines are meaningful in certain parts of C translation, such as the fact that a preprocessing directive ends with a new-line character. However, there is no relationship defined between the same columns on different lines, so the standard does not define anything meaningful for going up or down by lines beyond the fact this means going back or forth in the stream of characters by some amount.
The standard does allow that source files might be composed of lines of text, perhaps with Hollerith cards (punched cards) in which each individual card is a line or with fixed-length-record files in which each record is a fixed number of bytes (such as 80) and there are no new-line characters physically recorded in the file. (Implicitly, each line of text ends after 80 characters.) The standard treats these files as if a new-line character were inserted at the end of the line (5.2.1 3), thus effectively converting the file to a single stream of characters.

Contradiction in C18 standard (regarding character sets)?

We read in the C18 standard:
5.1.1.2 Translation phases
The precedence among the syntax rules of translation is specified by the following phases.
Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary.
Meaning that the source file character set is decoded and mapped to the source character set.
But then you can read:
5.2.1 Character sets
Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set).
Meaning that the source file character set is the source character set.
So the question is: which one did I understand wrong, or which one is actually wrong?
EDIT: Actually I was wrong. See my answer below.
Meaning that the source file character set is decoded and mapped to the source character set.
No, it does not mean that. My take is that the source is already assumed to be written in the source character set - how exactly would it make sense to "map the source character set to the source character set"? Either they are part of the set or they aren't. If you pick the wrong encoding for your source code, it will simply be rejected before the preprocessing even starts.
Translation phase 1 does two things not quite related to this at all:
Resolves trigraphs, which are standardized multibyte sequences.
Map multibyte characters into the source character set (defined in 5.2.1).
The source character set consists of the basic character set which is essentially the Latin alphabet plus various common symbols (5.2.1/3), and an extended character set, which is locale- and implemention-specific.
The definition of multibyte characters is found at 5.2.1.2:
The source character set may contain multibyte characters, used to represent members of
the extended character set. The execution character set may also contain multibyte
characters, which need not have the same encoding as for the source character set.
Meaning various locale-specific oddball special cases, such as locale-specific trigraphs.
All of this multibyte madness goes back to the first standardization in 1990 - according to anecdotes from those who were part of that committee, this was because members from various European countries weren't able to use various symbols on their national keyboards.
(I'm not sure how widespread the AltGr key on such keyboards was at the time. It remains a key subject to some serious button mashing when writing C on non-English keyboards anyway, to get access to {}[] symbols etc.)
Well, after all it seems I was wrong. After contacting David Keaton, from the WG14 group (they are in charge of the C standard), I got this clarifying reply:
There is a subtle distinction. The source character set is the
character set in which source files are written. However, the source
character set is just the list of characters available, which does not
say anything about the encoding.
Phase 1 maps the multibyte encoding of the source character set onto
the abstract source characters themselves.
In other words, a character that looks like this:
<byte 1><byte 2>
is mapped to this:
<character 1>
The first is an encoding that represents a character in the source
character set in which the program was written. The second is the
abstract character in the source character set.
You have encountered cross compiling, where a program is compiled on one architecture and executed on another architecture and these architectures have different character sets.
5.1.1.2 is active early in read, where the input file is converted into the compiler's single character set, which clearly must contain all of the characters required by a C program.
However when cross compiling, the execution character set may be different. 5.2.1 is allowing for this possibility. When the compiler emits code, it must translate all character and string constants to the target platform's character set. On modern platforms, this is a no-op, but on some ancient platforms it wasn't.

Restrictions to Unicode escape sequences in C11

Why is there a restriction for Unicode escape sequences (\unnnn and \Unnnnnnnn) in C11 such that only those characters outside of the basic character set may be represented? For example, the following code results in the compiler error: \u000A is not a valid universal character. (Some Unicode "dictionary" sites even give this invalid format as canon for the C/C++ languages, though admittedly these are likely auto-generated):
static inline int test_unicode_single() {
return strlen(u8"\u000A") > 1;
}
While I understand that it's not exactly necessary for these basic characters to supported, is there a technical reason why they're not? Something like not being able to represent the same character in more than one way?
It's precisely to avoid alternative spellings.
The primary motivations for adding Universal Character Names (UCNs) to C and C++ were to:
allow identifiers to include letters outside of the basic source character set (like ñ, for example).
allow portable mechanisms for writing string and character literals which include characters outside of the basic source character set.
Furthermore, there was a desire that the changes to existing compilers be as limited as possible, and in particular that compilers (and other tools) could continue to use their established (and often highly optimised) lexical analysis functions.
That was a challenge, because there are huge differences in the lexical analysis architectures of different compilers. Without going into all the details, it appeared that two broad implementation strategies were possible:
The compiler could internally use some single universal encoding, such as UTF-8. All input files in other encodings would be transcribed into this internal encoding very early in the input pipeline. Also, UCNs (wherever they appeared) would be converted to the corresponding internal encoding. This latter transformation could be conducted in parallel with continuation line processing, which also requires detecting backslashes, thus avoiding an extra test on every input character for a condition which very rarely turns out to be true.
The compiler could internally use strict (7-bit) ASCII. Input files in encodings allowing other characters would be transcribed into ASCII with non-ASCII characters converted to UCNs prior to any other lexical analysis.
In effect, both of these strategies would be implemented in Phase 1 (or equivalent), which is long before lexical analysis has taken place. But note the difference: strategy 1 converts UCNs to an internal character coding, while strategy 2 converts non-representable characters to UCNs.
What these two strategies have in common is that once the transcription is finished, there is no longer any difference between a character entered directly into the source stream (in whatever encoding the source file uses) and a character described with a UCN. So if the compiler allows UTF-8 source files, you could enter an ñ as either the two bytes 0xc3, 0xb1 or as the six-character sequence \u00D1, and they would both end up as the same byte sequence. That, in turn, means that every identifier has only one spelling, so no change is necessary (for example) to symbol table lookup.
Typically, compilers just pass variable names through the compilation pipeline, leaving them to be eventually handled by assemblers or linkers. If these downstream tools do not accept extended character encodings or UCNs (depending on implementation strategy) then names containing such characters need to be "mangled" (transcribed) in order to make them acceptable. But even if that's necessary, it's a minor change and can be done at a well-defined interface.
Rather than resolve arguments between compiler vendors whose products (or development teams) had clear preferences between the two strategies, the C and C++ standards committees chose mechanisms and restrictions which make both strategies compatible. In particular, both committees forbid the use of UCNs which represent characters which already have an encoding in the basic source character set. That avoids questions like:
What happens if I put \u0022 inside a string literal:
const char* quote = "\u0022";
If the compiler translates UCNs to the characters they represent, then by the time the lexical analyser sees that line, "\u0022" will have been converted to """, which is a lexical error. On the other hand, a compiler which retains UCNs until the end would happily accept that as a string literal. Banning the use of a UCN which represents a quotation mark avoids this possible non-portability.
Similarly, would '\u005cn' be a newline character? Again, if the UCN is converted to a backslash in Phase 1, then in Phase 3 the string literal would definitely be treated as a newline. But if the UCN is converted to a character value only after the character literal token has been identified as such, then the resulting character literal would contain two characters (an implementation-defined value).
And what about 2 \u002B 2? Is that going to look like an addition, even though UCNs aren't supposed to be used for punctuation characters? Or will it look like an identifier starting with a non-letter code?
And so on, for a large number of similar issues.
All of these details are avoided by the simple expedient of requiring that UCNs cannot be used to spell characters in the basic source character set. And that's what was embodied in the standards.
Note that the "basic source character set" does not contain every ASCII character. It does not contain the majority of the control characters, and nor does it contain the ASCII characters $, # and `. These characters (which have no meaning in a C or C++ program outside of string and character literals) can be written as the UCNs \u0024, \u0040 and \u0060 respectively.
Finally, in order to see what sort of knots you need to untie in order to correctly lexically analyse C (or C++), consider the following snippet:
const char* s = "\\
n";
Because continuation lines are dealt with in Phase 1, prior to lexical analysis, and Phase 1 only looks for the two-character sequence consisting of a backslash followed by a newline, that line is the same as
const char* s = "\n";
But that might not have been obvious looking at the original code.

Regarding to c99, What is the definition of a logical source line?

Imagine we write this code:
printf ("testtest"
"titiritest%s",
" test");
Would this be according to ISO/IEC:9899 §5.1.1.2 — 2
Be 3 different logical source lines or would it be a single one?
And is this
2. Each instance of a backslash character (\) immediately followed by a new-line
character is deleted, splicing physical source lines to form logical source lines.
Only the last backslash on any physical source line shall be eligible for being part
of such a splice. A source file that is not empty shall end in a new-line character,
which shall not be immediately preceded by a backslash character before any such
splicing takes place.
the only rule mentioned about forming logical source lines?
As regarding to the
5.2.4.1 Translation limits
[...]
— 4095 characters in a logical source line
Would mean each translation unit should not get bigger as 4095 characters, as long we dont use a \ right before our line breaks. And I'm pretty sure, thats not what they intend to say.
So where is the piece of the definition I'm missing to lookup?
It's three logical source lines.
Logical source lines are mostly important because macro definitions must fit into one logical source line; I cannot right now think of any other use for logical source lines of more than one line. To construct large string literals, you could either use logical source lines consisting of more than one physical source line (which I personally find very ugly), or relying on the fact that quoted strings will be concatenated, which is much more readable and maintainable .

Resources