Regarding to c99, What is the definition of a logical source line? - c

Imagine we write this code:
printf ("testtest"
"titiritest%s",
" test");
Would this be according to ISO/IEC:9899 §5.1.1.2 — 2
Be 3 different logical source lines or would it be a single one?
And is this
2. Each instance of a backslash character (\) immediately followed by a new-line
character is deleted, splicing physical source lines to form logical source lines.
Only the last backslash on any physical source line shall be eligible for being part
of such a splice. A source file that is not empty shall end in a new-line character,
which shall not be immediately preceded by a backslash character before any such
splicing takes place.
the only rule mentioned about forming logical source lines?
As regarding to the
5.2.4.1 Translation limits
[...]
— 4095 characters in a logical source line
Would mean each translation unit should not get bigger as 4095 characters, as long we dont use a \ right before our line breaks. And I'm pretty sure, thats not what they intend to say.
So where is the piece of the definition I'm missing to lookup?

It's three logical source lines.
Logical source lines are mostly important because macro definitions must fit into one logical source line; I cannot right now think of any other use for logical source lines of more than one line. To construct large string literals, you could either use logical source lines consisting of more than one physical source line (which I personally find very ugly), or relying on the fact that quoted strings will be concatenated, which is much more readable and maintainable .

Related

Is program translation direction well-defined?

It may sound obvious, but just out of curiosity: is program translation direction well-defined (i.e. top-to-bottom, left-to-right)? Is it explicitly defined in the standard?
A source file, and the translation unit that results from including headers via the #include directive, is implicitly a sequence of characters with one dimension. I do not see this explicitly stated in the standard, but there are numerous references to this dimension in the standard, referring to characters “followed by” (C 2018 5.1.1.2 1 1) or “before” (6.10.2 5) other characters, scopes begin “after” the appearance of a tag or declarator (6.2.1 7), and so on.
A compiler is free to read and compute with the parts of translation units in any order it wants, but the meaning of the translation unit is defined in terms of this start-to-finish order.
There is no “up” or “down” between lines. Lines are meaningful in certain parts of C translation, such as the fact that a preprocessing directive ends with a new-line character. However, there is no relationship defined between the same columns on different lines, so the standard does not define anything meaningful for going up or down by lines beyond the fact this means going back or forth in the stream of characters by some amount.
The standard does allow that source files might be composed of lines of text, perhaps with Hollerith cards (punched cards) in which each individual card is a line or with fixed-length-record files in which each record is a fixed number of bytes (such as 80) and there are no new-line characters physically recorded in the file. (Implicitly, each line of text ends after 80 characters.) The standard treats these files as if a new-line character were inserted at the end of the line (5.2.1 3), thus effectively converting the file to a single stream of characters.

Contradiction in C18 standard (regarding character sets)?

We read in the C18 standard:
5.1.1.2 Translation phases
The precedence among the syntax rules of translation is specified by the following phases.
Physical source file multibyte characters are mapped, in an implementation-defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary.
Meaning that the source file character set is decoded and mapped to the source character set.
But then you can read:
5.2.1 Character sets
Two sets of characters and their associated collating sequences shall be defined: the set in which source files are written (the source character set), and the set interpreted in the execution environment (the execution character set).
Meaning that the source file character set is the source character set.
So the question is: which one did I understand wrong, or which one is actually wrong?
EDIT: Actually I was wrong. See my answer below.
Meaning that the source file character set is decoded and mapped to the source character set.
No, it does not mean that. My take is that the source is already assumed to be written in the source character set - how exactly would it make sense to "map the source character set to the source character set"? Either they are part of the set or they aren't. If you pick the wrong encoding for your source code, it will simply be rejected before the preprocessing even starts.
Translation phase 1 does two things not quite related to this at all:
Resolves trigraphs, which are standardized multibyte sequences.
Map multibyte characters into the source character set (defined in 5.2.1).
The source character set consists of the basic character set which is essentially the Latin alphabet plus various common symbols (5.2.1/3), and an extended character set, which is locale- and implemention-specific.
The definition of multibyte characters is found at 5.2.1.2:
The source character set may contain multibyte characters, used to represent members of
the extended character set. The execution character set may also contain multibyte
characters, which need not have the same encoding as for the source character set.
Meaning various locale-specific oddball special cases, such as locale-specific trigraphs.
All of this multibyte madness goes back to the first standardization in 1990 - according to anecdotes from those who were part of that committee, this was because members from various European countries weren't able to use various symbols on their national keyboards.
(I'm not sure how widespread the AltGr key on such keyboards was at the time. It remains a key subject to some serious button mashing when writing C on non-English keyboards anyway, to get access to {}[] symbols etc.)
Well, after all it seems I was wrong. After contacting David Keaton, from the WG14 group (they are in charge of the C standard), I got this clarifying reply:
There is a subtle distinction. The source character set is the
character set in which source files are written. However, the source
character set is just the list of characters available, which does not
say anything about the encoding.
Phase 1 maps the multibyte encoding of the source character set onto
the abstract source characters themselves.
In other words, a character that looks like this:
<byte 1><byte 2>
is mapped to this:
<character 1>
The first is an encoding that represents a character in the source
character set in which the program was written. The second is the
abstract character in the source character set.
You have encountered cross compiling, where a program is compiled on one architecture and executed on another architecture and these architectures have different character sets.
5.1.1.2 is active early in read, where the input file is converted into the compiler's single character set, which clearly must contain all of the characters required by a C program.
However when cross compiling, the execution character set may be different. 5.2.1 is allowing for this possibility. When the compiler emits code, it must translate all character and string constants to the target platform's character set. On modern platforms, this is a no-op, but on some ancient platforms it wasn't.

C preprocessor: line continuation: why exactly comment is not allowed after backslash character ('\')?

Valid code:
#define M xxx\
yyy
Not valid code:
#define M xxx\/*comment*/
yyy
#define M xxx\//comment
yyy
Questions:
Why comment is not allowed after backslash character (\)?
What the standard says?
UPD.
Extra question:
What is the motivation / reason / argumentation behind the requirement that (in order to achieve splicing of physical source lines) backslash character (\) must immediately follow by a new-line character? What is the obstacle to allow comments (or spaces) after the backslash character (\)?
Lines are spliced together only if a backslash character is the last character on a line. C 2018 5.1.1.2 specifies phases of translating a C program. In phase 2:
Each instance of a backslash character () immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines…
If a comment follows a backslash character, the backslash character is not followed by a new-line character, so no splicing is performed. Comments are processed in phase 3:
The source file is decomposed into preprocessing tokens7) and sequences of white-space characters (including comments)… Each comment is replaced by one space character…
Regarding the added question:
What is the motivation / reason / argumentation behind the requirement that (in order to achieve splicing of physical source lines) backslash character (\) must immediately follow by a new-line character? What is the obstacle to allow comments (or spaces) after the backslash character (\)?
The earliest processing in compiling a C program is the simplest. Early C compilers may have been implemented as layers of simple filters: First local-environment characters or methods of file storage would be translated to a simple stream of characters, then lines would be spliced together (perhaps dealing with a problem of wanting a long source line while having to type your source code on 80-column punched cards), then comments would be removed, and so on.
Splicing together lines marked by a backslash at the end of a line is easy; it only requires looking at two characters. If instead we allow comments to follow the backslash that marks a splice, it becomes complicated:
A backslash followed by a comment followed by a new-line would be spliced, but a backslash followed by a comment followed by other source code would not. That requires looking possibly many characters ahead and parsing the comment delimiters, possibly for multiple comments.
One purpose of splicing lines was to allow continuing long strings across multiple lines. (This was before adjacent strings were concatenated in C.) So "abc\ on one line and def" on another would be spliced together, making "abcdef". While we might allow comments after backslashes intended to join lines, we do not want to splice after a line containing "abc\ /*" /*comment*/. That means the code doing the splicing has to be context-sensitive; if the backslash appears in a quoted string, it has to treat it differently.
There is actually a reason why backslash-newlines are processed before comments are removed. It's the same reason why backslash-newlines are entirely removed, instead of being replaced with (virtual) horizontal whitespace, as comments are. It's a ridiculous reason, but it's the official reason. It's so you can mechanically force-fit C code with long lines onto punched cards, by inserting backslash-newline at column 79 no matter what that happens to divide:
static int cp_old_stat(struct kstat *stat, struct __old_kernel_stat __user * st\
atbuf)
{
static int warncount = 5;
struct __old_kernel_stat tmp;
if (warncount > 0) {
warncount--;
printk(KERN_WARNING "VFS: Warning: %s using old stat() call. Re\
compile your binary.\n",
(this is the first chunk of C I found on my hard drive that actually had lines that wouldn't fit on punched cards)
For this to work as intended, backslash-newline has to be able to split a /* or a */, like
/* this comment just so happens to be exactly 80 characters wide at the close *\
/
And you can't have it both ways: if comments were to be removed before processing backslash-newline, then backslash-newline could not affect comment boundaries; conversely, if backslash-newline is to be processed first, then comments can't appear between the backslash and the newline.
(I Am Not Making This Up™: C99 Rationale section 5.1.1.2 paragraph 30 reads
A backslash immediately before a newline has long been used to continue string literals, as well as preprocessing command lines. In the interest of easing machine generation of C, and of transporting code to machines with restrictive physical line lengths, the C89 Committee generalized this mechanism to permit any token to be continued by interposing a backslash/newline sequence.
Emphasis in original. Sorry, I don't know of any non-PDF version of this document.)
Per 5.1.1.2 Translation phases of the C11 standard (note the bolded text added)
5.1.1.2 Translation phases
1 The precedence among the syntax rules of translation is specified by the following phases.6)
1 Physical source file multibyte characters are mapped, in an implementation- defined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.
2 Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character before any such splicing takes place.
...
Only backslash characters immediately followed by a new-line will cause lines to be spliced. A comment is not a new-line character.

What is octet meant in iCalendar Specification

Lines of text SHOULD NOT be longer than 75 octets, excluding the line break. Long content lines SHOULD be split into a multiple line representations using a line "folding" technique. That is, a long line can be split between any two characters by inserting a CRLF immediately followed by a single linear white-space character (i.e., SPACE or HTAB). iCalendar Specification 3.1.Content Lines
What is meant by octet here?
Does it mean number of characters over here?
No. It really means octet, as in 8bits. UTF-8 characters have a variable length (multi-octet). You have another hint here:
Note: It is possible for very simple implementations to generate
improperly folded lines in the middle of a UTF-8 multi-octet
sequence. For this reason, implementations need to unfold lines
in such a way to properly restore the original sequence.

C translation phases concrete examples

According to the C11 standard (5.1.1.2 Translation phases) there are 8 translation phases.
Can anyone give a concrete example for each of the phases.
For example at phase 1 there is:
Physical source file multibyte characters are mapped, in an
implementation- defined manner, to the source character set...
so can I have an example of what happens when that mapping is executed and so on
for other phases?
Well, one example of phase one would be storing your source code into a record-oriented format, such as in z/OS on the mainframe.
These data sets have fixed record sizes so, if your data set specification was FB80 (fixed, blocked, record length of 80), the "line":
int main (void)
would be stored as those fifteen characters followed by sixty-five spaces, and no newline.
Phase one translation would read in the record, possibly strip off the trailing spaces, and add a newline character, before passing the line on to the next phase.
As per the standard, this is also the phase that handles trigraphs, such as converting ??( into [ on a 3270 terminal that has no support for the [ character.
An example of phase five is if you're writing your code on z/OS (using EBCDIC) but cross-compiling it for Linux/x86 (using ASCII/Unicode).
In that case the source characters within string literals and character constants must have the ASCII representation rather than the EBCDIC one. Otherwise, you're likely to get some truly bizarre output on your Linux box.

Resources