Structure of C language - c

Why does this work
printf("Hello"
"World");
Whereas
printf("Hello
""World");
does not?
ANSI C concatenates adjacent Strings, that's ok... but it's a different thing.
Does this have something to do with the C language parser or something?
Thanks

The string must be terminated before the end of the line. This is a good thing. Otherwise, a forgotten close-quote could prevent subsequent lines of code from executing.
This could cost significant time to debug. These days syntax coloring would provide a clue, but in the early years there were monochrome displays.

You can't make a new line in a string literal. This was a choice made my the designers of C. IMO it's a good feature though.
You can however do this:
printf("Hello\
""World");
Which gives the same results.

The C language is defined in terms of tokens and one of the tokens is a string literal (in standardese: an s-char-sequence). s-char-sequences start and end with unescaped double quotes and must not contain an unescaped newline.
Relevant standard (C99) quote:
> Syntax
> string-literal:
> " s-char-sequence(opt) "
> L" s-char-sequence(opt) "
> s-char-sequence:
> s-char
> s-char-sequence s-char
> s-char:
> any member of the source character set
> except the double-quote ", backslash \,
> or new-line character
> escape-sequence
Escaped newlines, however, are removed in an early translation phase called line splicing, so the compiler never gets to interpret them. Here's the relevant standard (C99) quote:
The precedence among the syntax rules of translation is specified by the following phases.
Physical source file multibyte characters are mapped, in an implementationdefined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.
Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character before any such splicing takes place.
The source file is decomposed into preprocessing tokens6) and sequences of
white-space characters (including comments). A source file shall not end in a
partial preprocessing token or in a partial comment. Each comment is replaced by
one space character. New-line characters are retained. Whether each nonempty
sequence of white-space characters other than new-line is retained or replaced by one space character is implementation-defined.
Preprocessing directives are executed, macro invocations are expanded, and
_Pragma unary operator expressions are executed. If a character sequence that
matches the syntax of a universal character name is produced by token
concatenation (6.10.3.3), the behavior is undefined. A #include preprocessing
directive causes the named header or source file to be processed from phase 1
through phase 4, recursively. All preprocessing directives are then deleted.
Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementationdefined member other than the null (wide) character.7)
Adjacent string literal tokens are concatenated.
White-space characters separating tokens are no longer significant. Each
preprocessing token is converted into a token. The resulting tokens are
syntactically and semantically analyzed and translated as a translation unit.
All external object and function references are resolved. Library components are linked to satisfy external references to functions and objects not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

Related

what characters are valid to occur before and after a preprocessing directive in C according to the C standard [duplicate]

This question already has an answer here:
Can someone explain me exactly what the below definition means in the C standard about directives
(1 answer)
Closed 2 years ago.
What characters are valid to occur before and after a preprocessing directive in C according to the C standard.
/*what are all the valid characters that can occur here*/ #include <stdio.h> /*and here according to the C standard*/
main()
{
printf("Hello World");
}
now the C standard haven't mentioned what characters are valid to occur before and after a preprocessing directive if someone can guide me with the exact definition of the C standard it will be much appreciated
Note that before the preprocessor directives are examined, the compiler has already passed through phases 1 through 3 of the translation process. Translation phase 2 combines lines ending with a backslash with the following physical line in order to create logical lines. Translation phase 3 replaces every comment with a single space character. (It is allowed but not required that phase 3 also replaces every consecutive sequence of whitespace characters other than newline with a single space character.)
Once that is done, Phase 4 is entered, at which point preprocessor directives are identified. According to §6.10 paragraph 2 of the standard, a sequence of tokens is a preprocessor directive only if "The first token in the sequence is a # preprocessing token that (at the start of translation phase 4) is either the first character in the source file (optionally after white space containing no new-line characters) or that follows white space containing at least one new-line character."
That's a very verbose way of saying that the # has to be the first token on a line, which means that it can only be preceded on the line by whitespace. But as the parenthetic comment in the sentence I quoted says, the test applies to the program as seen after phases 1 to 3, which means that comments have already been replaced with whitespace. So in the original program, the # might be preceded by a comment or by whitespace. (The comment must be a /*…*/ comment, since //… comments extend to the end of the line. Also note that continuation lines are combined before comments are identified, so a continuation marker can occur inside a //… comment.)
As to what can follow a preprocessor directive on the line, the answer is technically "nothing", since the directive extends up to and including the newline. (Again, a comment may have appeared in the original program.) The standard shows a grammar for each preprocessor directive which indicates what the directive's syntax is. If you were to add a non-whitespace character to a preprocessing directive, that would either create a syntax error or alter the meaning of the directive.
§6.10 paragraph 5 requires that whitespace within a preprocessor directive can only be a space or tab character, so that vertical tab and form-feed characters would be illegal. However, it is possible that the implementation has changed those characters to space characters in translation phase 3, so the use of vertical tab and form-feed in a preprocessor directive is implementation-dependent. Portable programs should only contain vertical tab and form-feed characters at the beginning of a line.
Let's dissect the paragraph 2 of the C standard you linked in a comment:
A preprocessing directive consists of a sequence of preprocessing tokens that satisfies the following constraints:
The characters that make up a source code is divided in tokens. Such tokens are for example special characters like '#', identifiers beginning with a letter or underscore, or numbers beginning with a decimal character.
The first token in the sequence is a # preprocessing token that (at the start of translation phase 4) is either the first character in the source file (optionally after white space containing no new-line characters) or that follows white space containing at least one new-line character.
"White space" is any sequence of white space characters without any other character. Commonly used white space characters are space (or blank), horizontal tabulator, line feed or carriage return.
"[...] white space containing at least one new-line character" means that one new-line character exists in the sequence of white space characters before the '#'. It does not matter where in the sequence it is.
So these are all valid sequences, shown as C strings:
"\n\t\t\t#..."
"\n #..."
"\n#..."
"\n\t#..."
"\n\t #..."
"\n \t#..."
The last token in the sequence is the first new- line character that follows the first token in the sequence.
Beginning with the token '#' all next tokens make up the preprocessor directive, until the next new-line character is found. The footnote 165 mentions the term "line" for such a sequence.
A new-line character ends the preprocessing directive even if it occurs within what would otherwise be an invocation of a function-like macro.
The invocation of a function-like macro looks like a function call in C, an identifier with a pair of parentheses. If there is a new-line before the closing parenthesis, the directive ands at that place.
EDIT:
White space characters are listed concretely in chapter 7.4.1.10 "The isspace function" of the standard you linked:
The standard white-space characters are the following: space (' '), form feed ('\f'), new-line ('\n'), carriage return ('\r'), horizontal tab ('\t'), and vertical tab ('\v').
One can assume that this function is used by the preprocessor.
Your confusion might come from interpretating "[...] white space containing no new-line characters [...]" as "white space does not include new-line in general" or "new-line is a special white space character." Neither is true.
The new-line is a valid white space character. It just has a special meaning under the specific circumstances of marking the beginning and the end of a preprocessor directive. And that is why they request white space without any new-line.
If the white space contained a new-line, it will mark the beginning of a new token sequence in the context of the preprocessor.
Please note that the preprocessor and the language C are quite separated concepts. You can use the preprocessor for preprocessing any other source files, using it for assembly is quite common. And you can write C source files without any preprocessor directive.
The preprocessor knows nothing about C, and the C compiler knows nothing about preprocessing directives.

In which phase of the compiler, the error of an identifier name being too long is detected?

If I have a very long name of an identifier ,so in which phase of compiler this error can be detected .
Also if I have some long range of a constant assigned to a variable , is there any error in this ?
int a=1987655321467890008766555890765433111223;
The C standard defines eight phases of translation:
Physical source multibyte characters and trigraph sequences are mapped to characters of the source character set.
Each backslash followed by a new-line is deleted (splicing together two lines).
The source characters are grouped into preprocessing tokens, and each sequence of white-space characters is replaced by one space, except new-lines are kept.
Preprocessing directives and _Pragma operators are executed, and macro invocations are expanded.
Source characters in strings and character constants are converted to the execution character set.
Adjacent string literals are concatenated.
Each preprocessing token is converted into a grammar token, and white-space characters separated tokens are discarded. The resulting tokens are analyzed and translated (compiled).
All external references are resolved (the program is linked).
The C standard does not specify in which phase problems in names or values are detected, and the phases are largely conceptual. The phases explain how the C language is understood, not how a compiler must execute.
However, given that, phase 3 is a logical time to detect names that are too long, particularly since names can be preprocessing identifiers, not just identifiers for variables in the program. But this could also be done in phase 4 for preprocessing identifiers or 7 for other identifiers. Also, the compiler might accept long identifiers up to phase 7, but the linker in phase 8 might have a shorter limit, so errors could occur in 8.
Numbers that are much too large for the compiler to handle at all might be detected in phase 3, but 7 is more likely. For numbers that are too large for the object they are being used to initialize, phase 7 is the logical time to detect the problem.

Why does this compile without an error with Visual Studio and not GCC?

Do you know why does this compiles without an error with Visual Studio (2012) and not GCC 4.7.2?
I am running some compiler tests on tricky source files.
According to the accepted answer here, GCC should not error (error: expected expression before / token): any backslash character () immediately followed by a new-line character is deleted as well as the new-line character.
So, this is equivalent to line splicing and should pre-processed as a single line.
#include \
\
"my_header_\
file_example.h" /* this is a long trailing\
comment */
If you actually have a end of line immediately after the \, the source if correct and should be accepted by a conformant compiler. The Draft for C99 language says in 5.1.1.2 Translations phases :
§2: Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines.
Only the last backslash on any physical source line shall be eligible for being part of such a splice. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character before any such splicing takes place.
This actually occurs in phase 2 before the preprocessor executes any #include so it should be accepted according to the standard:
§4: Preprocessing directives are executed, macro invocations are expanded, and
_Pragma unary operator expressions are executed. If a character sequence that
matches the syntax of a universal character name is produced by token
concatenation (6.10.3.3), the behavior is undefined. A #include preprocessing
directive causes the named header or source file to be processed from phase 1
through phase 4, recursively.
Nevertheless, if there is a space between the \ and the end of line, operation of §2 will not happen and there will be an error.
Ignoring the backslash/newline combination is correct in a macro (macro continuation). However, this occurrs in a string. The backslash is not followed by an n or other defined escape letter but by a 0x0d or 0x0a which is then taken literally into the string. So the file it tries to open is something like my_header_0x0dfile_example.h.
Btw, interesting to know the error it gives. As I suspect in the above, it complains about file not found, not about "illigal '\' in string literal".

Why does this C program compile without an error?

I'm a beginner in C, and I was playing with C. I typed a C code like this:
#include <stdio.h>
int main()
{
printf("hello world\n");
\
return 0;
}
Even though I used \ knowingly, the C compiler doesn't throw any error. What is this symbol used for in the C language?
Edit:
Even this works:
"\n";
The sequence backslash-newline is removed from the code in a very early phase (phase 2) of the translation process. It used to be how you created long string literals before there was string concatenation, and is how you still extend macros over multiple lines.
See §5.1.1.2 Translation Phases of the C99 standard:
The precedence among the syntax rules of translation is specified by the following
phases.5)
Physical source file multibyte characters are mapped, in an implementation defined
manner, to the source character set (introducing new-line characters for
end-of-line indicators) if necessary. Trigraph sequences are replaced by
corresponding single-character internal representations.
Each instance of a backslash character (\) immediately followed by a new-line
character is deleted, splicing physical source lines to form logical source lines.
Only the last backslash on any physical source line shall be eligible for being part
of such a splice. A source file that is not empty shall end in a new-line character,
which shall not be immediately preceded by a backslash character before any such
splicing takes place.
The source file is decomposed into preprocessing tokens6) and sequences of
white-space characters (including comments). A source file shall not end in a
partial preprocessing token or in a partial comment. Each comment is replaced by
one space character. New-line characters are retained. Whether each nonempty
sequence of white-space characters other than new-line is retained or replaced by
one space character is implementation-defined.
Preprocessing directives are executed, macro invocations are expanded, and
_Pragma unary operator expressions are executed. If a character sequence that
matches the syntax of a universal character name is produced by token
concatenation (6.10.3.3), the behavior is undefined. A #include preprocessing
directive causes the named header or source file to be processed from phase 1
through phase 4, recursively. All preprocessing directives are then deleted.
Each source character set member and escape sequence in character constants and
string literals is converted to the corresponding member of the execution character
set; if there is no corresponding member, it is converted to an implementation defined
member other than the null (wide) character.7)
Adjacent string literal tokens are concatenated.
White-space characters separating tokens are no longer significant. Each
preprocessing token is converted into a token. The resulting tokens are
syntactically and semantically analyzed and translated as a translation unit.
All external object and function references are resolved. Library components are
linked to satisfy external references to functions and objects not defined in the
current translation. All such translator output is collected into a program image
which contains information needed for execution in its execution environment.
5) Implementations shall behave as if these separate phases occur, even though many are typically folded together in practice.
6) As described in 6.4, the process of dividing a source file’s characters into preprocessing tokens is
context-dependent. For example, see the handling of < within a #include preprocessing directive.
7) An implementation need not convert all non-corresponding source characters to the same execution
character.
If you had a blank or any other character after your stray backslash, you would have a compilation error. We can tell that you don't have anything after it because you don't have a compilation error.
The other part of your question, about:
"\n";
is quite different. It is a simple expression that has no side-effects and therefore no effect on the program. The optimizer will completely discard it. When you write:
i = 1;
you have an expression with a value that is discarded; it is evaluated for its side-effect of modifying i.
Sometimes, you'll find code like:
*ptr++;
The compiler will warn you that the result of the expression is discarded; the expression can be simplified to:
ptr++;
and will achieve the same effect in the program.
The \, when immediately followed by a newline, is consumed by preprocessing and causes the next "physical" line to be joined to the current logical line. This is very important for writing long preprocessing directives, which have to be all on one logical line:
#define SHORT very log macro \
consisting of lots and \
lots of preprocessor \
tokens
If you remove the backslash-newline sequences, it is no longer correct. Some other languages from the Unix culture have a similar backslash line continuation syntax: the POSIX shell language derived from the Bourne shell, and also makefiles.
$ this is \
one shell command
About "\n";, that is a primary expression used to form an expression-statement. In C, expressions can be used as statements, and this is exploited all the time. Your printf call, for instance, is an expression statement. printf("hello world\n") is a postfix expression which calls a function, obtaining a return value. Because you used this expression as a statement, the return value is thrown away. The return value of printf
indicates how many characters were printed, or whether it was successful at all, so by throwing it away, your program makes itself oblivious to whether the printf call actually worked.
Since the value of an expression-statement is discarded, if such a statement also has no side effects, it is a useless statement which does nothing (like your "\n"). But such useless expression statements are not erroneous. If you add warning options to your compiler command line you might get a warning such as "statement with no effect" or something like that.
The backslash \ get interpreted by the C preprocessor. It protect its following character (the new line character on your case).
The backslash is simply escaping the next character. In this case, probably a line end (CR) character. Perfectly reasonable.
The backslash plus what is following it is an escape sequence; "\n" together is the newline character (prints a newline). Another important one is "\t", for tab.

Working of the C Preprocessor

How does the following piece of code work, in other words what is the algorithm of the C preprocessor? Does this work on all compilers?
#include <stdio.h>
#define b a
#define a 170
int main() {
printf("%i", b);
return 0;
}
The preprocessor just replaces b with a wherever it finds it in the program and then replaces a with 170 It is just plain textual replacement.
Works on gcc.
It's at §6.10.3 (Macro Replacement):
6.10.3.4 Rescanning and further replacement
1) After all parameters in the replacement list have been substituted and #
and ## processing has taken place, all placemarker preprocessing tokens are removed. Then, the resulting preprocessing token sequence
is rescanned, along with all subsequent preprocessing tokens of the
source file, for more macro names to replace.
Further paragraphs state some complementary rules and exceptions, but this is basically it.
Though it may violate some definitions of "single pass", it's very useful. Like the recursive preprocessing of included files (§5.1.1.2p4).
This simple replacement (first b with a and then a with 170) should work with any compiler.
You should be careful with more complicated cases (usually involving stringification '#' and token concatenation '##') as there are corner case handled differently at least by MSVC and gcc.
In doubt, you can always check the ISO standard (a draft is available online) to see how things are supposed to work :). Section 6.10.3 is the most relevant in your case.
The preprocessor just replaces the symbols sequentially whenever they appear. The order of the definitions does not matter in this case, b is replaced by a first, and the printf statement becomes
printf("%i", a);
and after a is replaced by 170, it becomes
printf("%i", 170);
If the order of definition was changed, i.e
#define a 170
#define b a
Then preprocessor replaces a first, and the 2nd definition becomes
#define b 170
So, finally the printf statement becomes
printf("%i",170);
This works for any compiler.
To get detailed info you can try gcc -E to analyse your pre-processor output which can easily clear your doubt
#define simply assigns a value to a keyword.
Here, 'b' is first assigned value 'a' then 'a' is assigned value '170'. For simplicity, it can be expressed as follows:
b=a=170
It's just a different way of defining the same thing.
I think you are trying to get the information how the source code is processed by compiler. To know exactly you have to go through Translation Phases. The general steps that are followed by every compiler (tried to give every detail - gathered from different blogs and websites) are below:
First Step by Compiler - Physical source file characters are mapped to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.
Second Step by Compiler - Each instance of a new-line character and an immediately preceding backslash character is deleted, splicing physical source lines to form logical source lines. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character.
Third Step by Compiler - The source file is decomposed into preprocessing tokens and sequences of white-space characters (including comments). A source file shall not end in a partial preprocessing token or comment. Each comment is replaced by one space character. New-line characters are retained. Whether each nonempty sequence of other white-space characters is retained or replaced by one space character is implementation-defined.
Fourth Step by Compiler - Preprocessing directives are executed and macro invocations are expanded. A #include preprocessing directive causes the named header or source file to be processed from phase 1 through phase 4, recursively.
Fivth Step by Compler - Each escape sequence in character constants and string literals is converted to a member of the execution character set.
Sixth Step by Compiler - Adjacent character string literal tokens are concatenated and adjacent wide string literal tokens are concatenated.
Seventh Step by Compiler - White-space characters separating tokens are no longer significant. Preprocessing tokens are converted into tokens. The resulting tokens are syntactically and semantically analyzed and translated.
Last Step - All external object and function references are resolved. Library components are linked to satisfy external references to functions and objects not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

Resources