How called quotes of string litherals in lexical analysis? - lexical-analysis

In C-like languages, the string litherals are the arbitrary characters surrouned with quotes (e. g. "example"). But what is "? I know that it is terminal, or therminal character, but is should be some specifin

Related

evaluate expression with backslash in middle of statement

Suppose in C, I have the following code:
i=5 \
+6;
If I print i, it gives me 11.
I do not understand how the above code executes correctly. At first glance, I guessed it to be compiler error because of unrecognized token \. Can somebody explain the logic? Is it related to maximal munch logic?
A backslash at the end of a line tells the compiler to ignore the new-line character.
It is a way of formatting lines to be readable for humans without interrupting the source text. E.g., if you have a long string enclosed in quotation marks, you can use a backslash to continue the string on a new line without inserting a new-line character in the string.
(This was more useful before the C standard added the property that adjacent strings, such as "abc" "def", are concatenated. Now you can put strings on consecutive lines, and they will be concatenated. Prior to that, you had to use the backslash to do it.)
Nowadays the most common use of the backslash is, as heretolearn points out, to continue preprocessor macro definitions. Unlike regular C statements, preprocessor statements must be on a single line. However, some preprocessor macro definitions are quite long. To format them (somewhat) nicely, a definition is spread over multiple physical lines, but the backslash makes them into one line for the compiler (including the preprocessor).
A backslash followed by a new-line character are completely removed from the source text by the compiler, unlike a new-line character by itself. So the source text:
abc\
def
is equivalent to the single identifier abcdef, not abc def. You can use it in the middle of any operator or other language construction except trigraph sequences (trigraph sequences, such as ??=, are converted to replacement characters, such as #, before the backslash-new-line processing):
MyStructureVariable-\
>MemberName
IncrementMe+\
+
However, do not do that. Use it reasonably.
The practice of escaping the newlines at the end of a line is indeed to mark the continuation of the statement onto the next line. Apparently that was needed in the old C compilers There is only one place that I'm sure it is still needed and that is in macro definitions of functions, something that is generally frowned upon in C++.
A continued line is a line which ends with a backslash, . The
backslash is removed and the following line is joined with the current
one. No space is inserted, so you may split a line anywhere, even in
the middle of a word. (It is generally more readable to split lines
only at white space.)
The trailing backslash on a continued line is commonly referred to as
a backslash-newline.
If there is white space between a backslash and the end of a line,
that is still a continued line. However, as this is usually the result
of an editing mistake, and many compilers will not accept it as a
continued line, GCC will warn you about it.
Reference

two strings separated by blank being concatenated automatically

I just found something very interesting which was introduced by my typo. Here's a sample of very easy code script:
printf("A" "B");
The result would be
$> AB
Can someone explain how this happens?
As a part of the C standard, string literals that are next to one another are concatenated:
For C (quoting C99, but C11 has something similar in 6.4.5p5):
(C99, 6.4.5p5) "In translation phase 6, the multibyte character
sequences specified by any sequence of adjacent character and
identically-prefixed string literal tokens are concatenated into a
single multibyte character sequence."
C++ has a similar standard.
This is standard behaviour and can be very useful when splitting a very long string constant over multiple lines.
This is string concatenation, part of C standard. Any two or more consecutive string literals are combined into one.

Matching words in ANSI C

How can I match a word (1-n characters) in ANSI C? (in addition: What is the pattern to match a constant in C-sourcecode?)
I tried reading the file and passing it to regexec() (regex.h).
Problem: The tool I'm writing should be able to read sourcecode and find
all used constants (#define) to check if they're defined.
The pattern used for testing is: [a-zA-Z_0-9]{1,}. But this would match words such as the "h" in "test.h".
Identifiers must start with a letter or underscore, so the pattern is
[A-Za-z_][A-Za-z0-9_]*
I know of no syntactic difference between C and preprocessor identifiers. There is a convention to use upper case for preprocessor and lowercase for C identifiers, but no actual requirement. Unless defines are guaranteed to use a distinct naming convention you would basically have to find every identifier in the source file and any included files and sort them into preprocessor identifiers, C identifiers and undeclared identifiers.
From the GCC manual:
Preprocessing tokens fall into five broad classes: identifiers, preprocessing numbers, string literals, punctuators, and other. An identifier is the same as an identifier in C: any sequence of letters, digits, or underscores, which begins with a letter or underscore. Keywords of C have no significance to the preprocessor; they are ordinary identifiers. You can define a macro whose name is a keyword, for instance. The only identifier which can be considered a preprocessing keyword is defined.
Another option besides doing regex searches over C source code would be to use a preprocessor library like Boost Wave or perhaps something like Coan instead of starting from scratch.
Here is the Lexer grammar and the Parser grammar (in flex and bison format, respectively) for the entire c language. In particular, the part relevant to identifiers is:
D [0-9]
L [a-zA-Z_]
{L}({L}|{D})* { count(); return(check_type()); }
So the id can start with any uppercase or lowercase letter or an underscore, and then have more uppercase or lowercase letters, underscores, and numbers. I believe it doesn't match parts of file names because they're quoted and it handles quotes separately.

Structure of C language

Why does this work
printf("Hello"
"World");
Whereas
printf("Hello
""World");
does not?
ANSI C concatenates adjacent Strings, that's ok... but it's a different thing.
Does this have something to do with the C language parser or something?
Thanks
The string must be terminated before the end of the line. This is a good thing. Otherwise, a forgotten close-quote could prevent subsequent lines of code from executing.
This could cost significant time to debug. These days syntax coloring would provide a clue, but in the early years there were monochrome displays.
You can't make a new line in a string literal. This was a choice made my the designers of C. IMO it's a good feature though.
You can however do this:
printf("Hello\
""World");
Which gives the same results.
The C language is defined in terms of tokens and one of the tokens is a string literal (in standardese: an s-char-sequence). s-char-sequences start and end with unescaped double quotes and must not contain an unescaped newline.
Relevant standard (C99) quote:
> Syntax
> string-literal:
> " s-char-sequence(opt) "
> L" s-char-sequence(opt) "
> s-char-sequence:
> s-char
> s-char-sequence s-char
> s-char:
> any member of the source character set
> except the double-quote ", backslash \,
> or new-line character
> escape-sequence
Escaped newlines, however, are removed in an early translation phase called line splicing, so the compiler never gets to interpret them. Here's the relevant standard (C99) quote:
The precedence among the syntax rules of translation is specified by the following phases.
Physical source file multibyte characters are mapped, in an implementationdefined manner, to the source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences are replaced by corresponding single-character internal representations.
Each instance of a backslash character (\) immediately followed by a new-line character is deleted, splicing physical source lines to form logical source lines. Only the last backslash on any physical source line shall be eligible for being part of such a splice. A source file that is not empty shall end in a new-line character, which shall not be immediately preceded by a backslash character before any such splicing takes place.
The source file is decomposed into preprocessing tokens6) and sequences of
white-space characters (including comments). A source file shall not end in a
partial preprocessing token or in a partial comment. Each comment is replaced by
one space character. New-line characters are retained. Whether each nonempty
sequence of white-space characters other than new-line is retained or replaced by one space character is implementation-defined.
Preprocessing directives are executed, macro invocations are expanded, and
_Pragma unary operator expressions are executed. If a character sequence that
matches the syntax of a universal character name is produced by token
concatenation (6.10.3.3), the behavior is undefined. A #include preprocessing
directive causes the named header or source file to be processed from phase 1
through phase 4, recursively. All preprocessing directives are then deleted.
Each source character set member and escape sequence in character constants and string literals is converted to the corresponding member of the execution character set; if there is no corresponding member, it is converted to an implementationdefined member other than the null (wide) character.7)
Adjacent string literal tokens are concatenated.
White-space characters separating tokens are no longer significant. Each
preprocessing token is converted into a token. The resulting tokens are
syntactically and semantically analyzed and translated as a translation unit.
All external object and function references are resolved. Library components are linked to satisfy external references to functions and objects not defined in the current translation. All such translator output is collected into a program image which contains information needed for execution in its execution environment.

Is there a tokenizer function written in C that can do what boost::escaped_list_separator does?

I am looking for a stand alone tokenizer written in C that can parse and split strings based on user supplied character separator such as tabs, semi colons, commas etc.
Similar to what this boost library function does
http://www.boost.org/doc/libs/1_46_1/libs/tokenizer/escaped_list_separator.htm
The tokens may be double quoted and the separators may be embedded, empty tokens are not skipped
You can use strtok in string.h. You certainly will have to adapt it to manage specifics. As it is C, be careful about null-terminated strings.

Resources