K and R exercise 1-24 - c

I am doing programs in The C Programming Language by Kernighan and Ritchie.
I am currently at exercise 1-24 that says:
Write a program to check a C Program for rudimentary syntax errors
like unbalanced parentheses, brackets and braces. Don't forget about
quotes, both single and double, escape sequences, and comments.
I have done everything well... But I am not getting how escape sequences would affect these parentheses, brackets and braces?
Why did they warned about escape sequences?

In "\"", there are three double quote characters, but still it's a valid string literal. The middle " is escaped, meaning the outer two balance each other. Similarly, '\'' is a valid character literal.
Parentheses, brackets and braces are not affected, unless of course they appear in a string literal that you don't parse correctly because of an escaped quote.

I'd guess they mean that you need to differentiate between " (which starts or ends a string) and \" (which is a " character, possibly inside a string)
This is important if you're to avoid reporting e.g. strlen("\")"); as having unbalanced parentheses.

The obvious possibility would be an escaped quote inside a string. If you don't take the escape into account, you might think the string ended there. For example: "\")\"". The ) is part of the string literal, so it doesn't count as a mis-matched parenthesis.

Related

Flex match string literal, escaping line feed

I am using flex to try and match C-like, simplified string literals.
A regular expression as such:
\"([^"\\]|\\["?\\btnr]|\\x{HEXDIG}{HEXDIG})*\"
will match all one-line string literals I am interested in.
A string literal cannot contain a non-escaped backslash. A string literal also cannot contain a literal line feed (0x0a) unless it is escaped by a backslash, in which case the line feed and any following spaces and tabulations are ignored..
For example, assuming {LF} is an actual line feed and {TAB} an actual tabulation (I could not format it better than that).
In: "This is an example \{LF}{TAB}{TAB}{TAB}of a confusing valid string"
Token: "This is an example of a confusing valid string"
My first idea was to use a starting state, a trailing context and yymore() to match what I want and check for errors giving something like the following:
...
%%
\" { BEGIN STRING; yymore(); }
<STRING>{
\n { /* ERROR HERE! */ }
<<EOF>> { /* ERROR HERE AS WELL */ }
([^"\\]|\\["?\\btnr]|\\x{HEXDIG}{HEXDIG})* {
/* String ok up to here*/
yymore();
}
\\\n[ \t]* {
/*Vadid inside a tring but needs to be ignored! */
yymore();
}
\" { /* Full string matched */ BEGIN INITIAL;}
.|\n { \* Anything else is considered an error *\ }
}
%%
...
Is there a way to do what I want in the way I am trying to do it? Is there instead any other 'standard' maybe method provided by flex that I just stupidly have not though of? This does not look to me like an uncommon use case. Should I just match the strings separately (beginning to before , after whitespace to end) and concatenate them. This is a bit complicated to do since a string can be decomposed into an arbitrary number of lines using backslashes.
If all you want to do is to recognise a string literal, there's no need for start conditions. You can use some variant of the simple pattern which you'll find in many answers:
["]({normal}|{escape})*["]
(I used macros to make the structure clear, although in practice I would hardly ever use them.)
"Normal" here means any character without special significance in a string. In other words, any character other than " (which ends the literal), \ (which starts an escape sequence, or newline (which is usually an error although some languages allow newlines in strings). In other words, [^"\n\\] (or something similar).
"escape" would be any valid escape sequence. If you didn't want to validate the escape sequence, you could just match a backslash followed by any single character (including newline): \\(.|\n). But since you do seem to want to validate, you'd need to be explicit about the escape sequences you're prepared for:
\\([\n\\btnr"]|x[[:xdigit:]]{2})
But all that only recognises valid string literals. Invalid string literals won't match the pattern, and will therefore fall back to whatever you're using as a fallback rule (matching only the initial "). Since that's practically never what you want, you need to add a second rule which detects error. The easiest way to write the second rule is ["]({normal}|{escape})*, i.e. the valid rule without the final double quote. That will only match erroneous string literals because of (f)lex's maximal munch rule: a valid string literal has a longer match with the valid rule than with the error rule (because the valid rule's match includes the final double quote).
In real-life lexical scanners (as opposed to school exercises), it's more common to expect that the lexical scanner will actually resolve the string literal into the actual bytes it represents, by replacing escape sequences with the corresponding character. That is generally done with a start condition, but the individual patterns are more focussed (and there are more of them). For an example of such a parser, you could look at these two answers (and many others):
Flex / Lex Encoding Strings with Escaped Characters
Optimizing flex string literal parsing

How to detect string during lexical analysis?

I am using some syntax to detect string during lexical analysis
"".*"" return TOK_STRING;
but this is not working.
I think you want
\".*\"
but be aware that . in flex does not match newlines. And, as #chqrlie mentions in a comment, it does match ", so it will match to the end of the last string, and not the current one.
So a better pattern might be:
\"[^"]*\"
([^"] matches any character including newlines, except ").
But then you have no way to include a " in a string. So you will have to decide what syntax that should be. If you wanted to implement SQL style, with doubled quotes representing a single quote inside a string, you could use
\"([^"]|\"\")*\"
For the possibly more common backslash escape:
\"([^"]|\\(.|\n))*\"

Writing Regular Expressions for a C string

I am currently learning about regex and I am trying to figure out how to capture a string in C that does not allow newlines. I have searched around and found answers regarding flex and lex but I'm trying to learn it a simplistic as I can to gain a better understanding.
This is a piece of expression that I have found searching and it appears to be common(I have found it a lot). But I still have yet to find a clear explanation as to what it means and how it is used.
\"(\\.|[^"])*\"
What this expression means is that there must be a doublequote at the beginning and at the end \", and there will be a sequence of zero or more o the following:
A backslash character \\ followed by any single character ., or
A non-doublequote character [^"]
The first clause is self-explanatory. The second clause is there to treat any single character preceded by backslash as an escape sequence. This ensures that the expression would capture any of the following strings to the end:
"string \"one\" has embedded doublequotes"
"string two \
is split across \
multiple lines"
"string\tthree\nhas\tembedded\tescape\tcharacters"

Connection Strings ambiguity: quoted value containing both single and double quote

I'm writing a connection string parser/formatter for python, and I'm following these syntax rules:
http://www.connectionstrings.com/formating-rules-for-connection-strings/
All blank characters, except those placed within a value or within quotation marks, are ignored
If a semicolon (;) is part of a value it must be delimited by quotation marks (") Note: I assume it means the whole value
Use a single-quote (') if the value begins with a double-quote (")
Conversely, use the double quote (") if the value begins with a single quote (')
No escape sequences are supported
Names are case iNsEnSiTiVe
If a keyword contains an equal sign (=), it must be preceded by an additional equal sign to indicate that it is part of the keyword.
If a value has preceding or trailing spaces it must be enclosed in single- or double quotes, ie Keyword=" value ", else the spaces are removed.
But what happens if a value needs to be quoted (due to a ; for instance), and it contains itself both ' and "??
For instance:
Key1=Value1;Keyw="ab;'"lol";
If a value starts with quotes, I have to scan the contents as literal until the next quote of the same type, but it turns out that the next quote it is still part of the value, and the string literal ends some characters later. If the value was quoted with single quotes, the problem would be the same.
Similarly, I have found another ambiguity if the keyword contains a = sign. The rule says it should be doubled. This mechanism works fine as long as this equal sign is not at the end of the keyword. Because there we have:
Key1===jkn23;
And there is no way of knowing if key value pair is (Key1, ==jkn23) or (Key1=, jkn23)
Life would be so easy if there was a proper escape system...

What does mean \? escape in C grammar? [duplicate]

This question already has an answer here:
Why is "\?" an escape sequence in C/C++?
(1 answer)
Closed 8 years ago.
I was reading this and found the escape \?. What does means exactly this escape? the literal ? inside a string(I still can't see a reason) or is this a BNF grammar rule which I don't know about?
It specifies a literal question mark. see http://en.wikipedia.org/wiki/Digraphs_and_trigraphs
The backslash is used as a marker character to tell the compiler/interpreter that the next character has some special meaning. What that next character means is up to the implementation. For example C-style languages use \n to mean newline and \t to mean tab.
The use of the word "escape" really means to temporarily escape out of parsing the text and into a another mode where the subsequent character is treated differently.
It is used in a feature called trigraphs, it specifies a question mark. Using this you can write three-character sequence starting with question marks to substitute another character
From C11
C11 ยง6.4.4.4 Character constants Section 4
The double-quote " and question-mark ? are representable either by
themselves or by the escape sequences \" and \?, respectively, but the
single-quote ' and the backslash \ shall be represented, respectively,
by the escape sequences \' and \.

Resources