I'm writing a connection string parser/formatter for python, and I'm following these syntax rules:
http://www.connectionstrings.com/formating-rules-for-connection-strings/
All blank characters, except those placed within a value or within quotation marks, are ignored
If a semicolon (;) is part of a value it must be delimited by quotation marks (") Note: I assume it means the whole value
Use a single-quote (') if the value begins with a double-quote (")
Conversely, use the double quote (") if the value begins with a single quote (')
No escape sequences are supported
Names are case iNsEnSiTiVe
If a keyword contains an equal sign (=), it must be preceded by an additional equal sign to indicate that it is part of the keyword.
If a value has preceding or trailing spaces it must be enclosed in single- or double quotes, ie Keyword=" value ", else the spaces are removed.
But what happens if a value needs to be quoted (due to a ; for instance), and it contains itself both ' and "??
For instance:
Key1=Value1;Keyw="ab;'"lol";
If a value starts with quotes, I have to scan the contents as literal until the next quote of the same type, but it turns out that the next quote it is still part of the value, and the string literal ends some characters later. If the value was quoted with single quotes, the problem would be the same.
Similarly, I have found another ambiguity if the keyword contains a = sign. The rule says it should be doubled. This mechanism works fine as long as this equal sign is not at the end of the keyword. Because there we have:
Key1===jkn23;
And there is no way of knowing if key value pair is (Key1, ==jkn23) or (Key1=, jkn23)
Life would be so easy if there was a proper escape system...
Related
This is the instruction and error message I received.
I just tried to write a line which possess two slace mark.
Single quotes used to denote character literals. More than one character is a string literal.So given data should be placed inside double quotes instead of single quotes.
And to display a string with escape characters/line breaks, have to add '#' symbol in front of a string.This marks the string as a verbatim string literal.
Find updated code here
I am using flex to try and match C-like, simplified string literals.
A regular expression as such:
\"([^"\\]|\\["?\\btnr]|\\x{HEXDIG}{HEXDIG})*\"
will match all one-line string literals I am interested in.
A string literal cannot contain a non-escaped backslash. A string literal also cannot contain a literal line feed (0x0a) unless it is escaped by a backslash, in which case the line feed and any following spaces and tabulations are ignored..
For example, assuming {LF} is an actual line feed and {TAB} an actual tabulation (I could not format it better than that).
In: "This is an example \{LF}{TAB}{TAB}{TAB}of a confusing valid string"
Token: "This is an example of a confusing valid string"
My first idea was to use a starting state, a trailing context and yymore() to match what I want and check for errors giving something like the following:
...
%%
\" { BEGIN STRING; yymore(); }
<STRING>{
\n { /* ERROR HERE! */ }
<<EOF>> { /* ERROR HERE AS WELL */ }
([^"\\]|\\["?\\btnr]|\\x{HEXDIG}{HEXDIG})* {
/* String ok up to here*/
yymore();
}
\\\n[ \t]* {
/*Vadid inside a tring but needs to be ignored! */
yymore();
}
\" { /* Full string matched */ BEGIN INITIAL;}
.|\n { \* Anything else is considered an error *\ }
}
%%
...
Is there a way to do what I want in the way I am trying to do it? Is there instead any other 'standard' maybe method provided by flex that I just stupidly have not though of? This does not look to me like an uncommon use case. Should I just match the strings separately (beginning to before , after whitespace to end) and concatenate them. This is a bit complicated to do since a string can be decomposed into an arbitrary number of lines using backslashes.
If all you want to do is to recognise a string literal, there's no need for start conditions. You can use some variant of the simple pattern which you'll find in many answers:
["]({normal}|{escape})*["]
(I used macros to make the structure clear, although in practice I would hardly ever use them.)
"Normal" here means any character without special significance in a string. In other words, any character other than " (which ends the literal), \ (which starts an escape sequence, or newline (which is usually an error although some languages allow newlines in strings). In other words, [^"\n\\] (or something similar).
"escape" would be any valid escape sequence. If you didn't want to validate the escape sequence, you could just match a backslash followed by any single character (including newline): \\(.|\n). But since you do seem to want to validate, you'd need to be explicit about the escape sequences you're prepared for:
\\([\n\\btnr"]|x[[:xdigit:]]{2})
But all that only recognises valid string literals. Invalid string literals won't match the pattern, and will therefore fall back to whatever you're using as a fallback rule (matching only the initial "). Since that's practically never what you want, you need to add a second rule which detects error. The easiest way to write the second rule is ["]({normal}|{escape})*, i.e. the valid rule without the final double quote. That will only match erroneous string literals because of (f)lex's maximal munch rule: a valid string literal has a longer match with the valid rule than with the error rule (because the valid rule's match includes the final double quote).
In real-life lexical scanners (as opposed to school exercises), it's more common to expect that the lexical scanner will actually resolve the string literal into the actual bytes it represents, by replacing escape sequences with the corresponding character. That is generally done with a start condition, but the individual patterns are more focussed (and there are more of them). For an example of such a parser, you could look at these two answers (and many others):
Flex / Lex Encoding Strings with Escaped Characters
Optimizing flex string literal parsing
I am using some syntax to detect string during lexical analysis
"".*"" return TOK_STRING;
but this is not working.
I think you want
\".*\"
but be aware that . in flex does not match newlines. And, as #chqrlie mentions in a comment, it does match ", so it will match to the end of the last string, and not the current one.
So a better pattern might be:
\"[^"]*\"
([^"] matches any character including newlines, except ").
But then you have no way to include a " in a string. So you will have to decide what syntax that should be. If you wanted to implement SQL style, with doubled quotes representing a single quote inside a string, you could use
\"([^"]|\"\")*\"
For the possibly more common backslash escape:
\"([^"]|\\(.|\n))*\"
A comment (which should probably be submitted as an answer) has the code
sscanf(string, "<title>%[^<]</title>", extracted_string);
Running the code seems to copy the text between the <title> tags to extracted_string, but I cannot find any references to a caret in the printf family, either in the man pages or elsewhere online.
Can someone point me to a resource that explains the use of %[^<], and other similar syntax, in the sscanf() family?
From the C11 standard document, chapter ยง7.21.6.2, Paragraph 12, conversion specifiers, (emphasis mine)
[
Matches a nonempty sequence of characters from a set of expected characters
(the scanset).
....
The conversion specifier includes all subsequent characters in the format
string, up to and including the matching right bracket (]). The characters
between the brackets (the scanlist) compose the scanset, unless the character
after the left bracket is a circumflex (^), in which case the scanset contains all characters that do not appear in the scanlist between the circumflex and the
right bracket.
A draft version of the standard, found online.
It means match anything that is not a <, it's not a good idea to do that without specifying the maximum destination buffer length, if your destination buffer can hold say 100 characters, then
char extracted_string[100];
sscanf(string, "<title>%99[^<]</title>", extracted_string);
would be a better solution.
Using strstr() for this purpose allows you to actually make extracted_string dynamic.
this link explains the [ and ^ usage in scanf family of functions
(emphasis mine)
http://www.cdf.toronto.edu/~ajr/209/notes/printf.html
[
Matches a nonempty sequence of characters from the specified set of accepted characters; the next pointer must be a pointer to char, and there must be enough room for all the characters in the string, plus a terminating null byte. The usual skip of leading white space is suppressed. The string is to be made up of characters in (or not in) a particular set; the set is defined by the characters between the open bracket [ character and a close bracket ] character. The set excludes those characters if the first character after the open bracket is a circumflex (^). To include a close bracket in the set, make it the first character after the open bracket or the circumflex; any other position will end the set. The hyphen character - is also special; when placed between two other characters, it adds all intervening characters to the set. To include a hyphen, make it the last character before the final close bracket. For instance, [^]0-9-] means the set "everything except close bracket, zero through nine, and hyphen". The string ends with the appearance of a character not in the (or, with a circumflex, in) set or when the field width runs out.
I am doing programs in The C Programming Language by Kernighan and Ritchie.
I am currently at exercise 1-24 that says:
Write a program to check a C Program for rudimentary syntax errors
like unbalanced parentheses, brackets and braces. Don't forget about
quotes, both single and double, escape sequences, and comments.
I have done everything well... But I am not getting how escape sequences would affect these parentheses, brackets and braces?
Why did they warned about escape sequences?
In "\"", there are three double quote characters, but still it's a valid string literal. The middle " is escaped, meaning the outer two balance each other. Similarly, '\'' is a valid character literal.
Parentheses, brackets and braces are not affected, unless of course they appear in a string literal that you don't parse correctly because of an escaped quote.
I'd guess they mean that you need to differentiate between " (which starts or ends a string) and \" (which is a " character, possibly inside a string)
This is important if you're to avoid reporting e.g. strlen("\")"); as having unbalanced parentheses.
The obvious possibility would be an escaped quote inside a string. If you don't take the escape into account, you might think the string ended there. For example: "\")\"". The ) is part of the string literal, so it doesn't count as a mis-matched parenthesis.