Parsing delimited strings using petitparser - petitparser

I was originally looking to (manually) write a simple tokenise/parser for my grammar, but one of my requirements means that tokenising is a bit fiddly.
I need to be able to support the notion of delimited strings where the delimiter could be any char. eg. strings are most likely to be delimited using double quotes (eg. "hello") but it could just as easily be /hello/ or ,hello, or pathologically xhellox
So, I started looking at what alternatives there might be to do a combined tokenise/parse... which is when I stumbled across petit parser.
Just curious whether this type of delimited string might be something that would be able to be parsed using Petit Parser? Thanks.

There are multiple ways to achieve this with PetitParser. Probably the most elegant is to use the a continuation parser:
final delimited = any().callCC((continuation, context) {
final delimiter = continuation(context).value.toParser();
final parser = [
delimiter,
delimiter.neg().star().flatten(),
delimiter,
].toSequenceParser().pick<String>(1);
return parser.parseOn(context);
});
The above snippet parses the start character any() (can be further restricted, if necessary) and then dynamically creates a delimiter parser from that. Furthermore, it combines that delimiter parser into one that parses the start character, the contents (not the start character), and the end character and uses the new parser to consume the input. This also gives really nice error messages.

Related

Splitting string in C by blank spaces, besides when said blank space is within a set of quotes

I'm writing a simple Lisp in C without any external dependencies (please do not link the BuildYourOwnLisp), and I'm following this guide as a basis to parse the Lisp. It describes two steps in tokenising a S-exp, those steps being:
Put spaces around every paranthesis
Split on white space
The first step is easy enough, I wrote a trivial function that replaces certain substrings with other substrings, but I'm having problems with the second step. In the article it only uses the string "Lisp" in its examples of S-exps; if I were to use strtok() to blindly split by whitespace, any string in a S-exp that had a space within it would become fragmented and interpreted incorrectly by my Lisp. Obviously, a language limited to single-word strings isn't very useful.
How would I write a function that splits a string by white space, besides when the text is in between two double quotes?
I've tried using regex, but from what I can see of the POSIX regex.h library and PCRE, just extracting the matches would be incredibly laborious in terms of the amount of auxillary code I'd have to write, which would only serve to bloat my codebase. Besides, one of my goals with this project was to use only ANSI C, or, if need be, C99, solely for the sake of portability - fiddling with the POSIX library and the Win32 API would just fatten my code and make moving my lisp around a nightmare.
When researching this problem I came across this StackOverflow answer; but the approved answer only sends the tokenised string onto stdout, which isn't useful for me; I'd ideally have the tokens in a char** so that I could then parse them into useful in memory data structures.
As well as this, the approved answer on the aforementioned SO question is written to be restricted to specifically my problem - ideally, I'd have myself a general purpose function that would allow me to tokenise a string, except when a substring is between two of charachter x. This isn't a huge deal, it's just that I'd like my codebase to be clean and composable.
You have two delimiters: the space and double quotes.
You can use the strcspn (or with example: cppreference - strcspn) function for that.
Iterate over the string and look for the delimiters (space and quotes). strcspn returns if such a delimiter was found. If a space was found, continue looking for both. If a double quote was found, the delimiter chages from " \"" (space and quotes) to "\"" (double quotes). If you then hit the quotes again, change the delimiter to " \"" (space and quotes).
Based on your comment:
Lets say you have a string like
This is an Example.
The output would be
This
is
an
Example.
If the string would look like
This "is an" Example.
The output would be
This
is an
Example.

Flex match string literal, escaping line feed

I am using flex to try and match C-like, simplified string literals.
A regular expression as such:
\"([^"\\]|\\["?\\btnr]|\\x{HEXDIG}{HEXDIG})*\"
will match all one-line string literals I am interested in.
A string literal cannot contain a non-escaped backslash. A string literal also cannot contain a literal line feed (0x0a) unless it is escaped by a backslash, in which case the line feed and any following spaces and tabulations are ignored..
For example, assuming {LF} is an actual line feed and {TAB} an actual tabulation (I could not format it better than that).
In: "This is an example \{LF}{TAB}{TAB}{TAB}of a confusing valid string"
Token: "This is an example of a confusing valid string"
My first idea was to use a starting state, a trailing context and yymore() to match what I want and check for errors giving something like the following:
...
%%
\" { BEGIN STRING; yymore(); }
<STRING>{
\n { /* ERROR HERE! */ }
<<EOF>> { /* ERROR HERE AS WELL */ }
([^"\\]|\\["?\\btnr]|\\x{HEXDIG}{HEXDIG})* {
/* String ok up to here*/
yymore();
}
\\\n[ \t]* {
/*Vadid inside a tring but needs to be ignored! */
yymore();
}
\" { /* Full string matched */ BEGIN INITIAL;}
.|\n { \* Anything else is considered an error *\ }
}
%%
...
Is there a way to do what I want in the way I am trying to do it? Is there instead any other 'standard' maybe method provided by flex that I just stupidly have not though of? This does not look to me like an uncommon use case. Should I just match the strings separately (beginning to before , after whitespace to end) and concatenate them. This is a bit complicated to do since a string can be decomposed into an arbitrary number of lines using backslashes.
If all you want to do is to recognise a string literal, there's no need for start conditions. You can use some variant of the simple pattern which you'll find in many answers:
["]({normal}|{escape})*["]
(I used macros to make the structure clear, although in practice I would hardly ever use them.)
"Normal" here means any character without special significance in a string. In other words, any character other than " (which ends the literal), \ (which starts an escape sequence, or newline (which is usually an error although some languages allow newlines in strings). In other words, [^"\n\\] (or something similar).
"escape" would be any valid escape sequence. If you didn't want to validate the escape sequence, you could just match a backslash followed by any single character (including newline): \\(.|\n). But since you do seem to want to validate, you'd need to be explicit about the escape sequences you're prepared for:
\\([\n\\btnr"]|x[[:xdigit:]]{2})
But all that only recognises valid string literals. Invalid string literals won't match the pattern, and will therefore fall back to whatever you're using as a fallback rule (matching only the initial "). Since that's practically never what you want, you need to add a second rule which detects error. The easiest way to write the second rule is ["]({normal}|{escape})*, i.e. the valid rule without the final double quote. That will only match erroneous string literals because of (f)lex's maximal munch rule: a valid string literal has a longer match with the valid rule than with the error rule (because the valid rule's match includes the final double quote).
In real-life lexical scanners (as opposed to school exercises), it's more common to expect that the lexical scanner will actually resolve the string literal into the actual bytes it represents, by replacing escape sequences with the corresponding character. That is generally done with a start condition, but the individual patterns are more focussed (and there are more of them). For an example of such a parser, you could look at these two answers (and many others):
Flex / Lex Encoding Strings with Escaped Characters
Optimizing flex string literal parsing

How to detect string during lexical analysis?

I am using some syntax to detect string during lexical analysis
"".*"" return TOK_STRING;
but this is not working.
I think you want
\".*\"
but be aware that . in flex does not match newlines. And, as #chqrlie mentions in a comment, it does match ", so it will match to the end of the last string, and not the current one.
So a better pattern might be:
\"[^"]*\"
([^"] matches any character including newlines, except ").
But then you have no way to include a " in a string. So you will have to decide what syntax that should be. If you wanted to implement SQL style, with doubled quotes representing a single quote inside a string, you could use
\"([^"]|\"\")*\"
For the possibly more common backslash escape:
\"([^"]|\\(.|\n))*\"

Rigorous definition for CSV file reading/writing

I have written my own CSV reader/writer in C to store records in a character column in an ODBC database. Unfortunately I have discovered many edge cases that trip over my implementation, and I have come to the conclusion my problem is that I have not rigorously defined the rules for CSV. I've read RFC4180, but it seems incomplete and does not resolve ambiguities.
For example, should "" be considered an empty token or a double quote? Do quotes match outside-in or left to right? What do I do with an input string that has unmatched single quotes? The real mess begins when I have nested tokens, which doubles up the escaped quotation characters.
What I really need is a definitive CSV standard that I can implement in code. Every time I feel I have nailed every corner case, I find another one. I am sure this problem has been mulled over and solved many times over by superior minds to mine, has anyone written a rigorous definition of CSV that I can implement in code? I realise C is not the ideal language here, but I don't have a choice about the compiler at this stage; nor can I use a third party library (unless it compiles with C-90). Boost is not an option as my compiler doesn't support C++. I have contemplated ditching CSV for XML, but it seems like overkill for storing a few tokens in a 256 character database record. Anyone made a definitive CSV spec?
There is no standard (see Wikipedia's article, in particular http://en.wikipedia.org/wiki/Comma-separated_values#Lack_of_a_standard), so in order to use CSV, you need to follow the general principle of being conservative in what you generate and liberal in what you accept. In particular:
Do not use quotation marks for blank fields. Simply write an empty field (two adjacent delimiters, or a delimiter in the first/last position of the line).
Quote any field containing a quotation mark, comma, or newline.
Find the most authoritative CSV library you trust and read the source. CSV is not so complicated that you won't be able to understand its rules from a comprehensive reading of a source implementation. I have been happy with Java's opencsv. Perl's is here, and so forth.
According to RFC 4180, fields should be parsed from left to right to correctly interpret a double quote. In some contexts "" is an escaped double quote (when inside a quoted field), otherwise it's either an empty string or two double quotes (when inside an otherwise non-empty field value).
For example, consider a file with 4 records (1 column):
"field""value" CRLF
"" CRLF
field""value CRLF
"field value" extra CRLF
"field""value" - should be read as field"value
"" - should be read as an empty string
field""value - should be read as field""value
"field value" extra - could be read as field value extra or you can reject it
Record 4 is really an invalid field so you can either accept it or reject it.
When you start reading a field, you need to check if the first character read is a double quote or not. If the first character is a double quote, the field value is quoted and you need to read until you find an unescaped closing double quote. In this case you can ignore new lines and comma characters, since the field is quoted - it only ends when you encouter a closing double quote.
If the first character is not a double quote then all double quotes in the field value should be treated as literal double qoutes. In this case you reach the end of the field when you encounter a comma or a new line character.
Based on this, I'd recommend to always quote all fields when you write out records and write a proper parser to parse records when you read data. This way you can store any data in your CSV files (even multiline text with embedded quotes) and your format will be clear. When reading a CSV file, I'd fail all files that cannot be correctly parsed - if this is a database, you can expect users to not to mess with the records manually, unless they know what they're doing.

C - clarifying delimiters in strtok

I'm trying to break up a shell command that contains both pipes (|) and the OR symbols (||) represented as characters in an array with strtok, except, well the OR command could also be two pipes next to each other. Specifically, I need to know when |, ;, &&, or || show up in the command.
Is there a way to specify where one delimiter ends and another begins in strtok, since I know usually the delimiters are one character long and you just list them all out with no spaces or anything in between.
Oh and, is a newline a valid delimiter? Or does strtok only do spaces?
Starting from your last question: yes, strtok can use new-line as a delimiter without any problems.
Unfortunately, the answer to your first question isn't nearly so positive. strtok treats all delimiter characters as equal, and does nothing to differentiate between a single delimiter and an arbitrary number of consecutive delimiters. In other words, if you give |&; as the delimiter, it'll treat ||||||||| or &&& or &|&|; all exactly the same way.
I'll go a little further: I'll go out on a limb and state as a fact that strtok simply isn't suitable for breaking a shell command into constituent pieces -- I'm pretty sure there's just no way to use it for this job that will produce usable results.
In particular, you don't have anything that just acts as a delimiter. For your purposes, the &, |, and || are tokens of their own. In a string being supplied to the shell, you don't necessarily have anything that qualifies as a delimiter the way strtok "thinks" of them.
strtok is oriented toward tokens that are separated by delimiters that are nothing except delimiters. As strtok reads the tokens, the delimiters between them are completely ignored (and, destroyed, for that matter). For the shell, a string like a|b is really three tokens -- you need the a, the | and the b -- there's nothing between them that strtok can safely overwrite and/or ignore -- but that's a requirement for how strtok works. For it to deliver you the first a, it overwrites the next character (the | in this case) with a '\0'. Then it has no way of recovering that pipe to tell you what the next token should be.
I think you probably need a greedy tokenizer instead -- i.e., one that builds the longest string of characters that can be token, and stops when it encounters a character that can't be part of the current token. When you ask for the next token, it starts from the first character after the end of the previous token, without (necessarily) skipping/ignoring anything (though, of course, if it encounters something like white-space that hasn't been quoted somehow, it'll probably skip over it).
For your purpose, strtok() is not the correct tool to use; it destroys the delimiter, so you can't tell what was at the end of a token if someone types ls|wc. It could have been a pipe, a semi-colon, and ampersand, or a space. Also, it treats multiple adjacent delimiters as part of a single delimiter.
Look at strspn() and strcspn(); both are in standard C and are non-destructive relatives of strtok().
strtok() is quite happy to use newline as a delimiter; in fact, any character except '\0' can be used as one of the delimiters.
There are other reasons for being extremely cautious about using strtok(), such as thread safety and the fact that it is highly unwise to use it in library code.
strtok() is a basic, all-purpose parsing function. For more advanced parsing, I don't recommend its use.
For example, in the case of '|', you really need to inspect the next character to determine if you've found '|' or '||'.
I've done a huge amount of parsing of this nature, including writing a small language interpreter. It's not that hard if you break it up into smaller tasks. But my advice is to write your own parsing routine in this case.
And, yes, a newline character is a valid delimiter.

Resources