I'm trying to parse an iCal input file according to RFC 5545.
Specifically:
-Property name
-Optional parameters, each starting with semicolon ";" and possibly having multiple comma-separated values (parameter values may be double-quoted in which case they could contain colons, semicolons, and commas)
-Colon ":"
-Property value
Example line:
> ORGANIZER;CN=Obi-WanKenobi;SENTBY="mailto:obiwan#padawan.com":mailto:laowaion#padawan.com
in this case the line would be read into a buffer and parsed (using strtok currently) like this:
Organizer is the property name;
CN=Obi-WanKenobi and SENTBY="mailto:obiwan#padawan.com" are parameters; mailto:lauwaion#padawan.com is the property value.
I have no idea where to start. The different input cases are almost infinite and I haven't been able to figure out an effective algorithm to cover all of said cases. Is strtok the way to go? or is there another C library that has a more intelligent parser? Need someone to put me on the right track.
I'd suggest that you start with looking at existing C implementation:
in C: libical
in C#: dday.ical
Above answers are addressing your immediate question but you might hit other issues as you progress through the RFC5545 standard and looking at what others have done may be helpful
You can use flex(a GNU clone of lex) to write a lexical analyser that is tailored to your task. Ragel is another good tool for this problem.
Related
Is there a way to use a string as a delimiter?
We can use characters as delimiters using sscanf();
Example
I have
char url[]="username=jack&pwd=jack123&email=jack#example.com"
i can use.
char username[100],pwd[100],email[100];
sscanf(url, "username=%[^&]&pwd=%[^&]&email=%[^\n]", username,pwd,email);
it works fine for this string. but for
url="username=jack&jill&pwd=jack&123&email=jack#example.com"
it cant be used...its to remove SQL injection...but i want learn a trick to use
&pwd,&email as delimiters..not necessarily with sscanf.
Update: Solution doesnt necessarily need to be in C language. I only want to know of a way to use string as a delimiter
Just code your own parsing. In many cases, representing in memory the AST you have parsed is useful. But do specify and document your input language (perhaps using EBNF notation).
Your input language (which you have not defined in your question) seems to be similar to the MIME type application/x-www-form-urlencoded used in HTTP POST requests. So you might look, at least for inspiration, into the source code of free software libraries related to HTTP server processing (like libonion) and HTTP client processing (like libcurl).
You could read an entire line with getline (or perhaps fgets) then parse it appropriately. sscanf with %n, or strtok might be useful, but you can also parse the line "manually" (consider using e.g. your recursive descent parser). You might use strchr or strstr also.
BTW, in many cases, using common textual representations like JSON, YAML, XML can be helpful, and you can easily find many libraries to handle them.
Notice also that strings can be processed as FILE* by using fmemopen and/or open_memstream.
You could use parser generators such as bison (with flex).
In some cases, regular expressions could be useful. See regcomp and friends.
So what you want to achieve is quite easy to do and standard practice. But you need more that just sscanf and you may want to combine several things.
Many external libraries (e.g. glib from GTK) provide some parsing. And you should care about UTF-8 (today, you have UTF-8 everywhere).
On Linux, if permitted to do so, you might use GNU readline instead of getline when you want interactive input (with editing abilities and autocompletion). Then take inspiration from the source code of GNU bash (or of RefPerSys, if interested by C++).
If you are unfamiliar with usual parsing techniques, read a good book such as the Dragon Book. Most large programs deal somewhere with parsing, so you need to know how that can be done.
I just read some glibc 2.22 source code (the source file at /sysdeps/posix/readdir.c) and came across this comment:
/* The only version of `struct dirent*' that lacks `d_reclen' is fixed-size. */
(Newline removed.)
The weird emphasis of the type and identifier bugs me. Why not use just single-quotes or des accents graves? Is there some specific reason behind this? Might it be some character set conversion mistake?
I also searched the glibc style guide but didn't found anything concerning code formatting in comments.
Convention.
As you no doubt know, comments are ignored by the C compiler. They make no difference, but the developer who wrote that comment probably preferred their appearance to plain single quotes.
ASCII
Using non-ASCII characters (unicode) in source code is a relatively new practice (moreso when English-authored source code is concerned), and there are still compatibility issues remaining in many programming language implementations. Unicode in program input/output is a different thing entirely (and that isn't perfect either). In program source code, unicode characters are still quite uncommon, and I doubt we'll see them make much of an appearance in older code like the POSIX header files for some time, yet.
Source code filters
There are some source code filters, such as document generation packages like the the well-known Javadoc, that look for specific comment strings, such as /** to open a comment. Some of these programs may treat your `quoted strings' specially, but that quoting convention is older than most (all?) of the source code filters that might give them special treatment, so that's probably not it.
Backticks for command substutution
There is a strong convention in many scripting languages (as well as StackExchange markdown!) to use backticks (``) to execute commands and include the output, such as in shell scripts:
echo "The current directory is `pwd`"
Which would output something like:
The current directory is /home/type_outcast
This may be part of the reason behind the convention, however I believe Cristoph has a point as well, about the quotes being balanced, similar to properly typeset opening and closing quotation marks.
So, again, and in a word: `convention'.
This goes back to early computer fonts, where backtick and apostrophe were displayed as mirror images. In fact, early versions of the ASCII standard blessed this usage.
Paraphrased from RFC 20, which is easier to get at than the actual standards like USAS X3.4-1968:
Column/Row Symbol Name
2/7 ' Apostrophe (Closing Single Quotation Mark Acute Accent)
6/0 ` Grave Accent (Opening Single Quotation Mark)
This heritage can also be seen in tools like troff, m4 and TeX, which also used this quoting style originally.
Note that syntactically, there is a benefit to having different opening and closing marks: they can be nested properly.
How to read contents from file in ocaml? Specifically how to parse them?
Example :
Suppose file contains (a,b,c);(b,c,d)| (a,b,c,d);(b,c,d,e)|
then after reading this, I want two lists containing l1 = [(a,b,c);(b,c,d)] and l2 = [(a,b,c,d);(b,c,d,e)]
Is there any good tutorial for parsing?
This is a good use case for the menhir parser generator (successor to ocamlyacc). You might want to use ocamllex for lexing. All have good documentation.
You could also use camlp4 or camlp5 stream parsing abilities.
Read also the wikipedia pages on lexing & parsing.
I'd be inclined to use Aurochs, a PEG parser for something like this. There is example code in the repo there.
If you want to specify a grammar and have ocaml generate lexers and parsers for you, check out these ocamllex and ocamlyacc tutorials. I recommend doing it this way. If you really only have one type of token in your file format, then ocamlyacc might be overkill if you can just use the lexer to split the file up into tokens that are considered valid by the grammar.
I am trying to make semantic phase for c compiler using lex and yacc. Right now the problem is if I have multiple errors in the c program, it stops after the 1st. What can I do?
I strongly recommend that you perform the semantic analysis as a separate phase, not as a part of the parsing phase. Use YACC only to build an abstract syntax tree, then traverse this tree in a separate function. Said function will have unlimited freedom when it comes to moving around in the tree, as opposed to having to "follow the parsing". As for the specific problem you mentioned, #pmg's comment seems to have pinpointed the problem.
There is no one absolute answer to this. A typical way to handle it is to create a special pattern to read symbols until it gets to (for example) a semicolon at the end of a line, giving a reasonable signal that whatever's after that is intended as a new declaration, definition, statement, etc., and then re-start parsing from that point (retaining enough context to know that, for example, you're currently parsing a function body, so you accept/reject input on that basis).
I am looking for an easy way to print out a specific function from within some C/C++ source code. For example, assume that test.c has several functions defined within it. I want to be able to print out the source code associated with only one of those functions.
Edit: Sorry, I should be a bit more clear about my end goal. I want the function printed to the screen so I can use wc to grab the word count of this specific function. Also, I want this be part of a command line tool-chain so it isn't an option to manually enter files and select the text.
You can run your project through doxygen. It will index all your functions (and classes, structs etc) and can make them available in multiple formats (including PDF and HTML, both easily printable).
What is your end goal with printing out a function?
Do you want to use this as such:
if (error == Foo())
{
PrintFunction(foo);
exit(1);
}
There are easier ways to output where errors are. I could maybe help more if I had a better idea of the problem you are trying to solve with this.
For a idea of how to implement such a PrintFunction():
Have a data struct that wraps around a function and contains: function line start, function line end, and maybe a pointer to the function.
Write a function that prints out a line base on number of the
source file. __FILE__ gives you the source file name.
With knowing the start and end of where the function lies in the
code, printing the function would be trivial.
This has an annoying pitfall of needing to update the line numbers of where your function lies in the file. But this could maybe be solved with a macro.
I generally use the print-region (or preferably print-region-with-faces) from within emacs. However, it is not automated, I have to select the region by hand.
Works in other languages as well.
The following due to Tom Smith in the comments:
(defun print-fn (interactive)
(save-excursion (mark-defun)
(print-region)))
If you liked this, follow the link to Tom's user-page and see if he deserves your vote...
Making this CW, so I won't benefit from people voting up Tom's good thinking. Cheers.
Edit after clarification: This doesn't seem to be pointed at the OP's actual question. Alas.