Parsing an iCalendar file in C - c

I am looking to parse iCalendar files using C. I have an existing structure setup and reading in all ready and want to parse line by line with components.
For example I would need to parse something like the following:
UID:uid1#example.com
DTSTAMP:19970714T170000Z
ORGANIZER;CN=John Doe;SENT-BY="mailto:smith#example.com":mailto:john.doe#example.com
CATEGORIES:Project Report, XYZ, Weekly Meeting
DTSTART:19970714T170000Z
DTEND:19970715T035959Z
SUMMARY:Bastille Day Party
Here are some of the rules:
The first word on each line is the property name
The property name will be followed by a colon (:) or a semicolon (;)
If it is a colon then the property value will be directly to the right of the content to the end of the line
A further layer of complexity is added here as a comma separated list of values are allowed that would then be stored in an array. So the CATEGORIES one for example would have 3 elements in an array for the values
If after the property name a semi colon is there, then there are optional parameters that follow
The optional parameter format is ParamName=ParamValue. Again a comma separated list is supported here.
There can be more than one optional parameter as seen on the ORGANIZER line. There would just be another semicolon followed by the next parameter and value.
And to throw in yet another wrench, quotations are allowed in the values. If something is in quotes for the value it would need to be treated as part of the value instead of being part of the syntax. So a semicolon in a quotation would not mean that there is another parameter it would be part of the value.
I was going about this using strchr() and strtok() and have got some basic elements from that, however it is getting very messy and unorganized and does not seem to be the right way to do this.
How can I implement such a complex parser with the standard C libraries (or the POSIX regex library)? (not looking for whole solution, just starting point)

This answer is supposing that you want to roll your own parser using Standard C. In practice it is usually better to use an existing parser because they have already thought of and handled all the weird things that can come up.
My high level approach would be:
Read a line
Pass pointer to start of this line to a function parse_line:
Use strcspn on the pointer to identify the location of the first : or ; (aborting if no marker found)
Save the text so far as the property name
While the parsing pointer points to ;:
Call a function extract_name_value_pair passing address of your parsing pointer.
That function will extract and save the name and value, and update the pointer to point to the ; or : following the entry. Of course this function must handle quote marks in the value and the fact that their might be ; or : in the value
(At this point the parsing pointer is always on :)
Pass the rest of the string to a function parse_csv which will look for comma-separated values (again, being aware of quote marks) and store the results it finds in the right place.
The functions parse_csv and extract_name_value_pair should in fact be developed and tested first. Make a test suite and check that they work properly. Then write your overall parser function which calls those functions as needed.
Also, write all the memory allocation code as separate functions. Think of what data structure you want to store your parsed result in. Then code up that data structure, and test it, entirely independently of the parsing code. Only then, write the parsing code and call functions to insert the resulting data in the data structure.
You really don't want to have memory management code mixed up with parsing code. That makes it exponentially harder to debug.
When making a function that accepts a string (e.g. all three named functions above, plus any other helpers you decide you need) you have a few options as to their interface:
Accept pointer to null-terminated string
Accept pointer to start and one-past-the-end
Accept pointer to start, and integer length
Each way has its pros and cons: it's annoying to write null terminators everywhere and then unwrite them later if need be; but it's also annoying when you want to use strcspn or other string functions but you received a length-counted piece of string.
Also, when the function needs to let the caller know how much text it consumed in parsing, you have two options:
Accept pointer to character, Return the number of characters consumed; calling function will add the two together to know what happened
Accept pointer to pointer to character, and update the pointer to character. Return value could then be used for an error code.
There's no one right answer, with experience you will get better at deciding which option leads to the cleanest code.

Related

Difficulties understanding how to take elements from a file and store them in C

I'm working on an assignment that is supposed to go over the basics of reading a file and storing the information from that file. I'm personally new to C and struggling with the lack of a "String" variable.
The file that the program is supposed to work with contains temperature values, but we are supposed to account for "corrupted data". The assignment states:
Every input item read from the file should be treated as a stream of characters (string), you can
use the function atof() to convert a string value into a floating point number (invalid data can be
set to a value lower than the lowest minimum to identify it as corrupt)."
The number of elements in the file is undetermined but an example given is:
37.8, 38.a, 139.1, abc.5, 37.9, 38.8, 40.5, 39.0, 36.9, 39.8
After reading the file we're supposed to allow a user to query these individual entries, but as mentioned if the data entry contains a non-numeric value, we are supposed to state that the specific data entry is corrupted.
Overall, I understand how to functionally write a program that can fulfill those requirements. My issue is not knowing what data structure to use and/or how to store the information to be called upon later.
The closest to an actual string datatype which you find in C is a sequence of chars which is terminated by a '\0' value. That is used for most things which you'd expect to do with strings.
Storing them requires just sufficent memory, as offered by a sufficiently large array of char, or as offered by malloc().
I think the requirements of your assignment would be met by making a char array as buffer, then reading in with fgets(), making sure to not read more than fits into your array and making sure that there is a '\0' at the end.
Then you can use atof() on the content of the array and if it fails do the handling of corrupted input. Though I would prefer sscanf() for its better feedback via separate return value.

Extracting the domain extension of a URL stored in a string using scanf()

I am writing a code that takes a URL address as a string literal as input, then runs the domain extension of the URL through an array and returns the index if finds a match, -1 if does not.
For example, an input would be www.stackoverflow.com, in this case, I'd need to extract only the com part. In case of www.google.com.tr, I'd need only com again, ignoring the .tr part.
I can think of basically writing a function that'll do that just fine but I'm wondering if it is possible to do it using scanf() itself?
It's really an overhead to use scanf here. But you can do this to realize something similar
char a[MAXLEN],b[MAXLEN],c[MAXLEN];
scanf("%[^.].%[^.].%[^. \n]",a,b,c);
printf("Desired part is = %s\n",c);
To be sure that formatting is correct you can check whether this scanf call is successful or not. For example:
if( 3 != scanf("%[^.].%[^.].%[^. \n]",a,b,c)){
fprintf(stderr,"Format must be atleast sth.something.sth\n");
exit(EXIT_FAILURE);
}
What is the other way of achieving this same thing. Use fgets to read the whole line and then parse with strtok with delimiters ".". This way you will get parts of it. With fgets you can easily support different kind of rules. Instead of incorporating it in scanf (which will be a bit difficult in error case), you can use fgets,strtok to do the same.
With the solution provided above only the first three parts of the url is being considered. Rest are not parsed. But this is hardly the practical situation. Most the time we have to process the whole information, all the parts of the url (and we don't know how many parts can be there). Then you would be better using fgets/strtok as mentioned above.

What is standard notation for passing array to program as CLI argument?

I need to pass array as argument to my app. It's array of colors in hsl notation so values look like "hsl(123,20%,30%)","hsl(94,30%,30%)". Because of that separating elements with , doesn't seem to be convenient here. What are other notations? If i remember well Java was using : as list delimeter. Is it widely used notation?
A (POSIX) shell doesn't really care that much about the arguments you pass to your program. The only issue here is that you need to quote them because you are using parenthesis which the shell doesn't like unquoted, except when part of some syntaxic construction.
Whether you want to use a comma, a colon or whatever is then unrelated to the shell but to the way your application is parsing its parameters. Having commas both as component delimiters and array element delimiters is not that difficult for a parser to handle but a different punctuation will simplify it. I might have used a semicolon here still assuming the argument is quoted.
Another approach would be to pass each color as a separate argument, i.e. use a space character as separator. This would be simple to handle if that array need not to be followed by other arguments.

Getting character attributes

Using WinAPI to get the attribute of a character located in y line and x column of the screen console.
This is what I am trying to do after a call to GetConsoleScreenBufferInfo(GetStdHandle(STD_OUTPUT_HANDLE), &nativeData); where the console cursor is set to the specified location. This won't work. It will return the last used attribute change instead.
How do I obtain the attributes used on all the characters on their locations?
EDIT:
The code I used to test ReadConsoleOutput() : http://hastebin.com/atohetisin.pl
It throws garbage values.
I see several problems off the top of my head:
No error checking. You must check the return value for ReadConsoleOutput and other functions, as documented. If the function fails, you must call GetLastError() to get the error code. If you don't check for errors, you're flying blind.
You don't allocate a buffer to receive the data in. (Granted, the documentation confusingly implies that it allocates the buffer for you, but that's obviously wrong since there's no way for it to return a pointer to it. Also, the sample code clearly shows that you have to allocate the buffer yourself. I've added a note.)
It looks as if you had intended to read the characters you had written, but you are writing to (10,5) and reading from (0,0).
You're passing newpos, which is set to (10,5), as dwBufferCoord when you call ReadConsoleOutput, but you specified a buffer size of (2,1). It doesn't make sense for the target coordinates to be outside the buffer.
Taking those last two points together I think perhaps you have dwBufferCoord and lpReadRegion confused, though I'm not sure what you meant the coordinates (200,50) to do.
You're interpreting CHAR_INFO as an integer in the final printf statement. The first element of CHAR_INFO is the character itself, not the attribute. You probably wanted to say chiBuffer[0].Attributes rather than just chiBuffer[0]. (Of course, this is moot at the moment, since chiBuffer points to a random memory address.)
If you do want to retrieve the character, you'll first need to work out whether the console is in Unicode or ASCII mode, and retrieve UnicodeChar or AsciiChar accordingly.

matching brackets program in C

I am fairly new to c programming and I have a question to do with a bracket matching algorithm:
Basically, for an CS assignment, we have to do the following:
We need to prompt the user for a string of 1-20 characters. We then need to report whether or not any brackets match up. We need to account for the following types of brackets "{} [] ()".
Example:
Matching Brackets
-----------------
Enter a string (1-20 characters): (abc[d)ef]gh
The brackets do not match.
Another Example:
Enter a string (1-20 characters): ({[](){}[]})
The brackets match
One of the requirements is that we do NOT use any stack data structures, but use techniques
below:
Data types and basic operators
Branching and looping programming constructs
Basic input and output functions
Strings
Functions
Pointers
Arrays
Basic modularisation
Any ideas of the algorithmic steps I need to take ? I'm really stuck on this one. It isn't as simple as counting the brackets, because the case of ( { ) } wouldn't work; the bracket counts match, but obviously this is wrong.
Any help to put me in the right direction would be much appreciated.
You can use recursion (this essentially also simulates a stack, which is the general consensus for what needs to happen):
When you see an opening bracket, recurse down.
When you see a closing bracket:
If it's matched (i.e. the same type as the opening bracket in the current function), process it and continue with the next character (don't recurse)
If it's not matched, fail.
If you see any other character, just move on to the next character (don't recurse)
If we reach the end of the string and we currently have a opening bracket without a match, fail, otherwise succeed.
You are describing a Context-Free language in here that you need to verify if a word is in the language or not.
This means that there is a Context Free Grammar you can create that describes this language.
For this specific language, one can use a deterministic stack automaton to verify if a word is in the language or not (this is not true for every context free langauge, some require non deterministic stack automaton)
Note that you can use recursion to imitate stack, and use the implicit call stack for it.
Other alternative (which is good for all context free languages) is CYK Algorithm, but it's an overkill here.
So you're not allowed to use stacks..but you ARE allowed to use arrays! This is good.
This might be against the rules, but you can mimic a stack with an array. Keep an index to the "next open spot" in the array, and make sure you do all of your insertions / deletions from that index.
My suggestion? parse each character in the string, and use the "stack" described above to determine when to add and remove brackets / parens / curlys.
Here is the easiest way to do it using no regex/complicated language stuff.
The only thing you need is a simple array of maximum length 10 to simulate a stack. You need this to keep track of the last bracket type opened. Every time you open a bracket, you will "push" the bracket type onto the end of the array. Every time you close a bracket, you will "pop" the bracket type off the end of the array if and only if the bracket types match.
Algorithm:
Iterate over each character in the string.
When you encounter an open bracket of any type, append it to your array. If your array is full (i.e. you are already storing 10 open bracket types), and you can't append it, you already know that the brackets do not match and you can end your program.
When you encounter a closed bracket of any type, if the closed-bracket type does not match the last element of your array, you already know that the brackets do not match and you can end the program, printing that they don't match. Else if the closed-bracket type does match the last element of your array, "pop" it off the end of your array.
Finally, if the array is empty at the end of your iteration, then you know that the brackets match.
EDIT: It has been pointed out to me in the comments that this is an explicit stack and that recursion may be a better method of using an implicit stack.
As amit answered, you definitely need some sort of stack. This can be mathematically proven. However, you can avoid using stack data structures in your code by using the compiler's stack mechanism. This requires you to use recursive function calls.

Resources