Need to understand syntax in C program - c

I have been tasked with studying and modifying a C program. Generally, I write code in pl/sql, but not C. I have been able to decipher most of the code, but the program flow is still eluding me. After looking up several C references guides, I am not understanding how the C code works. I'm hoping someone here can answer a few syntax questions and tell me what each statement is trying to do.
Here is one sample, with my guesses below.
input(ask_fterm,TM_NLS_Get("0004","FROM TERM: "),6,ALPHA);
if ( !*ask_fterm ) goto opt_fterm;
tmstrcpy(fterm,ask_fterm);
goto nextparmb;
opt_fterm:
tmstrcpy(parm_no,_TMC("02"));
sel_optional_ind(FIRST_ROW);
if ( compare(rpt_optional_ind,_TMC("O"),EQS) ) goto nextparmb;
goto missing_parms;
First, I don't understand !*. What does the exclamation asterisk combination?
Second I assume that if must be ended with endif, unless it is on a single line?
Third tmstrcopy() apparently copies the value of the 2nd parameter into the 1st parameter?
I also have several parameters which I don't understand. I'm hoping someone gives me a hint.
tmstrcpy(valid_ind,_TMC("N"));
input(ask_toterm,TM_NLS_Get("0005","TO TERM: "),6,ALPHA);
I don't know where to find _TMC and TM_NLS_Get.

First, I don't understand !*. What does the exclamation asterisk combination?
That's two separate operators. ! is logical negation. Unary * is for dereferencing a pointer. Put together, they each have their separate effect, so !*ask_fterm means determine the value of the object to which pointer ask_fterm points (this is *); if that value is 0 then the result is 1, else the result is 0 (this is !). If ask_fterm is a pointer to the first character of a string, then that's a check for whether the string is empty (zero-length), because C strings are terminated by a character with value 0.
Second I assume that if must be ended with endif, unless it is on a single line?
There is no endif in C. An if construct controls exactly one statement, but that can be and often is a compound one (which you can recognize by the { and } delimiters enclosing it). There may also be an else clause, also controlling exactly one statement, which can be a compound one.
Third tmstrcopy() apparently copies the value of the 2nd parameter into the 1st parameter?
That appears to be a user-defined function. It is certainly not from the C standard library. If I were to guess based on the name and usage, I would guess that it copies a trimmed version of the string to which the right-hand argument points into the space to which the left-hand argument points.
I don't know where to find _TMC and TM_NLS_Get.
Those are not standard C features. Possibly they are recognized directly by your C implementation, or possibly they are macros defined earlier in the file or in one of the header files it includes.

Related

Parsing an iCalendar file in C

I am looking to parse iCalendar files using C. I have an existing structure setup and reading in all ready and want to parse line by line with components.
For example I would need to parse something like the following:
UID:uid1#example.com
DTSTAMP:19970714T170000Z
ORGANIZER;CN=John Doe;SENT-BY="mailto:smith#example.com":mailto:john.doe#example.com
CATEGORIES:Project Report, XYZ, Weekly Meeting
DTSTART:19970714T170000Z
DTEND:19970715T035959Z
SUMMARY:Bastille Day Party
Here are some of the rules:
The first word on each line is the property name
The property name will be followed by a colon (:) or a semicolon (;)
If it is a colon then the property value will be directly to the right of the content to the end of the line
A further layer of complexity is added here as a comma separated list of values are allowed that would then be stored in an array. So the CATEGORIES one for example would have 3 elements in an array for the values
If after the property name a semi colon is there, then there are optional parameters that follow
The optional parameter format is ParamName=ParamValue. Again a comma separated list is supported here.
There can be more than one optional parameter as seen on the ORGANIZER line. There would just be another semicolon followed by the next parameter and value.
And to throw in yet another wrench, quotations are allowed in the values. If something is in quotes for the value it would need to be treated as part of the value instead of being part of the syntax. So a semicolon in a quotation would not mean that there is another parameter it would be part of the value.
I was going about this using strchr() and strtok() and have got some basic elements from that, however it is getting very messy and unorganized and does not seem to be the right way to do this.
How can I implement such a complex parser with the standard C libraries (or the POSIX regex library)? (not looking for whole solution, just starting point)
This answer is supposing that you want to roll your own parser using Standard C. In practice it is usually better to use an existing parser because they have already thought of and handled all the weird things that can come up.
My high level approach would be:
Read a line
Pass pointer to start of this line to a function parse_line:
Use strcspn on the pointer to identify the location of the first : or ; (aborting if no marker found)
Save the text so far as the property name
While the parsing pointer points to ;:
Call a function extract_name_value_pair passing address of your parsing pointer.
That function will extract and save the name and value, and update the pointer to point to the ; or : following the entry. Of course this function must handle quote marks in the value and the fact that their might be ; or : in the value
(At this point the parsing pointer is always on :)
Pass the rest of the string to a function parse_csv which will look for comma-separated values (again, being aware of quote marks) and store the results it finds in the right place.
The functions parse_csv and extract_name_value_pair should in fact be developed and tested first. Make a test suite and check that they work properly. Then write your overall parser function which calls those functions as needed.
Also, write all the memory allocation code as separate functions. Think of what data structure you want to store your parsed result in. Then code up that data structure, and test it, entirely independently of the parsing code. Only then, write the parsing code and call functions to insert the resulting data in the data structure.
You really don't want to have memory management code mixed up with parsing code. That makes it exponentially harder to debug.
When making a function that accepts a string (e.g. all three named functions above, plus any other helpers you decide you need) you have a few options as to their interface:
Accept pointer to null-terminated string
Accept pointer to start and one-past-the-end
Accept pointer to start, and integer length
Each way has its pros and cons: it's annoying to write null terminators everywhere and then unwrite them later if need be; but it's also annoying when you want to use strcspn or other string functions but you received a length-counted piece of string.
Also, when the function needs to let the caller know how much text it consumed in parsing, you have two options:
Accept pointer to character, Return the number of characters consumed; calling function will add the two together to know what happened
Accept pointer to pointer to character, and update the pointer to character. Return value could then be used for an error code.
There's no one right answer, with experience you will get better at deciding which option leads to the cleanest code.

What is the syntax in c to combine statements as a parameter

I have an inkling there is an old nasty way to get a function run as a parameter is calculated, but sine I do not know what it is called I cannot search out the rules.
An example
char dstr[20];
printf("a dynamic string %s\n", (prep_dstr(dstr),dstr));
The idea is that the "()" will return the address dstr after having executed the prep_dstr function.
I know it is ugly and I could just do it on the line before - but it is complicated...
#
Ok - in answer to the pleading not to do it.
I am actually doing a MISRA cleanup on some existing code (not mine don't shoot me), currently the 'prep_dstr' function takes a buffer modifies it (without regard to the length of the buffer) and returns the pointer it was passed as a parameter.
I like to take a small step - test then another small step.
So - a slightly less nasty approach than returning a pointer with no clue about its persistence is to stop the function returning a pointer and use the comma operator (after making sure it does not romp off the end of the buffer).
That gets the MISRA error count down, when it all still works and the MISRA errors are gone I will try to get around to elegance - perhaps the year after next :).
Comma operator has the appropriate precedence and, besides, it gives a sequence point, that is, it defines a point in the execution flow of the program where all the previous side effects are resolved.
So, whatever your function prep_dstr() does to the string dstr, it's completely performed before the comma operator is reached.
On the other hand, comma operator gives an expression whose value is the rightest operand.
The following examples give you the value dstr, as you want:
5+3, prep_dstr(dstr), sqrt(25.0), dstr;
a+b-c, NULL, dstr;
(prep_dstr(dstr), dstr);
Of course, such expression can be used wherever you need the string dstr.
Theerefore, the syntax you employed in the question, then, it does the job perfectly.
Since you are open to play with the syntax, there is another possibility you can use.
By taking in account that the function printf() is a function, it is, in particular, an expression.
In this way, it can be put in a comma expression:
prep_dstr(dstr), printf("Show me the string: %s\n", dstr);
It seems that every body is telling you that "don't write code in this way and so and so...".
This kind of religious advices in the programming style are overestimated.
If you need to do something, just do it.
One of the principles of C says: "Don't prevent the programmer of doing what have be done."
However, whatever you do, try to write readable code.
Yes, the syntax you use will work for your purpose.
However, please consider writing clean and readable code. For instance,
char buffer[20];
char *destination = prepare_destination_string(buffer);
printf("a dynamic string %s\n", destination);
Everything can be cleanly named & understood, and intended behaviour easy to infer. You could even omit certain parts if you so would, like destination, or perform easier error checking.
Your inkling and your code are both correct. That said, please don't do this. Putting prep_dstr on its own line makes it much easier to reason about what happens and when.
What you're thinking of is the comma operator. In a context where the comma doesn't already have another meaning (such as separating function arguments), the expression a, b has the value of b, but evaluates a first. The extra parentheses in your code cause the comma to be interpreted this way, rather than as a function argument separator.

Good way to create an assembler?

I have an homework to do for my school. The goal is to create a really basic virtual machine as well as a simple assembler. I had no problem creating the virtual machine but I can't think of a 'nice' way to create the assembler.
The grammar of this assembler is really basic: an optional label followed by a colon, then a mnemonic followed by 1, 2 or 3 operands. If there is more than one operand they shall be separated by commas. Also, whitespaces are ignored as long as they don't occur in the middle of a word.
I'm sure I can do this with strtok() and some black magic, but I'd prefer to do it in a 'clean' way. I've heard about Parse Trees/AST, but I don't know how to translate my assembly code into these kinds of structures.
I wrote an assembler like this when I was a teenager. You don't need a complicated parser at all.
All you need to do is five steps for each line:
Tokenize (i.e. split the line into tokens). This will give you an array of tokens and then you don't need to worry about the whitespace, because you will have removed it during tokenization.
Initialize some variables representing parts of the line to NULL.
A sequence of if statements to walk over the token array and check which parts of the line are present. If they are present put the token (or a processed version of it) in the corresponding variable, otherwise leave that variable as NULL (i.e. do nothing).
Report any syntax errors (i.e. combinations of types of tokens that are not allowed).
Code generation - I guess you know how to do this part!
What you're looking for is actually lexical analyses, parsing en finally the generation of the compiled code. There are a lot of frameworks out there which helps creating/generating a parser like Gold Parser or ANTLR. Creating a language definition (and learning how to depending on the framework you use) is most often quite a lot of work.
I think you're best off with implementing the shunting yard algorithm. Which converts your source into a representation computers understand, which makes it easy to understand for your virtual machine.
I also want to say that diving into parsers, abstract syntax trees, all the tools available on the web and reading a lot of papers about this subject is a really good learning experience!
You can take a look at some already-made assemblers, like PASMO: an assmbler for Z80 CPU, and get ideas from it. Here it is:
http://pasmo.speccy.org/
I've written a couple of very simple assemblers, both of them using string manipulation with strtok() and the like. For a simple grammar like the assembly language is, it's enough. Key pieces of my assemblers are:
A symbol table: just an array of structs, with the name of a symbol and its value.
typedef struct
{
char nombre[256];
u8 valor;
} TSymbol;
TSymbol tablasim[MAXTABLA];
int maxsim = 0;
A symbol is just a name that have associated a value. This value can be the current position (the address where the next instruction will be assembled), or it can be an explicit value assigned by the EQU pseudoinstruction.
Symbol names in this implementation are limited to 255 characters each, and one source file is limited to MAXTABLA symbols.
I perform two passes to the source code:
The first one is to identify symbols and store them in the symbol table, detecting whether they are followed by an EQU instruction or not. If there is such, the value next to EQU is parsed and assigned to the symbol. In other case, the value of the current position is assigned. To update the current position I have to detect if there is a valid instruction (although I do not assemble it yet) and update it acordingly (this is easy for me because my CPU has a fixed instruction size).
Here you have a sample of my code that is in charge of updating the symbol table with a value from EQU of the current position, and advancing the current position if needed.
case 1:
if (es_equ (token))
{
token = strtok (NULL, "\n");
tablasim[maxsim].valor = parse_numero (token, &err);
if (err)
{
if (err==1)
fprintf (stderr, "Error de sintaxis en linea %d\n", nlinea);
else if (err==2)
fprintf (stderr, "Simbolo [%s] no encontrado en linea %d\n", token, nlinea);
estado = 2;
}
else
{
maxsim++;
token = NULL;
estado = 0;
}
}
else
{
tablasim[maxsim].valor = pcounter;
maxsim++;
if (es_instruccion (token))
pcounter++;
token = NULL;
estado = 0;
}
break;
The second pass is where I actually assemble instructions, replacing a symbol with its value when I find one. It's rather simple, using strtok() to split a line into its components, and using strncasecmp() to compare what I find with instruction mnemonics
If the operands can be expressions, like "1 << (x + 5)", you will need to write a parser. If not, the parser is so simple that you do not need to think in those terms. For each line get the first string (skipping whitespace). Does the string end with a colon? then it is a label, else it is the menmonic. etc.
For an assembler there's little need to build an explicit parse tree. Some assemblers do have fancy linkers capable of resolving complicated expressions at link-time time but for a basic assembler an ad-hoc lexer and parsers should do fine.
In essence you write a little lexer which consumes the input file character-by-character and classifies everything into simple tokens, e.g. numbers, labels, opcodes and special characters.
I'd suggest writing a BNF grammar even if you're not using a code generator. This specification may then be translated into a recursive-decent parser almost by-wrote. The parser simply walks through the whole code and emits assembled binary code along the way.
A symbol table registering every label and its value is also needed, traditionally implemented as a hash table. Initially when encountering an unknown label (say for a forward branch) you may not yet know the value however. So it is simply filed away for future reference.
The trick is then to spit out dummy values for labels and expressions the first time around but compute the label addresses as the program counter is incremented, then take a second pass through the entire file to fill in the real values.
For a simple assembler, e.g. no linker or macro facilities and a simple instruction set, you can get by with perhaps a thousand or so lines of code. Much of it brainless through-free hand translation from syntax descriptions and opcode tables.
Oh, and I strongly recommend that you check out the dragon book from your local university library as soon as possible.
At least in my experience, normal lexer/parser generators (e.g., flex, bison/byacc) are all but useless for this task.
When I've done it, nearly the entire thing has been heavily table driven -- typically one table of mnemonics, and for each of those a set of indices into a table of instruction formats, specifying which formats are possible for that instruction. Depending on the situation, it can make sense to do that on a per-operand rather than a per-instruction basis (e.g., for mov instructions that have a fairly large set of possible formats for both the source and the destination).
In a typical case, you'll have to look at the format(s) of the operand(s) to determine the instruction format for a particular instruction. For a fairly typical example, a format of #x might indicate an immediate value, x a direct address, and #x an indirect address. Another common form for an indirect address is (x) or [x], but for your first assembler I'd try to stick to a format that specifies instruction format/addressing mode based only on the first character of the operand, if possible.
Parsing labels is simpler, and (mostly) separate. Basically, each label is just a name with an address.
As an aside, if possible I'd probably follow the typical format of a label ending with a colon (":") instead of a semicolon (";"). Much more often, a semicolon will mark the beginning of a comment.

matching brackets program in C

I am fairly new to c programming and I have a question to do with a bracket matching algorithm:
Basically, for an CS assignment, we have to do the following:
We need to prompt the user for a string of 1-20 characters. We then need to report whether or not any brackets match up. We need to account for the following types of brackets "{} [] ()".
Example:
Matching Brackets
-----------------
Enter a string (1-20 characters): (abc[d)ef]gh
The brackets do not match.
Another Example:
Enter a string (1-20 characters): ({[](){}[]})
The brackets match
One of the requirements is that we do NOT use any stack data structures, but use techniques
below:
Data types and basic operators
Branching and looping programming constructs
Basic input and output functions
Strings
Functions
Pointers
Arrays
Basic modularisation
Any ideas of the algorithmic steps I need to take ? I'm really stuck on this one. It isn't as simple as counting the brackets, because the case of ( { ) } wouldn't work; the bracket counts match, but obviously this is wrong.
Any help to put me in the right direction would be much appreciated.
You can use recursion (this essentially also simulates a stack, which is the general consensus for what needs to happen):
When you see an opening bracket, recurse down.
When you see a closing bracket:
If it's matched (i.e. the same type as the opening bracket in the current function), process it and continue with the next character (don't recurse)
If it's not matched, fail.
If you see any other character, just move on to the next character (don't recurse)
If we reach the end of the string and we currently have a opening bracket without a match, fail, otherwise succeed.
You are describing a Context-Free language in here that you need to verify if a word is in the language or not.
This means that there is a Context Free Grammar you can create that describes this language.
For this specific language, one can use a deterministic stack automaton to verify if a word is in the language or not (this is not true for every context free langauge, some require non deterministic stack automaton)
Note that you can use recursion to imitate stack, and use the implicit call stack for it.
Other alternative (which is good for all context free languages) is CYK Algorithm, but it's an overkill here.
So you're not allowed to use stacks..but you ARE allowed to use arrays! This is good.
This might be against the rules, but you can mimic a stack with an array. Keep an index to the "next open spot" in the array, and make sure you do all of your insertions / deletions from that index.
My suggestion? parse each character in the string, and use the "stack" described above to determine when to add and remove brackets / parens / curlys.
Here is the easiest way to do it using no regex/complicated language stuff.
The only thing you need is a simple array of maximum length 10 to simulate a stack. You need this to keep track of the last bracket type opened. Every time you open a bracket, you will "push" the bracket type onto the end of the array. Every time you close a bracket, you will "pop" the bracket type off the end of the array if and only if the bracket types match.
Algorithm:
Iterate over each character in the string.
When you encounter an open bracket of any type, append it to your array. If your array is full (i.e. you are already storing 10 open bracket types), and you can't append it, you already know that the brackets do not match and you can end your program.
When you encounter a closed bracket of any type, if the closed-bracket type does not match the last element of your array, you already know that the brackets do not match and you can end the program, printing that they don't match. Else if the closed-bracket type does match the last element of your array, "pop" it off the end of your array.
Finally, if the array is empty at the end of your iteration, then you know that the brackets match.
EDIT: It has been pointed out to me in the comments that this is an explicit stack and that recursion may be a better method of using an implicit stack.
As amit answered, you definitely need some sort of stack. This can be mathematically proven. However, you can avoid using stack data structures in your code by using the compiler's stack mechanism. This requires you to use recursive function calls.

Syntactic errors

I'm writing code for the exercise 1-24, K&R2, which asks to write a basic syntactic debugger.
I made a parser with states normal, dquote, squote etc...
So I'm wondering if a code snippet like
/" text "
is allowed in the code? Should I report this as an error? (The problem is my parser goes into comment_entry state after / and ignores the ".)
Since a single / just means division it should not be interpreted as a comment. There is no division operator defined for strings, so something like "abc"/"def" doesn't make much sense, but it should not be a syntax error. Figuring out if this division is possible should not be done by the parser, but be left for later stages of the compilation to be decided there.
That is syntactically valid, but not semantically. It should parse as the division operator followed by a string literal. You can't divide stuff by a string literal, so it's not legal code, overall.
Comments start with a two-character token, /*, and end with */.
As a standalone syntactical element this should be reported as an error.
Theoretically (as part of an expression) it would be possible to write
a= b /"text"; / a = b divided through address of string literal "text"
which is also wrong (you can't divide through a pointer).
But on the surface level would seem okay because it would syntactically decode as: variable operator variable operator constant-expression (address of string).
The real error would probably have to be caught in a deeper state of syntactical analysis (i.e. when checking if given types are suitable for the division operator).

Resources