matching brackets program in C - c

I am fairly new to c programming and I have a question to do with a bracket matching algorithm:
Basically, for an CS assignment, we have to do the following:
We need to prompt the user for a string of 1-20 characters. We then need to report whether or not any brackets match up. We need to account for the following types of brackets "{} [] ()".
Example:
Matching Brackets
-----------------
Enter a string (1-20 characters): (abc[d)ef]gh
The brackets do not match.
Another Example:
Enter a string (1-20 characters): ({[](){}[]})
The brackets match
One of the requirements is that we do NOT use any stack data structures, but use techniques
below:
Data types and basic operators
Branching and looping programming constructs
Basic input and output functions
Strings
Functions
Pointers
Arrays
Basic modularisation
Any ideas of the algorithmic steps I need to take ? I'm really stuck on this one. It isn't as simple as counting the brackets, because the case of ( { ) } wouldn't work; the bracket counts match, but obviously this is wrong.
Any help to put me in the right direction would be much appreciated.

You can use recursion (this essentially also simulates a stack, which is the general consensus for what needs to happen):
When you see an opening bracket, recurse down.
When you see a closing bracket:
If it's matched (i.e. the same type as the opening bracket in the current function), process it and continue with the next character (don't recurse)
If it's not matched, fail.
If you see any other character, just move on to the next character (don't recurse)
If we reach the end of the string and we currently have a opening bracket without a match, fail, otherwise succeed.

You are describing a Context-Free language in here that you need to verify if a word is in the language or not.
This means that there is a Context Free Grammar you can create that describes this language.
For this specific language, one can use a deterministic stack automaton to verify if a word is in the language or not (this is not true for every context free langauge, some require non deterministic stack automaton)
Note that you can use recursion to imitate stack, and use the implicit call stack for it.
Other alternative (which is good for all context free languages) is CYK Algorithm, but it's an overkill here.

So you're not allowed to use stacks..but you ARE allowed to use arrays! This is good.
This might be against the rules, but you can mimic a stack with an array. Keep an index to the "next open spot" in the array, and make sure you do all of your insertions / deletions from that index.
My suggestion? parse each character in the string, and use the "stack" described above to determine when to add and remove brackets / parens / curlys.

Here is the easiest way to do it using no regex/complicated language stuff.
The only thing you need is a simple array of maximum length 10 to simulate a stack. You need this to keep track of the last bracket type opened. Every time you open a bracket, you will "push" the bracket type onto the end of the array. Every time you close a bracket, you will "pop" the bracket type off the end of the array if and only if the bracket types match.
Algorithm:
Iterate over each character in the string.
When you encounter an open bracket of any type, append it to your array. If your array is full (i.e. you are already storing 10 open bracket types), and you can't append it, you already know that the brackets do not match and you can end your program.
When you encounter a closed bracket of any type, if the closed-bracket type does not match the last element of your array, you already know that the brackets do not match and you can end the program, printing that they don't match. Else if the closed-bracket type does match the last element of your array, "pop" it off the end of your array.
Finally, if the array is empty at the end of your iteration, then you know that the brackets match.
EDIT: It has been pointed out to me in the comments that this is an explicit stack and that recursion may be a better method of using an implicit stack.

As amit answered, you definitely need some sort of stack. This can be mathematically proven. However, you can avoid using stack data structures in your code by using the compiler's stack mechanism. This requires you to use recursive function calls.

Related

Need to understand syntax in C program

I have been tasked with studying and modifying a C program. Generally, I write code in pl/sql, but not C. I have been able to decipher most of the code, but the program flow is still eluding me. After looking up several C references guides, I am not understanding how the C code works. I'm hoping someone here can answer a few syntax questions and tell me what each statement is trying to do.
Here is one sample, with my guesses below.
input(ask_fterm,TM_NLS_Get("0004","FROM TERM: "),6,ALPHA);
if ( !*ask_fterm ) goto opt_fterm;
tmstrcpy(fterm,ask_fterm);
goto nextparmb;
opt_fterm:
tmstrcpy(parm_no,_TMC("02"));
sel_optional_ind(FIRST_ROW);
if ( compare(rpt_optional_ind,_TMC("O"),EQS) ) goto nextparmb;
goto missing_parms;
First, I don't understand !*. What does the exclamation asterisk combination?
Second I assume that if must be ended with endif, unless it is on a single line?
Third tmstrcopy() apparently copies the value of the 2nd parameter into the 1st parameter?
I also have several parameters which I don't understand. I'm hoping someone gives me a hint.
tmstrcpy(valid_ind,_TMC("N"));
input(ask_toterm,TM_NLS_Get("0005","TO TERM: "),6,ALPHA);
I don't know where to find _TMC and TM_NLS_Get.
First, I don't understand !*. What does the exclamation asterisk combination?
That's two separate operators. ! is logical negation. Unary * is for dereferencing a pointer. Put together, they each have their separate effect, so !*ask_fterm means determine the value of the object to which pointer ask_fterm points (this is *); if that value is 0 then the result is 1, else the result is 0 (this is !). If ask_fterm is a pointer to the first character of a string, then that's a check for whether the string is empty (zero-length), because C strings are terminated by a character with value 0.
Second I assume that if must be ended with endif, unless it is on a single line?
There is no endif in C. An if construct controls exactly one statement, but that can be and often is a compound one (which you can recognize by the { and } delimiters enclosing it). There may also be an else clause, also controlling exactly one statement, which can be a compound one.
Third tmstrcopy() apparently copies the value of the 2nd parameter into the 1st parameter?
That appears to be a user-defined function. It is certainly not from the C standard library. If I were to guess based on the name and usage, I would guess that it copies a trimmed version of the string to which the right-hand argument points into the space to which the left-hand argument points.
I don't know where to find _TMC and TM_NLS_Get.
Those are not standard C features. Possibly they are recognized directly by your C implementation, or possibly they are macros defined earlier in the file or in one of the header files it includes.

Parsing an iCalendar file in C

I am looking to parse iCalendar files using C. I have an existing structure setup and reading in all ready and want to parse line by line with components.
For example I would need to parse something like the following:
UID:uid1#example.com
DTSTAMP:19970714T170000Z
ORGANIZER;CN=John Doe;SENT-BY="mailto:smith#example.com":mailto:john.doe#example.com
CATEGORIES:Project Report, XYZ, Weekly Meeting
DTSTART:19970714T170000Z
DTEND:19970715T035959Z
SUMMARY:Bastille Day Party
Here are some of the rules:
The first word on each line is the property name
The property name will be followed by a colon (:) or a semicolon (;)
If it is a colon then the property value will be directly to the right of the content to the end of the line
A further layer of complexity is added here as a comma separated list of values are allowed that would then be stored in an array. So the CATEGORIES one for example would have 3 elements in an array for the values
If after the property name a semi colon is there, then there are optional parameters that follow
The optional parameter format is ParamName=ParamValue. Again a comma separated list is supported here.
There can be more than one optional parameter as seen on the ORGANIZER line. There would just be another semicolon followed by the next parameter and value.
And to throw in yet another wrench, quotations are allowed in the values. If something is in quotes for the value it would need to be treated as part of the value instead of being part of the syntax. So a semicolon in a quotation would not mean that there is another parameter it would be part of the value.
I was going about this using strchr() and strtok() and have got some basic elements from that, however it is getting very messy and unorganized and does not seem to be the right way to do this.
How can I implement such a complex parser with the standard C libraries (or the POSIX regex library)? (not looking for whole solution, just starting point)
This answer is supposing that you want to roll your own parser using Standard C. In practice it is usually better to use an existing parser because they have already thought of and handled all the weird things that can come up.
My high level approach would be:
Read a line
Pass pointer to start of this line to a function parse_line:
Use strcspn on the pointer to identify the location of the first : or ; (aborting if no marker found)
Save the text so far as the property name
While the parsing pointer points to ;:
Call a function extract_name_value_pair passing address of your parsing pointer.
That function will extract and save the name and value, and update the pointer to point to the ; or : following the entry. Of course this function must handle quote marks in the value and the fact that their might be ; or : in the value
(At this point the parsing pointer is always on :)
Pass the rest of the string to a function parse_csv which will look for comma-separated values (again, being aware of quote marks) and store the results it finds in the right place.
The functions parse_csv and extract_name_value_pair should in fact be developed and tested first. Make a test suite and check that they work properly. Then write your overall parser function which calls those functions as needed.
Also, write all the memory allocation code as separate functions. Think of what data structure you want to store your parsed result in. Then code up that data structure, and test it, entirely independently of the parsing code. Only then, write the parsing code and call functions to insert the resulting data in the data structure.
You really don't want to have memory management code mixed up with parsing code. That makes it exponentially harder to debug.
When making a function that accepts a string (e.g. all three named functions above, plus any other helpers you decide you need) you have a few options as to their interface:
Accept pointer to null-terminated string
Accept pointer to start and one-past-the-end
Accept pointer to start, and integer length
Each way has its pros and cons: it's annoying to write null terminators everywhere and then unwrite them later if need be; but it's also annoying when you want to use strcspn or other string functions but you received a length-counted piece of string.
Also, when the function needs to let the caller know how much text it consumed in parsing, you have two options:
Accept pointer to character, Return the number of characters consumed; calling function will add the two together to know what happened
Accept pointer to pointer to character, and update the pointer to character. Return value could then be used for an error code.
There's no one right answer, with experience you will get better at deciding which option leads to the cleanest code.

Is there a known O(nm)-time/O(1)-space algorithm for POSIX filename matching (fnmatch)?

Edit: WHOOPS! Big admission, I screwed up the definition of the ? in fnmatch pattern syntax and seem to have proposed (and possibly solved) a much harder problem where it behaves like .? in regular expressions. Of course it actually is supposed to behave like . in regular expressions (matching exactly one character, not zero or one). Which in turn means my initial problem-reduction work was sufficient to solve the (now rather boring) original problem. Solving the harder problem is rather interesting still though; I might write it up sometime.
On the plus side, this means there's a much greater chance that something like 2way/SMOA needle factorization might be applicable to these patterns, which in turn could yield the better-than-originally-desired O(n) or even O(n/m) performance.
In the question title, let m be the length of the pattern/needle and n be the length of the string being matched against it.
This question is of interest to me because all the algorithms I've seen/used have either pathologically bad performance and possible stack overflow exploits due to backtracking, or required dynamic memory allocation (e.g. for a DFA approach or just avoiding doing backtracking on the call stack) and thus have failure cases that could also be dangerous if a program is using fnmatch to grant/deny access rights of some sort.
I'm willing to believe that no such algorithm exists for regular expression matching, but the filename pattern language is much simpler than regular expressions. I've already simplified the problem to the point where one can assume the pattern does not use the * character, and in this modified problem you're not matching the whole string but searching for an occurrence of the pattern in the string (like the substring match problem). If you further simplify the language and remove the ? character, the language is just composed of concatenations of fixed strings and bracket expressions, and this can easily be matched in O(mn) time and O(1) space, which perhaps can be improved to O(n) if the needle factorization techniques used in 2way and SMOA substring search can be extended to such bracket patterns. However, naively each ? requires trials with or without the ? consuming a character, bringing in a time factor of 2^q where q is the number of ? characters in the pattern.
Anyone know if this problem has already been solved, or have ideas for solving it?
Note: In defining O(1) space, I'm using the Transdichotomous_model.
Note 2: This site has details on the 2way and SMOA algorithms I referenced: http://www-igm.univ-mlv.fr/~lecroq/string/index.html
Have you looked into the re2 regular expression engine by Russ Cox (of Google)?
It's a regular expression matching engine based on deterministic finite automata, which is different than the usual implementations (Perl, PCRE) using backtracking to simulate a non-deterministic finite automaton. One of the specific design goals was to eliminate the catastrophic backtracking behaviour you mention.
It disallows some of the Perl extensions like backreferences in the search pattern, but you don't need that for glob matching.
I'm not sure if it guarantees O(mn) time and O(1) memory constraints specifically, but it was good enough to run the Google Code Search service while it existed.
At the very least it should be cool to look inside and see how it works. Russ Cox has written three articles about re2 - one, two, three - and the re2 code is open source.
Edit: WHOOPS! Big admission, I screwed up the definition of the ? in fnmatch pattern syntax and seem to have solved a much harder problem where it behaves like .? in regular expressions. Of course it actually is supposed to behave like . in regular expressions (matching exactly one character, not zero or one). Which in turn means my initial problem-reduction work was sufficient to solve the (now rather boring) original problem. Solving the harder problem is rather interesting still though; I might write it up sometime.
Possible solution to the harder problem follows below.
I have worked out what seems to be a solution in O(log q) space (where q is the number of question marks in the pattern, and thus q < m) and uncertain but seemingly better-than-exponential time.
First of all, a quick explanation of the problem reduction. First break the pattern at each *; it decomposes as a (possibly zero length) initial and final component, and a number of internal components flanked on both sided by a *. This means once we've determined if the initial/final components match up, we can apply the following algorithm for internal matches: Starting with the last component, search for the match in the string that starts at the latest offset. This leaves the most possible "haystack" characters free to match earlier components; if they're not all needed, it's no problem, because the fact that a * intervenes allows us to later throw away as many as needed, so it's not beneficial to try "using more ? marks" of the last component or finding an earlier occurrence of it. This procedure can then be repeated for every component. Note that here I'm strongly taking advantage of the fact that the only "repetition operator" in the fnmatch expression is the * that matches zero or more occurrences of any character. The same reduction would not work with regular expressions.
With that out of the way, I began looking for how to match a single component efficiently. I'm allowing a time factor of n, so that means it's okay to start trying at every possible position in the string, and give up and move to the next position if we fail. This is the general procedure we'll take (no Boyer-Moore-like tricks yet; perhaps they can be brought in later).
For a given component (which contains no *, only literal characters, brackets that match exactly one character from a given set, and ?), it has a minimum and maximum length string it could match. The minimum is the length if you omit all ? characters and count bracket expressions as one character, and the maximum is the length if you include ? characters. At each position, we will try each possible length the pattern component could match. This means we perform q+1 trials. For the following explanation, assume the length remains fixed (it's the outermost loop, outside the recursion that's about to be introduced). This also fixes a length (in characters) from the string that we will be comparing to the pattern at this point.
Now here's the fun part. I don't want to iterate over all possible combinations of which ? characters do/don't get used. The iterator is too big to store. So I cheat. I break the pattern component into two "halves", L and R, where each contains half of the ? characters. Then I simply iterate over all the possibilities of how many ? characters are used in L (from 0 to the total number that will be used based on the length that was fixed above) and then the number of ? characters used in R is determined as well. This also partitions the string we're trying to match into part that will be matched against pattern L and pattern R.
Now we've reduced the problem of checking if a pattern component with q ? characters matches a particular fixed-length string to two instances of checking if a pattern component with q/2 ? characters matches a particular smaller fixed-length string. Apply recursion. And since each step halves the number of ? characters involved, the number of levels of recursion is bounded by log q.
You can create a hash of both strings and then compare these. The hash computation will be done in O(m) while the search in O(m + n)
You can use something like this for calculating the hash of the string where s[i] is a character
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
As you said this is for file-name matching and you can't use this where you have wildcards in the strings. Good luck!
My feeling is that this is not possible.
Though I can't provide a bullet-proof argument, my intuition is that you will always be able to construct patterns containing q=Theta(m) ? characters where it will be necessary for the algorithm to, in some sense, account for all 2^q possibilities. This will then require O(q)=O(m) space to keep track of which of the possibilities you're currently looking at. For example, the NFA algorithm uses this space to keep track of the set of states it's currently in; the brute-force backtracking approach uses the space as stack (and to add insult to injury, it uses O(2^q) time in addition to the O(q) of space).
OK, here's how I solved the problem.
Attempt to match the initial part of the pattern up to the first * against the string. If this fails, bail out. If it succeeds, throw away this initial part of both the pattern and the string; we're done with them. (And if we hit the end of pattern before hitting a *, we have a match iff we also reached the end of the string.)
Skip all the way to end end of the pattern (everything after the last *, which might be a zero-length pattern if the pattern ends with a *). Count the number of characters needed to match it, and examine that many characters from the end of the string. If they fail to match, we're done. If they match, throw away this component of the pattern and string.
Now, we're left with a (possibly empty) sequence of subpatterns, all of which are flanked on both sides by *'s. We try searching for them sequentially in what remains of the string, taking the first match for each and discarding the beginning of the string up through the match. If we find a match for each component in this manner, we have a match for the whole pattern. If any component search fails, the whole pattern fails to match.
This alogorithm has no recursion and only stores a finite number of offsets in the string/pattern, so in the transdichotomous model it's O(1) space. Step 1 was O(m) in time, step 2 was O(n+m) in time (or O(m) if we assume the input string length is already known, but I'm assuming a C string), and step 3 is (using a naive search algorithm) O(nm). Thus the algorithm overall is O(nm) in time. It may be possible to improve step 3 to be O(n) but I haven't yet tried.
Finally, note that the original harder problem is perhaps still useful to solve. That's because I didn't account for multi-character collating elements, which most people implementing regex and such tend to ignore because they're ugly to get right and there's no standard API to interface with the system locale and obtain the necessary info to get them. But with that said, here's an example: Suppose ch is a multi-character collating element. Then [c[.ch.]] could consume either 1 or 2 characters. And we're back to needing the more advanced algorithm I described in my original answer, which I think needs O(log m) space and perhaps somewhat more than O(nm) time (I'm guessing O(n²m) at best). At the moment I have no interest in implementing multi-character collating element support, but it does leave a nice open problem...

My Simpler Dead-code Remover

I am doing a stimulation of dead-code remover in a very simpler manner.
For that my Idea is to,
Step 1: Read the input C-Program line by line and store it in a doubly linked-list or Array.(Since deletion and insertion will be easier than in file operations).
Doubt:Is my approach correct? If so, How to minimize traversing a Linked-List each time.
Step 2: Analyzing of the read strings will be done in parallel, and tables are created to maintain variables names and their details, functions and their calls,etc.,
Step 3: Searching will be done for each entries in the variable table, and the variables will be replaced by its that time's value(as it has).
(E.g.)
i=0;
if(i==3) will be replaced by if(0==3).
But on situation like..
get(a);
i=a;
if(i){}
here,'i' will not be replaced since it depends on another variable. 'a' will not be replaced since it depends on user input.
Doubt: if user input is,
if(5*5+6){print hello;} ,
it surely will be unnecessary check. How can i solve this expression to simplify the code as
{
print hello;
}
Step 4: Strings will be searched for if(0),while(0) etc., and using stack, the action block is removed. if(0){//this will be removed*/}
Step 5:(E.g) function foo(){/**/} ... if(0) foo(); ..., Once all the dead codes are removed, foo()'s entry in the function table is checked to get no.of.times it gets referred in the code. If it is 0, that function has to be removed using the same stack method.
Step 6: In the remaining functions, the lines below the return statements (if any) are removed except the '}'. This removal is done till the end of the function. The end of the function is identified using stack.
Step 7: And I will assume that my dead-free code is ready now. Store the linked-list or array in an output file.
My Questions are..
1.Whether my idea will be meaningful? or will it be implementable? How
can I improve this algorithm?
2.While i am trying to implement this idea, I have to deal more with string
manipulations rather than removing dead-codes. Is any way to reduce
string manipulations in this algorithm.
Do not do it this way. C is a free-form language, and trying to process it line-by-line will result in supporting a subset of C that is so ridiculously restricted that it doesn't deserve the name.
What you need to do is to write a proper parser. There is copious literature about that out there. Find out which textbook your school uses for its compiler-construction course, and work through that -- or just take the course! Only when you've got the parser down should you even begin to consider semantics. Then do your work on abstract syntax trees instead of strings. Alternatively, find an already written and tested parser for C that you can reuse (but you'll still need to learn quite a bit in order to integrate it with your own processing).
If you end up writing the parser yourself, and it's only for your own edification, consider using a simpler language than C as your subject. Even though C at is core is fairly compact as languages go, getting all details of the declaration syntax right is surprisingly tricky, and will probably detract you from what you're actually interested in. And the presence of the preprocessor is an issue in itself which can make it very difficult to design meaningful source-to-source transformations.
By the way, the transformations you sketch are known in the trade as "constant propagation", or (in a more ambitious variants that will clone functions and loop bodies when they have differing constant inputs) "partial evaluation". Googling those terms may be interesting.

Parsing a stream of data for control strings

I feel like this is a pretty common problem but I wasn't really sure what to search for.
I have a large file (so I don't want to load it all into memory) that I need to parse control strings out of and then stream that data to another computer. I'm currently reading in the file in 1000 byte chunks.
So for example if I have a string that contains ASCII codes escaped with ('$' some number of digits ';') and the data looked like this... "quick $33;brown $126;fox $a $12a". The string going to the other computer would be "quick brown! ~fox $a $12a".
In my current approach I have the following problems:
What happens when the control strings falls on a buffer boundary?
If the string is '$' followed by anything but digits and a ';' I want to ignore it. So I need to read ahead until the full control string is found.
I'm writing this in straight C so I don't have streams to help me.
Would an alternating double buffer approach work and if so how does one manage the current locations etc.
If I've followed what you are asking about it is called lexical analysis or tokenization or regular expressions. For regular languages you can construct a finite state machine which will recognize your input. In practice you can use a tool that understands regular expressions to recognize and perform different actions for the input.
Depending on different requirements you might go about this differently. For more complicated languages you might want to use a tool like lex to help you generate an input processor, but for this, as I understand it, you can use a much more simple approach, after we fix your buffer problem.
You should use a circular buffer for your input, so that indexing off the end wraps around to the front again. Whenever half of the data that the buffer can hold has been processed you should do another read to refill that. Your buffer size should be at least twice as large as the largest "word" you need to recognize. The indexing into this buffer will use the modulus (remainder) operator % to perform the wrapping (if you choose a buffer size that is a power of 2, such as 4096, then you can use bitwise & instead).
Now you just look at the characters until you read a $, output what you've looked at up until that point, and then knowing that you are in a different state because you saw a $ you look at more characters until you see another character that ends the current state (the ;) and perform some other action on the data that you had read in. How to handle the case where the $ is seen without a well formatted number followed by an ; wasn't entirely clear in your question -- what to do if there are a million numbers before you see ;, for instance.
The regular expressions would be:
[^$]
Any non-dollar sign character. This could be augmented with a closure ([^$]* or [^$]+) to recognize a string of non$ characters at a time, but that could get very long.
$[0-9]{1,3};
This would recognize a dollar sign followed by up 1 to 3 digits followed by a semicolon.
[$]
This would recognize just a dollar sign. It is in the brackets because $ is special in many regular expression representations when it is at the end of a symbol (which it is in this case) and means "match only if at the end of line".
Anyway, in this case it would recognize a dollar sign in the case where it is not recognized by the other, longer, pattern that recognizes dollar signs.
In lex you might have
[^$]{1,1024} { write_string(yytext); }
$[0-9]{1,3}; { write_char(atoi(yytext)); }
[$] { write_char(*yytext); }
and it would generate a .c file that will function as a filter similar to what you are asking for. You will need to read up a little more on how to use lex though.
The "f" family of functions in <stdio.h> can take care of the streaming for you. Specifically, you're looking for fopen(), fgets(), fread(), etc.
Nategoose's answer about using lex (and I'll add yacc, depending on the complexity of your input) is also worth considering. They generate lexers and parsers that work, and after you've used them you'll never write one by hand again.

Resources