We're creating a very simple programming language, using Flex and Bison for parsing and syntax analysis, and using C to build the compiler.
Before going straight to assembly, we're creating an abstract syntax tree from the language rules. But we're having trouble representing one specific function from the language.
The function is described as follows:
FILTERC: It takes a condition and an expression list as input and it returns how many of those expressions match the condition. It can be a single or compund condition.
It is used in this form: FILTERC (condition, [expression list])
The condition has to have an underscore before each element, representing where the expressions should be placed for comparison. Example: FILTERC ( _>4 and _<=6.5 , [a+c,b,c-a,d])
This is how the "filterc" function is expressed in BNF rules (we actually used tokens with Flex, but I simplified it with the actual characters since that's not the point and the syntax analysis is correctly done by Bison):
filter ::= FILTERC ( condition_filter , [ expression_list ] )
;
condition_filter ::= comparison_filter | comparison_filter AND comparison_filter | comparison_filter OR comparison_filter
;
comparison_filter ::= _ > expression | _ < expression | _ == expression | _ >= expression | _ <= expression | _ != expression
;
expression_list ::= expression | expression , expression_list
;
expression: term | expression + term | expression - term
;
term: factor | term * factor | term / factor
;
factor: ID | INT_LITERAL | REAL_LITERAL | STRING_LITERAL | ( expression ) | filter
;
We now have to write functions that create the nodes of the abstract syntax tree. At low level, the "filterc" function is nothing but a bunch of "IF" to verify that each one of the expressions matches the condition, only that now the expressions will be placed where the underscore is. So it would be something like: (expression) (comparison operator) (condition)
The thing is, the actual FILTERC sentence is read "backwards": the expressions are read first and then compared to the condition. But the program is read sequentially: the underscore is read before the actual expression is found. So we're really confused as to how to build the tree.
I'm not going to add all the code we use to create the nodes and leaves of the tree or this would be a total mess. But basically, there is a function that creates nodes with 2 children (left and right) and when there shouldn't be any children, those pointers are set to null. The basic structure we use is to place the operator in the root node and the operands as the children (e.g.: in an "if" sentence, the "if" keyword should be the root, the condition would be the left child and the code block to execute if true would be the right child). Like this:
IF condition THEN block {thenPtr = blockPtr;} ENDIF {createNode("if", conditionPtr, thenPtr);}
("condition" and "block" are defined elsewhere, where their pointers are created).
We were able to successfully create the tree for the expression regex and for all the other rules in the language, but this "filter" function is really confusing.
It is true that when the parser reads a piece of an expression (e.g., the ">"), it hasn't got enough to build the tree for the expression. That same is true for any concept ("nonterminal") in your language. And from that perspective I see you you can be confused.
Apparantly you don't understand how LR parsers like Bison work. Assume we have rules R1, R2, ... with rules have right hand sides, e.g., Rn = T1 T2 T3 ; with each rule having a right hand side length of L(Rn).
The key idea that you need, is that an LR parser collects ("stacks", yes it really uses a stack of tokens) tokens from the input stream, left to right. These steps are called "shifts". The parser shifts repeatedly, continually looking for situations which indicate that enough tokens have been read (e.g., T1, T2, then T3) to satisfy the right hand side of some grammar rule Rn. The magic of the parser generator and the LR tables it produces allows the parser to keep efficient track of all the "live" rules at once and we're not going to discuss that further here.
At the point where a right hand side for has been recognized at that point, the LR parser performs a "reduce" action and replaces the stacked tokens that match the body of the rule, with the nonterminal token Rn ("pops the stack L(Rn) times then and pushes Rn"). It does as many reductions as it can, before returning to collecting terminal tokens from the input stream. It is worth your trouble to simulate this process by hand on a really tiny grammar. [A subtle detail: some rules have empty right hand sides, e.g., L(Rn)==0); in that case, when a reduction happens zero pops occur, yes that sounds funny but it is deadly right].
At every point where the parser does a reduce action, it offers you, the parser programmer, an opportunity to do some additional work. That additional work is almost invariably "tree building". Clearly, the tokens that make up the rule Rn have all been seen, so one can build a tree representing Rn if the tokens were all terminals. In fact, if all the tokens for Rn have been seen, and Rn contains some nonterminals, then there must have been reduce actions to produce each of the nonterminals. If each of them produced a tree representing themselves, then when the rule containing the nonterminal is reduced, there are trees already produced for the other nonterminals, and these can be be combined to produce the tree for the current rule.
LR parser generator tools like Bison help you with, usually by providing tree-building operators that you can invoke in a reduce-action. It also helps by making the trees for already processed nonterminals available to your reduce-action, so it can combine them to produce the tree for the reduce action. (It does so by keeping track of generated trees in a stack parallel to the token stack. ) At no point does it ever attempt to reduce, or do you ever attempt to produce a tree, where you don't have all the subtrees needed.
I think you need to read the Bison manually carefully and all this will become clear, as you attempt to implement the parser and the reductions; the manual has good examples. It is clear that you haven't done that (fear of not knowing how to handle the trees?), because a) your rules as expressed are broken; there's no way to generate a term, and b) you don't have any embedded reduce actions.
Related
I'm wondering how $ variables work with non-tokens, like blocks of code. And my question can be reduced to this:
I have a rule like this, with a block of code in the middle of it. In this case who is $3 and $4?
func-header: ret-type ID { strcpy(func_id,current_id); } LPAREN params RPAREN
Mid-rule actions (MRA) are implemented as non-terminals which match an empty sequence. (Such non-terminals are sometimes called "markers".) The mid-rule action is the semantic action of the generated non-terminal.
Like any non-terminal, these automatically-generated markers have a semantic value, which is set by assigning $$ inside the action. However, the numbering of $n inside a MRA differ slightly from the numbering in normal actions. Inside a MRA, each n in $n is translated to a negative index, representing values on the top of the stack when the marker is reduced, by subtracting the MRA's iwn index.
Negative indices are always allowed by yacc/bison, but as the manual states they are quite dangerous and should only be used if you can prove that an appropriately typed value is necessarily at the indicated point on the stack. In the case of automatically-generated markers, yacc/bison can prove this because the marker is only used in a single production and the generated negative indices always fall into the part of the stack occupied by the right-hand side containing the MRA.
In the rule shown:
ret-type is $1.
ID is $2.
The code block is $3.
LPAREN is $4.
params is $5.
RPAREN is $6.
In other words, code blocks act as non-terminals.
I was wondering how inserting an equation such as 5 * 4 + 2 / 3 into a binary tree would work. I have tried doing this on my own however I can only make the tree grow to one side.
I am not an expert in the field, but I wrote basic expression parsers in the past.
You will need to tokenize your expression. Transform it from a string of characters to a list of understandable chunks.
You may want to create two structures, one for operators and one for operands. Hint, operators have a priority associated with them.
You can apply an algorithm to transform your operators/operands to an abstract syntax tree (AST) it's basically just a set of rules, generally used with a queue and a stack.
I have an array of elements whose key is a regex.
I would like to come up with a fast algorithm that given a string (not a regex) will return in less than O(N) time what are the matching array values based on execution of the key regex.
Currently I do a linear scan of the array, for each element I execute the respective regex with posix regexec API, but this means that to find the matching elements I have to search across the whole array.
I understand if the aray was composed by only simple strings as key, I could have kept it orderer and use a bsearch style API, but with regex looks like is not so easy.
Am I missing something here?
Example follows
// this is mainly to be considered
// as pseudocode
typedef struct {
regex_t r;
... some other data
} value;
const char *key = "some/key";
value my_array[1024];
bool my_matches[1024];
for(int i =0; i < 1024; ++i) {
if(!regexec(&my_array[i].r, key, 0, 0, REG_EXTENDED))
my_matches[i] = 1;
else
my_matches[i] = 0;
}
But the above, as you can see, is linear.
Thanks
Addendum:
I've put together a simple executable which executed above algorithm and something proposed in below answer, where form a large regex it build a binary tree of sub-regex and navigates it to find all the matches.
Source code is here (GPLv3): http://qpsnr.youlink.org/data/regex_search.cpp
Compile with: g++ -O3 -o regex_search ./regex_search.cpp -lrt
And run with: ./regex_search "a/b" (or use --help flag for options)
Interestingly (and I would say as expected) when searching in the tree, it takes less number of regex to execute, but these are far more complex to run for each comparison, so eventually the time it takes balances out with the linear scan of vectors. Results are printed on std::cerr so you can see that those are the same.
When running with long strings and/or many token, watch out for memory usage; be ready to hit Ctrl-C to stop it from preventing your system to crash.
This is possible but I think you would need to write your own regex library to achieve it.
Since you're using posix regexen, I'm going to assume that you intend to actually use regular expressions, as opposed to the random collection of computational features which modern regex libraries tend to implement. Regular expressions are closed under union (and many other operations), so you can construct a single regular expression from your array of regular expressions.
Every regular expression can be recognized by a DFA (deterministic finite-state automaton), and a DFA -- regardless of how complex -- recognizes (or fails to recognize) a string in time linear to the length of the string. Given a set of DFAs, you can construct a union DFA which recognizes the languages of all DFAs, and furthermore (with a slight modification of what it means for a DFA to accept a string), you can recover the information about which subset of the DFAs matched the string.
I'm going to try to use the same terminology as the Wikipedia article on DFAs. Let's suppose we have a set of DFAs M = {M1...Mn} which share a single alphabet Σ. So we have:
Mi = (Qi, Σ, δi, qi0, Fi) where Qi = {qij} for 0 ≤ j < |Qi|, and Qi ⊂ Fi.
We construct the union-DFA M⋃ = (Q⋃, Σ, δ⋃, q⋃0) (yes, no F; I'll get to that) as follows:
q⋃0 = <q10,...,qn0>
δ⋃(<q1j1,...,qnjn>, α) = <δ1(q1j1, α),... , δn(qnjn, α)> for each α ∈ Σ
Q⋃ consists of all states reachable through δ⋃ starting from q⋃0.
We can compute this using a standard closure algorithm in time proportional to the product of the sizes of the δi transition functions.
Now to do a union match on a string α1...αm, we run the union DFA in the usual fashion, starting with its start symbol and applying its transition function to each α in turn. Once we've read the last symbol in the string, the DFA will be in some state <q1j1,...,qnjn>. From that state, we can extract the set of Mi which would have matched the string as: {Mi | qiji ∈ Fi}.
In order for this to work, we need the individual DFAs to be complete (i.e., they have a transition from every state on every symbol). Some DFA construction algorithms produce DFAs which are lacking transitions on some symbols (indicating that no string with that prefix is in the language); such DFAs must be augmented with a non-accepting "sink" state which has a transition to itself on every symbol.
I don't know of any regex library which exposes its DFAs sufficiently to implement the above algorithm, but it's not too much work to write a simple regex library which does not attempt to implement any non-regular features. You might also be able to find a DFA library.
Constructing a DFA from a regular expression is potentially exponential in the size of the expression, although such cases are rare. (The non-deterministic FA can be constructed in linear time, but in some cases, the powerset construction on the NFA will require exponential time and space. See the Wikipedia article.) Once the DFAs are constructed, however, the union FA can be constructed in time proportional to the product of the sizes of the DFAs.
So it should be easy enough to allow dynamic modification to the set of regular expressions, by compiling each regex to a DFA once, and maintaining the set of DFAs. When the set of regular expressions changes, it is only necessary to regenerate the union DFA.
Hope that all helps.
I'm having issues 'describing each step' when creating an NFA from a regular expression. The question is as follows:
Convert the following regular expression to a non-deterministic finite-state automaton (NFA), clearly describing the steps of the algorithm that you use:
(b|a)*b(a|b)
I've made a simple 3-state machine but it's very much from intuition.
This is a question from a past exam written by my lecturer, who also wrote the following explanation of Thompson's algorithm: http://www.cs.may.ie/staff/jpower/Courses/Previous/parsing/node5.html
Can anyone clear up how to 'describe each step clearly'? It just seems like a set of basic rules rather than an algorithm with steps to follow.
Maybe there's an algorithm I've glossed over somewhere but so far I've just created them with an educated guess.
Short version for general approach.
There's an algo out there called the Thompson-McNaughton-Yamada Construction Algorithm or sometimes just "Thompson Construction." One builds intermediate NFAs, filling in the pieces along the way, while respecting operator precedence: first parentheses, then Kleene Star (e.g., a*), then concatenation (e.g., ab), followed by alternation (e.g., a|b).
Here's an in-depth walkthrough for building (b|a)*b(a|b)'s NFA
Building the top level
Handle parentheses. Note: In actual implementation, it can make sense to handling parentheses via a recursive call on their contents. For the sake of clarity, I'll defer evaluation of anything inside of parens.
Kleene Stars: only one * there, so we build a placeholder Kleene Star machine called P (which will later contain b|a).
Intermediate result:
Concatenation: Attach P to b, and attach b to a placeholder machine called Q (which will contain (a|b). Intermediate result:
There's no alternation outside of parentheses, so we skip it.
Now we're sitting on a P*bQ machine. (Note that our placeholders P and Q are just concatenation machines.) We replace the P edge with the NFA for b|a, and replace the Q edge with the NFA for a|b via recursive application of the above steps.
Building P
Skip. No parens.
Skip. No Kleene stars.
Skip. No contatenation.
Build the alternation machine for b|a. Intermediate result:
Integrating P
Next, we go back to that P*bQ machine and we tear out the P edge. We have the source of the P edge serve as the starting state for the P machine, and the destination of the P edge serve as the destination state for the P machine. We also make that state reject (take away its property of being an accept state). The result looks like this:
Building Q
Skip. No parens.
Skip. No Kleene stars.
Skip. No contatenation.
Build the alternation machine for a|b. Incidentally, alternation is commutative, so a|b is logically equivalent to b|a. (Read: skipping this minor footnote diagram out of laziness.)
Integrating Q
We do what we did with P above, except replacing the Q edge with the intermedtae b|a machine we constructed. This is the result:
Tada! Er, I mean, QED.
Want to know more?
All the images above were generated using an online tool for automatically converting regular expressions to non-deterministic finite automata. You can find its source code for the Thompson-McNaughton-Yamada Construction algorithm online.
The algorithm is also addressed in Aho's Compilers: Principles, Techniques, and Tools, though its explanation is sparse on implementation details. You can also learn from an implementation of the Thompson Construction in C by the excellent Russ Cox, who described it some detail in a popular article about regular expression matching.
In the GitHub repository below, you can find a Java implementation of Thompson's construction where first an NFA is being created from the regex and then an input string is being matched against that NFA:
https://github.com/meghdadFar/regex
https://github.com/White-White/RegSwift
No more tedious words. Check out this repo, it translates your regular expression to an NFA and visually shows you the state transitions of an NFA.
We have many documents that consists of words.
What is most appropriate way to index the documents.
A search query should support the AND/OR operations.
The query runtime should be efficient as possible.
Please describe the space required for the index.
The documents contain words only(excluding AND/OR) and the query contains words and the keywords AND/OR.
EDIT: what would be the algorithm if I would allow only 2 keywords and an operation
(e.g. w1 AND w2)
The basic data structured needed is an inverted index. This maps each word to the set of documents that contain it. Lets say lookup is a function from words to document sets: lookup: W -> Pos(D) (where W is the set of words, D the set of documents, Pos(D) the power set of D).
Lets say you have an query expression of the form:
Expr ::= Word | AndExpression | OrExpression
AndExpression ::= Expr 'AND' Expr
OrExpression ::= Expr 'OR' Expr
So you get an abstract syntax tree representing your query. That's a tree with the following kind of nodes:
abstract class Expression { }
class Word extends Expression {
String word
}
class AndExpression extends Expression {
Expression left
Expression right
}
class OrExpression extends Expression {
Expression left
Expression right
}
For example, foo AND (bar OR baz) would be translated to this tree:
AndExpression
/ \
/ \
Word('foo') OrExpression
/ \
/ \
Word('bar') Word('baz')
To evaluate this tree, follow these simple rules, expressed in pseudocode:
Set<Document> evaluate(Expr e) {
if (e is Word)
return lookup(e.word)
else if (e is AndExpression)
return intersection(evaluate(e.left), evaluate(e.right))
else if (e is OrExpression)
return union(evaluate(e.left), evaluate(e.right))
//otherwise, throw assertion error: no other case remaining
}
//implemented by the inverted index, not shown
Set<Document> lookup(String word)
Thus, AND expressions are basically translated to set intersections, while OR expressions are translated to union expressions, all evaluated recursively. I'm sure if you stare long enough on the above, you'll see its beauty :)
You could represent each set (that lookup returns) as a HashSet. If you use Java, You could also use guava's lazy union and intersection implementations, that should be fun (especially if you study the code or use your imagination to see what 'lazy' really means in this context).
To the best of my knowledge, though, intersections are rarely computed by intersecting hashtables - instead, what usually happens is the following: assume are 3 sets to be intersected, we pick one (the smallest preferably) and assign a counter (equal to 1) to each of the documents. We then iterate the other sets, incrementing the counter of each found document. Finally, we report each document that its counter becomes 3 (that means that the document appeared in all sets, thus exists in their intersection).
In case only 2 keywords are allowed (e.g. "key1 AND key2")
This is the solution I found so far:
1) keyMap, HashMap, where the key is the keywords and the value is the LinkedList of documents.
2) docMap, HashMap, where the key is the document id and the value is an HashSet of keywords
Now on such a query ("key1 AND key2") I would:
LinkedList docs = keyMap.get(key1);
for each (HashSet doc:docs)
if(doc.contains(keys))
result.add(doc);
return result
What do you think?
Is there any better way?
How about 3 keywords?
Google Desktop springs to mind, but all the major O/S have similar features built in or easily added. I'd describe the space used by Spotlight on my Mac as 'reasonable'.
With many documents, a dedicated package like Lucene becomes very attractive.
As #Wrikken said, use lucene.
As you are interested in the algorithms used there are many, you can find a starting point here and more information here.