I'm wondering how $ variables work with non-tokens, like blocks of code. And my question can be reduced to this:
I have a rule like this, with a block of code in the middle of it. In this case who is $3 and $4?
func-header: ret-type ID { strcpy(func_id,current_id); } LPAREN params RPAREN
Mid-rule actions (MRA) are implemented as non-terminals which match an empty sequence. (Such non-terminals are sometimes called "markers".) The mid-rule action is the semantic action of the generated non-terminal.
Like any non-terminal, these automatically-generated markers have a semantic value, which is set by assigning $$ inside the action. However, the numbering of $n inside a MRA differ slightly from the numbering in normal actions. Inside a MRA, each n in $n is translated to a negative index, representing values on the top of the stack when the marker is reduced, by subtracting the MRA's iwn index.
Negative indices are always allowed by yacc/bison, but as the manual states they are quite dangerous and should only be used if you can prove that an appropriately typed value is necessarily at the indicated point on the stack. In the case of automatically-generated markers, yacc/bison can prove this because the marker is only used in a single production and the generated negative indices always fall into the part of the stack occupied by the right-hand side containing the MRA.
In the rule shown:
ret-type is $1.
ID is $2.
The code block is $3.
LPAREN is $4.
params is $5.
RPAREN is $6.
In other words, code blocks act as non-terminals.
Related
I have a calculation inside a case statement structured like this:
SQRT(POWER()+POWER()+...x84)
That's 84 power functions. I keep getting the error:
Internal error: An expression services limit has been reached. Please look for potentially complex expressions in your query, and try to simplify them.
I understand the limit of identifiers in an expression is 65,535, but mine seemingly doesn't come even close to that. Do the functions change the effective number of identifiers in my expression?
In some web service, I receive this
"time":"0.301*0.869*1.387*2.93*3.653*3.956*4.344*6.268*6.805*7.712*9.099*9.784*11.071*11.921*13.347*14.253*14.965*16.313*16.563*17.426*17.62*18.114"
I want to separate the numbers and insert them into a table like this, how ?
0.301
0.869
1.387
2.93
3.653
3.956
4.344
6.268
6.805
7.712
9.099
9.784
11.071
11.921
13.347
14.253
14.965
16.313
16.563
17.426
17.62
18.114
A little string-matching should get the job done:
local str = [["time":"0.301*0.869*1.387*2.93*3.653*3.956*4.344*6.268*6.805*7.712*9.099*9.784*11.071*11.921*13.347*14.253*14.965*16.313*16.563*17.426*17.62*18.114"]]
local list = {}
for num in str:gmatch("%**(%d+%.%d+)") do
table.insert(list, tonumber(num))
end
A Little Explanation
I'll first briefly summarize what some of the symbols here are:
%d this means to look for a specific digit.
%. means to look specifically for a period
+ means to look for 1 or more of the specific thing you wanted to match earlier.
%* means to look specifically for a star.
* when the percentage sign isn't in front, this means that you can match 0 or more of a specific match.
Now, let's put this together to look at it from the start:
%** This means that we want the string to start with a star, but it that is optional. The reason we need it to be optional is because the first number you wanted does not have a star in front of it.
%d+ means to look for a sequence of digit(s) until something else pops up. In our case, this would be like the '18' in '18.114' or the '1' in '1.387'
%. as I said means we want the next thing found to be a period.
%d+ means we want another sequence of digit(s). Such as the 114 in 18.114
So, what do the parenthesis mean? It just means that we don't care about anything else outside the parenthesis when we capture the pattern.
We're creating a very simple programming language, using Flex and Bison for parsing and syntax analysis, and using C to build the compiler.
Before going straight to assembly, we're creating an abstract syntax tree from the language rules. But we're having trouble representing one specific function from the language.
The function is described as follows:
FILTERC: It takes a condition and an expression list as input and it returns how many of those expressions match the condition. It can be a single or compund condition.
It is used in this form: FILTERC (condition, [expression list])
The condition has to have an underscore before each element, representing where the expressions should be placed for comparison. Example: FILTERC ( _>4 and _<=6.5 , [a+c,b,c-a,d])
This is how the "filterc" function is expressed in BNF rules (we actually used tokens with Flex, but I simplified it with the actual characters since that's not the point and the syntax analysis is correctly done by Bison):
filter ::= FILTERC ( condition_filter , [ expression_list ] )
;
condition_filter ::= comparison_filter | comparison_filter AND comparison_filter | comparison_filter OR comparison_filter
;
comparison_filter ::= _ > expression | _ < expression | _ == expression | _ >= expression | _ <= expression | _ != expression
;
expression_list ::= expression | expression , expression_list
;
expression: term | expression + term | expression - term
;
term: factor | term * factor | term / factor
;
factor: ID | INT_LITERAL | REAL_LITERAL | STRING_LITERAL | ( expression ) | filter
;
We now have to write functions that create the nodes of the abstract syntax tree. At low level, the "filterc" function is nothing but a bunch of "IF" to verify that each one of the expressions matches the condition, only that now the expressions will be placed where the underscore is. So it would be something like: (expression) (comparison operator) (condition)
The thing is, the actual FILTERC sentence is read "backwards": the expressions are read first and then compared to the condition. But the program is read sequentially: the underscore is read before the actual expression is found. So we're really confused as to how to build the tree.
I'm not going to add all the code we use to create the nodes and leaves of the tree or this would be a total mess. But basically, there is a function that creates nodes with 2 children (left and right) and when there shouldn't be any children, those pointers are set to null. The basic structure we use is to place the operator in the root node and the operands as the children (e.g.: in an "if" sentence, the "if" keyword should be the root, the condition would be the left child and the code block to execute if true would be the right child). Like this:
IF condition THEN block {thenPtr = blockPtr;} ENDIF {createNode("if", conditionPtr, thenPtr);}
("condition" and "block" are defined elsewhere, where their pointers are created).
We were able to successfully create the tree for the expression regex and for all the other rules in the language, but this "filter" function is really confusing.
It is true that when the parser reads a piece of an expression (e.g., the ">"), it hasn't got enough to build the tree for the expression. That same is true for any concept ("nonterminal") in your language. And from that perspective I see you you can be confused.
Apparantly you don't understand how LR parsers like Bison work. Assume we have rules R1, R2, ... with rules have right hand sides, e.g., Rn = T1 T2 T3 ; with each rule having a right hand side length of L(Rn).
The key idea that you need, is that an LR parser collects ("stacks", yes it really uses a stack of tokens) tokens from the input stream, left to right. These steps are called "shifts". The parser shifts repeatedly, continually looking for situations which indicate that enough tokens have been read (e.g., T1, T2, then T3) to satisfy the right hand side of some grammar rule Rn. The magic of the parser generator and the LR tables it produces allows the parser to keep efficient track of all the "live" rules at once and we're not going to discuss that further here.
At the point where a right hand side for has been recognized at that point, the LR parser performs a "reduce" action and replaces the stacked tokens that match the body of the rule, with the nonterminal token Rn ("pops the stack L(Rn) times then and pushes Rn"). It does as many reductions as it can, before returning to collecting terminal tokens from the input stream. It is worth your trouble to simulate this process by hand on a really tiny grammar. [A subtle detail: some rules have empty right hand sides, e.g., L(Rn)==0); in that case, when a reduction happens zero pops occur, yes that sounds funny but it is deadly right].
At every point where the parser does a reduce action, it offers you, the parser programmer, an opportunity to do some additional work. That additional work is almost invariably "tree building". Clearly, the tokens that make up the rule Rn have all been seen, so one can build a tree representing Rn if the tokens were all terminals. In fact, if all the tokens for Rn have been seen, and Rn contains some nonterminals, then there must have been reduce actions to produce each of the nonterminals. If each of them produced a tree representing themselves, then when the rule containing the nonterminal is reduced, there are trees already produced for the other nonterminals, and these can be be combined to produce the tree for the current rule.
LR parser generator tools like Bison help you with, usually by providing tree-building operators that you can invoke in a reduce-action. It also helps by making the trees for already processed nonterminals available to your reduce-action, so it can combine them to produce the tree for the reduce action. (It does so by keeping track of generated trees in a stack parallel to the token stack. ) At no point does it ever attempt to reduce, or do you ever attempt to produce a tree, where you don't have all the subtrees needed.
I think you need to read the Bison manually carefully and all this will become clear, as you attempt to implement the parser and the reductions; the manual has good examples. It is clear that you haven't done that (fear of not knowing how to handle the trees?), because a) your rules as expressed are broken; there's no way to generate a term, and b) you don't have any embedded reduce actions.
I have an array of elements whose key is a regex.
I would like to come up with a fast algorithm that given a string (not a regex) will return in less than O(N) time what are the matching array values based on execution of the key regex.
Currently I do a linear scan of the array, for each element I execute the respective regex with posix regexec API, but this means that to find the matching elements I have to search across the whole array.
I understand if the aray was composed by only simple strings as key, I could have kept it orderer and use a bsearch style API, but with regex looks like is not so easy.
Am I missing something here?
Example follows
// this is mainly to be considered
// as pseudocode
typedef struct {
regex_t r;
... some other data
} value;
const char *key = "some/key";
value my_array[1024];
bool my_matches[1024];
for(int i =0; i < 1024; ++i) {
if(!regexec(&my_array[i].r, key, 0, 0, REG_EXTENDED))
my_matches[i] = 1;
else
my_matches[i] = 0;
}
But the above, as you can see, is linear.
Thanks
Addendum:
I've put together a simple executable which executed above algorithm and something proposed in below answer, where form a large regex it build a binary tree of sub-regex and navigates it to find all the matches.
Source code is here (GPLv3): http://qpsnr.youlink.org/data/regex_search.cpp
Compile with: g++ -O3 -o regex_search ./regex_search.cpp -lrt
And run with: ./regex_search "a/b" (or use --help flag for options)
Interestingly (and I would say as expected) when searching in the tree, it takes less number of regex to execute, but these are far more complex to run for each comparison, so eventually the time it takes balances out with the linear scan of vectors. Results are printed on std::cerr so you can see that those are the same.
When running with long strings and/or many token, watch out for memory usage; be ready to hit Ctrl-C to stop it from preventing your system to crash.
This is possible but I think you would need to write your own regex library to achieve it.
Since you're using posix regexen, I'm going to assume that you intend to actually use regular expressions, as opposed to the random collection of computational features which modern regex libraries tend to implement. Regular expressions are closed under union (and many other operations), so you can construct a single regular expression from your array of regular expressions.
Every regular expression can be recognized by a DFA (deterministic finite-state automaton), and a DFA -- regardless of how complex -- recognizes (or fails to recognize) a string in time linear to the length of the string. Given a set of DFAs, you can construct a union DFA which recognizes the languages of all DFAs, and furthermore (with a slight modification of what it means for a DFA to accept a string), you can recover the information about which subset of the DFAs matched the string.
I'm going to try to use the same terminology as the Wikipedia article on DFAs. Let's suppose we have a set of DFAs M = {M1...Mn} which share a single alphabet Σ. So we have:
Mi = (Qi, Σ, δi, qi0, Fi) where Qi = {qij} for 0 ≤ j < |Qi|, and Qi ⊂ Fi.
We construct the union-DFA M⋃ = (Q⋃, Σ, δ⋃, q⋃0) (yes, no F; I'll get to that) as follows:
q⋃0 = <q10,...,qn0>
δ⋃(<q1j1,...,qnjn>, α) = <δ1(q1j1, α),... , δn(qnjn, α)> for each α ∈ Σ
Q⋃ consists of all states reachable through δ⋃ starting from q⋃0.
We can compute this using a standard closure algorithm in time proportional to the product of the sizes of the δi transition functions.
Now to do a union match on a string α1...αm, we run the union DFA in the usual fashion, starting with its start symbol and applying its transition function to each α in turn. Once we've read the last symbol in the string, the DFA will be in some state <q1j1,...,qnjn>. From that state, we can extract the set of Mi which would have matched the string as: {Mi | qiji ∈ Fi}.
In order for this to work, we need the individual DFAs to be complete (i.e., they have a transition from every state on every symbol). Some DFA construction algorithms produce DFAs which are lacking transitions on some symbols (indicating that no string with that prefix is in the language); such DFAs must be augmented with a non-accepting "sink" state which has a transition to itself on every symbol.
I don't know of any regex library which exposes its DFAs sufficiently to implement the above algorithm, but it's not too much work to write a simple regex library which does not attempt to implement any non-regular features. You might also be able to find a DFA library.
Constructing a DFA from a regular expression is potentially exponential in the size of the expression, although such cases are rare. (The non-deterministic FA can be constructed in linear time, but in some cases, the powerset construction on the NFA will require exponential time and space. See the Wikipedia article.) Once the DFAs are constructed, however, the union FA can be constructed in time proportional to the product of the sizes of the DFAs.
So it should be easy enough to allow dynamic modification to the set of regular expressions, by compiling each regex to a DFA once, and maintaining the set of DFAs. When the set of regular expressions changes, it is only necessary to regenerate the union DFA.
Hope that all helps.
We have many documents that consists of words.
What is most appropriate way to index the documents.
A search query should support the AND/OR operations.
The query runtime should be efficient as possible.
Please describe the space required for the index.
The documents contain words only(excluding AND/OR) and the query contains words and the keywords AND/OR.
EDIT: what would be the algorithm if I would allow only 2 keywords and an operation
(e.g. w1 AND w2)
The basic data structured needed is an inverted index. This maps each word to the set of documents that contain it. Lets say lookup is a function from words to document sets: lookup: W -> Pos(D) (where W is the set of words, D the set of documents, Pos(D) the power set of D).
Lets say you have an query expression of the form:
Expr ::= Word | AndExpression | OrExpression
AndExpression ::= Expr 'AND' Expr
OrExpression ::= Expr 'OR' Expr
So you get an abstract syntax tree representing your query. That's a tree with the following kind of nodes:
abstract class Expression { }
class Word extends Expression {
String word
}
class AndExpression extends Expression {
Expression left
Expression right
}
class OrExpression extends Expression {
Expression left
Expression right
}
For example, foo AND (bar OR baz) would be translated to this tree:
AndExpression
/ \
/ \
Word('foo') OrExpression
/ \
/ \
Word('bar') Word('baz')
To evaluate this tree, follow these simple rules, expressed in pseudocode:
Set<Document> evaluate(Expr e) {
if (e is Word)
return lookup(e.word)
else if (e is AndExpression)
return intersection(evaluate(e.left), evaluate(e.right))
else if (e is OrExpression)
return union(evaluate(e.left), evaluate(e.right))
//otherwise, throw assertion error: no other case remaining
}
//implemented by the inverted index, not shown
Set<Document> lookup(String word)
Thus, AND expressions are basically translated to set intersections, while OR expressions are translated to union expressions, all evaluated recursively. I'm sure if you stare long enough on the above, you'll see its beauty :)
You could represent each set (that lookup returns) as a HashSet. If you use Java, You could also use guava's lazy union and intersection implementations, that should be fun (especially if you study the code or use your imagination to see what 'lazy' really means in this context).
To the best of my knowledge, though, intersections are rarely computed by intersecting hashtables - instead, what usually happens is the following: assume are 3 sets to be intersected, we pick one (the smallest preferably) and assign a counter (equal to 1) to each of the documents. We then iterate the other sets, incrementing the counter of each found document. Finally, we report each document that its counter becomes 3 (that means that the document appeared in all sets, thus exists in their intersection).
In case only 2 keywords are allowed (e.g. "key1 AND key2")
This is the solution I found so far:
1) keyMap, HashMap, where the key is the keywords and the value is the LinkedList of documents.
2) docMap, HashMap, where the key is the document id and the value is an HashSet of keywords
Now on such a query ("key1 AND key2") I would:
LinkedList docs = keyMap.get(key1);
for each (HashSet doc:docs)
if(doc.contains(keys))
result.add(doc);
return result
What do you think?
Is there any better way?
How about 3 keywords?
Google Desktop springs to mind, but all the major O/S have similar features built in or easily added. I'd describe the space used by Spotlight on my Mac as 'reasonable'.
With many documents, a dedicated package like Lucene becomes very attractive.
As #Wrikken said, use lucene.
As you are interested in the algorithms used there are many, you can find a starting point here and more information here.