index many documents to enable queries that supports AND/OR operations - database

We have many documents that consists of words.
What is most appropriate way to index the documents.
A search query should support the AND/OR operations.
The query runtime should be efficient as possible.
Please describe the space required for the index.
The documents contain words only(excluding AND/OR) and the query contains words and the keywords AND/OR.
EDIT: what would be the algorithm if I would allow only 2 keywords and an operation
(e.g. w1 AND w2)

The basic data structured needed is an inverted index. This maps each word to the set of documents that contain it. Lets say lookup is a function from words to document sets: lookup: W -> Pos(D) (where W is the set of words, D the set of documents, Pos(D) the power set of D).
Lets say you have an query expression of the form:
Expr ::= Word | AndExpression | OrExpression
AndExpression ::= Expr 'AND' Expr
OrExpression ::= Expr 'OR' Expr
So you get an abstract syntax tree representing your query. That's a tree with the following kind of nodes:
abstract class Expression { }
class Word extends Expression {
String word
}
class AndExpression extends Expression {
Expression left
Expression right
}
class OrExpression extends Expression {
Expression left
Expression right
}
For example, foo AND (bar OR baz) would be translated to this tree:
AndExpression
/ \
/ \
Word('foo') OrExpression
/ \
/ \
Word('bar') Word('baz')
To evaluate this tree, follow these simple rules, expressed in pseudocode:
Set<Document> evaluate(Expr e) {
if (e is Word)
return lookup(e.word)
else if (e is AndExpression)
return intersection(evaluate(e.left), evaluate(e.right))
else if (e is OrExpression)
return union(evaluate(e.left), evaluate(e.right))
//otherwise, throw assertion error: no other case remaining
}
//implemented by the inverted index, not shown
Set<Document> lookup(String word)
Thus, AND expressions are basically translated to set intersections, while OR expressions are translated to union expressions, all evaluated recursively. I'm sure if you stare long enough on the above, you'll see its beauty :)
You could represent each set (that lookup returns) as a HashSet. If you use Java, You could also use guava's lazy union and intersection implementations, that should be fun (especially if you study the code or use your imagination to see what 'lazy' really means in this context).
To the best of my knowledge, though, intersections are rarely computed by intersecting hashtables - instead, what usually happens is the following: assume are 3 sets to be intersected, we pick one (the smallest preferably) and assign a counter (equal to 1) to each of the documents. We then iterate the other sets, incrementing the counter of each found document. Finally, we report each document that its counter becomes 3 (that means that the document appeared in all sets, thus exists in their intersection).

In case only 2 keywords are allowed (e.g. "key1 AND key2")
This is the solution I found so far:
1) keyMap, HashMap, where the key is the keywords and the value is the LinkedList of documents.
2) docMap, HashMap, where the key is the document id and the value is an HashSet of keywords
Now on such a query ("key1 AND key2") I would:
LinkedList docs = keyMap.get(key1);
for each (HashSet doc:docs)
if(doc.contains(keys))
result.add(doc);
return result
What do you think?
Is there any better way?
How about 3 keywords?

Google Desktop springs to mind, but all the major O/S have similar features built in or easily added. I'd describe the space used by Spotlight on my Mac as 'reasonable'.

With many documents, a dedicated package like Lucene becomes very attractive.

As #Wrikken said, use lucene.
As you are interested in the algorithms used there are many, you can find a starting point here and more information here.

Related

Forward Index vs Inverted index Why?

I was reading about inverted index (used by the text search engines like Solr, Elastic Search etc) and as I understand (if we take "Person" as an example):
The attribute to Person relationship is inverted:
John -> PersonId(1), PersonId(2), PersonId(3)
London -> PersonId(1), PersonId(2), PersonId(5)
I can now search the person records for 'John who lives in London'
Doesn't this solve all the problems? Why do we have the forward (or regular database index) at all? Or in other words, in what cases the regular indexing is useful? Please explain. Thanks.
The point that you're missing is that there is no real technical distinction between a forward index and an inverted index. "Forward" and "inverted" in this case are just descriptive terms to distinguish between:
A list of words contained in a document.
A list of documents containing a word.
The concept of an inverted index only makes sense if the concept of a regular (forward) index already exists. In the context of a search engine, a forward index would be the term vector; a list of terms contained within a particular document. The inverted index would be a list of documents containing a given term.
When you understand that the terms "forward" and "inverted" are really just relative terms used to describe the nature of the index you're talking about - and that really an index is just an index - your question doesn't really make sense any more.
Here's an explanation of inverted index, from Elasticsearch:
Elasticsearch uses a structure called an inverted index, which is designed to allow very fast full-text searches. An inverted index consists of a list of all the unique words that appear in any document, and for each word, a list of the documents in which it appears.
https://www.elastic.co/guide/en/elasticsearch/guide/current/inverted-index.html
Inverted indexing is for fast full text search. Regular indexing is less efficient, because the engine looks through all entries for a term, but very fast with indexing!
You can say this:
Forward index: fast indexing, less efficient query's
Inverted index: fast query, slower indexing
But, it's always context related. If you compare it with MySQL: myisam has fast read, innodb has fast insert/update and slower read.
Read more here: https://www.found.no/foundation/indexing-for-beginners-part3/
In forward index, the input is a document and the output is words contained in the document.
{
doc1: [word1, word2, word3],
doc2: [word4, word5]
}
In the reverse/inverted index, the input is a word, and the output is all the documents in which the words are contained.
{
word1: [doc1, doc10, doc3],
word2: [doc5, doc3]
}
Search engines make use of reverse/inverted index to get us documents from keywords.

Abstract Syntax Tree in compiler: how exactly to represent a function?

We're creating a very simple programming language, using Flex and Bison for parsing and syntax analysis, and using C to build the compiler.
Before going straight to assembly, we're creating an abstract syntax tree from the language rules. But we're having trouble representing one specific function from the language.
The function is described as follows:
FILTERC: It takes a condition and an expression list as input and it returns how many of those expressions match the condition. It can be a single or compund condition.
It is used in this form: FILTERC (condition, [expression list])
The condition has to have an underscore before each element, representing where the expressions should be placed for comparison. Example: FILTERC ( _>4 and _<=6.5 , [a+c,b,c-a,d])
This is how the "filterc" function is expressed in BNF rules (we actually used tokens with Flex, but I simplified it with the actual characters since that's not the point and the syntax analysis is correctly done by Bison):
filter ::= FILTERC ( condition_filter , [ expression_list ] )
;
condition_filter ::= comparison_filter | comparison_filter AND comparison_filter | comparison_filter OR comparison_filter
;
comparison_filter ::= _ > expression | _ < expression | _ == expression | _ >= expression | _ <= expression | _ != expression
;
expression_list ::= expression | expression , expression_list
;
expression: term | expression + term | expression - term
;
term: factor | term * factor | term / factor
;
factor: ID | INT_LITERAL | REAL_LITERAL | STRING_LITERAL | ( expression ) | filter
;
We now have to write functions that create the nodes of the abstract syntax tree. At low level, the "filterc" function is nothing but a bunch of "IF" to verify that each one of the expressions matches the condition, only that now the expressions will be placed where the underscore is. So it would be something like: (expression) (comparison operator) (condition)
The thing is, the actual FILTERC sentence is read "backwards": the expressions are read first and then compared to the condition. But the program is read sequentially: the underscore is read before the actual expression is found. So we're really confused as to how to build the tree.
I'm not going to add all the code we use to create the nodes and leaves of the tree or this would be a total mess. But basically, there is a function that creates nodes with 2 children (left and right) and when there shouldn't be any children, those pointers are set to null. The basic structure we use is to place the operator in the root node and the operands as the children (e.g.: in an "if" sentence, the "if" keyword should be the root, the condition would be the left child and the code block to execute if true would be the right child). Like this:
IF condition THEN block {thenPtr = blockPtr;} ENDIF {createNode("if", conditionPtr, thenPtr);}
("condition" and "block" are defined elsewhere, where their pointers are created).
We were able to successfully create the tree for the expression regex and for all the other rules in the language, but this "filter" function is really confusing.
It is true that when the parser reads a piece of an expression (e.g., the ">"), it hasn't got enough to build the tree for the expression. That same is true for any concept ("nonterminal") in your language. And from that perspective I see you you can be confused.
Apparantly you don't understand how LR parsers like Bison work. Assume we have rules R1, R2, ... with rules have right hand sides, e.g., Rn = T1 T2 T3 ; with each rule having a right hand side length of L(Rn).
The key idea that you need, is that an LR parser collects ("stacks", yes it really uses a stack of tokens) tokens from the input stream, left to right. These steps are called "shifts". The parser shifts repeatedly, continually looking for situations which indicate that enough tokens have been read (e.g., T1, T2, then T3) to satisfy the right hand side of some grammar rule Rn. The magic of the parser generator and the LR tables it produces allows the parser to keep efficient track of all the "live" rules at once and we're not going to discuss that further here.
At the point where a right hand side for has been recognized at that point, the LR parser performs a "reduce" action and replaces the stacked tokens that match the body of the rule, with the nonterminal token Rn ("pops the stack L(Rn) times then and pushes Rn"). It does as many reductions as it can, before returning to collecting terminal tokens from the input stream. It is worth your trouble to simulate this process by hand on a really tiny grammar. [A subtle detail: some rules have empty right hand sides, e.g., L(Rn)==0); in that case, when a reduction happens zero pops occur, yes that sounds funny but it is deadly right].
At every point where the parser does a reduce action, it offers you, the parser programmer, an opportunity to do some additional work. That additional work is almost invariably "tree building". Clearly, the tokens that make up the rule Rn have all been seen, so one can build a tree representing Rn if the tokens were all terminals. In fact, if all the tokens for Rn have been seen, and Rn contains some nonterminals, then there must have been reduce actions to produce each of the nonterminals. If each of them produced a tree representing themselves, then when the rule containing the nonterminal is reduced, there are trees already produced for the other nonterminals, and these can be be combined to produce the tree for the current rule.
LR parser generator tools like Bison help you with, usually by providing tree-building operators that you can invoke in a reduce-action. It also helps by making the trees for already processed nonterminals available to your reduce-action, so it can combine them to produce the tree for the reduce action. (It does so by keeping track of generated trees in a stack parallel to the token stack. ) At no point does it ever attempt to reduce, or do you ever attempt to produce a tree, where you don't have all the subtrees needed.
I think you need to read the Bison manually carefully and all this will become clear, as you attempt to implement the parser and the reductions; the manual has good examples. It is clear that you haven't done that (fear of not knowing how to handle the trees?), because a) your rules as expressed are broken; there's no way to generate a term, and b) you don't have any embedded reduce actions.

Short-circuit OR operator in Lucene/Solr

I understand that lucene's AND (&&), OR (||) and NOT (!) operators are shorthands for REQUIRED, OPTIONAL and EXCLUDE respectively, which is why one can't treat them as boolean operators (adhering to boolean algebra).
I have been trying to construct a simple OR expression, as follows
q = +(field1:value1 OR field2:value2)
with a match on either field1 or field2. But since the OR is merely an optional, documents where both field1:value1 and field2:value2 are matched, the query returns a score resulting in a match on both the clauses.
How do I enforce short-circuiting in this context? In other words, how to implement short-circuiting as in boolean algebra where an expression A || B || C returns true if A is true without even looking into whether B or C could be true.
Strictly speaking, no, there is no short circuiting boolean logic. If a document is found for one term, you can't simply tell it not to check for the other. Lucene is an inverted index, so it doesn't really check documents for matches directly. If you search for A OR B, it finds A and gets all the documents which have indexed that value. Then it gets B in the index, and then list of all documents containing it (this is simplifying somewhat, but I hope it gets the point across). It doesn't really make sense for it to not check the documents in which A is found. Further, for the query provided, all the matches on a document still need to be enumerated in order to acquire a correct score.
However, you did mention scores! I suspect what you are really trying to get at is that if one query term in a set is found, to not compound the score with other elements. That is, for (A OR B), the score is either the score-A or the score-B, rather than score-A * score-B or some such (Sorry if I am making a wrong assumption here, of course).
That is what DisjunctionMaxQuery is for. Adding each subquery to it will render a score from it equal to the maximum of the scores of all subqueries, rather than a product.
In Solr, you should learn about the DisMaxQParserPlugin and it's more recent incarnation, the ExtendedDisMax, which, if I'm close to the mark here, should serve you very well.

How can I quickly do a string matching on many regex keys?

I have an array of elements whose key is a regex.
I would like to come up with a fast algorithm that given a string (not a regex) will return in less than O(N) time what are the matching array values based on execution of the key regex.
Currently I do a linear scan of the array, for each element I execute the respective regex with posix regexec API, but this means that to find the matching elements I have to search across the whole array.
I understand if the aray was composed by only simple strings as key, I could have kept it orderer and use a bsearch style API, but with regex looks like is not so easy.
Am I missing something here?
Example follows
// this is mainly to be considered
// as pseudocode
typedef struct {
regex_t r;
... some other data
} value;
const char *key = "some/key";
value my_array[1024];
bool my_matches[1024];
for(int i =0; i < 1024; ++i) {
if(!regexec(&my_array[i].r, key, 0, 0, REG_EXTENDED))
my_matches[i] = 1;
else
my_matches[i] = 0;
}
But the above, as you can see, is linear.
Thanks
Addendum:
I've put together a simple executable which executed above algorithm and something proposed in below answer, where form a large regex it build a binary tree of sub-regex and navigates it to find all the matches.
Source code is here (GPLv3): http://qpsnr.youlink.org/data/regex_search.cpp
Compile with: g++ -O3 -o regex_search ./regex_search.cpp -lrt
And run with: ./regex_search "a/b" (or use --help flag for options)
Interestingly (and I would say as expected) when searching in the tree, it takes less number of regex to execute, but these are far more complex to run for each comparison, so eventually the time it takes balances out with the linear scan of vectors. Results are printed on std::cerr so you can see that those are the same.
When running with long strings and/or many token, watch out for memory usage; be ready to hit Ctrl-C to stop it from preventing your system to crash.
This is possible but I think you would need to write your own regex library to achieve it.
Since you're using posix regexen, I'm going to assume that you intend to actually use regular expressions, as opposed to the random collection of computational features which modern regex libraries tend to implement. Regular expressions are closed under union (and many other operations), so you can construct a single regular expression from your array of regular expressions.
Every regular expression can be recognized by a DFA (deterministic finite-state automaton), and a DFA -- regardless of how complex -- recognizes (or fails to recognize) a string in time linear to the length of the string. Given a set of DFAs, you can construct a union DFA which recognizes the languages of all DFAs, and furthermore (with a slight modification of what it means for a DFA to accept a string), you can recover the information about which subset of the DFAs matched the string.
I'm going to try to use the same terminology as the Wikipedia article on DFAs. Let's suppose we have a set of DFAs M = {M1...Mn} which share a single alphabet Σ. So we have:
Mi = (Qi, Σ, δi, qi0, Fi) where Qi = {qij} for 0 ≤ j < |Qi|, and Qi &subset; Fi.
We construct the union-DFA M&Union; = (Q&Union;, Σ, δ&Union;, q&Union;0) (yes, no F; I'll get to that) as follows:
q&Union;0 = <q10,...,qn0>
δ&Union;(<q1j1,...,qnjn>, α) = <δ1(q1j1, α),... , δn(qnjn, α)> for each α &in; Σ
Q&Union; consists of all states reachable through δ&Union; starting from q&Union;0.
We can compute this using a standard closure algorithm in time proportional to the product of the sizes of the δi transition functions.
Now to do a union match on a string α1...αm, we run the union DFA in the usual fashion, starting with its start symbol and applying its transition function to each α in turn. Once we've read the last symbol in the string, the DFA will be in some state <q1j1,...,qnjn>. From that state, we can extract the set of Mi which would have matched the string as: {Mi | qiji &in; Fi}.
In order for this to work, we need the individual DFAs to be complete (i.e., they have a transition from every state on every symbol). Some DFA construction algorithms produce DFAs which are lacking transitions on some symbols (indicating that no string with that prefix is in the language); such DFAs must be augmented with a non-accepting "sink" state which has a transition to itself on every symbol.
I don't know of any regex library which exposes its DFAs sufficiently to implement the above algorithm, but it's not too much work to write a simple regex library which does not attempt to implement any non-regular features. You might also be able to find a DFA library.
Constructing a DFA from a regular expression is potentially exponential in the size of the expression, although such cases are rare. (The non-deterministic FA can be constructed in linear time, but in some cases, the powerset construction on the NFA will require exponential time and space. See the Wikipedia article.) Once the DFAs are constructed, however, the union FA can be constructed in time proportional to the product of the sizes of the DFAs.
So it should be easy enough to allow dynamic modification to the set of regular expressions, by compiling each regex to a DFA once, and maintaining the set of DFAs. When the set of regular expressions changes, it is only necessary to regenerate the union DFA.
Hope that all helps.

Query Term elimination

In boolean retrieval model query consist of terms which are combined together using different operators. Conjunction is most obvious choice at first glance, but when query length growth bad things happened. Recall dropped significantly when using conjunction and precision dropped when using disjunction (for example, stanford OR university).
As for now we use conjunction is our search system (and boolean retrieval model). And we have a problem if user enter some very rare word or long sequence of word. For example, if user enters toyota corolla 4wd automatic 1995, we probably doesn't have one. But if we delete at least one word from a query, we have such documents. As far as I understand in Vector Space Model this problem solved automatically. We does not filter documents on the fact of term presence, we rank documents using presence of terms.
So I'm interested in more advanced ways of combining terms in boolean retrieval model and methods of rare term elimination in boolean retrieval model.
It seems like the sky's the limit in terms of defining a ranking function here. You could define a vector where the wi are: 0 if the ith search term doesn't appear in the file, 1 if it does; the number of times search term i appears in the file; etc. Then, rank pages based on e.g. Manhattan distance, Euclidean distance, etc. and sort in descending order, possibly culling results with distance below a specified match tolerance.
If you want to handle more complex queries, you can put the query into CNF - e.g. (term1 or term2 or ... termn) AND (item1 or item2 or ... itemk) AND ... and then redefine the weights wi accordingly. You could list with each result the terms that failed to match in the file... so that the users would at least know how good a match it is.
I guess what I'm really trying to say is that to really get an answer that works for you, you have to define exactly what you are willing to accept as a valid search result. Under the strict interpretation, a query that is looking for A1 and A2 and ... Am should fail if any of the terms is missing...

Resources