Short-circuit OR operator in Lucene/Solr - solr

I understand that lucene's AND (&&), OR (||) and NOT (!) operators are shorthands for REQUIRED, OPTIONAL and EXCLUDE respectively, which is why one can't treat them as boolean operators (adhering to boolean algebra).
I have been trying to construct a simple OR expression, as follows
q = +(field1:value1 OR field2:value2)
with a match on either field1 or field2. But since the OR is merely an optional, documents where both field1:value1 and field2:value2 are matched, the query returns a score resulting in a match on both the clauses.
How do I enforce short-circuiting in this context? In other words, how to implement short-circuiting as in boolean algebra where an expression A || B || C returns true if A is true without even looking into whether B or C could be true.

Strictly speaking, no, there is no short circuiting boolean logic. If a document is found for one term, you can't simply tell it not to check for the other. Lucene is an inverted index, so it doesn't really check documents for matches directly. If you search for A OR B, it finds A and gets all the documents which have indexed that value. Then it gets B in the index, and then list of all documents containing it (this is simplifying somewhat, but I hope it gets the point across). It doesn't really make sense for it to not check the documents in which A is found. Further, for the query provided, all the matches on a document still need to be enumerated in order to acquire a correct score.
However, you did mention scores! I suspect what you are really trying to get at is that if one query term in a set is found, to not compound the score with other elements. That is, for (A OR B), the score is either the score-A or the score-B, rather than score-A * score-B or some such (Sorry if I am making a wrong assumption here, of course).
That is what DisjunctionMaxQuery is for. Adding each subquery to it will render a score from it equal to the maximum of the scores of all subqueries, rather than a product.
In Solr, you should learn about the DisMaxQParserPlugin and it's more recent incarnation, the ExtendedDisMax, which, if I'm close to the mark here, should serve you very well.

Related

Flink, what's the behavior of minBy or maxBy if multiple records meet the condition

I'm newbie to Flink and I'm wondering what's the behavior of minBy (guess for the maxBy is the same) if there are multiple records that have the minimum value. I noticed that Flink will output only one record in this case, but which one? The first, the last or a random one?
Thanks for help.
Note that as of FLIP-134 all of these relational methods on DataStreams, namely Windowed/KeyedStream#sum,min,max,minBy,maxBy, are planned to be deprecated. The entire DataSet API is also planned to eventually be deprecated as well.
The only long-term support for relational methods like these is what is provided by the Table and SQL APIs.
But to answer your question, minBy and maxBy work the same way.
The javadoc for DataSet#maxBy says
If multiple values with maximum value at the specified fields exist, a random one will be
picked.
while the javadocs for AllWindowedStream#maxBy(int positionToMaxBy) and KeyedStream#maxBy(int positionToMaxBy) say
If more elements have the same maximum value the operator returns the first by default.
and the javadocs for AllWindowedStream#maxBy(int positionToMaxBy, boolean first) and AllWindowedStream#maxBy(int positionToMaxBy, boolean first) explain that
If [first is] true, then the operator return the first element with the maximum value,
otherwise returns the last

Solr query: prefer phrase over occurrence of single words, but accept both

For our Solr product search a new requirement has been specified:
A given list of terms needs to be queried over some fields, and the score shall be higher, when they are found as a phrase than when all of the terms occur in a different order, and this must be scored higher than the occurrence of only some of the terms in a single field, or all of the fields occurring but in different fields. Also some of the fields need higher scores than other (title higher than description).
I have thought of a solution like this (with a, b, c being search terms, could be any number of them):
q=title:"a b c"^40.0 or title:(+a +b +c)^20.0 or title:(a b c)^5.0 or description:"a b c"^30.0 or description:(+a +b +c)^10.0 or description:(a b c)^3.0 ...
Some fields need a different treatment, e.g. person names should be scored higher, when they match exactly, but shall also be searched fuzzy, like that:
q=name:(+a +b +c)^40.0 or name:(a b c)^20.0 or name (a~0.9 b~0.9 c~0.9)^5.0 etc.
Other criteria has to be matched exactly to model certain restrictions, like
active:true and publicationDate[* to now] ...
Is this a valid solution? Are there better ones?
As I have no practical experience with the edismax parser, I am not quite sure, if it would be able to solve my problem.

ndb.OR makes query costs more

Using AppEngine appstats I profiled my queries, and noticed that although the docs say a query costs one read, queries using ndb.OR (or .IN which expands to OR), cost n reads (n equals the number of OR clauses).
eg:
votes = (Vote.query(ndb.OR(Vote.object == keys[0], Vote.object == keys[1]))
.filter(Vote.user_id == user_id)
.fetch(keys_only=True))
This query costs 2 reads (it matches 0 entities). If I replace the ndb.OR with Vote.object.IN, the number of reads equals the length of array I pass to read.
This behavior is kind of contradicts the docs.
I was wondering if anyone else experienced the same, and if this is a bug in AE, docs, or my understanding.
Thanks.
The query docs for ndb are not particularly explicit but this paragraph is your best answer
In addition to the native operators, the API supports the != operator,
combining groups of filters using the Boolean OR operation, and the IN
operation, which test for equality to one of a list of possible values
(like Python's 'in' operator). These operations don't map 1:1 to the
Datastore's native operations; thus they are a little quirky and slow,
relatively. They are implemented using in-memory merging of result
streams. Note that p != v is implemented as "p < v OR p > v". (This
matters for repeated properties.)
In this doc https://developers.google.com/appengine/docs/python/ndb/queries

Double negation in Lucene query

Lets say i have a binary field checked
Lets also assume that 3 documents out of 10 has checked:1 others checked:0
When I search in lucene
checked:1 - returns correct result (3)
checked:0 - returns correct result (7)
-checked:1 - returns correct result (7)
-checked:0 - returns correct result (3)
BUT
-(-(checked:1)) - suddenly returns wrong result (10, i.e. entire data set).
Any idea why lucene query parse acts so weird
Each Lucene query has to contain at least one positive term (either MUST/+ or SHOULD) so it matches at least one document. So your queries -checked:1 and -checked:0 are invalid, and I am surprised you are getting any results.
These queries should (most likely) look like this:
+*:* -checked:1
+*:* -checked:0
Getting back to your problem: double negation makes no sense in Lucene. Why would you have double negation, what are you trying to query?
Generally speaking, don't look at Lucene query operators (! & |) as Boolean operators, they aren't exactly what you think they are.
After some research and trial and error and building up on answer from midas, I have came up with the method to resolve this inconsistency. When I say inconsistency, I mean from a common sense view for a user. From information retrieval prospective, midas has linked an interesting article, which explains why such a query makes no sense.
So, the trick is to keep each negative expression with MatchAllDocsQueryNode class, namely the rewritten query has to look like this:
-(-(checked:1 *:*) *:*)
Then the query will produce the expected result. I have accomplished it by writing my own nodeprocessor class, which performs necessary operations.

Query Term elimination

In boolean retrieval model query consist of terms which are combined together using different operators. Conjunction is most obvious choice at first glance, but when query length growth bad things happened. Recall dropped significantly when using conjunction and precision dropped when using disjunction (for example, stanford OR university).
As for now we use conjunction is our search system (and boolean retrieval model). And we have a problem if user enter some very rare word or long sequence of word. For example, if user enters toyota corolla 4wd automatic 1995, we probably doesn't have one. But if we delete at least one word from a query, we have such documents. As far as I understand in Vector Space Model this problem solved automatically. We does not filter documents on the fact of term presence, we rank documents using presence of terms.
So I'm interested in more advanced ways of combining terms in boolean retrieval model and methods of rare term elimination in boolean retrieval model.
It seems like the sky's the limit in terms of defining a ranking function here. You could define a vector where the wi are: 0 if the ith search term doesn't appear in the file, 1 if it does; the number of times search term i appears in the file; etc. Then, rank pages based on e.g. Manhattan distance, Euclidean distance, etc. and sort in descending order, possibly culling results with distance below a specified match tolerance.
If you want to handle more complex queries, you can put the query into CNF - e.g. (term1 or term2 or ... termn) AND (item1 or item2 or ... itemk) AND ... and then redefine the weights wi accordingly. You could list with each result the terms that failed to match in the file... so that the users would at least know how good a match it is.
I guess what I'm really trying to say is that to really get an answer that works for you, you have to define exactly what you are willing to accept as a valid search result. Under the strict interpretation, a query that is looking for A1 and A2 and ... Am should fail if any of the terms is missing...

Resources