Double negation in Lucene query

Double negation in Lucene query - solr

Lets say i have a binary field checked
Lets also assume that 3 documents out of 10 has checked:1 others checked:0
When I search in lucene
checked:1 - returns correct result (3)
checked:0 - returns correct result (7)
-checked:1 - returns correct result (7)
-checked:0 - returns correct result (3)
BUT
-(-(checked:1)) - suddenly returns wrong result (10, i.e. entire data set).
Any idea why lucene query parse acts so weird

Each Lucene query has to contain at least one positive term (either MUST/+ or SHOULD) so it matches at least one document. So your queries -checked:1 and -checked:0 are invalid, and I am surprised you are getting any results.
These queries should (most likely) look like this:
+*:* -checked:1
+*:* -checked:0
Getting back to your problem: double negation makes no sense in Lucene. Why would you have double negation, what are you trying to query?
Generally speaking, don't look at Lucene query operators (! & |) as Boolean operators, they aren't exactly what you think they are.

After some research and trial and error and building up on answer from midas, I have came up with the method to resolve this inconsistency. When I say inconsistency, I mean from a common sense view for a user. From information retrieval prospective, midas has linked an interesting article, which explains why such a query makes no sense.
So, the trick is to keep each negative expression with MatchAllDocsQueryNode class, namely the rewritten query has to look like this:
-(-(checked:1 *:*) *:*)
Then the query will produce the expected result. I have accomplished it by writing my own nodeprocessor class, which performs necessary operations.

Related

solr fuzzy vs wildcard vs stemmer

I have couple of questions here.
I want to search a term jumps
With Fuzzy search, I can do jump~
With wild card search, I can do jump*
With stemmer I can do, jump
My understanding is that, fuzzy search gives pump. Wildcard search gives jumping as well. Stemmer gives "jumper" also.
I totally agree with the results.
What is the performance of thes three?
Wild card is not recommended if it is at the beginning of the term - my understanding as it has to match with all the tokens in the index - But in this case, it would be all the tokens which starts jump
Fuzzy search gives me unpredicted results - It has to do something kind of spellcheck I assume.
Stemmer suits only particular scenarios like it can;t match pumps.
How should I use these things which can give more relevant results?
I probably more confused about all these because of this section. Any suggestions please?

Question 1
Wildcard queries are (generally) not analysed (i.e. they're not tokenized or run through filters), meaning that anything that depend on filters doing their processing of the input/output tokens will give weird results (for example if the input string is broken into multiple strings).
The matching happens on the tokens, so what you've input is almost (lowercasing still works) matched directly against the prefix / postfix of the tokens in the index. Generally you'd want to avoid wildcard queries for general search queries, since they're rather limited for natural search and can give weird results (as shown).
Fuzzy search is based on "edit distance" - i.e. a number that tells Solr how many characters can be removed/inserted/changed to get to the resulting token. This will give your users OK-ish results, but might be hard to decipher in the sense of "why did this give me a hit" when the allowed distance is larger (Lucene/Solr supports up to 2 in edit distance which is also the default if no edit distance is given).
Stemming is usually the way to go, as it's the actual "formal" process of taking a term and reducing it down to its stem - the actual "meaning" (it doesn't really know anything about the meaning as in the natural language processing term, but it does it according to a set of static rules and exceptions for the language configured) of the word . It can be adjusted per language to rules suitable for that language, which neither of the two other options can.
For your downside regarding stemming ("Since it can't match pumps") - that might actually be a good thing. It'll be clearer to your users what the search results are based on, and instead of including pumps in your search result, include it as a spelling correction ("Did you mean pump / pumps instead?"). It'll give a far better experience for any user, where the search results will more closely match what they're searching for.
The requirements might differ based on what your actual use case is; i.e. if it's just for programmatic attempts to find terms that look similar.
Question 2
Present those results you deem more relevant as the first hits - if you're doing wildcard or fuzzy searches you can't do this through scoring alone, so you'll have to make several queries and then present them after each other. I usually suggest making that an explicit action by the user of the search when discussing this in projects.
Instead, as the main search, you can use an NGramFilter in a separate field and use a copyfield instruction to get the same content into both fields - and then score the ngramfilter far lower than hits in the more "exact" field. Usually you want three fields in that case - one for exact hits (non-stemmed), one for stemmed hits and one for ngram hits - and then score them appropriately with the qf parameter to edismax. It usually gives you the quickest and easiest results to a decent search results for your users, but make sure to give them decent ways of either filtering the result set (facets) or change their queries into something more meaningful (did you mean, also see xyz, etc.).
Guessing the user's intent is usually very hard unless you have invested a lot of time and resources into personalisation (think Google), so leave that for later - most users are happy as long as they have a clear and distinct way of solving their own problems, even if you don't get it perfect for the first result.

For question 2 you can go strict to permissive.
Option one: Only give strict search result. If no result found give stemmer results. Continue with fuzzy or wildcard search if no result found previously.
Option two: Give all results but rank them by level (ie. first exact match, then stemmer result, ...)

Custom SOLR-sorting that is aware of its neighbours

For a SOLR search, I want to treat some results differently (where the field "is_promoted" is set to "1") to give them a better ranking. After the "normal" query is performed, the order of the results should be rearranged so that approximately 30 % of the results in a given range (say, the first 100 results) should be "promoted results". The ordering of the results should otherwise be preserved.
I thought it would be a good idea to solve this by making a custom SOLR plugin. So I tried writing a SearchComponent, but it seems like you can't change the ordering of search results after it has passed through the QueryComponent (since they are cached)?
One could have written some kind of custom sort function (or a function query?) but the challenge is that the algorithm needs to know about the score/ordering of the other surrounding results. A simple increase in the score won't do the trick.
Any suggestions on how this should be implemented?

Just answered this question on the Solr users list. The RankQuery feature in Solr 4.9 is designed to solve this type of problem. You can read about RankQueries here: http://heliosearch.org/solrs-new-rankquery-feature/

Short-circuit OR operator in Lucene/Solr

I understand that lucene's AND (&&), OR (||) and NOT (!) operators are shorthands for REQUIRED, OPTIONAL and EXCLUDE respectively, which is why one can't treat them as boolean operators (adhering to boolean algebra).
I have been trying to construct a simple OR expression, as follows
q = +(field1:value1 OR field2:value2)
with a match on either field1 or field2. But since the OR is merely an optional, documents where both field1:value1 and field2:value2 are matched, the query returns a score resulting in a match on both the clauses.
How do I enforce short-circuiting in this context? In other words, how to implement short-circuiting as in boolean algebra where an expression A || B || C returns true if A is true without even looking into whether B or C could be true.

Strictly speaking, no, there is no short circuiting boolean logic. If a document is found for one term, you can't simply tell it not to check for the other. Lucene is an inverted index, so it doesn't really check documents for matches directly. If you search for A OR B, it finds A and gets all the documents which have indexed that value. Then it gets B in the index, and then list of all documents containing it (this is simplifying somewhat, but I hope it gets the point across). It doesn't really make sense for it to not check the documents in which A is found. Further, for the query provided, all the matches on a document still need to be enumerated in order to acquire a correct score.
However, you did mention scores! I suspect what you are really trying to get at is that if one query term in a set is found, to not compound the score with other elements. That is, for (A OR B), the score is either the score-A or the score-B, rather than score-A * score-B or some such (Sorry if I am making a wrong assumption here, of course).
That is what DisjunctionMaxQuery is for. Adding each subquery to it will render a score from it equal to the maximum of the scores of all subqueries, rather than a product.
In Solr, you should learn about the DisMaxQParserPlugin and it's more recent incarnation, the ExtendedDisMax, which, if I'm close to the mark here, should serve you very well.

Difference between 2 solr queries

What is the difference between running the following two solr queries. They seem to be giving me a different number of results.
fq=field1:value1&fq=(field2:value21 OR field2:value22)
versus
fq=field1:value1&fq=field2:value21 OR field2:value22
The first one gives me a larger result set whereas the second one gives me a smaller results set. Does the parentheses have any effect in this case? If so, what is it?

check this manual, for using Boolean operators in SOLR
http://robotlibrarian.billdueber.com/solr-and-boolean-operators/

Query Term elimination

In boolean retrieval model query consist of terms which are combined together using different operators. Conjunction is most obvious choice at first glance, but when query length growth bad things happened. Recall dropped significantly when using conjunction and precision dropped when using disjunction (for example, stanford OR university).
As for now we use conjunction is our search system (and boolean retrieval model). And we have a problem if user enter some very rare word or long sequence of word. For example, if user enters toyota corolla 4wd automatic 1995, we probably doesn't have one. But if we delete at least one word from a query, we have such documents. As far as I understand in Vector Space Model this problem solved automatically. We does not filter documents on the fact of term presence, we rank documents using presence of terms.
So I'm interested in more advanced ways of combining terms in boolean retrieval model and methods of rare term elimination in boolean retrieval model.

It seems like the sky's the limit in terms of defining a ranking function here. You could define a vector where the wi are: 0 if the ith search term doesn't appear in the file, 1 if it does; the number of times search term i appears in the file; etc. Then, rank pages based on e.g. Manhattan distance, Euclidean distance, etc. and sort in descending order, possibly culling results with distance below a specified match tolerance.
If you want to handle more complex queries, you can put the query into CNF - e.g. (term1 or term2 or ... termn) AND (item1 or item2 or ... itemk) AND ... and then redefine the weights wi accordingly. You could list with each result the terms that failed to match in the file... so that the users would at least know how good a match it is.
I guess what I'm really trying to say is that to really get an answer that works for you, you have to define exactly what you are willing to accept as a valid search result. Under the strict interpretation, a query that is looking for A1 and A2 and ... Am should fail if any of the terms is missing...