I have a Solr (6.2) DisMax Select Query which uses pf (phrase fields) and ps (phrase slop).
pf = text^2.2 title^2.2, ps = 2;
I want my query to return results following this algorithm:
If there are exact matches for the queried phrase, return them first, sort by date
If there are documents that have atleast one of the words of the queried phrase, return them second, sort by date
Example Data: text (last_modified timestamp in parenthesis)
stuff about important people (2018)
important people: the article (2019)
some people find that important (2020)
important news (2015)
people of the decade (2020)
The desired result:
phrases with acceptable slop first
some people find that important (2020)
important people: the article (2019)
stuff about important people (2018)
then at least one of the words
people of the decade (2020)
important news (2015)
What i've tried:
wrapping a query into double quotes and using qs (query phrase slop), this way it works as desired, but ignores the "at least on of the words" part;
using a bq (boost query) like last_modified:[NOW/DAY-3MONTH TO NOW/DAY]^20.0;
using a bf (boost function) like recip(ms(NOW,last_modified),3.16e-11,1,1);
explicit last_modified desc sort - but it ignores the score completely
using multiple sort score desc, last_modified desc - but the second sort will work only if there is a tie for the first one (and there is almost never a tie)
I've managed to get the (almost) desired result by using:
Boost Functions (bf) = recip(ms(NOW,last_modified),3.16e-11,1,1)^1500
(had to use a huge boost number to bubble up the most recent
results);
Query Fields qf = 'text^4 title^2';
Phrase Fields pf = 'text^5 title^2';
Phrase Slop ps = 4;
Query Phrase Slop qs = 2;
Minimum Should Match mm = len(split('\s', query)) + 1 (preudocode)
Split the query by whitespace, join the exact phrase and each separate word with OR and set Minimum Should Match parameter (mm) to len(split)+1 so, for example, query "apple dog" transforms into "apple dog" or apple or dog. The double quotes are necessary for qs parameter to work and force results with exact phrase to bubble up.
Maybe there are some tweaks to the method i'm using, any comments are appreciated.
Related
I am implementing solr fuzzy search using complex phrase query.
But i am phasing a weired case:
q={!complexphrase}name:"woo~1 grou~2" return "wood group" as a result.
q={!complexphrase}name:"woo~1 gro~2" does not return "wood group".
althouth distance between gro and group is 2!
searching for this query:
q={!complexphrase}name:"Anderso~1 Interes~2" returns 'Anderson Interests'.
duistance between Interes and Interests is same as gro and group!!!
any idea whats the reason?
I believe you are running into a problem with query rewrites.
Any multi-term query (fuzzy queries, prefix queries, etc.) gets expanded, in Lucene, into the exact terms that it matches. There is a maximum to the number of terms that can be generated this way though, so when rewriting the query, it will just try to pick the best within that limit. I suspect there are just too many matches for gro~2.
Perhaps you'll find it odd that there are so many matches that it can't incorporate all of them into the query. It looks like you are trying to search for words begining with gro, and with up to two more letters tacked onto the end. How many could there be? But that isn't what you're searching for. Fuzzy queries are based on levenshtein distance. The matches for that term include:
g__ -- Three-letter words beginning with g
_r_ -- Three-letter words with an r in the middle
__o -- Three-letter words with an o on the end
gr__ -- Any four-letter word beginning with gr
etc.
In short, it could match a massive list of terms, and in terms of similarity algorithm, "arm" and "cron" match just as well as "group".
If you really just want to match terms that start with "gro", use a prefix query instead: "woo* gro*".
If you want to actually search with a fuzzy query, including the list of possible matches seen above, you can enlarge the MaxBooleanClauses, in your solrconfig's query section.
<query>
<maxBooleanClauses>1024</maxBooleanClauses>
I have observed that Solr/Lucene gives too much weightage to matching all the query terms over tf of a particular query term.
e.g.
Say our query is : text: ("red" "jacket" "red jacket")
Document A -> contains "jacket" 40 times
Document B -> contains "red jacket" 1 time (and because of this "red" 1 time and "jacket" 1 time as well)
Document B is getting much higher score as its containing all the three terms of the query but just once whereas Document A is getting very low score even though it contains one term large number of times.
Can I create a query in such a manner that if Lucene finds a match for "red jacket" it does not consider it as match for "red" and "jacket" individually ?
I would recommend using a DisjunctionMaxQuery. In raw Lucene, this would look something like:
Query dismax = new DisjunctionMaxQuery(0);
dismax.add(parser.parse("red"));
dismax.add(parser.parse("junction"));
dismax.add(parser.parse("red jacket"));
The dismax query will score using the maximum score among it's subqueries, rather than the product of the scores of it's subqueries.
Using Solr, the dismax and edismax query parsers are the way to go for this, as well as many other handy features. Something like:
select/?q=red+jacket+"red jacket"&defType=dismax
Tf-idf is what search engines normally do but not what you always want. It is not what you want if you want to ignore repeated key words.
Tf-idf is calculated as the product of to factors: tf x idf. tf (term frequency) is how frequent a word is in a text. idf (inverse document frequency) means how unique a word is among all documents that you have in a search engine.
Consider a text containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12. See original source of example.
The best way to ignore tf-idf is probably the Solr exists function, which is accessible through the bf relevance boost parameter. For example:
bf=if(exists(query(location:A)),5,if(exists(query(location:B)),3,0))
See original source and context of second example.
I have a series of docs containing a nickname ( even with spaces ) and an ID.
The nickname can be like ["example","nick n4me", "nosp4ces","A fancy guy"].
I have to find a query that allow me to find profiles by a perfect matching, a fuzzy, or event with partial character.
So if a write down "nick" or "nick name" or "nick name", the document "nickname" has always to come out.
I tried with something like:
nickname:(%1%^4 %1%~^3 %1%*^1)
where "%1%" is what I'm searching, but it doesn't work, especially for spaces or numbers nicknames. For example if i try to search "nick n" the query would be:
nickname:(nick n^4 nick n~^3 nick n*^1)
Boosting with ^ will only affect the scoring and not the matching, i.e. if your query does not match at all, boosting the terms or not won't make any difference.
In your specific example, the query won't match because:
1) nick n won't match because that would require that either the token nick or n have been tokenized;
2) EDIT: I found out that fuzzy queries work only on single terms, if you use the standard query parser. In your case, you should probably rewrite nick n~ using ComplexPhraseQueryParser, so you can do a fuzzy query on the whole PhraseQuery. Also, you can specify a threshold for your fuzzy query (technically, you are specifying a minimum Levenshtein distance). Obviously you have to adjust the threshold, and that usually requires some trial and error.
An easier tactic is to load all nicknames into one field -- in your example you would have 4 values for your nickname field. If you want embedded spaces in your nicknames, you will need to use a simpler analyzer than StandardAnalyzer or use phrase searches.
I am searching for "i want honda bike" on a text field using edismax query handler.
My intent is to find out docs having "honda bike" in it.
Now the results containing "honda", "bike" and "honda bike". Basically I am not interest in "honda" and "bike". I am actually interested in "honda bike".
Is there any way to identify if the phrase in field has matched the user query?
I would investigate these parameters -- pf, pf2, and pf3.
pf -- phrase fields. This will let you boost the documents that have your q values in close proximity.
pf2 and pf3 -- chops the input into bigrams (or trigrams).
There are also slop settings to give some leeway in matching.
http://wiki.apache.org/solr/ExtendedDisMax#pf_.28Phrase_Fields.29
The problem was IDF was disturbing the score hence I could not fully rely on score to confidently say what has perfectly matched.
So I disabled IDF calculation.
take a look at
http://lucene.472066.n3.nabble.com/Identify-exact-search-in-edismax-td4011859.html#a4011976
mm (Minimum 'Should' Match) feature of edismax can be used here
http://wiki.apache.org/solr/ExtendedDisMax
I am implementing Solr dismax search and also using this function recip(ms(NOW,PubDate),3.16e-11,1000,1000) for date boost. Everthing is working fine but only got one problem.
if search keywords are repeated in the Title, they get more score than recent results.
e.g.
1) Title = solr lucene
Date = 1 day old
2) Title = solr lucene is best, love solr lucene
Date = 15 days old
If user searched for 'solr lucene', then #2 comes at first position only because keywords are repeated in the Title.
I have got too many records which are1,2 or 3 days old and they have even the exact same title "SOLR LUCENE" but those records doesn't come on first page only because old records have keywords repeated in the Title.
I don't want to sort the results entirely by date. Currently i am sorting it like this. sort= score desc, date asc
You shouldn't use an order clause, if you are using boost.
If you like to give the date more relevance, so pimp your boost function. It's up to you, who big is the date influence for the order of the search result is.
It also depends on the dismax-handler you are using:
{!edismax boost=recip(pow(ms(NOW,PubDate),<val>),3.16e-11,1,1)}
Put an value instead of the <val> placeholder between 0 and 2, where 0 is nearly "order by date" and 2 is order by relevance.
Not sure, if this works for dismax, but it works for standard solr search handler (with other syntax than the example above) and edismax.