Solr boost direct match over fuzzy match

Solr boost direct match over fuzzy match - solr

Let's say I have a query like this:
text_data:(Apple OR Apple~2)
How do I know what boost value to provide to give the direct match a clear priority over the fuzzy match?

You can't really guarantee a clear priority as the fuzzy search will naturally match on more terms (Apple, Appl, App, Appla and so on). Just give it a high enough boost value that it will outscore the fuzzy search in everything but edge cases. The fuzzy search will also help you out by scoring an exact match for 'Apple' higher than any matches that have deletions or substitutions
text_data:(Apple^10 OR Apple~2)
Will multiply 10 into the normal score for Apple search term

Related

solr fuzzy vs wildcard vs stemmer

I have couple of questions here.
I want to search a term jumps
With Fuzzy search, I can do jump~
With wild card search, I can do jump*
With stemmer I can do, jump
My understanding is that, fuzzy search gives pump. Wildcard search gives jumping as well. Stemmer gives "jumper" also.
I totally agree with the results.
What is the performance of thes three?
Wild card is not recommended if it is at the beginning of the term - my understanding as it has to match with all the tokens in the index - But in this case, it would be all the tokens which starts jump
Fuzzy search gives me unpredicted results - It has to do something kind of spellcheck I assume.
Stemmer suits only particular scenarios like it can;t match pumps.
How should I use these things which can give more relevant results?
I probably more confused about all these because of this section. Any suggestions please?

Question 1
Wildcard queries are (generally) not analysed (i.e. they're not tokenized or run through filters), meaning that anything that depend on filters doing their processing of the input/output tokens will give weird results (for example if the input string is broken into multiple strings).
The matching happens on the tokens, so what you've input is almost (lowercasing still works) matched directly against the prefix / postfix of the tokens in the index. Generally you'd want to avoid wildcard queries for general search queries, since they're rather limited for natural search and can give weird results (as shown).
Fuzzy search is based on "edit distance" - i.e. a number that tells Solr how many characters can be removed/inserted/changed to get to the resulting token. This will give your users OK-ish results, but might be hard to decipher in the sense of "why did this give me a hit" when the allowed distance is larger (Lucene/Solr supports up to 2 in edit distance which is also the default if no edit distance is given).
Stemming is usually the way to go, as it's the actual "formal" process of taking a term and reducing it down to its stem - the actual "meaning" (it doesn't really know anything about the meaning as in the natural language processing term, but it does it according to a set of static rules and exceptions for the language configured) of the word . It can be adjusted per language to rules suitable for that language, which neither of the two other options can.
For your downside regarding stemming ("Since it can't match pumps") - that might actually be a good thing. It'll be clearer to your users what the search results are based on, and instead of including pumps in your search result, include it as a spelling correction ("Did you mean pump / pumps instead?"). It'll give a far better experience for any user, where the search results will more closely match what they're searching for.
The requirements might differ based on what your actual use case is; i.e. if it's just for programmatic attempts to find terms that look similar.
Question 2
Present those results you deem more relevant as the first hits - if you're doing wildcard or fuzzy searches you can't do this through scoring alone, so you'll have to make several queries and then present them after each other. I usually suggest making that an explicit action by the user of the search when discussing this in projects.
Instead, as the main search, you can use an NGramFilter in a separate field and use a copyfield instruction to get the same content into both fields - and then score the ngramfilter far lower than hits in the more "exact" field. Usually you want three fields in that case - one for exact hits (non-stemmed), one for stemmed hits and one for ngram hits - and then score them appropriately with the qf parameter to edismax. It usually gives you the quickest and easiest results to a decent search results for your users, but make sure to give them decent ways of either filtering the result set (facets) or change their queries into something more meaningful (did you mean, also see xyz, etc.).
Guessing the user's intent is usually very hard unless you have invested a lot of time and resources into personalisation (think Google), so leave that for later - most users are happy as long as they have a clear and distinct way of solving their own problems, even if you don't get it perfect for the first result.

For question 2 you can go strict to permissive.
Option one: Only give strict search result. If no result found give stemmer results. Continue with fuzzy or wildcard search if no result found previously.
Option two: Give all results but rank them by level (ie. first exact match, then stemmer result, ...)

Solr: character proximity ranks misspellings higher because of inverse document frequency

I'm using character proximity to allow for some misspellings, for example:
text:manager~1
This allows both 'manager' and 'managre' to be matched. The problem is, the misspellings are always ranked higher than the proper spelling because there are fewer of those in the index. For example, let's say I have 3 documents as follows:
1) text:manager
2) text:manager
3) text:managre
Then the character proximity query above will give an inverse document frequency (idf) of 1.7 to 'managre' and 1.2 to 'manager', effectively ranking the misspelled 'managre' higher. From a technical perspective, this makes sense (there are fewer occurances of 'managre' than 'manager'), but in reality, this doesn't make sense. Is there a way to get Solr to set the idf of misspelled words to match that of the correct spelling?

Short answers is No. Long answer is you have good options here, You need to solve this in a different way.
To begin with take the power of query time boosting. So you can query something like:
text:manager^1.2 OR text:manager~1^0.8
Here you are saying my user is smart so i will give higher boost to user query, but just incase I will give it's variance a bit lower boost. You need to do a boolean query of exact match with higher boost with a Boolean OR query of fuzzy query so that exact matches ranks higher. Do not worry about extra work for solr. It is built for very complex Lucene query trees. Using a combination of queries to get expected relevancy ranking is common practice.
TF , IDF and solr's in built relevancy ranking arbitrary and framing query with boosts, boolean queries, and context based filters is where power and flexibility of solr exists.

Fuzzy Search in Solr

I am working on a a fuzzy query using Solr, which goes over a repository of data which could have misspelled words or abbreviated words. For example the repository could have a name with words "Hlth" (abbreviated form of the word 'Health').
If I do a fuzzy search for Name:'Health'~0.35 I get results with word 'Health' but not 'Hlth'.
If I do a fuzzy search for Name:'Hlth'~0.35 I get records with names 'Health' and 'Hlth'.
I would like to get first query to work. In my bussiness use-case, I would have to use the clean data to query for all the misspelled or abbreviated words.
Could someone please help and throw some light on why #1 fuzzy search is not working and if there are any other ways of achieving the same.

You use fuzzy query in a wrong way.
According to what Mike McCandless saying (http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html):
FuzzyQuery matches terms "close" to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.
The QueryParser syntax is term~ or term~N, where N is the maximum
allowed number of edits (for older releases N was a confusing float
between 0.0 and 1.0, which translates to an equivalent max edit
distance through a tricky formula).
FuzzyQuery is great for matching proper names: I can search for
mcandless~1 and it will match mccandless (insert c), mcandles (remove
s), mkandless (replace c with k) and a great many other "close" terms.
With max edit distance 2 you can have up to 2 insertions, deletions or
substitutions. The score for each match is based on the edit distance
of that term; so an exact match is scored highest; edit distance 1,
lower; etc.
So you need to write queries like this - Health~2

You write: "I wanted to match Parkway with Pkwy"
Parkway and Pkwy have an edit distance of 3. You could achieve this by subbing in "~3" for "~2" from the first response, but Solr fuzzy matching is not recommended for values greater than 2 for performance reasons.
I think the best way to approach your problem would be to generate a context-specific dictionary of synonyms and do query-time expansion.

Using phonetic filters may solve your problem.
Please consider looking at the following
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-PhoneticFilter
https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching
Hope this helps.

what this `^` mean here in solr

I am confuse her but i want to clear my doubt. I think it is stupid question but i want to know.
Use a TokenFilter that outputs two tokens (one original and one lowercased) for each input token. For queries, the client would need to expand any search terms containing upper case characters to two terms, one lowercased and one original. The original search term may be given a boost, although it may not be necessary given that a match on both terms will produce a higher score.
text:NeXT ==> (text:NeXT^10 OR text:next)
what this ^ mean here .
http://wiki.apache.org/solr/SolrRelevancyCookbook#Relevancy_and_Case_Matching

This is giving a boost (making it more important) to the value NeXT versus next in this query. From the wiki page you linked to "The original search term may be given a boost, although it may not be necessary given that a match on both terms will produce a higher score."
For more on Boosting please see the Boosting Ranking Terms section in your the Solr Relevancy Cookbook. This Slide Deck about Boosting from the Lucene Revolution Conference earlier this year, also contains good information on how boosting works and how to apply it to various scenarios.
Edit1:
For more information on the boost values (the number after the ^), please refer to the following:
Lucene Score Boosting
Lucene Similarity Implementation
Edit2:
The value of the boost influences the score/relevancy of an item returned from the search results.
(term:NeXT^10 term:next) - Any documents matching term:NeXT will be scored higher/more relevant in this query because they have a boost value of 10 applied.
(term:NeXT^10 term:Next^5 term:next) - Any documents matching term:NeXT will be scored the highest (because of highest boost value), any documents matching term:Next will be scored lower than term:NeXT, but higher than term:next.

Boosting documents in Solr based on the vote count

I have a field in my schema which holds the number of votes a document has. How can I boost documents based on that number?
Something like the one which has the maximum number has a boost of 10, the one with the smallest number has 0.5 and in between the values get calculated automatically.
What I do now is this, but it doesn't give the desired results:
recip(rord(vote_count),1,1000,1000)^10.0
Thanks.

i tend to build my indexes using raw lucene, in which case it is extremely easy,
doc.setBoost(boost_val);

I'm just starting on this and it looks like either a linear boost or log based boost will help most: i.e. log(votecount)^10 (don't forget ^10 means boost times 10, not to the tenth power.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight