I'm trying to understand how the scoring has been generated for Azure Search matches as some of my results are distinctly odd (though probably correct if only I understood why!). There is nothing officially documented but is there anything like Lucene Explain for Azure Search?
Thanks
The default scoring method use the TF-IDF algorithm to calculate a value for each searchable field in the document. Those values are then summed up together to create a final score.
More details on TFIDF here: https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
You can alter the score further by using scoring profiles to boost the score of certain fields.
https://learn.microsoft.com/en-us/rest/api/searchservice/add-scoring-profiles-to-a-search-index
He there, I was having the same problem that you're having. A client of mine was asking me to help improve on search performance. I therefore reverse engineered the Azure Search scoring algorithm and documented it in a blog. Please take a look at it and let me know if its helpful.
It basically comes down to the following equation.
totalscore = (weightedfieldscores) ∗ (functionaggregration)
weighted field scores = (f*w) + (f*w) + ...
Where is f the TF-IDF score of the field, and w the weight configured in the scoring profile for the corresponding field. The summation of the weighted field scores is the total weighted field score.
This will be multiplied by the aggregated function score. Which is the following:
functionaggregration = fa(f1(x), f2(x), ...).
Where fa is the aggregation functions, this can be the sum of all functions or the firs, or average, etc. And f1, f2 are the tag, magnitude, etc. functions themselves.
Please let me know if this is helpful.
https://dibranmulder.github.io/2020/09/22/Improving-your-Azure-Seach-performance/
Related
We're having some relevance issues with Solr results. In this particular example we have product A showing up above product B. Product A's title contains the search term. Product B's title also contains the search term along with its Description and Category Name. So logically, Product B should be more relevant and appear above Product A, but it does not.
The schema is configured to take all of these extra fields into account. After analyzing the debug info of the query with ...&debugQuery=true&debug.explain.structured=trueit appears that both products have achieved the same score. Looking further, I can see these extra fields having scores calculated, but for some reason, the parser only takes the maximum of these scores instead of the sum which causes it to be the same:
Is there a reason that Solr behaves this way? Is there any way to change this behavior to use the sum instead of the max? (Just like in the parent element in the images)
You can control how the score is calculated using the tie parameter, provided that you are using Dismax/eDismax query parser.
Solr documentation explains it very well :
tie (Tie Breaker) parameter :
The tie parameter specifies a float value (which should be something
much less than 1) to use as tiebreaker in DisMax queries.
When a term from the user’s input is tested against multiple fields,
more than one field may match. If so, each field will generate a
different score based on how common that word is in that field (for
each document relative to all other documents).
The tie parameter lets
you control how much the final score of the query will be influenced
by the scores of the lower scoring fields compared to the highest
scoring field.
A value of "0.0" - the default - makes the query a pure "disjunction
max query": that is, only the maximum scoring subquery contributes to
the final score.
A value of "1.0" makes the query a pure "disjunction
sum query" where it doesn’t matter what the maximum scoring sub query
is, because the final score will be the sum of the subquery scores.
Typically a low value, such as 0.1, is useful.
I can see from the documentation that I can use referencePointParameter and tagsParameter to pass parameters into the disance and tags scoring functions respectively.
I'd like to do the same with the magnitude scoring function, but can't see from the documentation how to do this (or if it's even possible).
For example, if a product was £100, I'd like to get similar products with a similar price. I think I could do this with 2 magnitude functions (e.g. boost from £80 to £100, and again from £120 to £100 will boost products closest to the £100 price of the original product).
Is this possible?
No, it is not possible to do magnitude boosting based on relative values of a field across documents. This feature is intended for situations where you statically know the ranges that you want to boost (for example, when boosting based on a rating field with a fixed scale).
I am working on a a fuzzy query using Solr, which goes over a repository of data which could have misspelled words or abbreviated words. For example the repository could have a name with words "Hlth" (abbreviated form of the word 'Health').
If I do a fuzzy search for Name:'Health'~0.35 I get results with word 'Health' but not 'Hlth'.
If I do a fuzzy search for Name:'Hlth'~0.35 I get records with names 'Health' and 'Hlth'.
I would like to get first query to work. In my bussiness use-case, I would have to use the clean data to query for all the misspelled or abbreviated words.
Could someone please help and throw some light on why #1 fuzzy search is not working and if there are any other ways of achieving the same.
You use fuzzy query in a wrong way.
According to what Mike McCandless saying (http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html):
FuzzyQuery matches terms "close" to a specified base term: you specify an allowed maximum edit distance, and any terms within that edit distance from the base term (and, then, the docs containing those terms) are matched.
The QueryParser syntax is term~ or term~N, where N is the maximum
allowed number of edits (for older releases N was a confusing float
between 0.0 and 1.0, which translates to an equivalent max edit
distance through a tricky formula).
FuzzyQuery is great for matching proper names: I can search for
mcandless~1 and it will match mccandless (insert c), mcandles (remove
s), mkandless (replace c with k) and a great many other "close" terms.
With max edit distance 2 you can have up to 2 insertions, deletions or
substitutions. The score for each match is based on the edit distance
of that term; so an exact match is scored highest; edit distance 1,
lower; etc.
So you need to write queries like this - Health~2
You write: "I wanted to match Parkway with Pkwy"
Parkway and Pkwy have an edit distance of 3. You could achieve this by subbing in "~3" for "~2" from the first response, but Solr fuzzy matching is not recommended for values greater than 2 for performance reasons.
I think the best way to approach your problem would be to generate a context-specific dictionary of synonyms and do query-time expansion.
Using phonetic filters may solve your problem.
Please consider looking at the following
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-PhoneticFilter
https://cwiki.apache.org/confluence/display/solr/Phonetic+Matching
Hope this helps.
I am confuse her but i want to clear my doubt. I think it is stupid question but i want to know.
Use a TokenFilter that outputs two tokens (one original and one lowercased) for each input token. For queries, the client would need to expand any search terms containing upper case characters to two terms, one lowercased and one original. The original search term may be given a boost, although it may not be necessary given that a match on both terms will produce a higher score.
text:NeXT ==> (text:NeXT^10 OR text:next)
what this ^ mean here .
http://wiki.apache.org/solr/SolrRelevancyCookbook#Relevancy_and_Case_Matching
This is giving a boost (making it more important) to the value NeXT versus next in this query. From the wiki page you linked to "The original search term may be given a boost, although it may not be necessary given that a match on both terms will produce a higher score."
For more on Boosting please see the Boosting Ranking Terms section in your the Solr Relevancy Cookbook. This Slide Deck about Boosting from the Lucene Revolution Conference earlier this year, also contains good information on how boosting works and how to apply it to various scenarios.
Edit1:
For more information on the boost values (the number after the ^), please refer to the following:
Lucene Score Boosting
Lucene Similarity Implementation
Edit2:
The value of the boost influences the score/relevancy of an item returned from the search results.
(term:NeXT^10 term:next) - Any documents matching term:NeXT will be scored higher/more relevant in this query because they have a boost value of 10 applied.
(term:NeXT^10 term:Next^5 term:next) - Any documents matching term:NeXT will be scored the highest (because of highest boost value), any documents matching term:Next will be scored lower than term:NeXT, but higher than term:next.
When I choose to view the score field in solr results I see the score assigned by solr to every document returned and a maxscore value that is the score of the topmost returned document.
I need to know is there a cut-off to the solr score or not. I mean if the maxscore is 6.89343 or 2.34365, so does this mean that it is 6.89343 of 10 as the final score? or how can I decide that I'm close to the most correct result.
If possible, I need a simple explanation of the scoring algorithm used by solr.
The maxscore is the scoring of the topmost document in the search results.
There is no cutoff for the maxscore, and depends upon the scoring calculations and normalization done by Lucene/Solr.
The topmost document would have the maxscore, while you would get an idea from the scores of the documents below it, as to how off they are from the topmost.
For Scoring explaination you can check link
If it is indeed a z-score from a normal distribution then you can calculate the CDF (as it appears here ). The CDF will give you a bounded score from 0 to 1. Its hard for me to interpret what the CDF really means in this case given the un-normalized score is calculated in several steps, but you can sort of think of it as the probability that you got the right answer as long as your collection is well populated with the relevant material.