I'm loading into Solr data from a mysql database with DataImportHandler. Every document contains a popularity field (int type) that is calculated from another application and saved into mysql (this field is based on some rules relatives to the domain of application).
How can i use this value to improve solr ranking? Would be correct to sum the solr score with popularity value?
How bf can be used here?
A good starting point that'd probably work is multiplying the score by a sublinear function that increases (slowly) with popularity. For example,
newScore = score * log(1 + 0.5 * popularity)
To apply this boost you should use Solr's EDisMax query parser and pass the boost parameter with the following value:
&boost=log(sum(1, product(0.5, popularity)))
where popularity is the name of the field. You don't need to use the bf parameter since you should use a multiplicative boost, not an additive one.
The reason for adding 1 is to handle the case in which popularity=0 (so if each document's popularity is always at least 1, you don't need to add 1). The strength of the popularity effect can be increased or decreased by changing the 0.5 factor to some other value. For example, you can use a factor of 2 to increase the effect:
newScore = score * log(1 + 2 * popularity)
A good factor is probably around 9 / m where m is what you expect should be the median popularity, since in this case the boost of a "median document" (median in the sense that its popularity equals m) is going to be 1 (that is, its score won't be boosted at all).
Again, this is just a starting point and you'll have to try out different boosting functions until you find one that performs well.
Related
We're having some relevance issues with Solr results. In this particular example we have product A showing up above product B. Product A's title contains the search term. Product B's title also contains the search term along with its Description and Category Name. So logically, Product B should be more relevant and appear above Product A, but it does not.
The schema is configured to take all of these extra fields into account. After analyzing the debug info of the query with ...&debugQuery=true&debug.explain.structured=trueit appears that both products have achieved the same score. Looking further, I can see these extra fields having scores calculated, but for some reason, the parser only takes the maximum of these scores instead of the sum which causes it to be the same:
Is there a reason that Solr behaves this way? Is there any way to change this behavior to use the sum instead of the max? (Just like in the parent element in the images)
You can control how the score is calculated using the tie parameter, provided that you are using Dismax/eDismax query parser.
Solr documentation explains it very well :
tie (Tie Breaker) parameter :
The tie parameter specifies a float value (which should be something
much less than 1) to use as tiebreaker in DisMax queries.
When a term from the user’s input is tested against multiple fields,
more than one field may match. If so, each field will generate a
different score based on how common that word is in that field (for
each document relative to all other documents).
The tie parameter lets
you control how much the final score of the query will be influenced
by the scores of the lower scoring fields compared to the highest
scoring field.
A value of "0.0" - the default - makes the query a pure "disjunction
max query": that is, only the maximum scoring subquery contributes to
the final score.
A value of "1.0" makes the query a pure "disjunction
sum query" where it doesn’t matter what the maximum scoring sub query
is, because the final score will be the sum of the subquery scores.
Typically a low value, such as 0.1, is useful.
In Apache Solr there is an omitTermFreqAndPositions property, and there is an omitPositions property. Is there a built in way to omit term frequency but preserve term positions when a field's score is calculated, or is it otherwise simple to do so?
No, not unless you use a custom similarity class. These are field specific from Solr 4.x, so you can have a custom similarity for one field if you don't want term frequency to contribute to the score by returning 1.0f for the termfreq regardless of how many times the term occurs in the field.
I have written a small search engine as my weekly project. It is based upon cosine similarity between query vector and document vector. Vector is calculate using of tf-idf sores of tokens.
I have come to know about Apache Solr which is a full text search engine. My question is does solr use cosine similarity internally when rank search results?
No. Solr uses something similar to cosine similarity, but not quite the same - there are some key differences.
If you visit that same link (https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html) and scroll further down, you will see "Lucene Conceptual Scoring Formula" and "Lucene Practical Scoring Formula" that give more details.
Ignoring any index/query-time boosts, the following are some key differences:
1. Different document normalization factor
Instead of normalizing each document by the Euclidean norm of its tf-idf vector, it uses "doc-len-norm". For the default similarity measure (DefaultSimilairty) this is just 1/sqrt(number of terms in the doc) which basically equals 1/sqrt(sum(tf)) - i.e., where tf is the sum of the term counts in the doc - no squaring as with the Euclidean norm and the idf for each term is left out. Furthermore this value is rounded to a byte to save space. This will most often come out to a different value than the normalization factor as used for cosine similarity.
2. Extra "coord" boost
There is also an extra value multiplied onto the score equal to:
the number of query terms matched in the document / the total number of terms in the query.
This gives an extra boost for fields (documents) matching more of the query terms, and may be of questionable value. This essentially is multiplying the tf-idf vector score with another inner product - the inner product of these vectors converted to boolean vectors (0 if it does not have the given term, 1 if it does) with the query vector only normalized by its Euclidean norm.
Yes, Solr (which runs on top of Lucene) does use Cosine similarity. From the Lucene documentation:
VSM score of document d for query q is the Cosine Similarity of the
weighted query vectors V(q) and V(d)
cosine-similarity(q,d) = V(q) · V(d) / |V(q)| |V(d)|
https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
If you're looking for actual vector similarity in Solr, there are two approaches:
1) use delimited payloads. There are a few plugins that implement this already, like https://github.com/moshebla/solr-vector-scoring and https://github.com/saaay71/solr-vector-scoring
2) use streaming expressions, which comes out of the box: https://lucene.apache.org/solr/guide/8_5/vector-math.html
The latter is slower, but more flexible.
Is it possible to do a custom multiplication of the Score returned by Solr? We have a factor in the range of 1.00-1.30 based on our own formula and I wish to just multiply the "final" Solr score with this - without having it normalized.
I've tried using various boosts in DisMax, but none of them produce the desired result, because 1) custom value is added (not multiplied) to the score and 2) they are normalized (queryNorm) before addition.
I found a way to do this. Using the Extended DisMax query parser, introduced in 3.1, it offers all the same features as the normal DisMax, but with a few useful enhancements.
The one I needed was the boost parameter. It acts the same way as the bf parameter from DisMax, but instead of adding a normalized value to the score, it multiplies the boost into the score (without any normalization).
For more info, see the Solr Wiki on ExtendedDisMax
When I choose to view the score field in solr results I see the score assigned by solr to every document returned and a maxscore value that is the score of the topmost returned document.
I need to know is there a cut-off to the solr score or not. I mean if the maxscore is 6.89343 or 2.34365, so does this mean that it is 6.89343 of 10 as the final score? or how can I decide that I'm close to the most correct result.
If possible, I need a simple explanation of the scoring algorithm used by solr.
The maxscore is the scoring of the topmost document in the search results.
There is no cutoff for the maxscore, and depends upon the scoring calculations and normalization done by Lucene/Solr.
The topmost document would have the maxscore, while you would get an idea from the scores of the documents below it, as to how off they are from the topmost.
For Scoring explaination you can check link
If it is indeed a z-score from a normal distribution then you can calculate the CDF (as it appears here ). The CDF will give you a bounded score from 0 to 1. Its hard for me to interpret what the CDF really means in this case given the un-normalized score is calculated in several steps, but you can sort of think of it as the probability that you got the right answer as long as your collection is well populated with the relevant material.