Flat multiplying of Solr score - solr

Is it possible to do a custom multiplication of the Score returned by Solr? We have a factor in the range of 1.00-1.30 based on our own formula and I wish to just multiply the "final" Solr score with this - without having it normalized.
I've tried using various boosts in DisMax, but none of them produce the desired result, because 1) custom value is added (not multiplied) to the score and 2) they are normalized (queryNorm) before addition.

I found a way to do this. Using the Extended DisMax query parser, introduced in 3.1, it offers all the same features as the normal DisMax, but with a few useful enhancements.
The one I needed was the boost parameter. It acts the same way as the bf parameter from DisMax, but instead of adding a normalized value to the score, it multiplies the boost into the score (without any normalization).
For more info, see the Solr Wiki on ExtendedDisMax

Related

Can I pass a parameter to a Magnitude scoring function in Azure Search?

I can see from the documentation that I can use referencePointParameter and tagsParameter to pass parameters into the disance and tags scoring functions respectively.
I'd like to do the same with the magnitude scoring function, but can't see from the documentation how to do this (or if it's even possible).
For example, if a product was £100, I'd like to get similar products with a similar price. I think I could do this with 2 magnitude functions (e.g. boost from £80 to £100, and again from £120 to £100 will boost products closest to the £100 price of the original product).
Is this possible?
No, it is not possible to do magnitude boosting based on relative values of a field across documents. This feature is intended for situations where you statically know the ranges that you want to boost (for example, when boosting based on a rating field with a fixed scale).

How to omit term frequency in Apache Solr

In Apache Solr there is an omitTermFreqAndPositions property, and there is an omitPositions property. Is there a built in way to omit term frequency but preserve term positions when a field's score is calculated, or is it otherwise simple to do so?
No, not unless you use a custom similarity class. These are field specific from Solr 4.x, so you can have a custom similarity for one field if you don't want term frequency to contribute to the score by returning 1.0f for the termfreq regardless of how many times the term occurs in the field.

Does solr use cosine similarity?

I have written a small search engine as my weekly project. It is based upon cosine similarity between query vector and document vector. Vector is calculate using of tf-idf sores of tokens.
I have come to know about Apache Solr which is a full text search engine. My question is does solr use cosine similarity internally when rank search results?
No. Solr uses something similar to cosine similarity, but not quite the same - there are some key differences.
If you visit that same link (https://lucene.apache.org/core/4_10_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html) and scroll further down, you will see "Lucene Conceptual Scoring Formula" and "Lucene Practical Scoring Formula" that give more details.
Ignoring any index/query-time boosts, the following are some key differences:
1. Different document normalization factor
Instead of normalizing each document by the Euclidean norm of its tf-idf vector, it uses "doc-len-norm". For the default similarity measure (DefaultSimilairty) this is just 1/sqrt(number of terms in the doc) which basically equals 1/sqrt(sum(tf)) - i.e., where tf is the sum of the term counts in the doc - no squaring as with the Euclidean norm and the idf for each term is left out. Furthermore this value is rounded to a byte to save space. This will most often come out to a different value than the normalization factor as used for cosine similarity.
2. Extra "coord" boost
There is also an extra value multiplied onto the score equal to:
the number of query terms matched in the document / the total number of terms in the query.
This gives an extra boost for fields (documents) matching more of the query terms, and may be of questionable value. This essentially is multiplying the tf-idf vector score with another inner product - the inner product of these vectors converted to boolean vectors (0 if it does not have the given term, 1 if it does) with the query vector only normalized by its Euclidean norm.
Yes, Solr (which runs on top of Lucene) does use Cosine similarity. From the Lucene documentation:
VSM score of document d for query q is the Cosine Similarity of the
weighted query vectors V(q) and V(d)
cosine-similarity(q,d) = V(q) · V(d) / |V(q)| |V(d)|
https://lucene.apache.org/core/4_0_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html
If you're looking for actual vector similarity in Solr, there are two approaches:
1) use delimited payloads. There are a few plugins that implement this already, like https://github.com/moshebla/solr-vector-scoring and https://github.com/saaay71/solr-vector-scoring
2) use streaming expressions, which comes out of the box: https://lucene.apache.org/solr/guide/8_5/vector-math.html
The latter is slower, but more flexible.

Solr: character proximity ranks misspellings higher because of inverse document frequency

I'm using character proximity to allow for some misspellings, for example:
text:manager~1
This allows both 'manager' and 'managre' to be matched. The problem is, the misspellings are always ranked higher than the proper spelling because there are fewer of those in the index. For example, let's say I have 3 documents as follows:
1) text:manager
2) text:manager
3) text:managre
Then the character proximity query above will give an inverse document frequency (idf) of 1.7 to 'managre' and 1.2 to 'manager', effectively ranking the misspelled 'managre' higher. From a technical perspective, this makes sense (there are fewer occurances of 'managre' than 'manager'), but in reality, this doesn't make sense. Is there a way to get Solr to set the idf of misspelled words to match that of the correct spelling?
Short answers is No. Long answer is you have good options here, You need to solve this in a different way.
To begin with take the power of query time boosting. So you can query something like:
text:manager^1.2 OR text:manager~1^0.8
Here you are saying my user is smart so i will give higher boost to user query, but just incase I will give it's variance a bit lower boost. You need to do a boolean query of exact match with higher boost with a Boolean OR query of fuzzy query so that exact matches ranks higher. Do not worry about extra work for solr. It is built for very complex Lucene query trees. Using a combination of queries to get expected relevancy ranking is common practice.
TF , IDF and solr's in built relevancy ranking arbitrary and framing query with boosts, boolean queries, and context based filters is where power and flexibility of solr exists.

Boosting relevance, based on absolute numerical closeness

In my Solr scheme, I have a numeric field that stores a color value (out of, say 65535). How can I make so that when I search for a particular color, the search relevance gets boosted, depending on how close (in absolute value) the particular search is to the asked value?
you can use function queries to calculate the closeness and boost the value.
e.g. div(x,65535) which will generate a value of 1 if exact and less values depending on the closeness.
You can check for the other queries as well to factor the boost accordingly.
And boost the results q={!boost b=div(x,65535)}text:supervillians
together with the function queries, you can use the recip function for calculating boost factor from the color distance http://wiki.apache.org/solr/FunctionQuery#recip
Example:
recip(div(x,65535),1,10000,10000)

Resources