Solr Multi Value Field - Boost values nearer to start - solr

As I understand it, for Multi Value fields Solr boosts scores based on a few things.
Specifically scoring shorter field lengths higher than longer ones (even if the search string is nearer the beginning).
The scoring factors I found in the above link:
termFreq: how often a term appears in the document
idf: how often the term appears across the index
fieldNorm: importance of the term, depending on index-time boosting and field length
However I would like to boost values in a multi value field where the value is nearer the start of the list. For example.
When searching for a document with herceptin PRODUCT 1 should rank higher than PRODUCT 2 - except PRODUCT 2 socres higher due to it's shorter field length.
PRODUCT 1
"herceptin",
"succinimidyl",
"radiolabeling",
"labeling",
"stability",
"discovery",
"potent",
"cb2 agonists",
"agonists",
"linkers",
"yield",
"esters",
"agent",
"syntheses",
"elimination",
"ligands",
"analogue",
"chemistry",
"functionality",
"formation",
"proteins",
"product",
"oxidizing",
"agonist",
"conjugated",
"receptor",
"activity",
"model".
PRODUCT 2
"trastuzumab",
"breast",
"cancer",
"patients",
"breast cancer",
"treatment",
"growth",
"antibody",
"receptor",
"human",
"clinical",
"chemotherapy",
"herceptin",
"combination",
"results".
Any ideas on how I could achieve this?
Thanks

Related

Document size adjusting Search.Score - virtually reducing Scoring profile score

We are using scoring profile for driving the relevance and adjusting scores i.e. boost the relevance for a attribute isActive is 1 by 50 using function in scoring profile, While searching for a specific fields on the Index by passing &searchFields=******
however Search.Score seems highly squeezed by size of the document , smaller the size high score probably due to TF-IDF…..
And this is defeating the purpose of using scoring profile , however in our case we don’t want score to be impact due to size of document since we are passing searchFields.
Cases where searchFields are not passed we want scores to be adjusted by size i.e. free form search in all searchable fields.
example search query -
agency temps&$count=true&$top=30&$skip=0&searchMode=All&$filter=(CompanyCode eq '13453' and VNumber eq '00023232312016') &scoringProfile=BusinessProfile1&searchFields=VCategory
I wonder if the new featuresMode preview capability would be helpful for you? Using this, you can get a lot more information back from the search query such as uniqueTokenMatches and termFrequency on a field by field basis. Using this, you could adjust the ordering as needed on the client side.
Also, you are correct that the default is a TF-IDF like scoring, however, you might also be interested in trying BM25 which although does not solve what you are asking for, could be more effective for helping to get scores you are looking for.
For now I adopted the approach to adjust the parameters for algorithm BM25 as advised by Liam, and added b as 0.0 in index creation json, so that document size is not used during TF-IDF while calculating score for the document,
"similarity": {
"#odata.type": "#Microsoft.Azure.Search.BM25Similarity",
"b" : 0.0,
"k1" : 1.3
}
however same time identified another field on the index having a correlation with size of the record on the index i.e. larger the size higher the value of that field and using that in scoring profile for the case where document size should be considered in scoring.

Autocomplete in solr with popularity

I am now using solr autocomplete and search functions, I want to use the popularity of searched terms in ranking the autocomplete suggestions.
For example, if 'usb' was searched 10 times last week, and 'user' was searched 100 time last week, when typing 'us', user should be ranked higher than usb.
Is there any way to fulfill this requirement? Thanks
In short, you need to use a Index time boosting to boost the value of a 'search queries index' - and periodically refresh it.
Lets you there is an index of all searched queries. That index can be created with an index time boost for each doc as a function of number of times searched. The boost factor could just be the number of times searched. https://wiki.apache.org/solr/SolrRelevancyFAQ#index-time_boosts
Eg. search queries - 'foo', 'foo', 'bar' , 'bar', 'bar' , 'abcd' will be added to a new index with 'foo' having a boost of 2,'bar' with 3, 'abcd' with 1.
You can do dynamic search on this index (starting with :) - and adding typed ahead characters to the query. The document score reflects the index time boost.
Eg. : will return docs with highest score first. After user types an 'f' - the term 'f*' returns 'foo' above others because of its higher index time boost.
I don't know 'terms component' behaves here. Its score is based on term frequency - not based on index time boost.
As you accumulate more search requests, you have to re-index with updated boost values that reflect the newer search counts.
Eg. if there is a new search for 'bar' - then you reindex 'bar' document with a boost of 4.

Solr TF vs All Terms match

I have observed that Solr/Lucene gives too much weightage to matching all the query terms over tf of a particular query term.
e.g.
Say our query is : text: ("red" "jacket" "red jacket")
Document A -> contains "jacket" 40 times
Document B -> contains "red jacket" 1 time (and because of this "red" 1 time and "jacket" 1 time as well)
Document B is getting much higher score as its containing all the three terms of the query but just once whereas Document A is getting very low score even though it contains one term large number of times.
Can I create a query in such a manner that if Lucene finds a match for "red jacket" it does not consider it as match for "red" and "jacket" individually ?
I would recommend using a DisjunctionMaxQuery. In raw Lucene, this would look something like:
Query dismax = new DisjunctionMaxQuery(0);
dismax.add(parser.parse("red"));
dismax.add(parser.parse("junction"));
dismax.add(parser.parse("red jacket"));
The dismax query will score using the maximum score among it's subqueries, rather than the product of the scores of it's subqueries.
Using Solr, the dismax and edismax query parsers are the way to go for this, as well as many other handy features. Something like:
select/?q=red+jacket+"red jacket"&defType=dismax
Tf-idf is what search engines normally do but not what you always want. It is not what you want if you want to ignore repeated key words.
Tf-idf is calculated as the product of to factors: tf x idf. tf (term frequency) is how frequent a word is in a text. idf (inverse document frequency) means how unique a word is among all documents that you have in a search engine.
Consider a text containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12. See original source of example.
The best way to ignore tf-idf is probably the Solr exists function, which is accessible through the bf relevance boost parameter. For example:
bf=if(exists(query(location:A)),5,if(exists(query(location:B)),3,0))
See original source and context of second example.

Haystack/Solr boosting results if the query is found in a specific field

We're having issues with non relevant results being returned as the highest results in our search and we're trying to improve that behavior, but not really sure how.
We have SearchIndex with about a dozen fields. The document=True field is a template backed field that we have placed the majority of the content into. Some of the stuff found in there is much less relevant than other stuff, even if it's still useful.
To give a concrete example: if a user searches for "red rose", we want to return red roses as the top results...even better if lower results are just roses or just red, or even are described as being "rose red" in color.
The issue is our document=True field has a ton of items that are described as being "rose red". Worse the actual red roses don't have "red" and "rose" particularly close to each other as those values would come from disparate fields. As a result we get the top few hundred results that are completely irrelevant.
What we would like to do is either:
A. Search the primary document and then search each of our other fields and boost (but not hard filter) accordingly. If the term "rose" appears in one of the items names and "red" appears as one of it's attribute values than that result should have a higher score. This gives us the optimal results in theory sorted by relevancy.
B. Search all fields at once and boost if the value is any of the "boosted" fields.
It seems like using field boost should be the answer, but we can't figure out how to express it since filtering based on a field is a harsh exclude and we want it to only impact the relevance scoring.
The result of both of these is effectively the same. We just can't figure out how to do either of them with Haystack. Or if we'd have to fall back to raw queries how to write a solr query that accomplishes this.
I can give you some pointers, as I did not get the exact use case :-
You can check on Solr edismax query parser to configure:-
Fields you want to search on - Mainly to select the results
Variable boost on fields for relevancy - To determine the importance on fields
Variable boost for different words combination e.g. single words, phrase match, shingle match with slop to determine relevancy
Provide additional boost on other fields
This will help you to filter the results and order them accordingly as per the field and word combination matches

Boost evenly across field of varying length

I've got a text field that can potentially have multiple values.
doc 1:
field a:"X Y"
doc 2:
field a:"X"
I want to be able to do :
a:X^5
And have both doc 1 and 2 get an identical score.
I've been messing around with all the field options, but I always end up with doc 2 getting double the score of doc 1.
I've tried setting multiValued="true", but get the same result.
Is there someway that I can set my search or the field definition so that it will boost just based upon the existence of the search term and not be effected by the rest of the field's contents.
Disable norms by setting omitNorms=true in your schema and reindex - it should disable the length normalization for the field and give you the desired results.
For more details of what omitNorms does, see this.
The field a of doc 2 has only one term as compared to doc 1 which has two.
Solr DefaultSimilartiy implementation takes into account the length norm, number of terms in the field, for the fields when calculating the score.
LenghtNorm is 1.0 / Math.sqrt(numTerms)
LengthNorm allows you to make shorter documents score higher.
You can provide your own implementation of Similarity class which doesn't take into account the lengthNorm.
Check computeNorm method implementation.
You can turn of the Norms using omitNorms=false.
Norms allow for index time boosts and field length normalization. This allows you to add boosts to fields at index time and makes shorter documents score higher.
So you would lose both of the above if you use it.

Resources