Boosting search results for numbers in solr - solr

Suppose I have two documents with just one field as follows:
Document 1: foo bar 1
Document 2: foo baz 2
And a user searches for "foo baz 1"
Doucment 1 matches "foo" and "1" and Document 2 matches "baz" and "foo" so they would ordinarily be tied. Is there any way to weight a match on a number higher than a match on text that would cause Document 1's match to be preferred over Document 2?
I don't want to boost by the number that matched, I want all numbers to be boosted by the same amount.

Your question is about boosting numbers in a query.
At query time you can boosting a term or you could use payloads at index time: Adding Boost to Score According to Payload of Multivalued Field at Solr

Related

Boost rare results in solr

In a collection there are several different categories of documents. I want the highest ranked search results to be the documents from categories where, for the specific query, there are fewest matching documents.
Concrete example
Let the categories be "foo", "bar", and "baz". If I were to search for "Fred", faceted by category, I would get back the following counts:
foo: 17
bar: 1
baz: 201312
I want to construct a search and/or configure the index such that the one match from the "bar" category would be top of the search results, the 17 "foo" matches would be next, and finally the "baz" matches.
One way I think I could do this would be first to do a faceted search to get the count of matching documents in each category, and then do a second search with boosts based on the category counts - something along the lines of bq=category:bar^10000&bq=category:foo^100; the boosts of 10000 and 100 would obviously be derived from the facet counts and inserted into the query.
I would like to know if something roughly equivalent to this could be achieved in a more efficient way using only a single query, i.e. avoiding the need for a pre-query to fetch the facet counts.

Solr Multi Value Field - Boost values nearer to start

As I understand it, for Multi Value fields Solr boosts scores based on a few things.
Specifically scoring shorter field lengths higher than longer ones (even if the search string is nearer the beginning).
The scoring factors I found in the above link:
termFreq: how often a term appears in the document
idf: how often the term appears across the index
fieldNorm: importance of the term, depending on index-time boosting and field length
However I would like to boost values in a multi value field where the value is nearer the start of the list. For example.
When searching for a document with herceptin PRODUCT 1 should rank higher than PRODUCT 2 - except PRODUCT 2 socres higher due to it's shorter field length.
PRODUCT 1
"herceptin",
"succinimidyl",
"radiolabeling",
"labeling",
"stability",
"discovery",
"potent",
"cb2 agonists",
"agonists",
"linkers",
"yield",
"esters",
"agent",
"syntheses",
"elimination",
"ligands",
"analogue",
"chemistry",
"functionality",
"formation",
"proteins",
"product",
"oxidizing",
"agonist",
"conjugated",
"receptor",
"activity",
"model".
PRODUCT 2
"trastuzumab",
"breast",
"cancer",
"patients",
"breast cancer",
"treatment",
"growth",
"antibody",
"receptor",
"human",
"clinical",
"chemotherapy",
"herceptin",
"combination",
"results".
Any ideas on how I could achieve this?
Thanks

Cloudant Lucene index with different relevance per field

How can I specify during the index creation that one field should receive more relevance than another field?
Example: I have documents with a title and a description field and want the content of the title field to be more important during query time.
doc1: title:"Hello, world", description:"Just a greeting"
doc2: title:"Greetings", description:"Hello, everybody. Hello, hello"
index("default", doc.title);
index("default", doc.description);
A search for the term "hello" should return doc1 one with a higher relevance than doc2 because the word "hello" is present in the title field even though doc2 contains the word 3 times.
How can this be accomplished?
You can specify a boost at query time e.g. if you index items separately
index("title", doc.title);
index("description", doc.description);
Then at query time your can specify that the title gets more weight than the description field
q=(title:hello)^100 OR (description:hello)
where ^100 indicates that this term is boosted. See https://docs.cloudant.com/search.html#query-syntax

solr optional query fields exempt from mm criteria

I have 5 query fields in my search query and I have fairly complicated mm parameter which starts with 3 meaning minimum 3 (in the case of at least 3 search terms) or the number of search terms (in the case of less than three search terms) matches are needed. I want one particular query field specified in the qf field out of the 5 to be exempt from the mentioned matching criteria. In other words, I want it to be used not for determining which documents should be matched but only for ranking the match results. Is this possible?
If a field is not used for matching, it probably should not appear in that list. If you want to use it afterwards to change the ranking, you could experiment with boost queries or with Query Re-Ranking.

Solr TF vs All Terms match

I have observed that Solr/Lucene gives too much weightage to matching all the query terms over tf of a particular query term.
e.g.
Say our query is : text: ("red" "jacket" "red jacket")
Document A -> contains "jacket" 40 times
Document B -> contains "red jacket" 1 time (and because of this "red" 1 time and "jacket" 1 time as well)
Document B is getting much higher score as its containing all the three terms of the query but just once whereas Document A is getting very low score even though it contains one term large number of times.
Can I create a query in such a manner that if Lucene finds a match for "red jacket" it does not consider it as match for "red" and "jacket" individually ?
I would recommend using a DisjunctionMaxQuery. In raw Lucene, this would look something like:
Query dismax = new DisjunctionMaxQuery(0);
dismax.add(parser.parse("red"));
dismax.add(parser.parse("junction"));
dismax.add(parser.parse("red jacket"));
The dismax query will score using the maximum score among it's subqueries, rather than the product of the scores of it's subqueries.
Using Solr, the dismax and edismax query parsers are the way to go for this, as well as many other handy features. Something like:
select/?q=red+jacket+"red jacket"&defType=dismax
Tf-idf is what search engines normally do but not what you always want. It is not what you want if you want to ignore repeated key words.
Tf-idf is calculated as the product of to factors: tf x idf. tf (term frequency) is how frequent a word is in a text. idf (inverse document frequency) means how unique a word is among all documents that you have in a search engine.
Consider a text containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12. See original source of example.
The best way to ignore tf-idf is probably the Solr exists function, which is accessible through the bf relevance boost parameter. For example:
bf=if(exists(query(location:A)),5,if(exists(query(location:B)),3,0))
See original source and context of second example.

Resources