As a simplified example.
I have two fields: title and keywords.
I am using edismax with the following parameter
qf: title + keywords^2
Previously, it was working fine. I have about 15M records indexed in solr. All records have non-empty title. Most records HAD non-empty keywords.
But recently, we decided to remove keywords for most records. As a result, we currently only have 1 record (out of 15M records) that has non-empty keywords.
Unfortunately, as a result of that, the keywords^2 boost specified in qf does not seem to work any more.
For that record, we have title, say, "good store", and keywords, say, "pants clothing". Now, if I search for 'good store pants', the solr matching score is exactly the same regardless of whether I use qf: title or qf: title keywords^2.5. (Again, I think it worked before when most records have non-empty keywords since the solr matching scores are different for the above comparison.)
Answering my own question.
Since there is only one record that has non-empty keywords.
Based on the IDF formula used by solr, the base value is smaller than 1. There fore, boosting it by ^2 does not help at all.
So, I think the "solution" is to add more records with non-empty keywords. Of course, this is not a real solution.
See following for output from debugQuery.
0.84748024 = weight(keywords:good in 4161) [], result of:
0.84748024 = score(doc=4161,freq=1.0 = termFreq=1.0
), product of:
3.0 = boost
0.2876821 = idf(docFreq=1, docCount=1)
0.9819638 = tfNorm, computed from:
1.0 = termFreq=1.0
1.2 = parameter k1
0.75 = parameter b
5.0 = avgFieldLength
5.2244897 = fieldLength
Related
I have a Solr (6.2) DisMax Select Query which uses pf (phrase fields) and ps (phrase slop).
pf = text^2.2 title^2.2, ps = 2;
I want my query to return results following this algorithm:
If there are exact matches for the queried phrase, return them first, sort by date
If there are documents that have atleast one of the words of the queried phrase, return them second, sort by date
Example Data: text (last_modified timestamp in parenthesis)
stuff about important people (2018)
important people: the article (2019)
some people find that important (2020)
important news (2015)
people of the decade (2020)
The desired result:
phrases with acceptable slop first
some people find that important (2020)
important people: the article (2019)
stuff about important people (2018)
then at least one of the words
people of the decade (2020)
important news (2015)
What i've tried:
wrapping a query into double quotes and using qs (query phrase slop), this way it works as desired, but ignores the "at least on of the words" part;
using a bq (boost query) like last_modified:[NOW/DAY-3MONTH TO NOW/DAY]^20.0;
using a bf (boost function) like recip(ms(NOW,last_modified),3.16e-11,1,1);
explicit last_modified desc sort - but it ignores the score completely
using multiple sort score desc, last_modified desc - but the second sort will work only if there is a tie for the first one (and there is almost never a tie)
I've managed to get the (almost) desired result by using:
Boost Functions (bf) = recip(ms(NOW,last_modified),3.16e-11,1,1)^1500
(had to use a huge boost number to bubble up the most recent
results);
Query Fields qf = 'text^4 title^2';
Phrase Fields pf = 'text^5 title^2';
Phrase Slop ps = 4;
Query Phrase Slop qs = 2;
Minimum Should Match mm = len(split('\s', query)) + 1 (preudocode)
Split the query by whitespace, join the exact phrase and each separate word with OR and set Minimum Should Match parameter (mm) to len(split)+1 so, for example, query "apple dog" transforms into "apple dog" or apple or dog. The double quotes are necessary for qs parameter to work and force results with exact phrase to bubble up.
Maybe there are some tweaks to the method i'm using, any comments are appreciated.
I am reviewing the similarity calculations performed by the DefaultSimilarity class in Lucene invoked by Solr. Specifically, I am not clear about field normalization as to how its calculated when the Solr query doesn't reference a specific field.
norm(t,d) = doc.getBoost() · lengthNorm · ∏ f.getBoost() .... field f in d named as t
where
doc.getBoost() = document's boost specified at index time
f.getBoost() = field's boost specified at index time
lengthNorm = number of terms/tokens in the field
My question is, if a solr query is specified as -
/select?q=indian cricket&rows=5&wt=json
without reference to a specific field in schema.xml, how is norm(t,d) calculated? for every field, the term is found in? If so, how
are these combined?
Thanks in advance for your insights!
Fields without a field name will use the defaultSearchField setting from the schema, the df (default field) query parameter or the qf query fields parameter (if using (e)dismax, and the terms will be prefixed with the field name. Each field, term combination for each queried field will then be used to evaluate the norm.
Use the debugQuery feature of Solr to see each scored part and how it affects the score.
I have observed that Solr/Lucene gives too much weightage to matching all the query terms over tf of a particular query term.
e.g.
Say our query is : text: ("red" "jacket" "red jacket")
Document A -> contains "jacket" 40 times
Document B -> contains "red jacket" 1 time (and because of this "red" 1 time and "jacket" 1 time as well)
Document B is getting much higher score as its containing all the three terms of the query but just once whereas Document A is getting very low score even though it contains one term large number of times.
Can I create a query in such a manner that if Lucene finds a match for "red jacket" it does not consider it as match for "red" and "jacket" individually ?
I would recommend using a DisjunctionMaxQuery. In raw Lucene, this would look something like:
Query dismax = new DisjunctionMaxQuery(0);
dismax.add(parser.parse("red"));
dismax.add(parser.parse("junction"));
dismax.add(parser.parse("red jacket"));
The dismax query will score using the maximum score among it's subqueries, rather than the product of the scores of it's subqueries.
Using Solr, the dismax and edismax query parsers are the way to go for this, as well as many other handy features. Something like:
select/?q=red+jacket+"red jacket"&defType=dismax
Tf-idf is what search engines normally do but not what you always want. It is not what you want if you want to ignore repeated key words.
Tf-idf is calculated as the product of to factors: tf x idf. tf (term frequency) is how frequent a word is in a text. idf (inverse document frequency) means how unique a word is among all documents that you have in a search engine.
Consider a text containing 100 words wherein the word cat appears 3 times. The term frequency (i.e., tf) for cat is then (3 / 100) = 0.03. Now, assume we have 10 million documents and the word cat appears in one thousand of these. Then, the inverse document frequency (i.e., idf) is calculated as log(10,000,000 / 1,000) = 4. Thus, the Tf-idf weight is the product of these quantities: 0.03 * 4 = 0.12. See original source of example.
The best way to ignore tf-idf is probably the Solr exists function, which is accessible through the bf relevance boost parameter. For example:
bf=if(exists(query(location:A)),5,if(exists(query(location:B)),3,0))
See original source and context of second example.
I read in the CF10 docs that the attribute 'FieldBoost' has been added to CFIndex in order to specify which fields should have more importance in Solr's scoring.
However, it seems that not only does it not work as intended, it in fact causes the whole indexing operation to fail completely!
I've seen other posts on the Adobe forums mentioning exactly the same issue, but no replies or resolution available.
I'm running CF10 Update 11.
The following code works and indexes 14,000 records:
<cfindex collection = "MyCollection"
action = "refresh"
type = "custom"
query = "Local.MyContent"
key = "ID"
title = "Name"
body = "Name,Description"
>
However, if I add the FieldBoost value, there are no errors and the index operation appears to run correctly, however the collection now contains zero records:
<cfindex collection = "MyCollection"
action = "refresh"
type = "custom"
query = "Local.MyContent"
key = "itemID"
title = "Name"
body = "Name,Description"
fieldBoost = "title"
>
Has anyone had this working?
From the comments...
I found this bug which I believe is similar to your situation (although it was reported on a Mac platform).
Although it is not documented very well you need to include the weight with the fieldboost attribute. For ColdFusion's implementation you specify the weight by appending it to the field you want boosted delimited with a : (colon). The attribute should look something like this:
fieldboost="title:6"
I was able to find a little bit of documentation on this attribute in the Adobe ColdFusion 10 Beta documentation (on page 106 of that document specifically). Here is an excerpt from that document:
Improving search result rankings
The following attributes in cfindex help you improve the search result rankings:
fieldBoost: Boost specific fields while indexing.
fieldBoost enhances the score of the fields and thereby the ranking in the search results. Multiple fields can be boosted by specifying the values as a comma-separated list.
docBoost: Boost entire document while indexing.
docBoost enhances the score of the documents and thereby the ranking in the search results
And the following code is the example they used to show the fieldboost attribute (notice that they are boosting two fields, separated by a comma):
<cfindex collection="autocommit_check" action="update" type="file"
key="#Expandpath(".")#/_boost1.txt" first_t="fieldboost" second_t="secondfield"
fieldboost="first_t:1,second_t:2" docboost="6" autocommit="true">
Also check this related question for a way to boost fields during the search - CF10 Fieldboost on cfindex has no effect
I am implementing Solr dismax search and also using this function recip(ms(NOW,PubDate),3.16e-11,1000,1000) for date boost. Everthing is working fine but only got one problem.
if search keywords are repeated in the Title, they get more score than recent results.
e.g.
1) Title = solr lucene
Date = 1 day old
2) Title = solr lucene is best, love solr lucene
Date = 15 days old
If user searched for 'solr lucene', then #2 comes at first position only because keywords are repeated in the Title.
I have got too many records which are1,2 or 3 days old and they have even the exact same title "SOLR LUCENE" but those records doesn't come on first page only because old records have keywords repeated in the Title.
I don't want to sort the results entirely by date. Currently i am sorting it like this. sort= score desc, date asc
You shouldn't use an order clause, if you are using boost.
If you like to give the date more relevance, so pimp your boost function. It's up to you, who big is the date influence for the order of the search result is.
It also depends on the dismax-handler you are using:
{!edismax boost=recip(pow(ms(NOW,PubDate),<val>),3.16e-11,1,1)}
Put an value instead of the <val> placeholder between 0 and 2, where 0 is nearly "order by date" and 2 is order by relevance.
Not sure, if this works for dismax, but it works for standard solr search handler (with other syntax than the example above) and edismax.