I have a simple setup with IndexSearcher, QueryParser, SimpleAnalyzer.
Running some queries I recognised that a query with more than one term returns a different ScoreDoc[i].score than shown in explain query statement. Apparently it is the score shown in explain divided by the number of search terms. any explanation for this behaviour?
Running search(TERM1 TERM2 TERM3)
line:term1 line:term2 line:term3
2.167882 = sum of:
0.6812867 = weight(line:term1 in 6594) [DefaultSimilarity], result of:
0.6812867 = score(doc=6594,freq=2.0), product of:
0.5389907 = queryWeigh
totalHits 1
1678413725, TERM1 TERM2 TERM3, score: 0.72262734
I understand the coord() statement would be used to penalise documents which include only a subset of the search terms provided. However this document includes all terms. Any suggestions?
EDIT: seems like the division does only occur if the query is configured to use OR statements instead of AND. So using OR queries and matching all terms is still divided by the number of terms in the search query. I couldn't find this part in the documentation but at least it explains the difference.
However applying QueryWrapperFilter seems to change the scoring again. Although according to the documentation it should only filter the results without impact on scoring.
More details
These two scores are result of the same query. Only the second query gets divided
0.114700586 = product of:
0.34410176 = sum of:
0.34410176 = weight(line:term1 in 24) [DefaultSimilarity], result of:
0.34410176 = score(doc=24,freq=1.0), product of:
0.5389907 = queryWeight, product of:
8.17176 = idf(docFreq=14, maxDocs=19532)
0.065957725 = queryNorm
0.63841873 = fieldWeight in 24, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
8.17176 = idf(docFreq=14, maxDocs=19532)
0.078125 = fieldNorm(doc=24)
0.33333334 = coord(1/3)
item_id: 1495958818, item_name: term 1 dolor sit met, score: 0.114700586
0.18352094 = product of:
0.5505628 = sum of:
0.5505628 = weight(line:term 1 in 6112) [DefaultSimilarity], result of:
0.5505628 = score(doc=6112,freq=1.0), product of:
0.5389907 = queryWeight, product of:
8.17176 = idf(docFreq=14, maxDocs=19532)
0.065957725 = queryNorm
1.02147 = fieldWeight in 6112, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
8.17176 = idf(docFreq=14, maxDocs=19532)
0.125 = fieldNorm(doc=6112)
0.33333334 = coord(1/3)
item_id: 1677761523, item_name: some text term 1, score: 0.061173648
Hello when i execute following query
fl=id model label timestamp score uuid&echoParams=all&qf=label^6 content_level_high^5 content_level_middle^2 content_level_less^1&hl.fl=teaser&wt=xml&rows=9&defType=edismax&facet=true&bq=model:"Component"^10 model:"Object"^90 model:"Address"^1 model:"eav_table_54f5d74b4efef9.49994240"^14&debugQuery=on&start=0&q=Fraumünster
The same query easier to readable
defType=edismax
fl=id model label timestamp score uuid
qf=label^6 content_level_high^5 content_level_middle^2 content_level_less^1
bq=model:"Component"^10 model:"Object"^90 model:"Address"^1 model:"eav_table_54f5d74b4efef9.49994240"^14
q=Fraumünster
start=0
rows=9
wt=xml
facet=true
echoParams=all
debugQuery=on
hl.fl=teaser
to a solr 3.6.2 server it seams that the boost on "model" field will be totaly ignored.
Because all entrys get the same score by having a single hit in "label".
So the order should imho be done by boost query order.
Here a full explain:
http://explain.solr.pl/explains/ipu6qacf
The raw query result:
http://pastebin.com/3uFdd8uw
Solr schema (for solr 5.x):
http://pastebin.com/0pZB5gDt
Solr config:
http://pastebin.com/Kd6W2nYD
The documents to in solr add syntax:
http://pastebin.com/HMBrwAWV
Has anyone an idea what is wrong with the boost query?
Please specify all boost queries in single parameters:
bq=model:"Component"^10&bq=model:"Object"^90&bq=model:"Address"^1&bq=model:"eav_table_54f5d74b4efef9.49994240"^14
Then the query is correctly parsed and recognized in the relevance:
+(content_level_less:chang | label:chang^6.0 | content_level_high:chang^5.0 | content_level_middle:chang^2.0) model:Component^10.0 model:Object^90.0 model:Address model:eav_table_54f5d74b4efef9.49994240^14.0
0.1813628 = (MATCH) sum of: 0.13184154 = (MATCH) max of: 0.13184154 = (MATCH) weight(label:chang^6.0 in 4) [DefaultSimilarity], result of: 0.13184154 = score(doc=4,freq=3.0), product of: 0.041205455 = queryWeight, product of: 6.0 = boost 1.8472979 = idf(docFreq=2, maxDocs=7) 0.003717633 = queryNorm 3.1996138 = fieldWeight in 4, product of: 1.7320508 = tf(freq=3.0), with freq of: 3.0 = termFreq=3.0 1.8472979 = idf(docFreq=2, maxDocs=7) 1.0 = fieldNorm(doc=4) 0.04952125 = (MATCH) weight(model:Component^10.0 in 4) [DefaultSimilarity], result of: 0.04952125 = score(doc=4,freq=1.0), product of: 0.04290709 = queryWeight, product of: 10.0 = boost 1.1541507 = idf(docFreq=5, maxDocs=7) 0.003717633 = queryNorm 1.1541507 = fieldWeight in 4, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.1541507 = idf(docFreq=5, maxDocs=7) 1.0 = fieldNorm(doc=4)
I hope that helps!
I have the following records and the scores against it when I search for "iphone" -
Record1:
FieldName - DisplayName : "Iphone"
FieldName - Name : "Iphone"
11.654595 = (MATCH) sum of:
11.654595 = (MATCH) max plus 0.01 times others of:
7.718274 = (MATCH) weight(DisplayName:iphone^10.0 in 915195), product of:
0.6654692 = queryWeight(DisplayName:iphone^10.0), product of:
10.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
11.598244 = (MATCH) fieldWeight(DisplayName:iphone in 915195), product of:
1.0 = tf(termFreq(DisplayName:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
1.0 = fieldNorm(field=DisplayName, doc=915195)
11.577413 = (MATCH) weight(Name:iphone^15.0 in 915195), product of:
0.99820393 = queryWeight(Name:iphone^15.0), product of:
15.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
11.598244 = (MATCH) fieldWeight(Name:iphone in 915195), product of:
1.0 = tf(termFreq(Name:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
1.0 = fieldNorm(field=Name, doc=915195)
Record2:
FieldName - DisplayName : "The Iphone Book"
FieldName - Name : "The Iphone Book"
7.284122 = (MATCH) sum of:
7.284122 = (MATCH) max plus 0.01 times others of:
4.823921 = (MATCH) weight(DisplayName:iphone^10.0 in 453681), product of:
0.6654692 = queryWeight(DisplayName:iphone^10.0), product of:
10.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
7.2489023 = (MATCH) fieldWeight(DisplayName:iphone in 453681), product of:
1.0 = tf(termFreq(DisplayName:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.625 = fieldNorm(field=DisplayName, doc=453681)
7.2358828 = (MATCH) weight(Name:iphone^15.0 in 453681), product of:
0.99820393 = queryWeight(Name:iphone^15.0), product of:
15.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
7.2489023 = (MATCH) fieldWeight(Name:iphone in 453681), product of:
1.0 = tf(termFreq(Name:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.625 = fieldNorm(field=Name, doc=453681)
Record3:
FieldName - DisplayName: "iPhone"
FieldName - Name: "iPhone"
7.284122 = (MATCH) sum of:
7.284122 = (MATCH) max plus 0.01 times others of:
4.823921 = (MATCH) weight(DisplayName:iphone^10.0 in 5737775), product of:
0.6654692 = queryWeight(DisplayName:iphone^10.0), product of:
10.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
7.2489023 = (MATCH) fieldWeight(DisplayName:iphone in 5737775), product of:
1.0 = tf(termFreq(DisplayName:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.625 = fieldNorm(field=DisplayName, doc=5737775)
7.2358828 = (MATCH) weight(Name:iphone^15.0 in 5737775), product of:
0.99820393 = queryWeight(Name:iphone^15.0), product of:
15.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
7.2489023 = (MATCH) fieldWeight(Name:iphone in 5737775), product of:
1.0 = tf(termFreq(Name:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.625 = fieldNorm(field=Name, doc=5737775)
Why is Record2 and Record3 have the same score when record2 has 3 words and record3 has just one word. So Record3 should have higher relevancy than record 2. Why are the fieldNorm of both Record2 and Record3 the same?
QueryParser: Dismax
FieldType: text fieldtype as default in solrconfig.xml
Adding DataFeed:
Record1: Iphone
{
"ListPrice":1184.526,
"ShipsTo":1,
"OID":"190502",
"EAN":"9780596804299",
"ISBN":"0596804296",
"Author":"Pogue, David",
"product_type_fq":"Books",
"ShipmentDurationDays":"21",
"CurrencyValue":"24.9900",
"ShipmentDurationText":"NORMALLY SHIPS IN 21 BUSINESS DAYS",
"Availability":0,
"COD":0,
"PublicationDate":"2009-08-07 00:00:00.0",
"Discount":"25",
"SubCategory_fq":"Hardware",
"Binding":"Paperback",
"Category_fq":"Non Classifiable",
"ShippingCharges":"0",
"OIDType":8,
"Pages":"397",
"CallOrder":"0",
"TrackInventory":"Ingram",
"Author_fq":"Pogue, David",
"DisplayName":"Iphone",
"url":"/iphone-pogue-david/books/9780596804299.htm",
"CurrencyType":"USD",
"SubSubCategory":"Handheld Devices",
"Mask":0,
"Publisher":"Oreilly & Associates Inc",
"Name":"Iphone",
"Language":"English",
"DisplayPriority":"999",
"rowid":"books_9780596804299"
}
Record2: The Iphone Book
{
"ListPrice":1184.526,
"ShipsTo":1,
"OID":"94694",
"EAN":"9780321534101",
"ISBN":"0321534107",
"Author":"Kelby, Scott/ White, Terry",
"product_type_fq":"Books",
"ShipmentDurationDays":"21",
"CurrencyValue":"24.9900",
"ShipmentDurationText":"NORMALLY SHIPS IN 21 BUSINESS DAYS",
"Availability":1,
"COD":0,
"PublicationDate":"2007-08-13 00:00:00.0",
"Discount":"25",
"SubCategory_fq":"Handheld Devices",
"Binding":"Paperback",
"BAMcategory_src":"Computers",
"Category_fq":"Computers",
"ShippingCharges":"0",
"OIDType":8,
"Pages":"219",
"CallOrder":"0",
"TrackInventory":"Ingram",
"Author_fq":"Kelby, Scott/ White, Terry",
"DisplayName":"The Iphone Book",
"url":"/iphone-book-kelby-scott-white-terry/books/9780321534101.htm",
"CurrencyType":"USD",
"SubSubCategory":" Handheld Devices",
"BAMcategory_fq":"Computers",
"Mask":0,
"Publisher":"Pearson P T R",
"Name":"The Iphone Book",
"Language":"English",
"DisplayPriority":"999",
"rowid":"books_9780321534101"
}
Record 3: iPhone
{
"ListPrice":278.46,
"ShipsTo":1,
"OID":"694715",
"EAN":"9781411423527",
"ISBN":"1411423526",
"Author":"Quamut (COR)",
"product_type_fq":"Books",
"ShipmentDurationDays":"21",
"CurrencyValue":"5.9500",
"ShipmentDurationText":"NORMALLY SHIPS IN 21 BUSINESS DAYS",
"Availability":0,
"COD":0,
"PublicationDate":"2010-08-03 00:00:00.0",
"Discount":"25",
"SubCategory_fq":"Hardware",
"Binding":"Paperback",
"Category_fq":"Non Classifiable",
"ShippingCharges":"0",
"OIDType":8,
"CallOrder":"0",
"TrackInventory":"BNT",
"Author_fq":"Quamut (COR)",
"DisplayName":"iPhone",
"url":"/iphone-quamut-cor/books/9781411423527.htm",
"CurrencyType":"USD",
"SubSubCategory":"Handheld Devices",
"Mask":0,
"Publisher":"Sterling Pub Co Inc",
"Name":"iPhone",
"Language":"English",
"DisplayPriority":"999",
"rowid":"books_9781411423527"
}
fieldnorm takes into account the field length i.e. the number of terms.
The fieldtype used is text for the fields display name & name, which would have the stopwords and the word delimiter filters.
Record 1 - Iphone
Would generate a single token - IPhone
Record 2 - The Iphone Book
Would generate 2 tokens - Iphone, Book
The would be removed by the stopwords.
Record 3 - iPhone
Would also generate 2 tokens - i,phone
As iPhone has a case change, the word delimiter filter with splitOnCaseChange would now split iPhone into 2 tokens i, Phone and would produce the field norm same as Record 2
This is the answer to user1021590's follow-up question/answer on the "da vinci code" search example.
The reason all the documents get the same score is due to a subtle implementation detail of lengthNorm. Lucence TFIDFSimilarity doc states the following about norm(t, d):
the resulted norm value is encoded as a single byte before being stored. At search time, the norm byte value is read from the index directory and decoded back to a float norm value. This encoding/decoding, while reducing index size, comes with the price of precision loss - it is not guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.75.
If you dig into the code, you see that this float-to-byte encoding is implemented as follows:
public static byte floatToByte315(float f)
{
int bits = Float.floatToRawIntBits(f);
int smallfloat = bits >> (24 - 3);
if (smallfloat <= ((63 - 15) << 3))
{
return (bits <= 0) ? (byte) 0 : (byte) 1;
}
if (smallfloat >= ((63 - 15) << 3) + 0x100)
{
return -1;
}
return (byte) (smallfloat - ((63 - 15) << 3));
}
and the decoding of that byte to float is done as:
public static float byte315ToFloat(byte b)
{
if (b == 0)
return 0.0f;
int bits = (b & 0xff) << (24 - 3);
bits += (63 - 15) << 24;
return Float.intBitsToFloat(bits);
}
lengthNorm is calculated as 1 / sqrt( number of terms in field ). This is then encoded for storage using floatToByte315. For a field with 3 terms, we get:
floatToByte315( 1/sqrt(3.0) ) = 120
and for a field with 4 terms, we get:
floatToByte315( 1/sqrt(4.0) ) = 120
so both of them get decoded to:
byte315ToFloat(120) = 0.5.
The doc also states this:
The rationale supporting such lossy compression of norm values is that given the difficulty (and inaccuracy) of users to express their true information need by a query, only big differences matter.
UPDATE: As of Solr 4.10, this implementation and corresponding statements are part of DefaultSimilarity.
I want my search results to order by score, which they are doing, but the score is being calculated improperly. This is to say, not necessarily improperly, but differently than expected and I'm not sure why. My goal is to remove whatever is changing the score.
If I perform a search that matches on two objects (where ObjectA is expected to have a higher score than ObjectB), ObjectB is being returned first.
Let's say, for this example, that my query is a single term: "apples".
ObjectA's title: "apples are apples" (2/3 terms)
ObjectA's description: "There were apples in the apples-apples and now the apples went all apples all over the apples!" (6/18 terms)
ObjectB's title: "apples are great" (1/3 terms)
ObjectB's description: "There were apples in the apples-room and now the apples went all bad all over the apples!" (4/18 terms)
The title field has no boost (or rather, a boost of 1) and the description field has a boost of 0.8. I have not specified a document boost through solrconfig.xml or through the query that I'm passing through. If there is another way to specify a document boost, there is the chance that I'm missing one.
After analyzing the explain printout, it looks like ObjectA is properly calculating a higher score than ObjectB, just like I want, except for one difference: ObjectB's title fieldNorm is always higher than ObjectA's.
Here follows the explain printout. Just so you know: the title field is mditem5_tns and the description field is mditem7_tns:
ObjectB:
1.3327172 = (MATCH) sum of:
1.0352166 = (MATCH) max plus 0.1 times others of:
0.9766194 = (MATCH) weight(mditem5_tns:appl in 0), product of:
0.53929156 = queryWeight(mditem5_tns:appl), product of:
1.8109303 = idf(docFreq=3, maxDocs=9)
0.2977981 = queryNorm
1.8109303 = (MATCH) fieldWeight(mditem5_tns:appl in 0), product of:
1.0 = tf(termFreq(mditem5_tns:appl)=1)
1.8109303 = idf(docFreq=3, maxDocs=9)
1.0 = fieldNorm(field=mditem5_tns, doc=0)
0.58597165 = (MATCH) weight(mditem7_tns:appl^0.8 in 0), product of:
0.43143326 = queryWeight(mditem7_tns:appl^0.8), product of:
0.8 = boost
1.8109303 = idf(docFreq=3, maxDocs=9)
0.2977981 = queryNorm
1.3581977 = (MATCH) fieldWeight(mditem7_tns:appl in 0), product of:
2.0 = tf(termFreq(mditem7_tns:appl)=4)
1.8109303 = idf(docFreq=3, maxDocs=9)
0.375 = fieldNorm(field=mditem7_tns, doc=0)
0.2975006 = (MATCH) FunctionQuery(1000.0/(1.0*float(top(rord(lastmodified)))+1000.0)), product of:
0.999001 = 1000.0/(1.0*float(1)+1000.0)
1.0 = boost
0.2977981 = queryNorm
ObjectA:
1.2324848 = (MATCH) sum of:
0.93498427 = (MATCH) max plus 0.1 times others of:
0.8632177 = (MATCH) weight(mditem5_tns:appl in 0), product of:
0.53929156 = queryWeight(mditem5_tns:appl), product of:
1.8109303 = idf(docFreq=3, maxDocs=9)
0.2977981 = queryNorm
1.6006513 = (MATCH) fieldWeight(mditem5_tns:appl in 0), product of:
1.4142135 = tf(termFreq(mditem5_tns:appl)=2)
1.8109303 = idf(docFreq=3, maxDocs=9)
0.625 = fieldNorm(field=mditem5_tns, doc=0)
0.7176658 = (MATCH) weight(mditem7_tns:appl^0.8 in 0), product of:
0.43143326 = queryWeight(mditem7_tns:appl^0.8), product of:
0.8 = boost
1.8109303 = idf(docFreq=3, maxDocs=9)
0.2977981 = queryNorm
1.6634457 = (MATCH) fieldWeight(mditem7_tns:appl in 0), product of:
2.4494898 = tf(termFreq(mditem7_tns:appl)=6)
1.8109303 = idf(docFreq=3, maxDocs=9)
0.375 = fieldNorm(field=mditem7_tns, doc=0)
0.2975006 = (MATCH) FunctionQuery(1000.0/(1.0*float(top(rord(lastmodified)))+1000.0)), product of:
0.999001 = 1000.0/(1.0*float(1)+1000.0)
1.0 = boost
0.2977981 = queryNorm
The problem is caused by the stemmer. It expands "apples are apples" to "apples appl are apples appl" thus making the field longer. As document B only contains 1 term that is being expanded by the stemmer the field stays shorter then document A.
This results in different fieldNorms.
FieldNOrm is computed of 3 components - index-time boost on the field, index-time boost on the document and field length. Assuming that you are not supplying any index-time boost, the difference must be field length.
Thus, since lengthNorm is higher for shorter field values, for B to have a higher fieldNorm value for the title, it must have smaller number of tokens in the title than A.
See the following pages for a detailed explanation of Lucene scoring:
http://lucene.apache.org/java/2_4_0/scoring.html
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html