solr: boost query will be ignored - solr

Hello when i execute following query
fl=id model label timestamp score uuid&echoParams=all&qf=label^6 content_level_high^5 content_level_middle^2 content_level_less^1&hl.fl=teaser&wt=xml&rows=9&defType=edismax&facet=true&bq=model:"Component"^10 model:"Object"^90 model:"Address"^1 model:"eav_table_54f5d74b4efef9.49994240"^14&debugQuery=on&start=0&q=Fraumünster
The same query easier to readable
defType=edismax
fl=id model label timestamp score uuid
qf=label^6 content_level_high^5 content_level_middle^2 content_level_less^1
bq=model:"Component"^10 model:"Object"^90 model:"Address"^1 model:"eav_table_54f5d74b4efef9.49994240"^14
q=Fraumünster
start=0
rows=9
wt=xml
facet=true
echoParams=all
debugQuery=on
hl.fl=teaser
to a solr 3.6.2 server it seams that the boost on "model" field will be totaly ignored.
Because all entrys get the same score by having a single hit in "label".
So the order should imho be done by boost query order.
Here a full explain:
http://explain.solr.pl/explains/ipu6qacf
The raw query result:
http://pastebin.com/3uFdd8uw
Solr schema (for solr 5.x):
http://pastebin.com/0pZB5gDt
Solr config:
http://pastebin.com/Kd6W2nYD
The documents to in solr add syntax:
http://pastebin.com/HMBrwAWV
Has anyone an idea what is wrong with the boost query?

Please specify all boost queries in single parameters:
bq=model:"Component"^10&bq=model:"Object"^90&bq=model:"Address"^1&bq=model:"eav_table_54f5d74b4efef9.49994240"^14
Then the query is correctly parsed and recognized in the relevance:
+(content_level_less:chang | label:chang^6.0 | content_level_high:chang^5.0 | content_level_middle:chang^2.0) model:Component^10.0 model:Object^90.0 model:Address model:eav_table_54f5d74b4efef9.49994240^14.0
0.1813628 = (MATCH) sum of: 0.13184154 = (MATCH) max of: 0.13184154 = (MATCH) weight(label:chang^6.0 in 4) [DefaultSimilarity], result of: 0.13184154 = score(doc=4,freq=3.0), product of: 0.041205455 = queryWeight, product of: 6.0 = boost 1.8472979 = idf(docFreq=2, maxDocs=7) 0.003717633 = queryNorm 3.1996138 = fieldWeight in 4, product of: 1.7320508 = tf(freq=3.0), with freq of: 3.0 = termFreq=3.0 1.8472979 = idf(docFreq=2, maxDocs=7) 1.0 = fieldNorm(doc=4) 0.04952125 = (MATCH) weight(model:Component^10.0 in 4) [DefaultSimilarity], result of: 0.04952125 = score(doc=4,freq=1.0), product of: 0.04290709 = queryWeight, product of: 10.0 = boost 1.1541507 = idf(docFreq=5, maxDocs=7) 0.003717633 = queryNorm 1.1541507 = fieldWeight in 4, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.1541507 = idf(docFreq=5, maxDocs=7) 1.0 = fieldNorm(doc=4)
I hope that helps!

Related

Solr/Lucene simple operator "or" missunderstanding / Seach same word in differnt fields

Learning Solr/Lucene Syntax, using Solr Admin in Browser.
There I try to search for the same word in two differnt fields with following syntax:
content:myword -> results found
content:myword OR title:existingTitle -> results found
but
content:myword OR title:myword -> ZERO results found, why? It is "or".
also tried without operator which should be equal to "or" , also tried "|" and "||"
this happens when I try to find the same word in one of multipe fields
[edit]
Here are the solr url requests:
content:fahrzeug title:fahrzeug
http://xxx/solr/core_de/select?q=content%3Afahrzeug%20title%3Afahrzeug
content:fahrzeug OR title:fahrzeug
http://xxx/solr/core_de/select?q=content%3Afahrzeug%20OR%20title%3Afahrzeug
content:fahrzeug | title:fahrzeug
http://xxx/solr/core_de/select?q=content%3Afahrzeug%20%7C%20title%3Afahrzeug
{
"responseHeader":{
"status":400,
"QTime":5,
"params":{
"q":"content:fahrzeug OR title:fahrzeug",
"debugQuery":"1"}},
"error":{
"metadata":[
"error-class","org.apache.solr.common.SolrException",
"root-error-class","org.apache.solr.common.SolrException"],
"msg":"invalid boolean value: 1",
"code":400}}
I guess, that it is configured like this:
Try:
http://www119.pxia.de:8983/solr/core_de/select?fq=content%3Afahrzeug%20title%3Afahrzeug&q=*%3A* - this returns correct documents. So those documents are there if only filtering is used. Query use more complex conditions, your default configuration is:
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="echoParams">explicit</str>
<str name="qf">content^40.0 title^5.0 keywords^2.0 tagsH1^5.0 tagsH2H3^3.0 tagsH4H5H6^2.0 tagsInline^1.0</str>
<str name="pf">content^2.0</str>
<str name="df">content</str>
<int name="ps">15</int>
<str name="mm">2<-35%</str>
<str name="mm.autoRelax">true</str>
...
Parser and boosting may play a key role here.
I am not familiar with edixmax parser, please check: documentation
I would guess mm parameter may be causing this.
Anyway its strange, that OR does not work as we are use to from boolean algebra.
"debug":{
"queryBoosting":{
"q":"title:Home OR content:Perfekt",
"match":null},
"rawquerystring":"title:Home OR content:Perfekt",
"querystring":"title:Home OR content:Perfekt",
"parsedquery":"+(title:hom content:perfekt)~2 ()",
"parsedquery_toString":"+((title:hom content:perfekt)~2) ()",
"explain":{
"bf72a75534ba703e4b8dc7194f92ce34223fc0d2/pages/1/0/0/0":"\n4.8893824 = sum of:\n 4.8893824 = sum of:\n 1.9924302 = weight(title:hom in 0) [SchemaSimilarity], result of:\n 1.9924302 = score(doc=0,freq=1.0 = termFreq=1.0\n), product of:\n 1.9924302 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n 1.0 = docFreq\n 10.0 = docCount\n 1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.0 = parameter b (norms omitted for field)\n 2.8969522 = weight(content:perfekt in 0) [SchemaSimilarity], result of:\n 2.8969522 = score(doc=0,freq=5.0 = termFreq=5.0\n), product of:\n 1.4816046 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n 2.0 = docFreq\n 10.0 = docCount\n 1.9552802 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n 5.0 = termFreq=5.0\n 1.2 = parameter k1\n 0.75 = parameter b\n 508.3 = avgFieldLength\n 184.0 = fieldLength\n"},
"QParser":"ExtendedDismaxQParser",
Check "parsedquery":"+(title:hom content:perfekt)~2 ()" it basically says, that both title and content must be there:
Solr operators

Lucene 5.4 - scoring divided by number of search terms?

I have a simple setup with IndexSearcher, QueryParser, SimpleAnalyzer.
Running some queries I recognised that a query with more than one term returns a different ScoreDoc[i].score than shown in explain query statement. Apparently it is the score shown in explain divided by the number of search terms. any explanation for this behaviour?
Running search(TERM1 TERM2 TERM3)
line:term1 line:term2 line:term3
2.167882 = sum of:
0.6812867 = weight(line:term1 in 6594) [DefaultSimilarity], result of:
0.6812867 = score(doc=6594,freq=2.0), product of:
0.5389907 = queryWeigh
totalHits 1
1678413725, TERM1 TERM2 TERM3, score: 0.72262734
I understand the coord() statement would be used to penalise documents which include only a subset of the search terms provided. However this document includes all terms. Any suggestions?
EDIT: seems like the division does only occur if the query is configured to use OR statements instead of AND. So using OR queries and matching all terms is still divided by the number of terms in the search query. I couldn't find this part in the documentation but at least it explains the difference.
However applying QueryWrapperFilter seems to change the scoring again. Although according to the documentation it should only filter the results without impact on scoring.
More details
These two scores are result of the same query. Only the second query gets divided
0.114700586 = product of:
0.34410176 = sum of:
0.34410176 = weight(line:term1 in 24) [DefaultSimilarity], result of:
0.34410176 = score(doc=24,freq=1.0), product of:
0.5389907 = queryWeight, product of:
8.17176 = idf(docFreq=14, maxDocs=19532)
0.065957725 = queryNorm
0.63841873 = fieldWeight in 24, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
8.17176 = idf(docFreq=14, maxDocs=19532)
0.078125 = fieldNorm(doc=24)
0.33333334 = coord(1/3)
item_id: 1495958818, item_name: term 1 dolor sit met, score: 0.114700586
0.18352094 = product of:
0.5505628 = sum of:
0.5505628 = weight(line:term 1 in 6112) [DefaultSimilarity], result of:
0.5505628 = score(doc=6112,freq=1.0), product of:
0.5389907 = queryWeight, product of:
8.17176 = idf(docFreq=14, maxDocs=19532)
0.065957725 = queryNorm
1.02147 = fieldWeight in 6112, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
8.17176 = idf(docFreq=14, maxDocs=19532)
0.125 = fieldNorm(doc=6112)
0.33333334 = coord(1/3)
item_id: 1677761523, item_name: some text term 1, score: 0.061173648

solr / lucene idf score

I'm trying to get a better understanding of how lucene scored my search so that I can make necessary tweaks to my search configuration or the document content.
The below is a part of the score breakdown.
product of:
0.34472802 = queryWeight, product of:
2.2 = boost
7.880174 = idf(docFreq=48, maxDocs=47667)
0.019884655 = queryNorm
1.9700435 = fieldWeight in 14363, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
7.880174 = idf(docFreq=48, maxDocs=47667)
0.25 = fieldNorm(doc=14363)
0.26806915 = (MATCH) max of:
0.07832639 = (MATCH) weight(shortDescription:tires^1.1 in 14363) [DefaultSimilarity], result of:
0.07832639 = score(doc=14363,freq=1.0 = termFreq=1.0
I understand how the boost is calculated as that is my configuration value
But how was idf calculated (7.880174 = idf value).
According to the lucene, the idf formula is: idf(t) = 1 + log(numDocs/(docFreq+1))
I checked the core admin console and found that my docFreq = maxDocs = 47667.
Using the formula from lucene, I was not able to calculate expected 7.880174. Instead I get: idf = 3.988 = 1 + log(47667/(48+1)).
Is there something I am missing in my formula.
I think your log function choose 10 as base while in lucene we choose e as base.
log(47667/(48+1), 10) = 2.9880217397306
log(47667/(48+1), e) = 6.8801743154459
The source code of idf method of lucene is:
public float idf(int docFreq, int numDocs) {
return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);
}
As you see, idf use Java Math.log to calculate idf while Math.log choose e as log function. See Java Math api for detail.
Looks like the lucene site has a typo.
http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html
states 1 + log(numDocs/(docFreq+1))
but it is actually 1 + ln(numDocs/(docFreq+1))

solr scoring - fieldnorm

I have the following records and the scores against it when I search for "iphone" -
Record1:
FieldName - DisplayName : "Iphone"
FieldName - Name : "Iphone"
11.654595 = (MATCH) sum of:
11.654595 = (MATCH) max plus 0.01 times others of:
7.718274 = (MATCH) weight(DisplayName:iphone^10.0 in 915195), product of:
0.6654692 = queryWeight(DisplayName:iphone^10.0), product of:
10.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
11.598244 = (MATCH) fieldWeight(DisplayName:iphone in 915195), product of:
1.0 = tf(termFreq(DisplayName:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
1.0 = fieldNorm(field=DisplayName, doc=915195)
11.577413 = (MATCH) weight(Name:iphone^15.0 in 915195), product of:
0.99820393 = queryWeight(Name:iphone^15.0), product of:
15.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
11.598244 = (MATCH) fieldWeight(Name:iphone in 915195), product of:
1.0 = tf(termFreq(Name:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
1.0 = fieldNorm(field=Name, doc=915195)
Record2:
FieldName - DisplayName : "The Iphone Book"
FieldName - Name : "The Iphone Book"
7.284122 = (MATCH) sum of:
7.284122 = (MATCH) max plus 0.01 times others of:
4.823921 = (MATCH) weight(DisplayName:iphone^10.0 in 453681), product of:
0.6654692 = queryWeight(DisplayName:iphone^10.0), product of:
10.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
7.2489023 = (MATCH) fieldWeight(DisplayName:iphone in 453681), product of:
1.0 = tf(termFreq(DisplayName:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.625 = fieldNorm(field=DisplayName, doc=453681)
7.2358828 = (MATCH) weight(Name:iphone^15.0 in 453681), product of:
0.99820393 = queryWeight(Name:iphone^15.0), product of:
15.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
7.2489023 = (MATCH) fieldWeight(Name:iphone in 453681), product of:
1.0 = tf(termFreq(Name:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.625 = fieldNorm(field=Name, doc=453681)
Record3:
FieldName - DisplayName: "iPhone"
FieldName - Name: "iPhone"
7.284122 = (MATCH) sum of:
7.284122 = (MATCH) max plus 0.01 times others of:
4.823921 = (MATCH) weight(DisplayName:iphone^10.0 in 5737775), product of:
0.6654692 = queryWeight(DisplayName:iphone^10.0), product of:
10.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
7.2489023 = (MATCH) fieldWeight(DisplayName:iphone in 5737775), product of:
1.0 = tf(termFreq(DisplayName:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.625 = fieldNorm(field=DisplayName, doc=5737775)
7.2358828 = (MATCH) weight(Name:iphone^15.0 in 5737775), product of:
0.99820393 = queryWeight(Name:iphone^15.0), product of:
15.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
7.2489023 = (MATCH) fieldWeight(Name:iphone in 5737775), product of:
1.0 = tf(termFreq(Name:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.625 = fieldNorm(field=Name, doc=5737775)
Why is Record2 and Record3 have the same score when record2 has 3 words and record3 has just one word. So Record3 should have higher relevancy than record 2. Why are the fieldNorm of both Record2 and Record3 the same?
QueryParser: Dismax
FieldType: text fieldtype as default in solrconfig.xml
Adding DataFeed:
Record1: Iphone
{
"ListPrice":1184.526,
"ShipsTo":1,
"OID":"190502",
"EAN":"9780596804299",
"ISBN":"0596804296",
"Author":"Pogue, David",
"product_type_fq":"Books",
"ShipmentDurationDays":"21",
"CurrencyValue":"24.9900",
"ShipmentDurationText":"NORMALLY SHIPS IN 21 BUSINESS DAYS",
"Availability":0,
"COD":0,
"PublicationDate":"2009-08-07 00:00:00.0",
"Discount":"25",
"SubCategory_fq":"Hardware",
"Binding":"Paperback",
"Category_fq":"Non Classifiable",
"ShippingCharges":"0",
"OIDType":8,
"Pages":"397",
"CallOrder":"0",
"TrackInventory":"Ingram",
"Author_fq":"Pogue, David",
"DisplayName":"Iphone",
"url":"/iphone-pogue-david/books/9780596804299.htm",
"CurrencyType":"USD",
"SubSubCategory":"Handheld Devices",
"Mask":0,
"Publisher":"Oreilly & Associates Inc",
"Name":"Iphone",
"Language":"English",
"DisplayPriority":"999",
"rowid":"books_9780596804299"
}
Record2: The Iphone Book
{
"ListPrice":1184.526,
"ShipsTo":1,
"OID":"94694",
"EAN":"9780321534101",
"ISBN":"0321534107",
"Author":"Kelby, Scott/ White, Terry",
"product_type_fq":"Books",
"ShipmentDurationDays":"21",
"CurrencyValue":"24.9900",
"ShipmentDurationText":"NORMALLY SHIPS IN 21 BUSINESS DAYS",
"Availability":1,
"COD":0,
"PublicationDate":"2007-08-13 00:00:00.0",
"Discount":"25",
"SubCategory_fq":"Handheld Devices",
"Binding":"Paperback",
"BAMcategory_src":"Computers",
"Category_fq":"Computers",
"ShippingCharges":"0",
"OIDType":8,
"Pages":"219",
"CallOrder":"0",
"TrackInventory":"Ingram",
"Author_fq":"Kelby, Scott/ White, Terry",
"DisplayName":"The Iphone Book",
"url":"/iphone-book-kelby-scott-white-terry/books/9780321534101.htm",
"CurrencyType":"USD",
"SubSubCategory":" Handheld Devices",
"BAMcategory_fq":"Computers",
"Mask":0,
"Publisher":"Pearson P T R",
"Name":"The Iphone Book",
"Language":"English",
"DisplayPriority":"999",
"rowid":"books_9780321534101"
}
Record 3: iPhone
{
"ListPrice":278.46,
"ShipsTo":1,
"OID":"694715",
"EAN":"9781411423527",
"ISBN":"1411423526",
"Author":"Quamut (COR)",
"product_type_fq":"Books",
"ShipmentDurationDays":"21",
"CurrencyValue":"5.9500",
"ShipmentDurationText":"NORMALLY SHIPS IN 21 BUSINESS DAYS",
"Availability":0,
"COD":0,
"PublicationDate":"2010-08-03 00:00:00.0",
"Discount":"25",
"SubCategory_fq":"Hardware",
"Binding":"Paperback",
"Category_fq":"Non Classifiable",
"ShippingCharges":"0",
"OIDType":8,
"CallOrder":"0",
"TrackInventory":"BNT",
"Author_fq":"Quamut (COR)",
"DisplayName":"iPhone",
"url":"/iphone-quamut-cor/books/9781411423527.htm",
"CurrencyType":"USD",
"SubSubCategory":"Handheld Devices",
"Mask":0,
"Publisher":"Sterling Pub Co Inc",
"Name":"iPhone",
"Language":"English",
"DisplayPriority":"999",
"rowid":"books_9781411423527"
}
fieldnorm takes into account the field length i.e. the number of terms.
The fieldtype used is text for the fields display name & name, which would have the stopwords and the word delimiter filters.
Record 1 - Iphone
Would generate a single token - IPhone
Record 2 - The Iphone Book
Would generate 2 tokens - Iphone, Book
The would be removed by the stopwords.
Record 3 - iPhone
Would also generate 2 tokens - i,phone
As iPhone has a case change, the word delimiter filter with splitOnCaseChange would now split iPhone into 2 tokens i, Phone and would produce the field norm same as Record 2
This is the answer to user1021590's follow-up question/answer on the "da vinci code" search example.
The reason all the documents get the same score is due to a subtle implementation detail of lengthNorm. Lucence TFIDFSimilarity doc states the following about norm(t, d):
the resulted norm value is encoded as a single byte before being stored. At search time, the norm byte value is read from the index directory and decoded back to a float norm value. This encoding/decoding, while reducing index size, comes with the price of precision loss - it is not guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.75.
If you dig into the code, you see that this float-to-byte encoding is implemented as follows:
public static byte floatToByte315(float f)
{
int bits = Float.floatToRawIntBits(f);
int smallfloat = bits >> (24 - 3);
if (smallfloat <= ((63 - 15) << 3))
{
return (bits <= 0) ? (byte) 0 : (byte) 1;
}
if (smallfloat >= ((63 - 15) << 3) + 0x100)
{
return -1;
}
return (byte) (smallfloat - ((63 - 15) << 3));
}
and the decoding of that byte to float is done as:
public static float byte315ToFloat(byte b)
{
if (b == 0)
return 0.0f;
int bits = (b & 0xff) << (24 - 3);
bits += (63 - 15) << 24;
return Float.intBitsToFloat(bits);
}
lengthNorm is calculated as 1 / sqrt( number of terms in field ). This is then encoded for storage using floatToByte315. For a field with 3 terms, we get:
floatToByte315( 1/sqrt(3.0) ) = 120
and for a field with 4 terms, we get:
floatToByte315( 1/sqrt(4.0) ) = 120
so both of them get decoded to:
byte315ToFloat(120) = 0.5.
The doc also states this:
The rationale supporting such lossy compression of norm values is that given the difficulty (and inaccuracy) of users to express their true information need by a query, only big differences matter.
UPDATE: As of Solr 4.10, this implementation and corresponding statements are part of DefaultSimilarity.

Solr: fieldNorm different per document, with no document boost

I want my search results to order by score, which they are doing, but the score is being calculated improperly. This is to say, not necessarily improperly, but differently than expected and I'm not sure why. My goal is to remove whatever is changing the score.
If I perform a search that matches on two objects (where ObjectA is expected to have a higher score than ObjectB), ObjectB is being returned first.
Let's say, for this example, that my query is a single term: "apples".
ObjectA's title: "apples are apples" (2/3 terms)
ObjectA's description: "There were apples in the apples-apples and now the apples went all apples all over the apples!" (6/18 terms)
ObjectB's title: "apples are great" (1/3 terms)
ObjectB's description: "There were apples in the apples-room and now the apples went all bad all over the apples!" (4/18 terms)
The title field has no boost (or rather, a boost of 1) and the description field has a boost of 0.8. I have not specified a document boost through solrconfig.xml or through the query that I'm passing through. If there is another way to specify a document boost, there is the chance that I'm missing one.
After analyzing the explain printout, it looks like ObjectA is properly calculating a higher score than ObjectB, just like I want, except for one difference: ObjectB's title fieldNorm is always higher than ObjectA's.
Here follows the explain printout. Just so you know: the title field is mditem5_tns and the description field is mditem7_tns:
ObjectB:
1.3327172 = (MATCH) sum of:
1.0352166 = (MATCH) max plus 0.1 times others of:
0.9766194 = (MATCH) weight(mditem5_tns:appl in 0), product of:
0.53929156 = queryWeight(mditem5_tns:appl), product of:
1.8109303 = idf(docFreq=3, maxDocs=9)
0.2977981 = queryNorm
1.8109303 = (MATCH) fieldWeight(mditem5_tns:appl in 0), product of:
1.0 = tf(termFreq(mditem5_tns:appl)=1)
1.8109303 = idf(docFreq=3, maxDocs=9)
1.0 = fieldNorm(field=mditem5_tns, doc=0)
0.58597165 = (MATCH) weight(mditem7_tns:appl^0.8 in 0), product of:
0.43143326 = queryWeight(mditem7_tns:appl^0.8), product of:
0.8 = boost
1.8109303 = idf(docFreq=3, maxDocs=9)
0.2977981 = queryNorm
1.3581977 = (MATCH) fieldWeight(mditem7_tns:appl in 0), product of:
2.0 = tf(termFreq(mditem7_tns:appl)=4)
1.8109303 = idf(docFreq=3, maxDocs=9)
0.375 = fieldNorm(field=mditem7_tns, doc=0)
0.2975006 = (MATCH) FunctionQuery(1000.0/(1.0*float(top(rord(lastmodified)))+1000.0)), product of:
0.999001 = 1000.0/(1.0*float(1)+1000.0)
1.0 = boost
0.2977981 = queryNorm
ObjectA:
1.2324848 = (MATCH) sum of:
0.93498427 = (MATCH) max plus 0.1 times others of:
0.8632177 = (MATCH) weight(mditem5_tns:appl in 0), product of:
0.53929156 = queryWeight(mditem5_tns:appl), product of:
1.8109303 = idf(docFreq=3, maxDocs=9)
0.2977981 = queryNorm
1.6006513 = (MATCH) fieldWeight(mditem5_tns:appl in 0), product of:
1.4142135 = tf(termFreq(mditem5_tns:appl)=2)
1.8109303 = idf(docFreq=3, maxDocs=9)
0.625 = fieldNorm(field=mditem5_tns, doc=0)
0.7176658 = (MATCH) weight(mditem7_tns:appl^0.8 in 0), product of:
0.43143326 = queryWeight(mditem7_tns:appl^0.8), product of:
0.8 = boost
1.8109303 = idf(docFreq=3, maxDocs=9)
0.2977981 = queryNorm
1.6634457 = (MATCH) fieldWeight(mditem7_tns:appl in 0), product of:
2.4494898 = tf(termFreq(mditem7_tns:appl)=6)
1.8109303 = idf(docFreq=3, maxDocs=9)
0.375 = fieldNorm(field=mditem7_tns, doc=0)
0.2975006 = (MATCH) FunctionQuery(1000.0/(1.0*float(top(rord(lastmodified)))+1000.0)), product of:
0.999001 = 1000.0/(1.0*float(1)+1000.0)
1.0 = boost
0.2977981 = queryNorm
The problem is caused by the stemmer. It expands "apples are apples" to "apples appl are apples appl" thus making the field longer. As document B only contains 1 term that is being expanded by the stemmer the field stays shorter then document A.
This results in different fieldNorms.
FieldNOrm is computed of 3 components - index-time boost on the field, index-time boost on the document and field length. Assuming that you are not supplying any index-time boost, the difference must be field length.
Thus, since lengthNorm is higher for shorter field values, for B to have a higher fieldNorm value for the title, it must have smaller number of tokens in the title than A.
See the following pages for a detailed explanation of Lucene scoring:
http://lucene.apache.org/java/2_4_0/scoring.html
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html

Resources