SOLR Results Ranking - solr

Using 6.0.1 SOLR.
Have got a type declaration:
<fieldType name="customy_icu" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="1" max="100"/>
<filter class="solr.NGramTokenizerFactory" minGramSize="2" maxGramSize="20"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.LengthFilterFactory" min="1" max="100"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
customy_icu is used for storing text data at hebrew lang (word are reading/writing) from right to left.
When query is "מי פנים"
I have got the result in incorrect order, product_3351 is higher (more relevant) than product product_3407, but should be vice versa.
Here is debug:
<str name="product_3351">
2.711071 = sum of:
2.711071 = max of:
0.12766865 = weight(meta_keyword:"מי פנים" in 882) [ClassicSimilarity], result of:
0.12766865 = score(doc=882,freq=1.0), product of:
0.05998979 = queryWeight, product of:
8.5126915 = idf(), sum of:
4.7235003 = idf(docFreq=21, docCount=910)
3.7891912 = idf(docFreq=55, docCount=910)
0.0070471005 = queryNorm
2.1281729 = fieldWeight in 882, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = phraseFreq=1.0
8.5126915 = idf(), sum of:
4.7235003 = idf(docFreq=21, docCount=910)
3.7891912 = idf(docFreq=55, docCount=910)
0.25 = fieldNorm(doc=882)
2.711071 = weight(name:"מי פנים" in 882) [ClassicSimilarity], result of:
2.711071 = score(doc=882,freq=1.0), product of:
0.6178363 = queryWeight, product of:
9.99 = boost
8.776017 = idf(), sum of:
4.8417873 = idf(docFreq=22, docCount=1071)
3.93423 = idf(docFreq=56, docCount=1071)
0.0070471005 = queryNorm
4.3880086 = fieldWeight in 882, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = phraseFreq=1.0
8.776017 = idf(), sum of:
4.8417873 = idf(docFreq=22, docCount=1071)
3.93423 = idf(docFreq=56, docCount=1071)
0.5 = fieldNorm(doc=882)
</str>
and
<str name="product_3407">
2.711071 = sum of:
2.711071 = max of:
2.711071 = weight(name:"מי פנים" in 919) [ClassicSimilarity], result of:
2.711071 = score(doc=919,freq=1.0), product of:
0.6178363 = queryWeight, product of:
9.99 = boost
8.776017 = idf(), sum of:
4.8417873 = idf(docFreq=22, docCount=1071)
3.93423 = idf(docFreq=56, docCount=1071)
0.0070471005 = queryNorm
4.3880086 = fieldWeight in 919, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = phraseFreq=1.0
8.776017 = idf(), sum of:
4.8417873 = idf(docFreq=22, docCount=1071)
3.93423 = idf(docFreq=56, docCount=1071)
0.5 = fieldNorm(doc=919)
</str>
The product 3351 has name field value:
סאבליים סופט מי פנים
And product 3407 has name field value:
מי פנים מיסלרים
http://screencast.com/t/2iBwLQqu
How I can boost 3407 product it become higher in result list ?
Thanks a lot!

If you have a specific query where you want to boost a document to the top of the result set, irrelevant of its own score, use the Query Elevation Component.
There is no automagic boosting for "appears earlier in the document", but there's a few ways to work around it. See How to boost scores for early matches for a couple of possible solutions.
"Relevancy" is a fluent term, and you have to implement the kind of scoring that you feel is suitable for your application outside of the standard rules. The debugQuery you've included shows that the documents are scored identically on relevancy by default.

You can use elevate.xml file to set particular document to appear top in the resultset for specific serachterm.
example :
<elevate>
<query text ="מי פנים">
<doc id="your_product_ID" />
</query>

Related

Solr Shingle Is Not Visible In Debug Query

I am trying to use Solr to find exact matches on categories in a user search (e.g. "skinny jeans" in "blue skinny jeans"). I am using the following type definition:
<fieldType name="subphrase" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="\ "
replacement="_"/>
<tokenizer class="solr.KeywordTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory"
outputUnigrams="true"
outputUnigramsIfNoShingles="true"
tokenSeparator="_"
minShingleSize="2"
maxShingleSize="99"/>
</analyzer>
</fieldType>
The type will index categories without tokenizing, only replacing whitespace with underscores. But it will tokenize queries and shingle them (with underscores).
What I am trying to do is match the query shingles against the indexed categories. In the Solr Analysis page I can see that the whitespace/underscore replacement works on both index and query, and I can see that the query is being shingled correctly (screenshot below):
My problem is that in the Solr Query page, I cannot see shingles being generated, and I presume that as a result the category "skinny jeans" is not matched, but the category "jeans" is matched :(
This is the debug output:
{
"responseHeader": {
"status": 0,
"QTime": 1,
"params": {
"q": "name:(skinny jeans)",
"indent": "true",
"wt": "json",
"debugQuery": "true",
"_": "1464170217438"
}
},
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"id": 33,
"name": "jeans",
}
]
},
"debug": {
"rawquerystring": "name:(skinny jeans)",
"querystring": "name:(skinny jeans)",
"parsedquery": "name:skinny name:jeans",
"parsedquery_toString": "name:skinny name:jeans",
"explain": {
"33": "\n2.2143755 = product of:\n 4.428751 = sum of:\n 4.428751 = weight(name:jeans in 54) [DefaultSimilarity], result of:\n 4.428751 = score(doc=54,freq=1.0), product of:\n 0.6709952 = queryWeight, product of:\n 6.600272 = idf(docFreq=1, maxDocs=541)\n 0.10166174 = queryNorm\n 6.600272 = fieldWeight in 54, product of:\n 1.0 = tf(freq=1.0), with freq of:\n 1.0 = termFreq=1.0\n 6.600272 = idf(docFreq=1, maxDocs=541)\n 1.0 = fieldNorm(doc=54)\n 0.5 = coord(1/2)\n"
},
"QParser": "LuceneQParser"
}
}
It's clear that the parsedquery parameter does not display the shingled query. What do I need to do to complete the process of matching query shingles against indexed values? I feel like I am very close to cracking this problem. Any advice is appreciated!
This is an incomplete answer, but it might be enough to get you moving.
1: You probably want outputUnigrams="false", so you don't match category "jeans" on query "skinny jeans"
2: You actually do want to search with quotes, (a phrase) or the field won't ever see more than one token to shingle.
3: It seems like you're trying to do the same thing as this person was:
http://comments.gmane.org/gmane.comp.jakarta.lucene.user/34746
That thread looks like it lead to the inclusion of the PositionFilterFactory
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.PositionFilterFactory
If you're using Solr < 5.0, try putting that at the end of your query time analysis and see if it works.
Unfortunately, that filter factory was removed in 5.0. This is the only comment I've found about what to do instead:
http://lucene.apache.org/core/4_10_0/analyzers-common/org/apache/lucene/analysis/position/PositionFilter.html
I played with autoGeneratePhraseQueries a little, but I have yet to find another way to prevent Solr from generating a MultiPhraseQuery.

Solr and hyphenated numbers

I have a number with hyphens 91-21-22020-4.
My problem is that I would like hits even if the hyphens are moved within the number string. As it's now 912122020-4 will give one hit but 91212202-04 will not?
The debug info looks like:
"debug": {
"rawquerystring": "91212202-04",
"querystring": "91212202-04",
"parsedquery": "+((freetext:91212202 freetext:9121220204)/no_coord) +freetext:04",
"parsedquery_toString": "+(freetext:91212202 freetext:9121220204) +freetext:04",
"explain": {},
"QParser": "LuceneQParser",
AND
"debug": {
"rawquerystring": "912122020-4",
"querystring": "912122020-4",
"parsedquery": "+((freetext:912122020 freetext:9121220204)/no_coord) +freetext:4",
"parsedquery_toString": "+(freetext:912122020 freetext:9121220204) +freetext:4",
"explain": {
"ATEST003-81419": "\n0.33174315 = (MATCH) sum of:\n 0.17618936 = (MATCH) sum of:\n 0.17618936 = (MATCH) weight(freetext:9121220204 in 0) [DefaultSimilarity], result of:\n 0.17618936 = score(doc=0,freq=1.0), product of:\n 0.5690552 = queryWeight, product of:\n 3.3025851 = idf(docFreq=1, maxDocs=20)\n 0.17230599 = queryNorm\n 0.30961734 = fieldWeight in 0, product of:\n 1.0 = tf(freq=1.0), with freq of:\n 1.0 = termFreq=1.0\n 3.3025851 = idf(docFreq=1, maxDocs=20)\n 0.09375 = fieldNorm(doc=0)\n 0.15555379 = (MATCH) weight(freetext:4 in 0) [DefaultSimilarity], result of:\n 0.15555379 = score(doc=0,freq=2.0), product of:\n 0.44962177 = queryWeight, product of:\n 2.609438 = idf(docFreq=3, maxDocs=20)\n 0.17230599 = queryNorm\n 0.34596586 = fieldWeight in 0, product of:\n 1.4142135 = tf(freq=2.0), with freq of:\n 2.0 = termFreq=2.0\n 2.609438 = idf(docFreq=3, maxDocs=20)\n 0.09375 = fieldNorm(doc=0)\n"
},
My schema.xml looks like:
<fieldType name="text_indexed" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.HyphenatedWordsFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-index.txt"/>
<filter class="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-index.txt"/>
</analyzer>
</fieldType>
Use a PatternReplaceCharFilter to remove all traces of the hyphens before they're indexed in Solr (or use PatternReplaceFilter to change the tokens stored and not the text indexed).
91212202-04 would then be indexed (and searched) as 9121220204, which would effectively remove any dependency on hyphens.

Unexpected results in solr index when searching for string

I have set up a SOLR environment and am using a text_nl fieldtype which I fill with several other fields.
I am experiencing some odd behavior. Whenever I search for "new", the query returns results with new in the index, but also some results which don't have the "new" string in them. I already disabled al the filter factories, but to no avail. I keep getting results in the query, which do not contain this word.
Below you will find pieces of my solrconfig.xml and schema.xml.
Fieldtype text_nl:
<fieldType name="text_nl" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_nl.txt" format="snowball" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" />
<filter class="solr.ReversedWildcardFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_nl.txt" format="snowball" />
<!-- <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" /> -->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Field names:
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="Merk" type="text_nl" indexed="false" stored="true"/>
<field name="Model" type="text_nl" indexed="false" stored="true" multiValued="true" />
<field name="Kleur" type="text_nl" indexed="false" stored="true"/>
<field name="Collectie" type="text_nl" indexed="false" stored="true"/>
<field name="Categorie" type="text_nl" indexed="true" stored="true"/>
<field name="MateriaalSoort" type="text_nl" indexed="false" stored="true"/>
<field name="Zool" type="text_nl" indexed="false" stored="true"/>
<field name="Omschrijving" type="text_nl" indexed="false" stored="true"/>
<field name="text" type="text_nl" indexed="true" stored="true" multiValued="true"/>
Solrconfig.xml
<requestHandler name="/query" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">50000</int>
<str name="wt">json</str>
<str name="indent">true</str>
<str name="df">text</str>
<str name="fl">id,Merk,Model,Kleur,Collectie,Categorie,Zool,Omschrijving</str>
<str name="qf">Merk^100 Model^0.8 Omschrijving^0.3 id^1.0</str>
<str name="pf">Merk^100 Model^0.8 Omschrijving^0.3 id^1.0</str>
</lst>
The data is as follows:
/query?q=new
Yields:
{
"id":"3215.70.101204",
"Merk":"New balance",
"Model":["M576"],
"Kleur":"Groen",
"Collectie":"Herenschoenen",
"Categorie":"Sneakers",
"Zool":"Rubber",
"Omschrijving":"Groene nubuck special runner van het merk New Balance. Het logo is van groen nubuck."},
{
"id":"3215.26.104592",
"Merk":"Greve",
"Model":["6260"],
"Kleur":"Jeans",
"Collectie":"Herenschoenen",
"Categorie":"Sneakers",
"Zool":"Rubber",
"Omschrijving":"Deze jeans blauwe suède/lederen runner is van het merk Greve. De runner heeft een merklabel van Greve aan de achterzijde. De runner heeft een witte met houten middenzool en een rubberen zool, verder heeft de runner zilveren studs details."},
As you can see there is no "new" in the result of the second id.
This is the result of the debug query:
debug":{
"rawquerystring":"new",
"querystring":"new",
"parsedquery":"text:new",
"parsedquery_toString":"text:new",
"explain":{
"3215.13.101204":"\n1.4514455 = (MATCH) weight(text:new in 2047) [DefaultSimilarity], result of:\n 1.4514455 = fieldWeight in 2047, product of:\n 1.7320508 = tf(freq=3.0), with freq of:\n 3.0 = termFreq=3.0\n 4.469293 = idf(docFreq=113, maxDocs=3661)\n 0.1875 = fieldNorm(doc=2047)\n",
"3215.30.101204":"\n1.4514455 = (MATCH) weight(text:new in 2142) [DefaultSimilarity], result of:\n 1.4514455 = fieldWeight in 2142, product of:\n 1.7320508 = tf(freq=3.0), with freq of:\n 3.0 = termFreq=3.0\n 4.469293 = idf(docFreq=113, maxDocs=3661)\n 0.1875 = fieldNorm(doc=2142)\n",
"3215.70.101204":"\n1.4514455 = (MATCH) weight(text:new in 2217) [DefaultSimilarity], result of:\n 1.4514455 = fieldWeight in 2217, product of:\n 1.7320508 = tf(freq=3.0), with freq of:\n 3.0 = termFreq=3.0\n 4.469293 = idf(docFreq=113, maxDocs=3661)\n 0.1875 = fieldNorm(doc=2217)\n",
"3215.26.104592":"\n1.3966541 = (MATCH) weight(text:new in 2137) [DefaultSimilarity], result of:\n 1.3966541 = fieldWeight in 2137, product of:\n 2.0 = tf(freq=4.0), with freq of:\n 4.0 = termFreq=4.0\n 4.469293 = idf(docFreq=113, maxDocs=3661)\n 0.15625 = fieldNorm(doc=2137)\n",
"3215.34.104592":"\n1.3966541 = (MATCH) weight(text:new in 2185) [DefaultSimilarity], result of:\n 1.3966541 = fieldWeight in 2185, product of:\n 2.0 = tf(freq=4.0), with freq of:\n 4.0 = termFreq=4.0\n 4.469293 = idf(docFreq=113, maxDocs=3661)\n 0.15625 = fieldNorm(doc=2185)\n",
"3215.70.104592":"\n1.3966541 = (MATCH) weight(text:new in 2232) [DefaultSimilarity], result of:\n 1.3966541 = fieldWeight in 2232, product of:\n 2.0 = tf(freq=4.0), with freq of:\n 4.0 = termFreq=4.0\n 4.469293 = idf(docFreq=113, maxDocs=3661)\n 0.15625 = fieldNorm(doc=2232)\n",
This is probably happening due to the combination of EdgeNGramFilter and ReversedWildcardFilter. EdgeNGramFilter is first splitting terms into ngrams of size three or more. Each of these are then indexed in both forward and reversed form, so if you index the word "went", you end up with:
ngrams: "wen", "ent", "went"
reversewildcard: "wen", "new", "ent", "tne", "went", "tnew"
And so you get a match on the term "went" with a query for "new". Any word containing either "new" or "wen" can be expected to match.
Really, I think using both of these is overkill. Reversing ngrams doesn't make a great deal of sense to me. Both of them are approaches to similar problems, and they don't make sense used together, to my mind.
Also, you may have a synonym defined in "synonyms.txt" for the word "new".

Solr search query does not consider special character

I have indexed in solr shop names like
H&M
Lotte & Anna
fan & more
Tele2
Pure Tea
I have the following two issues (with importance in priority)
if I search for "H&M" I will never get any result. If I search for "te & Ann" I get the expected results.
if I search for "te & an" the results I get are Tele2 and Pure Tea whereas I would have expected "Lotte & Anna" to appear first in the list.
It appears as if the & character is not taken into consideration. What am I doing wrong here?
These are my analysers for the specific field (both query and index)
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Ok, so the 1st problem was addressed with the WordDelimiterFilterFactory specifying & => ALPHA in the wdfftypes.txt and changing switching from the StandardTokenizerFactory to the WhitepsaceTokenizerFactory
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" types="wdfftypes.txt"/>
(edited in both analyser and query).
2nd question still remains.
In the debugQuery I get the following
"debug": {
"rawquerystring": "te & an",
"querystring": "te & an",
"parsedquery": "text:te text:an",
"parsedquery_toString": "text:te text:an",
"explain": {
"": "\n0.8152958 = (MATCH) product of:\n 1.6305916 = (MATCH) sum of:\n 1.6305916 = (MATCH) weight(text:te in 498) [DefaultSimilarity], result of:\n 1.6305916 = score(doc=498,freq=1.0 = termFreq=1.0\n), product of:\n 0.8202942 = queryWeight, product of:\n 5.300835 = idf(docFreq=87, maxDocs=6491)\n 0.15474811 = queryNorm\n 1.9878132 = fieldWeight in 498, product of:\n 1.0 = tf(freq=1.0), with freq of:\n 1.0 = termFreq=1.0\n 5.300835 = idf(docFreq=87, maxDocs=6491)\n 0.375 = fieldNorm(doc=498)\n 0.5 = coord(1/2)\n"
},
so, what should I modify so that the weights shift in favour of the desired result?
Use "NGramFilterFactory" instead of "EdgeNGramFilterFactory". That way, "Lotte & Anne", gets indexed into "lo, ot, tt, te, lot, ott, tte, lott, otte, lotte" and "an, nn, ne, ann, nne, anne". so when you search for "tte & ann", the document will match.

Solr: fieldNorm different per document, with no document boost

I want my search results to order by score, which they are doing, but the score is being calculated improperly. This is to say, not necessarily improperly, but differently than expected and I'm not sure why. My goal is to remove whatever is changing the score.
If I perform a search that matches on two objects (where ObjectA is expected to have a higher score than ObjectB), ObjectB is being returned first.
Let's say, for this example, that my query is a single term: "apples".
ObjectA's title: "apples are apples" (2/3 terms)
ObjectA's description: "There were apples in the apples-apples and now the apples went all apples all over the apples!" (6/18 terms)
ObjectB's title: "apples are great" (1/3 terms)
ObjectB's description: "There were apples in the apples-room and now the apples went all bad all over the apples!" (4/18 terms)
The title field has no boost (or rather, a boost of 1) and the description field has a boost of 0.8. I have not specified a document boost through solrconfig.xml or through the query that I'm passing through. If there is another way to specify a document boost, there is the chance that I'm missing one.
After analyzing the explain printout, it looks like ObjectA is properly calculating a higher score than ObjectB, just like I want, except for one difference: ObjectB's title fieldNorm is always higher than ObjectA's.
Here follows the explain printout. Just so you know: the title field is mditem5_tns and the description field is mditem7_tns:
ObjectB:
1.3327172 = (MATCH) sum of:
1.0352166 = (MATCH) max plus 0.1 times others of:
0.9766194 = (MATCH) weight(mditem5_tns:appl in 0), product of:
0.53929156 = queryWeight(mditem5_tns:appl), product of:
1.8109303 = idf(docFreq=3, maxDocs=9)
0.2977981 = queryNorm
1.8109303 = (MATCH) fieldWeight(mditem5_tns:appl in 0), product of:
1.0 = tf(termFreq(mditem5_tns:appl)=1)
1.8109303 = idf(docFreq=3, maxDocs=9)
1.0 = fieldNorm(field=mditem5_tns, doc=0)
0.58597165 = (MATCH) weight(mditem7_tns:appl^0.8 in 0), product of:
0.43143326 = queryWeight(mditem7_tns:appl^0.8), product of:
0.8 = boost
1.8109303 = idf(docFreq=3, maxDocs=9)
0.2977981 = queryNorm
1.3581977 = (MATCH) fieldWeight(mditem7_tns:appl in 0), product of:
2.0 = tf(termFreq(mditem7_tns:appl)=4)
1.8109303 = idf(docFreq=3, maxDocs=9)
0.375 = fieldNorm(field=mditem7_tns, doc=0)
0.2975006 = (MATCH) FunctionQuery(1000.0/(1.0*float(top(rord(lastmodified)))+1000.0)), product of:
0.999001 = 1000.0/(1.0*float(1)+1000.0)
1.0 = boost
0.2977981 = queryNorm
ObjectA:
1.2324848 = (MATCH) sum of:
0.93498427 = (MATCH) max plus 0.1 times others of:
0.8632177 = (MATCH) weight(mditem5_tns:appl in 0), product of:
0.53929156 = queryWeight(mditem5_tns:appl), product of:
1.8109303 = idf(docFreq=3, maxDocs=9)
0.2977981 = queryNorm
1.6006513 = (MATCH) fieldWeight(mditem5_tns:appl in 0), product of:
1.4142135 = tf(termFreq(mditem5_tns:appl)=2)
1.8109303 = idf(docFreq=3, maxDocs=9)
0.625 = fieldNorm(field=mditem5_tns, doc=0)
0.7176658 = (MATCH) weight(mditem7_tns:appl^0.8 in 0), product of:
0.43143326 = queryWeight(mditem7_tns:appl^0.8), product of:
0.8 = boost
1.8109303 = idf(docFreq=3, maxDocs=9)
0.2977981 = queryNorm
1.6634457 = (MATCH) fieldWeight(mditem7_tns:appl in 0), product of:
2.4494898 = tf(termFreq(mditem7_tns:appl)=6)
1.8109303 = idf(docFreq=3, maxDocs=9)
0.375 = fieldNorm(field=mditem7_tns, doc=0)
0.2975006 = (MATCH) FunctionQuery(1000.0/(1.0*float(top(rord(lastmodified)))+1000.0)), product of:
0.999001 = 1000.0/(1.0*float(1)+1000.0)
1.0 = boost
0.2977981 = queryNorm
The problem is caused by the stemmer. It expands "apples are apples" to "apples appl are apples appl" thus making the field longer. As document B only contains 1 term that is being expanded by the stemmer the field stays shorter then document A.
This results in different fieldNorms.
FieldNOrm is computed of 3 components - index-time boost on the field, index-time boost on the document and field length. Assuming that you are not supplying any index-time boost, the difference must be field length.
Thus, since lengthNorm is higher for shorter field values, for B to have a higher fieldNorm value for the title, it must have smaller number of tokens in the title than A.
See the following pages for a detailed explanation of Lucene scoring:
http://lucene.apache.org/java/2_4_0/scoring.html
http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html

Resources