Learning Solr/Lucene Syntax, using Solr Admin in Browser.
There I try to search for the same word in two differnt fields with following syntax:
content:myword -> results found
content:myword OR title:existingTitle -> results found
but
content:myword OR title:myword -> ZERO results found, why? It is "or".
also tried without operator which should be equal to "or" , also tried "|" and "||"
this happens when I try to find the same word in one of multipe fields
[edit]
Here are the solr url requests:
content:fahrzeug title:fahrzeug
http://xxx/solr/core_de/select?q=content%3Afahrzeug%20title%3Afahrzeug
content:fahrzeug OR title:fahrzeug
http://xxx/solr/core_de/select?q=content%3Afahrzeug%20OR%20title%3Afahrzeug
content:fahrzeug | title:fahrzeug
http://xxx/solr/core_de/select?q=content%3Afahrzeug%20%7C%20title%3Afahrzeug
{
"responseHeader":{
"status":400,
"QTime":5,
"params":{
"q":"content:fahrzeug OR title:fahrzeug",
"debugQuery":"1"}},
"error":{
"metadata":[
"error-class","org.apache.solr.common.SolrException",
"root-error-class","org.apache.solr.common.SolrException"],
"msg":"invalid boolean value: 1",
"code":400}}
I guess, that it is configured like this:
Try:
http://www119.pxia.de:8983/solr/core_de/select?fq=content%3Afahrzeug%20title%3Afahrzeug&q=*%3A* - this returns correct documents. So those documents are there if only filtering is used. Query use more complex conditions, your default configuration is:
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="echoParams">explicit</str>
<str name="qf">content^40.0 title^5.0 keywords^2.0 tagsH1^5.0 tagsH2H3^3.0 tagsH4H5H6^2.0 tagsInline^1.0</str>
<str name="pf">content^2.0</str>
<str name="df">content</str>
<int name="ps">15</int>
<str name="mm">2<-35%</str>
<str name="mm.autoRelax">true</str>
...
Parser and boosting may play a key role here.
I am not familiar with edixmax parser, please check: documentation
I would guess mm parameter may be causing this.
Anyway its strange, that OR does not work as we are use to from boolean algebra.
"debug":{
"queryBoosting":{
"q":"title:Home OR content:Perfekt",
"match":null},
"rawquerystring":"title:Home OR content:Perfekt",
"querystring":"title:Home OR content:Perfekt",
"parsedquery":"+(title:hom content:perfekt)~2 ()",
"parsedquery_toString":"+((title:hom content:perfekt)~2) ()",
"explain":{
"bf72a75534ba703e4b8dc7194f92ce34223fc0d2/pages/1/0/0/0":"\n4.8893824 = sum of:\n 4.8893824 = sum of:\n 1.9924302 = weight(title:hom in 0) [SchemaSimilarity], result of:\n 1.9924302 = score(doc=0,freq=1.0 = termFreq=1.0\n), product of:\n 1.9924302 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n 1.0 = docFreq\n 10.0 = docCount\n 1.0 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1) from:\n 1.0 = termFreq=1.0\n 1.2 = parameter k1\n 0.0 = parameter b (norms omitted for field)\n 2.8969522 = weight(content:perfekt in 0) [SchemaSimilarity], result of:\n 2.8969522 = score(doc=0,freq=5.0 = termFreq=5.0\n), product of:\n 1.4816046 = idf, computed as log(1 + (docCount - docFreq + 0.5) / (docFreq + 0.5)) from:\n 2.0 = docFreq\n 10.0 = docCount\n 1.9552802 = tfNorm, computed as (freq * (k1 + 1)) / (freq + k1 * (1 - b + b * fieldLength / avgFieldLength)) from:\n 5.0 = termFreq=5.0\n 1.2 = parameter k1\n 0.75 = parameter b\n 508.3 = avgFieldLength\n 184.0 = fieldLength\n"},
"QParser":"ExtendedDismaxQParser",
Check "parsedquery":"+(title:hom content:perfekt)~2 ()" it basically says, that both title and content must be there:
Solr operators
I have a simple setup with IndexSearcher, QueryParser, SimpleAnalyzer.
Running some queries I recognised that a query with more than one term returns a different ScoreDoc[i].score than shown in explain query statement. Apparently it is the score shown in explain divided by the number of search terms. any explanation for this behaviour?
Running search(TERM1 TERM2 TERM3)
line:term1 line:term2 line:term3
2.167882 = sum of:
0.6812867 = weight(line:term1 in 6594) [DefaultSimilarity], result of:
0.6812867 = score(doc=6594,freq=2.0), product of:
0.5389907 = queryWeigh
totalHits 1
1678413725, TERM1 TERM2 TERM3, score: 0.72262734
I understand the coord() statement would be used to penalise documents which include only a subset of the search terms provided. However this document includes all terms. Any suggestions?
EDIT: seems like the division does only occur if the query is configured to use OR statements instead of AND. So using OR queries and matching all terms is still divided by the number of terms in the search query. I couldn't find this part in the documentation but at least it explains the difference.
However applying QueryWrapperFilter seems to change the scoring again. Although according to the documentation it should only filter the results without impact on scoring.
More details
These two scores are result of the same query. Only the second query gets divided
0.114700586 = product of:
0.34410176 = sum of:
0.34410176 = weight(line:term1 in 24) [DefaultSimilarity], result of:
0.34410176 = score(doc=24,freq=1.0), product of:
0.5389907 = queryWeight, product of:
8.17176 = idf(docFreq=14, maxDocs=19532)
0.065957725 = queryNorm
0.63841873 = fieldWeight in 24, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
8.17176 = idf(docFreq=14, maxDocs=19532)
0.078125 = fieldNorm(doc=24)
0.33333334 = coord(1/3)
item_id: 1495958818, item_name: term 1 dolor sit met, score: 0.114700586
0.18352094 = product of:
0.5505628 = sum of:
0.5505628 = weight(line:term 1 in 6112) [DefaultSimilarity], result of:
0.5505628 = score(doc=6112,freq=1.0), product of:
0.5389907 = queryWeight, product of:
8.17176 = idf(docFreq=14, maxDocs=19532)
0.065957725 = queryNorm
1.02147 = fieldWeight in 6112, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
8.17176 = idf(docFreq=14, maxDocs=19532)
0.125 = fieldNorm(doc=6112)
0.33333334 = coord(1/3)
item_id: 1677761523, item_name: some text term 1, score: 0.061173648
Hello when i execute following query
fl=id model label timestamp score uuid&echoParams=all&qf=label^6 content_level_high^5 content_level_middle^2 content_level_less^1&hl.fl=teaser&wt=xml&rows=9&defType=edismax&facet=true&bq=model:"Component"^10 model:"Object"^90 model:"Address"^1 model:"eav_table_54f5d74b4efef9.49994240"^14&debugQuery=on&start=0&q=Fraumünster
The same query easier to readable
defType=edismax
fl=id model label timestamp score uuid
qf=label^6 content_level_high^5 content_level_middle^2 content_level_less^1
bq=model:"Component"^10 model:"Object"^90 model:"Address"^1 model:"eav_table_54f5d74b4efef9.49994240"^14
q=Fraumünster
start=0
rows=9
wt=xml
facet=true
echoParams=all
debugQuery=on
hl.fl=teaser
to a solr 3.6.2 server it seams that the boost on "model" field will be totaly ignored.
Because all entrys get the same score by having a single hit in "label".
So the order should imho be done by boost query order.
Here a full explain:
http://explain.solr.pl/explains/ipu6qacf
The raw query result:
http://pastebin.com/3uFdd8uw
Solr schema (for solr 5.x):
http://pastebin.com/0pZB5gDt
Solr config:
http://pastebin.com/Kd6W2nYD
The documents to in solr add syntax:
http://pastebin.com/HMBrwAWV
Has anyone an idea what is wrong with the boost query?
Please specify all boost queries in single parameters:
bq=model:"Component"^10&bq=model:"Object"^90&bq=model:"Address"^1&bq=model:"eav_table_54f5d74b4efef9.49994240"^14
Then the query is correctly parsed and recognized in the relevance:
+(content_level_less:chang | label:chang^6.0 | content_level_high:chang^5.0 | content_level_middle:chang^2.0) model:Component^10.0 model:Object^90.0 model:Address model:eav_table_54f5d74b4efef9.49994240^14.0
0.1813628 = (MATCH) sum of: 0.13184154 = (MATCH) max of: 0.13184154 = (MATCH) weight(label:chang^6.0 in 4) [DefaultSimilarity], result of: 0.13184154 = score(doc=4,freq=3.0), product of: 0.041205455 = queryWeight, product of: 6.0 = boost 1.8472979 = idf(docFreq=2, maxDocs=7) 0.003717633 = queryNorm 3.1996138 = fieldWeight in 4, product of: 1.7320508 = tf(freq=3.0), with freq of: 3.0 = termFreq=3.0 1.8472979 = idf(docFreq=2, maxDocs=7) 1.0 = fieldNorm(doc=4) 0.04952125 = (MATCH) weight(model:Component^10.0 in 4) [DefaultSimilarity], result of: 0.04952125 = score(doc=4,freq=1.0), product of: 0.04290709 = queryWeight, product of: 10.0 = boost 1.1541507 = idf(docFreq=5, maxDocs=7) 0.003717633 = queryNorm 1.1541507 = fieldWeight in 4, product of: 1.0 = tf(freq=1.0), with freq of: 1.0 = termFreq=1.0 1.1541507 = idf(docFreq=5, maxDocs=7) 1.0 = fieldNorm(doc=4)
I hope that helps!
I'm trying to get a better understanding of how lucene scored my search so that I can make necessary tweaks to my search configuration or the document content.
The below is a part of the score breakdown.
product of:
0.34472802 = queryWeight, product of:
2.2 = boost
7.880174 = idf(docFreq=48, maxDocs=47667)
0.019884655 = queryNorm
1.9700435 = fieldWeight in 14363, product of:
1.0 = tf(freq=1.0), with freq of:
1.0 = termFreq=1.0
7.880174 = idf(docFreq=48, maxDocs=47667)
0.25 = fieldNorm(doc=14363)
0.26806915 = (MATCH) max of:
0.07832639 = (MATCH) weight(shortDescription:tires^1.1 in 14363) [DefaultSimilarity], result of:
0.07832639 = score(doc=14363,freq=1.0 = termFreq=1.0
I understand how the boost is calculated as that is my configuration value
But how was idf calculated (7.880174 = idf value).
According to the lucene, the idf formula is: idf(t) = 1 + log(numDocs/(docFreq+1))
I checked the core admin console and found that my docFreq = maxDocs = 47667.
Using the formula from lucene, I was not able to calculate expected 7.880174. Instead I get: idf = 3.988 = 1 + log(47667/(48+1)).
Is there something I am missing in my formula.
I think your log function choose 10 as base while in lucene we choose e as base.
log(47667/(48+1), 10) = 2.9880217397306
log(47667/(48+1), e) = 6.8801743154459
The source code of idf method of lucene is:
public float idf(int docFreq, int numDocs) {
return (float)(Math.log(numDocs/(double)(docFreq+1)) + 1.0);
}
As you see, idf use Java Math.log to calculate idf while Math.log choose e as log function. See Java Math api for detail.
Looks like the lucene site has a typo.
http://lucene.apache.org/core/3_6_0/api/core/org/apache/lucene/search/Similarity.html
states 1 + log(numDocs/(docFreq+1))
but it is actually 1 + ln(numDocs/(docFreq+1))
I have the following records and the scores against it when I search for "iphone" -
Record1:
FieldName - DisplayName : "Iphone"
FieldName - Name : "Iphone"
11.654595 = (MATCH) sum of:
11.654595 = (MATCH) max plus 0.01 times others of:
7.718274 = (MATCH) weight(DisplayName:iphone^10.0 in 915195), product of:
0.6654692 = queryWeight(DisplayName:iphone^10.0), product of:
10.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
11.598244 = (MATCH) fieldWeight(DisplayName:iphone in 915195), product of:
1.0 = tf(termFreq(DisplayName:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
1.0 = fieldNorm(field=DisplayName, doc=915195)
11.577413 = (MATCH) weight(Name:iphone^15.0 in 915195), product of:
0.99820393 = queryWeight(Name:iphone^15.0), product of:
15.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
11.598244 = (MATCH) fieldWeight(Name:iphone in 915195), product of:
1.0 = tf(termFreq(Name:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
1.0 = fieldNorm(field=Name, doc=915195)
Record2:
FieldName - DisplayName : "The Iphone Book"
FieldName - Name : "The Iphone Book"
7.284122 = (MATCH) sum of:
7.284122 = (MATCH) max plus 0.01 times others of:
4.823921 = (MATCH) weight(DisplayName:iphone^10.0 in 453681), product of:
0.6654692 = queryWeight(DisplayName:iphone^10.0), product of:
10.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
7.2489023 = (MATCH) fieldWeight(DisplayName:iphone in 453681), product of:
1.0 = tf(termFreq(DisplayName:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.625 = fieldNorm(field=DisplayName, doc=453681)
7.2358828 = (MATCH) weight(Name:iphone^15.0 in 453681), product of:
0.99820393 = queryWeight(Name:iphone^15.0), product of:
15.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
7.2489023 = (MATCH) fieldWeight(Name:iphone in 453681), product of:
1.0 = tf(termFreq(Name:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.625 = fieldNorm(field=Name, doc=453681)
Record3:
FieldName - DisplayName: "iPhone"
FieldName - Name: "iPhone"
7.284122 = (MATCH) sum of:
7.284122 = (MATCH) max plus 0.01 times others of:
4.823921 = (MATCH) weight(DisplayName:iphone^10.0 in 5737775), product of:
0.6654692 = queryWeight(DisplayName:iphone^10.0), product of:
10.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
7.2489023 = (MATCH) fieldWeight(DisplayName:iphone in 5737775), product of:
1.0 = tf(termFreq(DisplayName:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.625 = fieldNorm(field=DisplayName, doc=5737775)
7.2358828 = (MATCH) weight(Name:iphone^15.0 in 5737775), product of:
0.99820393 = queryWeight(Name:iphone^15.0), product of:
15.0 = boost
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.0057376726 = queryNorm
7.2489023 = (MATCH) fieldWeight(Name:iphone in 5737775), product of:
1.0 = tf(termFreq(Name:iphone)=1)
11.598244 = idf(docFreq=484, maxDocs=19431244)
0.625 = fieldNorm(field=Name, doc=5737775)
Why is Record2 and Record3 have the same score when record2 has 3 words and record3 has just one word. So Record3 should have higher relevancy than record 2. Why are the fieldNorm of both Record2 and Record3 the same?
QueryParser: Dismax
FieldType: text fieldtype as default in solrconfig.xml
Adding DataFeed:
Record1: Iphone
{
"ListPrice":1184.526,
"ShipsTo":1,
"OID":"190502",
"EAN":"9780596804299",
"ISBN":"0596804296",
"Author":"Pogue, David",
"product_type_fq":"Books",
"ShipmentDurationDays":"21",
"CurrencyValue":"24.9900",
"ShipmentDurationText":"NORMALLY SHIPS IN 21 BUSINESS DAYS",
"Availability":0,
"COD":0,
"PublicationDate":"2009-08-07 00:00:00.0",
"Discount":"25",
"SubCategory_fq":"Hardware",
"Binding":"Paperback",
"Category_fq":"Non Classifiable",
"ShippingCharges":"0",
"OIDType":8,
"Pages":"397",
"CallOrder":"0",
"TrackInventory":"Ingram",
"Author_fq":"Pogue, David",
"DisplayName":"Iphone",
"url":"/iphone-pogue-david/books/9780596804299.htm",
"CurrencyType":"USD",
"SubSubCategory":"Handheld Devices",
"Mask":0,
"Publisher":"Oreilly & Associates Inc",
"Name":"Iphone",
"Language":"English",
"DisplayPriority":"999",
"rowid":"books_9780596804299"
}
Record2: The Iphone Book
{
"ListPrice":1184.526,
"ShipsTo":1,
"OID":"94694",
"EAN":"9780321534101",
"ISBN":"0321534107",
"Author":"Kelby, Scott/ White, Terry",
"product_type_fq":"Books",
"ShipmentDurationDays":"21",
"CurrencyValue":"24.9900",
"ShipmentDurationText":"NORMALLY SHIPS IN 21 BUSINESS DAYS",
"Availability":1,
"COD":0,
"PublicationDate":"2007-08-13 00:00:00.0",
"Discount":"25",
"SubCategory_fq":"Handheld Devices",
"Binding":"Paperback",
"BAMcategory_src":"Computers",
"Category_fq":"Computers",
"ShippingCharges":"0",
"OIDType":8,
"Pages":"219",
"CallOrder":"0",
"TrackInventory":"Ingram",
"Author_fq":"Kelby, Scott/ White, Terry",
"DisplayName":"The Iphone Book",
"url":"/iphone-book-kelby-scott-white-terry/books/9780321534101.htm",
"CurrencyType":"USD",
"SubSubCategory":" Handheld Devices",
"BAMcategory_fq":"Computers",
"Mask":0,
"Publisher":"Pearson P T R",
"Name":"The Iphone Book",
"Language":"English",
"DisplayPriority":"999",
"rowid":"books_9780321534101"
}
Record 3: iPhone
{
"ListPrice":278.46,
"ShipsTo":1,
"OID":"694715",
"EAN":"9781411423527",
"ISBN":"1411423526",
"Author":"Quamut (COR)",
"product_type_fq":"Books",
"ShipmentDurationDays":"21",
"CurrencyValue":"5.9500",
"ShipmentDurationText":"NORMALLY SHIPS IN 21 BUSINESS DAYS",
"Availability":0,
"COD":0,
"PublicationDate":"2010-08-03 00:00:00.0",
"Discount":"25",
"SubCategory_fq":"Hardware",
"Binding":"Paperback",
"Category_fq":"Non Classifiable",
"ShippingCharges":"0",
"OIDType":8,
"CallOrder":"0",
"TrackInventory":"BNT",
"Author_fq":"Quamut (COR)",
"DisplayName":"iPhone",
"url":"/iphone-quamut-cor/books/9781411423527.htm",
"CurrencyType":"USD",
"SubSubCategory":"Handheld Devices",
"Mask":0,
"Publisher":"Sterling Pub Co Inc",
"Name":"iPhone",
"Language":"English",
"DisplayPriority":"999",
"rowid":"books_9781411423527"
}
fieldnorm takes into account the field length i.e. the number of terms.
The fieldtype used is text for the fields display name & name, which would have the stopwords and the word delimiter filters.
Record 1 - Iphone
Would generate a single token - IPhone
Record 2 - The Iphone Book
Would generate 2 tokens - Iphone, Book
The would be removed by the stopwords.
Record 3 - iPhone
Would also generate 2 tokens - i,phone
As iPhone has a case change, the word delimiter filter with splitOnCaseChange would now split iPhone into 2 tokens i, Phone and would produce the field norm same as Record 2
This is the answer to user1021590's follow-up question/answer on the "da vinci code" search example.
The reason all the documents get the same score is due to a subtle implementation detail of lengthNorm. Lucence TFIDFSimilarity doc states the following about norm(t, d):
the resulted norm value is encoded as a single byte before being stored. At search time, the norm byte value is read from the index directory and decoded back to a float norm value. This encoding/decoding, while reducing index size, comes with the price of precision loss - it is not guaranteed that decode(encode(x)) = x. For instance, decode(encode(0.89)) = 0.75.
If you dig into the code, you see that this float-to-byte encoding is implemented as follows:
public static byte floatToByte315(float f)
{
int bits = Float.floatToRawIntBits(f);
int smallfloat = bits >> (24 - 3);
if (smallfloat <= ((63 - 15) << 3))
{
return (bits <= 0) ? (byte) 0 : (byte) 1;
}
if (smallfloat >= ((63 - 15) << 3) + 0x100)
{
return -1;
}
return (byte) (smallfloat - ((63 - 15) << 3));
}
and the decoding of that byte to float is done as:
public static float byte315ToFloat(byte b)
{
if (b == 0)
return 0.0f;
int bits = (b & 0xff) << (24 - 3);
bits += (63 - 15) << 24;
return Float.intBitsToFloat(bits);
}
lengthNorm is calculated as 1 / sqrt( number of terms in field ). This is then encoded for storage using floatToByte315. For a field with 3 terms, we get:
floatToByte315( 1/sqrt(3.0) ) = 120
and for a field with 4 terms, we get:
floatToByte315( 1/sqrt(4.0) ) = 120
so both of them get decoded to:
byte315ToFloat(120) = 0.5.
The doc also states this:
The rationale supporting such lossy compression of norm values is that given the difficulty (and inaccuracy) of users to express their true information need by a query, only big differences matter.
UPDATE: As of Solr 4.10, this implementation and corresponding statements are part of DefaultSimilarity.