I am reviewing the similarity calculations performed by the DefaultSimilarity class in Lucene invoked by Solr. Specifically, I am not clear about field normalization as to how its calculated when the Solr query doesn't reference a specific field.
norm(t,d) = doc.getBoost() · lengthNorm · ∏ f.getBoost() .... field f in d named as t
where
doc.getBoost() = document's boost specified at index time
f.getBoost() = field's boost specified at index time
lengthNorm = number of terms/tokens in the field
My question is, if a solr query is specified as -
/select?q=indian cricket&rows=5&wt=json
without reference to a specific field in schema.xml, how is norm(t,d) calculated? for every field, the term is found in? If so, how
are these combined?
Thanks in advance for your insights!
Fields without a field name will use the defaultSearchField setting from the schema, the df (default field) query parameter or the qf query fields parameter (if using (e)dismax, and the terms will be prefixed with the field name. Each field, term combination for each queried field will then be used to evaluate the norm.
Use the debugQuery feature of Solr to see each scored part and how it affects the score.
Related
I am trying to understand the root cause of an issue with my SOLR search query. Below code is SOLRJ client code.
query.setStart(0);
query.setRows(1000);
query.set("debugQuery", true);
query.set("defType", "edismax");
query.setQuery("title:business OR statistics) OR (name:business OR statistics)");
query.add("fq", "bsuiness_id:(101 102)");
query.add("tie", "0.1");
query.set("bq","weight:[0 TO 500]^1 weight:[501 TO 1000]^3");
returns 200 search results
query.setStart(0);
query.setRows(1000);
query.set("debugQuery", true);
query.set("defType", "edismax");
query.setQuery("title:statistics OR business) OR (name:statistics OR business)");
query.add("fq", "bsuiness_id:(101 102)");
query.add("tie", "0.1");
query.set("bq","weight:[0 TO 500]^1 weight:[501 TO 1000]^3");
returns 100 search results
My understanding is keyword "business statistics" and "statistics business" should yield same results. However, you may notice above that they are not.
Can someone please provide any pointers about what is missing?
The two queries are not the same. (And you're missing a ( at the start)
title:business OR statistics) OR (name:business OR statistics)
searches for business in the title field and statistics in the default search field (since it doesn't seem like you have a qf parameter), or business in the name field and again, statistics in the default search field.
So in effect:
title = business or name = business or statistics in default search field
Your second query:
title:statistics OR business) OR (name:statistics OR business)
.. searches for statistics in the title field, or business in the default search field, or statistics in the name field, or business (again) in the default search field. In effect:
title = statistics or name = statistics or business in the default search field
.. as you can see, these two queries are not the same. The field: prefix is only valid for the token that follows right behind it - not for those other tokens.
Using the edismax handler, I suggest you rewrite this to using the qf parameter instead (query fields), which tells Solr which fields to query. Your two examples can then be simplified to:
q=statistics business&qf=name title
.. search for statistics and business in the two fields named in the qf parameter. You can use q.op=OR to get hits where any of the terms are present (as in your example), or q.op=AND to get hits where both are present.
In that case statistics business and business statistics as the query will give you the same result.
If you want to use the explicit syntax (aka the Lucene syntax), you can use the form field:(term1 OR term2) - title:(business OR statistics) OR name:(business OR statistics) - but since you're already using the edismax handler, I recommend using the built-in support for more natural queries and using qf to say which fields to search. You can also use weights with qf to weigh hits in the two fields differently - qf=name^3 title will give three times the weigh to any hits in the name field.
How to boost record depend on any field in Solr.
Reference link :https://wiki.apache.org/solr/SolrRelevancyFAQ#How_can_I_increase_the_score_for_specific_documents
But I am not getting clearlly in my case.
I have some record after search
How to get Id : 5,8,17 and 1 up some step not top of the list, just boost some step.Because it's price is higher.
It's my row query ;
select?mm=100%25&version=2.2&q=(book)&defType=edismax&spellcheck.q=(book)&qf=Price^10+Name^1+nGramContent&spellcheck=true&stats=true&facet.mincount=1&facet=true&spellcheck.collate=true&stats.field=Price&rows=50&indent=on&wt=json&fl=Id,score,Price
Please help me.
Thanks!
The qf parameters are for hits in the field and will not affect the ranking unless the query produces a hit in the field. Your example would require you to search for the price (and not book) for anything to be boosted by the qf=Price^10 argument.
The FAQ you've linked to answers your question, just not the question you've referenced: How can I change the score of a document based on the value of a field. From the example (replace popularity with price for your case):
# simple boosts by popularity
defType=dismax&qf=text&q=supervillians&bf=popularity
q={!boost b=popularity}text:supervillians
# boosts based on complex functions of the popularity field
defType=dismax&qf=text&q=supervillians&bf=sqrt(popularity)
q={!boost b=sqrt(popularity)}text:supervillians
edismax makes the {!boost} (multiplicative boost) available as the boost= parameter as well, so you can reference it directly instead of having it in your query.
I'm indexing several different fields in a document using Apache SOLR 3.6.
When I do a search for a term, SOLR returns all the occurrences of the term in each field. However, the same score for all the fields that the term occurred inside the text of the field does not change. For example if USC occurred in the title field, and in the contents field, they both get the same score.
Is there a way to index a document of different fields and have a weighted score based on the type of field within the document?
use dismax or edismax and set the qf (query field) parameter to something like this to give the title more weight than the body.
qf=title^3 body
I have ngram-indexed 2 fields (columns in the database) and the third one is my full text field. Now my default text field is the full text field and while querying I use dismax handler and specify in it both the ngrammed field with certain boost values and also full text field with a certain boost value.
Problem for me if I dont use dismax and just search full text field(i.e. default field specified in schema) synonyms work correctly i.e. ca returns all results where california is there whereas if i use dismax ca is also searched in the ngrammed fields and return partial matches of the word ca and does not go at all in the synonym part.
I want to use synonyms in every case so how should I go about it?
Ensure you already correctly configured the "SynonymFilterFactory" filter in your ngram field's query analyzer.
If still doesn't work, the Solr admin's analysis interface can give more details of the tokenize/filter procedures, through which can check if the Synonym part already works as expected.
I've got a text field that can potentially have multiple values.
doc 1:
field a:"X Y"
doc 2:
field a:"X"
I want to be able to do :
a:X^5
And have both doc 1 and 2 get an identical score.
I've been messing around with all the field options, but I always end up with doc 2 getting double the score of doc 1.
I've tried setting multiValued="true", but get the same result.
Is there someway that I can set my search or the field definition so that it will boost just based upon the existence of the search term and not be effected by the rest of the field's contents.
Disable norms by setting omitNorms=true in your schema and reindex - it should disable the length normalization for the field and give you the desired results.
For more details of what omitNorms does, see this.
The field a of doc 2 has only one term as compared to doc 1 which has two.
Solr DefaultSimilartiy implementation takes into account the length norm, number of terms in the field, for the fields when calculating the score.
LenghtNorm is 1.0 / Math.sqrt(numTerms)
LengthNorm allows you to make shorter documents score higher.
You can provide your own implementation of Similarity class which doesn't take into account the lengthNorm.
Check computeNorm method implementation.
You can turn of the Norms using omitNorms=false.
Norms allow for index time boosts and field length normalization. This allows you to add boosts to fields at index time and makes shorter documents score higher.
So you would lose both of the above if you use it.