Solr will give as many highlights as I specify in hl.snippets, giving a list of highlights. What I want is the set of (2 or 3) highlight snippets that best match the query. Is there in innate Solr feature that does this?
The Unified Highlighter allows you to tell it how it should score the returned highlights. It should already do some scoring by default, so the first task would be switch to the unified highlighter if you're not using that.
You can then tweak how it uses BM25 to score the returned highlights:
hl.score.k1 [Optional] [Default: 1.2]
Specifies BM25 term frequency normalization parameter 'k1'. For example, it can be set to 0 to rank passages solely based on the number of query terms that match.
hl.score.b [Optional] [Default: 0.75]
Specifies BM25 length normalization parameter 'b'. For example, it can be set to "0" to ignore the length of passages entirely when ranking.
hl.score.pivot [Optional] [Default: 87]
Specifies BM25 average passage length in characters.
Related
I'm attempting to query solr for documents, given a basic schema with the following field names, data types irrelevant:
I'm attempting to match documents that match at least one of the following:
occupation, name, age, gender but i want to OR them together
How do you OR together many terms, and enforce the document to match at least one?
This seems to be failing: +(name:Sarah age:24 occupation:doctor gender:male)
How do you convert a boolean expression into solr query syntax? I can't figure out the syntax with + and - and the default operator for OR.
Still I don't get your requirement but you just need to query like:
+(age:24 OR gender:male)
Or if you want data for multiple value in same field with OR condition like.
i.e. You get data of age:24 and age:25 both.
+(age:24 OR age:25 OR gender:male)
Then you can:
+(age:(24 25) OR gender:male)
If it is't your requirement, then let me know.
If you want to make it as simple as possible for the client, just go for the dismax[1] or edismax[2] query parser.
Specifically you can configure a request parameter called "qf" :
"The qf parameter introduces a list of fields, each of which is assigned a boost factor to increase or decrease that particular field’s importance in the query. For example, the query below:
qf=fieldOne^2.3 fieldTwo fieldThree^0.4
assigns fieldOne a boost of 2.3, leaves fieldTwo with the default boost (because no boost factor is specified), and fieldThree a boost of 0.4.
These boost factors make matches in fieldOne much more significant than matches in fieldTwo, which in turn are much more significant than matches in fieldThree." from the wiki
Then you can just pass a free text query, and it will be searched in the fields you specified, giving also different importance to each one, if necessary.
[1] https://lucene.apache.org/solr/guide/6_6/the-dismax-query-parser.html
[2] https://lucene.apache.org/solr/guide/6_6/the-extended-dismax-query-parser.html
I implementing Solr search using an API. When I call it using the parameters as, "Chillout Lounge", it returns me the collection which are same/similar to the string "Chillout Lounge".
But when I search for "Chillout Lounge Box", it returns me results which don't have any of these three words.(in the DB there are values which have these 3 values, but they are not returned.)
According to me, Solr uses Fuzzy search, but when it is done it should return me some values, which will have at least one these value.
Or what could be the possible changes I should to my schema.XML, such that is would give me proper values.
First of all - "Fuzzy search" is a feature you'll have to ask for (by using ~ in standard Lucene query syntax).
If you're talking about regular searches, you can use q.op to select which operator to use. q.op=AND will make sure that all the terms match, while q.op=OR will make any document that contain at least one of the terms be returned. As long as you aren't using fq for this, the documents that match more terms should be scored higher (as the score will add up across multiple terms), and thus, be shown higher in the result set.
You can use the debug query feature in the web interface to see scores for each term for a document, and find out why the document was returned at all. If the document doesn't match any terms, it shouldn't be returned, unless you're asking for all documents to be returned.
Be aware that the analyzer chain defined for the field you're searching might affect what's considered a match and not.
You'll have to add a proper example to get a more detailed answer.
I query using
qf=Name+Tag
Now I want that documents that have the phrase in tag will arrive first so I use
qf=Name+Tag^2
and they do appear first.
What should be the rule of thumb regarding the number that comes after the field?
How do I know what number to set it?
The number is pure preference based and is mainly trial and error basis.
As to how much the field weighs in comparison to the other field.
The scoring takes into account various factors, however some factors can be considered and tested
e.g. term frequency - So is a word appears twice in Name should it override a single occurrence in the tag field
Also, if you are checking for a Phrase match you should use pf if using the edismax parser.
qf will match individuals words where pf will match whole words.
For e.g. if you have fields name & tag and you search for ruby rails
qf would cause scoring name:ruby tag:ruby & name:rails tag:rails
pf would cause scoring name:"ruby rails" tag:"ruby rails"
so would be better to use qf to match the results and boost single matches but have higher pf values.
Hello stackOverflowers
I have a field in a Solr document collection with a field called
names_txt - this is a multiValue="true" field.
This field contains all the names of the associated persons to a document
I want to be able to both do a fuzzy search and at the same time limit the number of terms between the to matching terms.
The query
names_txt:("markus foss"~2)
Will return all documents where you find the terms markus and foss where theres max 2 terms between them.
But when i search in a fuzzy way AND want to also specify the max number of terms between the matches, I cant get the syntax right.
The query:
names_txt:(markus~0.7 foss~0.7)
This does work, but returns false postives, since it will return a document with "markus something" in one value, and "foss somethingElse" in another.
What I would like to write is:
(markus~0.7 foss~0.7)~2
but this syntax is illegal in solr.
Anyone out there have a solution for my problem?
Since in one single query term Solr can either process a word distance restraint or a fuzzy search restraint, we will need two terms for this:
names_txt:("markus foss"~2) AND names_txt:(markus~0.7 foss~0.7)
Note that quantifying fuzzyness by a float number is deprecated. Internally, lucene converts converts the float number to an int between 0 and 2 anyway, so we should use this integer (Damereau Levenshtein) edit distance right from the beginning in our search terms. So my final proposal states:
names_txt:("markus foss"~2) AND names_txt:(markus~1 foss~1)
(For those who are interested: The deprecated, somewhat quirky function that converts the similarity float to an edit distance int can be found at the end of this code file.)
I think you could do that using SpanQuery The issue is that the usual query parsers in Solr dont support them. Look at this article that mentions those that support spans: Surround, Xml-Query-Parser and Qsol. But check the status of each in current solr version.
I've got a text field that can potentially have multiple values.
doc 1:
field a:"X Y"
doc 2:
field a:"X"
I want to be able to do :
a:X^5
And have both doc 1 and 2 get an identical score.
I've been messing around with all the field options, but I always end up with doc 2 getting double the score of doc 1.
I've tried setting multiValued="true", but get the same result.
Is there someway that I can set my search or the field definition so that it will boost just based upon the existence of the search term and not be effected by the rest of the field's contents.
Disable norms by setting omitNorms=true in your schema and reindex - it should disable the length normalization for the field and give you the desired results.
For more details of what omitNorms does, see this.
The field a of doc 2 has only one term as compared to doc 1 which has two.
Solr DefaultSimilartiy implementation takes into account the length norm, number of terms in the field, for the fields when calculating the score.
LenghtNorm is 1.0 / Math.sqrt(numTerms)
LengthNorm allows you to make shorter documents score higher.
You can provide your own implementation of Similarity class which doesn't take into account the lengthNorm.
Check computeNorm method implementation.
You can turn of the Norms using omitNorms=false.
Norms allow for index time boosts and field length normalization. This allows you to add boosts to fields at index time and makes shorter documents score higher.
So you would lose both of the above if you use it.