I have a SOLR query which supports both exact and partial matches. The query terms have appropriate boost factors added where exact matches have higher boost compared to partial matches.
However, within partial matches too, we want to define the boost factors in such a way that
a partial match having a full word gets more priority than a partial match appearing as a part of a word.
For example: If a user searches for a string "Annie Hall", then the documents containing values like: "Tanner Hall", "Hall Pass" etc. should have a higher weight (priority) as compared to values like: "Halloween", " "The Dog Who Saved Halloween". They all are partial matches but "Hall" appears as a separate word in "Tanner Hall" and "Hall Pass" and hence they should have more score.
Please help.
Regards,
I am assuming you are using ngram filter for your queries as it is able to match both the full and the partial matches.
If so, you can always have two fields.
Non Ngramed field with higher boost - text
NGramed field with normal boost - text_ngram
e.g. For dismax - text^2 text_ngram would result in prefect matches having higher boost then the partial matches.
Remember if there is a full match, there would be a partial match as well so its a cumulative boost.
Related
i use solr and i have a trouble with result score. For example
i have such docs with one field (for example "content"):
content = car
content = cars
content = carable awesome
content = awful for carable
And i make search query with such params ":{
"mm":"1",
"q":"car",
"tie":"0.1",
"defType":"dismax",
"fl":"*, score",}
i expect to see the result like this:
car: 5 score
cars: 4.8 score
carable awesome: 3
awful for carable: 3
Word without "s" should be highter, but i have strange things. How i can boost absolute match (like a car)
This happens because the field type you're using for the field has a stemming filter (or an ngramfilter) attached (which makes cars and car generate hits against each other). You can't boost "exact hits" inside such a field, since for Lucene they are the same value. What's stored in the index is the same for both car and cars - the latter is processed down to car as well.
To implement this and get exact hits higher, you add a second field without that filter present that only tokenizes (splits) your content on whitespace and lowercases the token. That way you have a field where cars and car are stored as different tokens, and tokens won't contribute to the score if they're not being matched.
You can use qf in Solr to tell Solr which fields you want to search against, and you can give a boost at the same time - so in your case you'd have qf=exact_field^10 text_field where hits in exact_field would be valued ten times higher than hits in the regular field (the exact boost values will depend on your use case and how you want the query profile to behave).
You can also use the different boost arguments (bq and boost) to apply boosts outside of your regular query (i.e. add a query to bq that replicates your original query), but the previous suggestion will probably work just fine.
I need to give priority to the documents where full search term occurs. For example if the search term is "Georgia Tech", I want the document having "Georgia Tech" in it to have extra boost than those having more frequent "Georgia" term in them.
that is almost standard:
index it two fields (use copyField), one whitespace (or similar) tokenized, one as a keyword.
you use edismax and boost the keyword field with more weight than the other
I query using
qf=Name+Tag
Now I want that documents that have the phrase in tag will arrive first so I use
qf=Name+Tag^2
and they do appear first.
What should be the rule of thumb regarding the number that comes after the field?
How do I know what number to set it?
The number is pure preference based and is mainly trial and error basis.
As to how much the field weighs in comparison to the other field.
The scoring takes into account various factors, however some factors can be considered and tested
e.g. term frequency - So is a word appears twice in Name should it override a single occurrence in the tag field
Also, if you are checking for a Phrase match you should use pf if using the edismax parser.
qf will match individuals words where pf will match whole words.
For e.g. if you have fields name & tag and you search for ruby rails
qf would cause scoring name:ruby tag:ruby & name:rails tag:rails
pf would cause scoring name:"ruby rails" tag:"ruby rails"
so would be better to use qf to match the results and boost single matches but have higher pf values.
How do you set up partial (substring) fuzzy match in Solr 4.2.1?
For example, if you have a list of US cities indexed, I would like a search term "Alber" to match "Alburquerque".
I have tried using the NGramFilterFactory on the <fieldType> and rebuilt the index but queries do not return results as expected - they still work as if I had just done the standard text_general defaults. Exact matches work, and explicit fuzzy searches would work given sufficient similarity (for example "Alberquerque~" with one misspelling would work.)
I did go to the analyzer tool in the Solr admin and saw that my ngrams were indeed being generated.
Is there something i'm missing from the query side?
Or should I take a different approach altogether?
And can this work with dismax? (Multiple fields indexed like this with different weights)
Thanks!
Is there any way to map analyzer to Query-types (phrase, range) similar to the way we do with Analyzer to field names?
I want to support exact match in case of phrase searches and search on even stemmed words if it's not a phrase search. During indexing I'm indexing both the original token and stemmed token at the same position.
Consider the following case:
document1 : feature flipping
document2 : feature flip
Tokens generated during indexing phase:
document1 : feature featur flipping flip
document2 : feature featur flip
feature & featur are at the same position and flipping & flip are at the same position
When I search using phrase query "feature flipping" query generated is
Your Query: +matchAllDocs:true +(alltext:("feature flipping"))
Lucene's: +matchAllDocs:true +alltext:"(feature featur) (flipping flip)"
And this returns both the documents. Is there any way to return only the exact match (document 1)? I thought that if it' possible to map analyzers to query-types, then i will skip phrase queries from stemFilter.
UPDATE
https://issues.apache.org/jira/browse/LUCENE-2892 is what I'm looking for.
Thanks