Solr eDismax Search - Prioritize phrase over individual words - solr

I am trying to use the eDismax Query Parser with the following requirements where a search query can be intepreted as a phrase and also individual words, but where phrase takes precedence over individual words.
Example:
Search query: We are cool
Results should be:
Documents fields with phrase 'we are cool' appearing top of list
Documents where fields comprises of either 'we', 'are', 'cool' where highest number of occurences take precedence.
How would I go about implementing this? Thanks.

The simplest way: use pf param boosting for that, check the doc here
So for example, adding this (if you had those two fields):
q=We are cool&pf=mytitle^10 mydescription

Related

Solr Results boosted according to keyword match

Does anyone worked on the solr search to boost the result based on maximum search keyword match? Actually I am doing query on Solr to get the result based on multiple keywords search and need to boost the result based on maximum matches keyword.
Let say my search term is field:("suresh" or "ramesh" or "vikas"). Now if any result match all three word then should come first. If any any result match only two word then come 2nd position and so on.
Thanks !

how to use pf(Phrase Fields) and ps(Phrase Slop) of eDisMax Query Parser in solr?

What is Phrase Fields, Phrase Slop and Query Phrase Slop in eDisMax. I go through many website but not understand these with implementation. I want to know how query pass in solr using this and how output differs from each other If I have following data.
{
"id":"2",
"shipping_firstname":"Sudhanshu",
"address":"H.No. 444, Gali No.2 Jain Nagar",
"date_added":"2017-01-21T14:15:15Z",
"_version_":1562029999829024768}]
}
Welcome,
Phrase Fields, Phrase Slop and Query Phrase Slop in eDisMax parser are used to boost a document based on certain criteria.
Based on your use case you can give different boost values to manipulate the overall score of a document.
The pf (Phrase Fields) parameter can be used to boost the score of documents in which all of the terms in the q parameter appear in close proximity. The pf parameter takes a list of fields and optional corresponding boosts. The eDisMax query parser will attempt to make phrase queries out of all the terms in the q parameter, and if it’s able to find the exact phrase in any of the phrase fields, it will apply the specified boost to the match for that document.
The ps (Phrase Slop) parameter :
When using the pf parameter, you may not want to require all terms in the query to appear as an exact phrase. You can make use of the ps (phrase slop) parameter to specify how many term positions the terms in the query can be off by to be considered a match on the phrase fields.
The qs (Query Phrase Slop) parameter :
Just as the ps parameter allows you to define the amount of slop (edit distances) on phrases matching in the phrase fields (pf parameter), the qs parameter allows you to do the same for phrases the user explicitly specifies in the main q parameter. Think of the qs parameter as redefining what an exact match is, allowing you to change the slop from the default of 0 (terms must appear beside each other) to a higher number.
What is your requirement here? These params can only help you for ranking results to boost or get some documents at the top and not in actual search criteria or finding matching documents.

How to boost a document if full query text is present in it? - Solr

I need to give priority to the documents where full search term occurs. For example if the search term is "Georgia Tech", I want the document having "Georgia Tech" in it to have extra boost than those having more frequent "Georgia" term in them.
that is almost standard:
index it two fields (use copyField), one whitespace (or similar) tokenized, one as a keyword.
you use edismax and boost the keyword field with more weight than the other

Solr - how to plan field boosting

I query using
qf=Name+Tag
Now I want that documents that have the phrase in tag will arrive first so I use
qf=Name+Tag^2
and they do appear first.
What should be the rule of thumb regarding the number that comes after the field?
How do I know what number to set it?
The number is pure preference based and is mainly trial and error basis.
As to how much the field weighs in comparison to the other field.
The scoring takes into account various factors, however some factors can be considered and tested
e.g. term frequency - So is a word appears twice in Name should it override a single occurrence in the tag field
Also, if you are checking for a Phrase match you should use pf if using the edismax parser.
qf will match individuals words where pf will match whole words.
For e.g. if you have fields name & tag and you search for ruby rails
qf would cause scoring name:ruby tag:ruby & name:rails tag:rails
pf would cause scoring name:"ruby rails" tag:"ruby rails"
so would be better to use qf to match the results and boost single matches but have higher pf values.

Solr query results using *

I want to provide for partial matching, so I am tacking on * to the end of search queries. What I've noticed is that a search query of gatorade will return 12 results whereas gatorade* returns 7. So * seems to be 1 or many as opposed to 0 or many ... how can I achieve this? Am I going about partial matching in Solr all wrong? Thanks.
First, I think Solr wildcards are better summarized by "0 or many" than "1 or many". I doubt that's the source of your problem. (For example, see the javadocs for WildcardQuery.)
Second, are you using stemming, because my first guess is that you're dealing with a stemming issue. Solr wildcards can behave kind of oddly with stemming. This is because wildcard expansion is based by searching through the list of terms stored in the inverted index; these terms are going to be in stemmed form (perhaps something like "gatorad"), rather than the words from the original source text (perhaps "gatorade" or "gatorades").
For example, suppose you have a stemmer that maps both "gatorade" and "gatorades" to the stem "gatorad". This means your inverted index will not contain either "gatorade" or "gatorades", only "gatorad". If you then issue the query gatorade*, Solr will walk the term index looking for all the stems beginning with "gatorade". But there are no such stems, so you won't get any matches. Similarly, if you searched gatorades*, Solr will look for all stems beginning with "gatorades". But there are no such stems, so you won't get any matches.
Third, for optimal help, I'd suggest posting some more information, in particular:
Some particular query URLs you are submitting to Solr
An excerpt from your schema.xml file. In particular, include A) the field elements for the fields you are having trouble with, and B) the field type definitions corresponding to those fields
so what I was looking for is to make the search term for 'gatorade' -> 'gatorade OR gatorade*' which will give me all the matches i'm looking for.
If you want a query to return all documents that match either a stemmed form of gatorade or words that begin with gatorade, you'll need to construct the query yourself: +(gatorade gatorade*). You could alternatively extend the SolrParser to do this, but that's more work.
Another alternative is to use NGrams and TokenFilterFactories, specifically the EdgeNGramFilterFactory. .
This will create indexes for ngrams or parts of words. Documents, with a min ngram size of 5 and max ngram size of 8, would index: Docum Docume Document Documents
There is a bit of a tradeoff for index size and time. One of the Solr books quotes as a rough guide: Indexing takes 10 times longer Uses 5 times more disk space Creates 6 times more distinct terms.
However, the EdgeNGram will do better than that.
You do need to make sure that you don't submit wildcard character in your queries. As you aren't doing a wildcard search, you are matching a search term on ngrams(parts of words).
My guess is the missing matches are "Gatorade" (with a capital 'G'), and you have a lowercase filter on your field. The idea is that you have filters in your schema.xml that preprocess the input data, but wildcard queries do not use them;
see this about how Solr deals with wildcard queries:
http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
("Solr and wildcard handling").
From what I've read the wildcards only matched words with additional characters after the search term. "Gatorade*" would match Gatorades but not Gatorade itself. It appears there's been an update to Solr in version 3.6 that takes this into account by using the 'multiterm' field type instead of the 'text' field.
A better description is here:
http://bensch.be/the-solr-wildcard-problem-and-multiterm-solution

Resources