We have a large restaurant menu database where users can search for menu items. There are many items that when the words are side by side its a unique dish but the words are so common and appear all over the place.
Example: Users want to search for "cheese steak"
In the database...it can be "cheesesteak" or "cheese steak"
In my synonym file I have:
cheesesteak => cheesesteak, cheese steak
cheese steak => cheesesteak, cheese steak
When I search for "cheesesteak", I get valid results. I get menu items with "cheesesteak" and also "cheese steak" (words side by side)
But when I search for "cheese steak", I get all kinds of non relevant results like "steak salad with blue cheese" its picking up anything with the words cheese and steak
Is there a way to configure this synonym file so it works? I don't want to force user to enter quotes, etc.
What you are looking for is proximity search, were scoring improves with the correct ordering and distance of words. From the Solr FAQ
A proximity search can be done with a sloppy phrase query. The closer
together the two terms appear in the document, the higher the score
will be. A sloppy phrase query specifies a maximum "slop", or the
number of positions tokens need to be moved to get a match.
This example for the standard request handler will find all documents
where "batman" occurs within 100 words of "movie":
q=text:"batman movie"~100
what you should do is use edismax and let boosting show the most relevant docs. You can also do this by using standard handler if you add boosting queries or optional phrase with all terms like +cheese +steak ("cheesesteak"^100 "steak cheese"^50)
Related
everybody. I'm trying to elaborate a query that complies with the following:
Find a set of words that appear in a group of fields. For example, i want to find the documents that have the words soccer, ball and goalkeeper in one or both fields: 'sport_name' and 'descritpion'.
The problem I'm having is that I need to treat both fields as only one for getting results like:
{
"sport_name":"soccer",
"description": "...played with a ball... positions are goalkeeper"
}
I need that the words appear in any field, but all the words need to appear in the "concatenated bigger field".
Is there a way to do this during query time?
Thanks!!
You can do this by using the edismax handler (defType=edismax), setting q.op=AND (since all the terms has to be present) and using qf=sport_name description to tell Solr to search for the given terms in both fields.
You can also use qf=sport_name^2 description to say that you want to weigh hits in the sport_name field twice as much as hits in the description field. So if there was a sport named something with ball, that hit would contribute more to the score than if the same content were present in the description field.
Scenario: eCommerce - Product search.
Is there a feature in SOLR that allow us to add STOP Words or "Keywords to ignore " per Product?
Example:
Search word: :Maker":
Expected results: ABC Coffee Maker, XYZ Juice Maker, MNO Label Maker, DEF Coffee Maker.
Search word: :Coffee Maker":
Expected results: ABC Coffee Maker, XYZ Juice Maker, DEF Coffee Maker.
MNO Label Maker should not be displayed when user searches for "Coffee Maker"
Thanks,
Jitendra.
The only way I know how to do what you want is to search on a phrase and give it a good boost. Here's an example from one of my own queries:
desc_search:(20%^10.0 AND SMD^10.0 OR "20% SMD"^100.0)
Note the "20% SMD" in quotes. This tells Solr to search on that exact phrase and boost documents that contain it. Depending on your boosting scheme, 100.0 may be too much or too little, so you'll need to experiment.
You'll still have the unrelated searches down near the bottom of the results list.
I haven't noticed any speed or efficiency issues yet with this approach, but I imagine if you let a user search on a super common word like "the", "and", etc., you could see a lot of results coming back and that could slow things down a little. I've had as many as 10K docs come back without problems, though.
If the desired behavior is to show all "makers" but prioritize "coffee makers", then boosting (phrases or shingles as above) is the way to go.
If, instead you want all coffee makers and no label makers for the search coffee maker, then just have your client code do this:
Run a phrase search for "coffee maker"
If no results are found, then optionally, run the normal term search before displaying results.
To remove stop words
Add the stopwords filter to your fieldType in schema.xml solr/collection1/conf/schema.xml
Customize the stopwords.txt list solr/collection1/conf/stopwords.txt
restart solr
Words in the stopwords list will be excluded # index time
I am a newbie with solr and I have a question about query mechanism.
In my solr schema.xml for a particular field (say field1) i have a standard tokenizer that splits into words and a couple of filters. One of the filters is a solr.KeepWordFilterFactory filter that has a extremely short dictionary (just 10 words, say they are: red, orange, yellow, green etc). I tested the schema with analyze menu of solr and everything works.
that is a document with text "Red fox was sitting on green grass". would translate to {"red,"green"}
However, when I submit a query: field1:"red green" it fails to find such a document. As if the query is applied to unfiltered yet tokenized source.
Can you confirm that this is what standard query parser actually does. I.e the filters are applied exclusively for the index, but no for the actual search ??(i understand that the search will be applied only to those documents where the index matches the analyzed query). Or if not how the phrase query actually works in the above example.
When you do a query like this : "red green", Lucene expects to find these terms in consecutive positions , so pos(green) = pos(red) + 1. When you do it like this : "red green"~10 , you give it 10 moves to shuffle the terms around and try to make them seem consecutive (it's called a phrase slop) .
Other that that , what a KeywordMarkerFilter does is mark tokens with the keyword flag. Filters following it could implement a logic that check if the token is a keyword before modifying it. It does not stop lucene from indexing tokens not marked as keywords, but it could stop it from further modifying them.
I've got an index of about 500.000 documents, and about 10 of these documents contains the title "at the moon" ('title' field) and the tag "nasa" ('tag' field). When I do a search for "at the moon nasa" these documents come up quite far down on the list of the search results. This is because the title field does not get boosted, but the tag field gets boosted quite a bit. So other documents with the tag 'nasa' takes precedence over the documents which almost matches the entire query through the title field.
However, even though Solr can't know, the query "at the moon nasa" almost matches the document title "at the moon". If I remove the "nasa" part from the query, the documents come up at the top.
Is there some way to tell Solr to do some sort of approximate phrase query? Would it make sense to implement some sort of gram-ish search through the bq parameter, where i would split the search phrase up in word combinations such as:
// PHP-ish pseudocode
$bq[]=title:"at the"^2
$bq[]=title:"at the moon"^3
$bq[]=title:"at the moon nasa"^4
$bq[]=title:"the moon"^2
$bq[]=title:"the moon nasa"^3
$bq[]=title:"moon nasa"^4
Would this make sense at all, and would it make sense to boost documents according to how large part of the query they match?
Before you do anything else, try using eDisMax with pf3 parameter. That does the 3-grams for your automatically.
You may also be interesting in a recent vifun project that helps the visualize the effects of various parameters.
We're using Solr to search on a Shop index and a Product index. Currently a Shop has a field shop_keyword which also contains the keywords of the products assigned to it. The shop keywords are separated by a space. Consequently, if there is a product which has a keyword "apple" and another which has "orange", a search for shops having Apple AND Orange would return the shop for these products.
However, this is incorrect since we want that a search for shops having Apple AND Orange returns shop(s) having products with both "apple" and "orange" as keywords.
We tried solving this problem, by making shop keywords multi-valued and assigning the keywords of every product of the shop as a new value in shop keywords. However as was confirmed in another post Querying Solr documents with one of the fields multi-valued, Solr does not support "all words must match
in the same value of a multi-valued field".
(Hope I explained myself well)
How can we go about this? Ideally, we shouldn't change our search infrastructure dramatically.
Thanks!
Krt_Malta
I am going to assume shop_keyword is a text field.
A keyword search of Apple AND Orange would return only shop_keyword terms that contain both Apple and Orange, provided you are searching on that field exclusively (shop_keyword:Apple AND Orange). For example, you should only see results that contain:
Apple Orange
And not:
Apple Mango
(I was able to confirm this on my local Solr instance with a text field)
However, you would see results that contain:
Apple Lime Orange Tree
(where "Orange Tree" is a single word but has spaces)
From the link you posted, it seems like this is the problem. So your real problem is that you have spaces in your keywords, which Solr is also using as a delimiter of sorts, in which case the technical solutions listed there are the only ones I know of. However...
If you have control of the terms and they aren't used in a free text search (or for google), you could consider removing the spaces from the keywords and adding quotes to your search. That would solve your problem:
shop_keyword:"Apple" AND "Orange"
Wouldn't return "Orange_Tree".
If you went this route you could use a separate field to index terms for free text search and other non-programmatic purposes.
Not ideal, but I hope that kinda helps =).