SOLR - Stopwords-document - solr

Scenario: eCommerce - Product search.
Is there a feature in SOLR that allow us to add STOP Words or "Keywords to ignore " per Product?
Example:
Search word: :Maker":
Expected results: ABC Coffee Maker, XYZ Juice Maker, MNO Label Maker, DEF Coffee Maker.
Search word: :Coffee Maker":
Expected results: ABC Coffee Maker, XYZ Juice Maker, DEF Coffee Maker.
MNO Label Maker should not be displayed when user searches for "Coffee Maker"
Thanks,
Jitendra.

The only way I know how to do what you want is to search on a phrase and give it a good boost. Here's an example from one of my own queries:
desc_search:(20%^10.0 AND SMD^10.0 OR "20% SMD"^100.0)
Note the "20% SMD" in quotes. This tells Solr to search on that exact phrase and boost documents that contain it. Depending on your boosting scheme, 100.0 may be too much or too little, so you'll need to experiment.
You'll still have the unrelated searches down near the bottom of the results list.
I haven't noticed any speed or efficiency issues yet with this approach, but I imagine if you let a user search on a super common word like "the", "and", etc., you could see a lot of results coming back and that could slow things down a little. I've had as many as 10K docs come back without problems, though.

If the desired behavior is to show all "makers" but prioritize "coffee makers", then boosting (phrases or shingles as above) is the way to go.
If, instead you want all coffee makers and no label makers for the search coffee maker, then just have your client code do this:
Run a phrase search for "coffee maker"
If no results are found, then optionally, run the normal term search before displaying results.

To remove stop words
Add the stopwords filter to your fieldType in schema.xml solr/collection1/conf/schema.xml
Customize the stopwords.txt list solr/collection1/conf/stopwords.txt
restart solr
Words in the stopwords list will be excluded # index time

Related

SolR phrase query with minimum match

I have a phrase which I want to find in SolR for example: (Ann OR Annie) is walking her dog. I want to be able to find it in SolR documents like:
Ann is walking a dog (changed token)
Ann is walking dog (missing token)
Ann is walking her wonderful dog (additional token).
First one can be done (more or less) with usage of ComplexPhraseQueryParser with for example (her OR a) (but it is not perfect as I might not now the alternatives) and it works fine for third type with usage of proximity ~, but it won't work at all for the second type of query as one of tokens is missing.
The second and third one can be achieved by eDisMax with combination of minimum match and ps2 and ps3, but they won't work for the variability needed in Ann OR Annie as they would parse the whole query as OR, so the document which has Ann AND Annie would have better score than the one with only one of them (I want to treat them equally). And I am still not sure if it is working well when searched words (Ann and Annie) are in the same position in Solr (increment=0).
The perfect solution would be something like ComplexPhraseQueryParser with minimum match. Is there a possibility to achieve that only by query or do I have to create my own parser?

Is there a way we can use lucene to discover the relevancy of word based on search query

All:
I wonder if there is any way that we can use lucene to do search keyword relevancy discovering based on search history?
For example:
The code can read in user search string, parse it, extract the keyword and find out which words have most possibility to come together when search.
When I try Solr, I found that the lucene has a lot of text analysis feature, that is why I am wondering if there is any way we can use it and combine with other machine learning libs(if necessary) to achieve my goal.
Thanks
Yes and No.
Yes.
It should work. Simply treat every keyword as a document and then use MoreLikeThis feature of lucene, which constructs a lucene query on the fly based on terms within the raw query. The lucenue query is then used to find other similar documents (keywords) in the index.
MoreLikeThis mlt = new MoreLikeThis(reader); // Pass the index reader
mlt.setFieldNames(new String[] {"keywords"}); // specify the field for similarity
Query query = mlt.like(docID); // Pass the doc id
TopDocs similarDocs = searcher.search(query, 20); // Use the searcher
if (similarDocs.totalHits == 0)
// Do handling
}
Suppose in your indexed keywords, you have such keywords as
iphone 6
apple iphone
iphone on sale
apple and fruit
apple and pear
when you launch a query with "iphone", I am sure you will find the first three keywords above as "most similar" due to the full term match of "iphone".
No.
The default similarity function in lucene never understands that iphone is relevant to Apple Inc, thus iphone is relevant to "apple store". If your raw query is just "apple store", an ideal search result within your current keywords would be as follows (ordered by relevancy from high to low):
apple iphone
iphone 6
iphone on sale
unfortunately, you will get below results:
apple iphone
apple and fruit
apple and pear
The first one is great however the other two are totally unrelated. To get the real relevancy discovery (using the semantic) , you need more work to do topic modeling. If you happen to have a great way (e.g., a pre-trained LDA model or wordvec ) to pre-process each keyword and produce a list of topic ids, you can store those topic ids in a separate field with each keyword document. Something like below:
[apple iphone] -> topic_iphone:1.0, topic_apple_inc:0.8
[apple and fruit] -> topic_apple_fruit:1.0
[apple and pear] -> topic_apple_fruit:0.99, topic_pear_fruit:0.98
where each keyword is also mapped to a few topic ids with weight value.
At query time, you should run the same topic modeling tool to generate topic ids for the raw query together with its terms. For example,
[apple store] -> topic_apple_inc:0.75, topic_shopping_store:0.6
Now you should combine the two fields (keyword and topic) to compute the overall similarity.

Solr : best way to match "at the moon nasa" against "at the moon" through phrase query?

I've got an index of about 500.000 documents, and about 10 of these documents contains the title "at the moon" ('title' field) and the tag "nasa" ('tag' field). When I do a search for "at the moon nasa" these documents come up quite far down on the list of the search results. This is because the title field does not get boosted, but the tag field gets boosted quite a bit. So other documents with the tag 'nasa' takes precedence over the documents which almost matches the entire query through the title field.
However, even though Solr can't know, the query "at the moon nasa" almost matches the document title "at the moon". If I remove the "nasa" part from the query, the documents come up at the top.
Is there some way to tell Solr to do some sort of approximate phrase query? Would it make sense to implement some sort of gram-ish search through the bq parameter, where i would split the search phrase up in word combinations such as:
// PHP-ish pseudocode
$bq[]=title:"at the"^2
$bq[]=title:"at the moon"^3
$bq[]=title:"at the moon nasa"^4
$bq[]=title:"the moon"^2
$bq[]=title:"the moon nasa"^3
$bq[]=title:"moon nasa"^4
Would this make sense at all, and would it make sense to boost documents according to how large part of the query they match?
Before you do anything else, try using eDisMax with pf3 parameter. That does the 3-grams for your automatically.
You may also be interesting in a recent vifun project that helps the visualize the effects of various parameters.

Solr synonyms aren't working right

We have a large restaurant menu database where users can search for menu items. There are many items that when the words are side by side its a unique dish but the words are so common and appear all over the place.
Example: Users want to search for "cheese steak"
In the database...it can be "cheesesteak" or "cheese steak"
In my synonym file I have:
cheesesteak => cheesesteak, cheese steak
cheese steak => cheesesteak, cheese steak
When I search for "cheesesteak", I get valid results. I get menu items with "cheesesteak" and also "cheese steak" (words side by side)
But when I search for "cheese steak", I get all kinds of non relevant results like "steak salad with blue cheese" its picking up anything with the words cheese and steak
Is there a way to configure this synonym file so it works? I don't want to force user to enter quotes, etc.
What you are looking for is proximity search, were scoring improves with the correct ordering and distance of words. From the Solr FAQ
A proximity search can be done with a sloppy phrase query. The closer
together the two terms appear in the document, the higher the score
will be. A sloppy phrase query specifies a maximum "slop", or the
number of positions tokens need to be moved to get a match.
This example for the standard request handler will find all documents
where "batman" occurs within 100 words of "movie":
q=text:"batman movie"~100
what you should do is use edismax and let boosting show the most relevant docs. You can also do this by using standard handler if you add boosting queries or optional phrase with all terms like +cheese +steak ("cheesesteak"^100 "steak cheese"^50)

Query problem in Solr

We're using Solr to search on a Shop index and a Product index. Currently a Shop has a field shop_keyword which also contains the keywords of the products assigned to it. The shop keywords are separated by a space. Consequently, if there is a product which has a keyword "apple" and another which has "orange", a search for shops having Apple AND Orange would return the shop for these products.
However, this is incorrect since we want that a search for shops having Apple AND Orange returns shop(s) having products with both "apple" and "orange" as keywords.
We tried solving this problem, by making shop keywords multi-valued and assigning the keywords of every product of the shop as a new value in shop keywords. However as was confirmed in another post Querying Solr documents with one of the fields multi-valued, Solr does not support "all words must match
in the same value of a multi-valued field".
(Hope I explained myself well)
How can we go about this? Ideally, we shouldn't change our search infrastructure dramatically.
Thanks!
Krt_Malta
I am going to assume shop_keyword is a text field.
A keyword search of Apple AND Orange would return only shop_keyword terms that contain both Apple and Orange, provided you are searching on that field exclusively (shop_keyword:Apple AND Orange). For example, you should only see results that contain:
Apple Orange
And not:
Apple Mango
(I was able to confirm this on my local Solr instance with a text field)
However, you would see results that contain:
Apple Lime Orange Tree
(where "Orange Tree" is a single word but has spaces)
From the link you posted, it seems like this is the problem. So your real problem is that you have spaces in your keywords, which Solr is also using as a delimiter of sorts, in which case the technical solutions listed there are the only ones I know of. However...
If you have control of the terms and they aren't used in a free text search (or for google), you could consider removing the spaces from the keywords and adding quotes to your search. That would solve your problem:
shop_keyword:"Apple" AND "Orange"
Wouldn't return "Orange_Tree".
If you went this route you could use a separate field to index terms for free text search and other non-programmatic purposes.
Not ideal, but I hope that kinda helps =).

Resources