Query problem in Solr - solr

We're using Solr to search on a Shop index and a Product index. Currently a Shop has a field shop_keyword which also contains the keywords of the products assigned to it. The shop keywords are separated by a space. Consequently, if there is a product which has a keyword "apple" and another which has "orange", a search for shops having Apple AND Orange would return the shop for these products.
However, this is incorrect since we want that a search for shops having Apple AND Orange returns shop(s) having products with both "apple" and "orange" as keywords.
We tried solving this problem, by making shop keywords multi-valued and assigning the keywords of every product of the shop as a new value in shop keywords. However as was confirmed in another post Querying Solr documents with one of the fields multi-valued, Solr does not support "all words must match
in the same value of a multi-valued field".
(Hope I explained myself well)
How can we go about this? Ideally, we shouldn't change our search infrastructure dramatically.
Thanks!
Krt_Malta

I am going to assume shop_keyword is a text field.
A keyword search of Apple AND Orange would return only shop_keyword terms that contain both Apple and Orange, provided you are searching on that field exclusively (shop_keyword:Apple AND Orange). For example, you should only see results that contain:
Apple Orange
And not:
Apple Mango
(I was able to confirm this on my local Solr instance with a text field)
However, you would see results that contain:
Apple Lime Orange Tree
(where "Orange Tree" is a single word but has spaces)
From the link you posted, it seems like this is the problem. So your real problem is that you have spaces in your keywords, which Solr is also using as a delimiter of sorts, in which case the technical solutions listed there are the only ones I know of. However...
If you have control of the terms and they aren't used in a free text search (or for google), you could consider removing the spaces from the keywords and adding quotes to your search. That would solve your problem:
shop_keyword:"Apple" AND "Orange"
Wouldn't return "Orange_Tree".
If you went this route you could use a separate field to index terms for free text search and other non-programmatic purposes.
Not ideal, but I hope that kinda helps =).

Related

SOLR - Stopwords-document

Scenario: eCommerce - Product search.
Is there a feature in SOLR that allow us to add STOP Words or "Keywords to ignore " per Product?
Example:
Search word: :Maker":
Expected results: ABC Coffee Maker, XYZ Juice Maker, MNO Label Maker, DEF Coffee Maker.
Search word: :Coffee Maker":
Expected results: ABC Coffee Maker, XYZ Juice Maker, DEF Coffee Maker.
MNO Label Maker should not be displayed when user searches for "Coffee Maker"
Thanks,
Jitendra.
The only way I know how to do what you want is to search on a phrase and give it a good boost. Here's an example from one of my own queries:
desc_search:(20%^10.0 AND SMD^10.0 OR "20% SMD"^100.0)
Note the "20% SMD" in quotes. This tells Solr to search on that exact phrase and boost documents that contain it. Depending on your boosting scheme, 100.0 may be too much or too little, so you'll need to experiment.
You'll still have the unrelated searches down near the bottom of the results list.
I haven't noticed any speed or efficiency issues yet with this approach, but I imagine if you let a user search on a super common word like "the", "and", etc., you could see a lot of results coming back and that could slow things down a little. I've had as many as 10K docs come back without problems, though.
If the desired behavior is to show all "makers" but prioritize "coffee makers", then boosting (phrases or shingles as above) is the way to go.
If, instead you want all coffee makers and no label makers for the search coffee maker, then just have your client code do this:
Run a phrase search for "coffee maker"
If no results are found, then optionally, run the normal term search before displaying results.
To remove stop words
Add the stopwords filter to your fieldType in schema.xml solr/collection1/conf/schema.xml
Customize the stopwords.txt list solr/collection1/conf/stopwords.txt
restart solr
Words in the stopwords list will be excluded # index time

Is there a way we can use lucene to discover the relevancy of word based on search query

All:
I wonder if there is any way that we can use lucene to do search keyword relevancy discovering based on search history?
For example:
The code can read in user search string, parse it, extract the keyword and find out which words have most possibility to come together when search.
When I try Solr, I found that the lucene has a lot of text analysis feature, that is why I am wondering if there is any way we can use it and combine with other machine learning libs(if necessary) to achieve my goal.
Thanks
Yes and No.
Yes.
It should work. Simply treat every keyword as a document and then use MoreLikeThis feature of lucene, which constructs a lucene query on the fly based on terms within the raw query. The lucenue query is then used to find other similar documents (keywords) in the index.
MoreLikeThis mlt = new MoreLikeThis(reader); // Pass the index reader
mlt.setFieldNames(new String[] {"keywords"}); // specify the field for similarity
Query query = mlt.like(docID); // Pass the doc id
TopDocs similarDocs = searcher.search(query, 20); // Use the searcher
if (similarDocs.totalHits == 0)
// Do handling
}
Suppose in your indexed keywords, you have such keywords as
iphone 6
apple iphone
iphone on sale
apple and fruit
apple and pear
when you launch a query with "iphone", I am sure you will find the first three keywords above as "most similar" due to the full term match of "iphone".
No.
The default similarity function in lucene never understands that iphone is relevant to Apple Inc, thus iphone is relevant to "apple store". If your raw query is just "apple store", an ideal search result within your current keywords would be as follows (ordered by relevancy from high to low):
apple iphone
iphone 6
iphone on sale
unfortunately, you will get below results:
apple iphone
apple and fruit
apple and pear
The first one is great however the other two are totally unrelated. To get the real relevancy discovery (using the semantic) , you need more work to do topic modeling. If you happen to have a great way (e.g., a pre-trained LDA model or wordvec ) to pre-process each keyword and produce a list of topic ids, you can store those topic ids in a separate field with each keyword document. Something like below:
[apple iphone] -> topic_iphone:1.0, topic_apple_inc:0.8
[apple and fruit] -> topic_apple_fruit:1.0
[apple and pear] -> topic_apple_fruit:0.99, topic_pear_fruit:0.98
where each keyword is also mapped to a few topic ids with weight value.
At query time, you should run the same topic modeling tool to generate topic ids for the raw query together with its terms. For example,
[apple store] -> topic_apple_inc:0.75, topic_shopping_store:0.6
Now you should combine the two fields (keyword and topic) to compute the overall similarity.

Apache Solr phrase query is not aware of filters from schema.xml

I am a newbie with solr and I have a question about query mechanism.
In my solr schema.xml for a particular field (say field1) i have a standard tokenizer that splits into words and a couple of filters. One of the filters is a solr.KeepWordFilterFactory filter that has a extremely short dictionary (just 10 words, say they are: red, orange, yellow, green etc). I tested the schema with analyze menu of solr and everything works.
that is a document with text "Red fox was sitting on green grass". would translate to {"red,"green"}
However, when I submit a query: field1:"red green" it fails to find such a document. As if the query is applied to unfiltered yet tokenized source.
Can you confirm that this is what standard query parser actually does. I.e the filters are applied exclusively for the index, but no for the actual search ??(i understand that the search will be applied only to those documents where the index matches the analyzed query). Or if not how the phrase query actually works in the above example.
When you do a query like this : "red green", Lucene expects to find these terms in consecutive positions , so pos(green) = pos(red) + 1. When you do it like this : "red green"~10 , you give it 10 moves to shuffle the terms around and try to make them seem consecutive (it's called a phrase slop) .
Other that that , what a KeywordMarkerFilter does is mark tokens with the keyword flag. Filters following it could implement a logic that check if the token is a keyword before modifying it. It does not stop lucene from indexing tokens not marked as keywords, but it could stop it from further modifying them.

Solr - How do I get the number of documents for each field containing the search term within that field in Solr?

Imagine an index like the following:
id partno name description
1 1000.001 Apple iPod iPod by Apple
2 1000.123 Apple iPhone The iPhone
When the user searches for "Apple" both documents would be returned. Now I'd like to give the user the possibility to narrow down the results by limiting the search to one or more fields that have documents containing the term "Apple" within those fields.
So, ideally, the user would see something like this in the filter section of the ui after his first query:
Filter by field
name (2)
description (1)
When the user applies the filter for field "description", only documents which contain the term "Apple" within the field "description" would be returned. So the result set of that second request would be the iPod document only. For that I'd use a query like ?q=Apple&qf=description (I'm using the Extended DisMax Query Parser)
How can I accomplish that with Solr?
I already experimented with faceting, grouping and highlighting components, but did not really come to a decent solution to this.
[Update]
Just to make that clear again: The main problem here is to get the information needed for displaying the "Filter by field" section. This includes the names of the fields and the hits per field. Sending a second request with one of those filters applied already works.
Solr just plain Doesn't Do This. If you absolutely need it, I'd try it the multiple requests solution and benchmark it -- solr tends to be a lot faster than what people put in front of it, so an couple few requests might not be that big of a deal.
you could achieve this with two different search requests/queries:
name:apple -> 2 hits
description:apple -> 1 hit
EDIT:
You also could implement your own SearchComponent that executes multiple queries in the background and put it in the SearchHandler processing chain so you only will need a single query in the frontend.
if you want the term to be searched over the same fields every time, you have 2 options not breaking the "single query" requirement:
1) copyField: you group at index time all the fields that should match togheter. With just one copyfield your problem doesn't exist, if you need more than one, you're at the same spot.
2) you could filter the query each time dynamically adding the "fq" parameter at the end
http://<your_url_and_stuff>/?q=Apple&fq=name:Apple ...
this works if you'll be searching always on the same two fields (or you can setup them before querying) otherwise you'll always need at least a second query
Since i said "you have 2 options" but you actually have 3 (and i rushed my answer), here's the third:
3) the dismax plugin described by them like this:
The DisMaxQParserPlugin is designed to process simple user entered phrases
(without heavy syntax) and search for the individual words across several fields
using different weighting (boosts) based on the significance of each field.
so, if you can use it, you may want to give it a look and start from the qf parameters (that is what the option number 2 wanted to be about, but i changed it in favor of fq... don't ask me why...)
SolrFaceting should solve your problem.
Have a look at the Examples.
This can be achieved with Solr faceting, but it's not neat. For example, I can issue this query:
/select?q=*:*&rows=0&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json
to find the number of documents containing donkey in the title and text fields. I may get this response:
{
"responseHeader":{"status":0,"QTime":1,"params":{"facet":"true","facet.query":["title:donkey","text:donkey"],"q":"*:*","wt":"json","rows":"0"}},
"response":{"numFound":3365840,"start":0,"docs":[]},
"facet_counts":{
"facet_queries":{
"title:donkey":127,
"text:donkey":4108
},
"facet_fields":{},
"facet_dates":{},
"facet_ranges":{}
}
}
Since you also want the documents back for the field-disjunctive query, something like the following works:
/select?q=donkey&defType=edismax&qf=text+titlle&rows=10&facet=true&facet.query=title:donkey&facet.query=text:donkey&wt=json

Solr synonyms aren't working right

We have a large restaurant menu database where users can search for menu items. There are many items that when the words are side by side its a unique dish but the words are so common and appear all over the place.
Example: Users want to search for "cheese steak"
In the database...it can be "cheesesteak" or "cheese steak"
In my synonym file I have:
cheesesteak => cheesesteak, cheese steak
cheese steak => cheesesteak, cheese steak
When I search for "cheesesteak", I get valid results. I get menu items with "cheesesteak" and also "cheese steak" (words side by side)
But when I search for "cheese steak", I get all kinds of non relevant results like "steak salad with blue cheese" its picking up anything with the words cheese and steak
Is there a way to configure this synonym file so it works? I don't want to force user to enter quotes, etc.
What you are looking for is proximity search, were scoring improves with the correct ordering and distance of words. From the Solr FAQ
A proximity search can be done with a sloppy phrase query. The closer
together the two terms appear in the document, the higher the score
will be. A sloppy phrase query specifies a maximum "slop", or the
number of positions tokens need to be moved to get a match.
This example for the standard request handler will find all documents
where "batman" occurs within 100 words of "movie":
q=text:"batman movie"~100
what you should do is use edismax and let boosting show the most relevant docs. You can also do this by using standard handler if you add boosting queries or optional phrase with all terms like +cheese +steak ("cheesesteak"^100 "steak cheese"^50)

Resources