how to find documents that only contain searched words in `solr` - solr

For example, I have a solr collection that contains documents with a field called "key_phrase".
I know it is easy to find all documents that contain all the searched words in a search query. (i.e. using mm=100% in edismax)
However, what I am asking for is how to return documents whose "key_phrase" contains only the searched words and nothing else. This "key_phrase" is also a multi_valued field.
For example:
Search query: 'kids soccer gear'
The query would return the following document whose "key_phrase" field contains: "kids soccer".
It would also return a document who have two "key_phrase" values such as 'kids gear' and 'any other word' since one of them does not contain any words that is not in the search query.
On the other hand, it would not return a document that has 'kids soccer gear for boy' since this document contains 'boy', which is not present in the search query.

You can try by indexing the field using the ShingleFilterFactory.
e.g.
<filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true"/>
you can refer here ShingleFilterFactory
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory"/>
</analyzer>
If you have the input as
In: "To be, or what?"
Tokenizer to Filter: "To"(1), "be"(2), "or"(3), "what"(4)
Out: "To"(1), "To be"(1), "be"(2), "be or"(2), "or"(3), "or what"(3), "what"(4)

Related

What is the easiest way to implement SVD algorithm for my searched results on Solr?

I created my own core on http://localhost:8983/solr and added some documents so I could query. But When I query something like"dog", I want those documents that contains "pooch" will be returned too. So I want to implement SVD algorithm to make some improvement on my results.
Since I am new to the search engine thing. All I know is that I can use Mahout to implement SVD, but it seems a little bit difficult coz I have to install Maven, Hadoop and Mahout.
Any suggestion will be appreciated.
You can use SynonymGraphFilterFactory
This filter maps single- or multi-token synonyms, producing a fully correct graph output. This filter is a replacement for the Synonym Filter, which produces incorrect graphs for multi-token synonyms.
If you use this filter during indexing, you must follow it with a Flatten Graph Filter to squash tokens on top of one another like the Synonym Filter.
Create a file i.e mysynonyms.txt in the directory your_collection/conf/ and put the synonyms with => sign
pooch,pup,fido => dog
huge,ginormous,humungous => large
And Example Schema will be :
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="mysynonyms.txt"/>
<filter class="solr.FlattenGraphFilterFactory"/> <!-- required on index analyzers after graph filters -->
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="mysynonyms.txt"/>
</analyzer>
Source : https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions
The is another way to augment your index with terms not in the content. Synonyms is good as #ashraful says. But there are 2 other problems you will run into:
words used but not in the synonym list
behavioral search: using other user behavior as a hint to what they are looking for
These require you to augment the index with terms learned from 1) other searches, and 2) user behavior. Mahout's Correlated Cross Occurrence algorithm can help with both. You can set it up to find terms that lead to people reading an item and (if you have something like purchase or other preference data) conversion items that correlate with items in the index. In the second case you would add user conversions to the search query to personalize the results.
A blog about the technique here: http://actionml.com/blog/personalized_search
The page on Mahout docs here: http://mahout.apache.org/users/algorithms/intro-cooccurrence-spark.html
You should also look at word2vec, which will (given the right training data) find that "dog" and "pooch" are synonyms regardless of the synonym list because it is learned from the data. I'm not sure how you add word2vec to Solr but it is integrated into Fusion, the closed source product of Lucid.

Solr query data with white space needs to be queried

I am new to solr. I have data in solr something like "name":"John Lewis".
Query formed looks and searches perfectly as fq=name%3A+%22John+Lewis%22
This is formed in Solr console and works well.
My requirement is to search a particular word coming from my Java layer as "JohnLewis". It has to be mapped with "John Lewis" in solr repo.
This search is not just restricted to name field(2 words and a space in-between).
I have some other details like "Cash Reward Credit Cards", which has 4 words and user would query like "CashRewardCreditCards".
Could someone help me on this, if this can be handled in schema.xml with any parsers that is available in solr.
You need to create custom fieldType.
First define a fieldType in your solr schema :
<fieldType name="word_concate" class="solr.TextField" indexed="true" stored="false">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\s*" replacement=""/>
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
</fieldType>
Here we named the fieldType as word_concate.
We used CharFilterFactories's solr.PatternReplaceCharFilterFactory
Char Filter is a component that pre-processes input characters. Char Filters can be chained like Token Filters and placed in front of a Tokenizer. PatternReplaceCharFilterFactory filter uses regular expressions to replace or change character patterns
Pattern : \s* means zero or more whitespace character
Second create a field with word_concate as type :
<field name="cfname" type="word_concate"/>
Copy your name field to cfname with copy field
<copyField source="name" dest="cfname"/>
Third reindex the data.
Now you can query : cfname:"JohnLewis" it will return name John Lewis
Assuming your input is CamelCase as shown I would use Solr's Word Delimiter Filter
with the splitOnCaseChange parameter on the query side of your analyzer as a starting point. This will take an input token such as CashRewardCreditCards and generate the tokens Cash Reward Credit Cards
See also:
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter
Look at WordDelimiterFilterFactory
It has a splitOnCaseChange property. If you set that to 1, JohnLewis will be indexed as John Lewis.
You'll need to add this to your query analyzer. If the user searches for JohnLewis, the search will be translated to John Lewis.

Solr search index on different tokens of a sentence

Folks,
We wanted to make a search on solr such that it will give a priority to partial match in the sentences.
Lets say for example :
Sentence is like "Have wonderful evening today here"
If user is supplying "today here" then it should match.
If user is supplying "wonderful evening" then it should match.
If user is supplying "Have wonderful" then it should match.
We want to give low priority to key word search compared to above.
keyword match could be : "today" "wonderful" "evening" etc.
Is there any way this can be achieve is solr since solr works on inverted index of words on a given sentence.
You can use a separate field with a SingleFilter defined - this will combine runs of tokens into separate tokens, so that "Have wonderful evening today here" can be indexed as "have wonderful", "wonderful evening", "evening today" and "today here".
Make hits in this field a higher priority than hits in your regular search field by using qf=shinglefield^<boostvalue> - what the exact boost value needs to be depends on the scoring profile of your index and if you're doing other boosts.
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="2" outputUnigrams="false"/>
</analyzer>

SOLR eDISMAX product search

I'm new to SOLR and am implementing it to search our product catalog. I'm creating ngrams and edge ngrams on the brand name, display name and category fields.
I'm using edismax and have qf defined as displayname_nge displayname_ng category_nge category_ng brandname_nge brandname_ng.
When I search for 'vitamin c' (without the quotes) I get all of the vitamins. If I surround it with quotes then I only get vitamin c. The problem is that I can't always surround the query string with quotes because a person might enter 'chewable vitamin c', or 'vendor x vitamin c'. I've tried the mm parameter without luck. I've also tried applying different boost levels and still not getting the expected results.
Any suggestions would be greatly appreciated. Thank you
Was there a reason for using only ngrams fields for searching? I'm not sure this is the problem in your case, but you may want to look at your ngrams analysis configuration in schema.xml. One from one of my indexes looks like this:
<fieldType name="ngram" class="solr.TextField" >
<analyzer type="index">
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>
</fieldType>
Though you can see this is actually using the safer EdgeNGramFilterFactory, the important thing to note here is minGramSize="2". This means that during the indexing process only grams of at least two characters will be created. The word 'c'? That doesn't get any grams at all. While you could set minGramSize="1" and rebuild your index, single character grams are a very bad idea, as your search for 'c' would match against any document with a word that starts with 'c' (or contains the letter 'c' with NGramFilterFactory).
If you're currently using NGrams with minGramSize="2", a search for 'ca' would find any documents with any words containing the letters 'ca' consecutively in that order. This may not be exactly what you want, either.
My top suggestion would be to drop the ngrams in favor of a more vanilla Text field. Whether you want to keep the edge-ngrams around for better truncation support is up to you, but I suspect you'll have better luck if the Text field is at least in the mix.
You could also take a look at this question on StackOverflow: "Can I protect short words from an n-gram filter in Solr?" if you want to pursue the ngrams further.
Also, you should consider using Solr's built-in analysis tool to figure out where your searches are failing. You choose a field or fieldType, and provide values for what was entered into the index and what is being searched. It will show you how the analysis works against both values so you can see how each string is broken down and why it does or doesn't create matching tokens. The URL for the tool depends on whether you're in a multi-core environment, but if you go to Solr's web interface you should be able to find the Analysis link on the left.
Update:
Now that I have a little more detail from you and am thinking about it again, the results you're getting are very explainable.
With minGramSize="1", your unquoted search for 'vitamin c' is looking for records with the word 'vitamin' (or a longer word containing 'vitamin'), and the word 'c' (or a longer word containing 'c'). Since most records are likely to have a 'c' somewhere, this is hardly a limiting factor and your results will be very close to or exactly the same as your results for just the word 'vitamin'.
In the quoted search for 'vitamin c', the 'c' now has to appear in a word immediately following vitamin, making it a much more useful search, but still not great. You should be able to test this by finding records that have a word following vitamin that isn't a vitamin designation. For example, a record mentioning "vitamin tablets" should be found when searching for "vitamin b" (because there's a 'b' in "tablets"). and a record mentioning "vitamin chart" or "vitamin deficiency" should be found when searching for "vitamin c".
The upshot of this is that I strongly recommend having a set of fields for searching separate from your fields for autocomplete. The NGrams with minGramSize="1" are just not going to give you reasonable results for the actual search step.
Other option is to use edismax - 'mm', there you can give matching %. if you give 100% it will give you accurate matching. 75% will give you list of vitamin... you can programatically handle % according to your need
You may consider to replace the query keyword this way: "'vitamin c' vitamin c". In such case, records matching 'vitamin c' can get higher score than those matching 'vitamin' and 'c' separately. Your search results will still return all matching records. Please see if this help, and feel free to comment.

How to work with solr phrases

I'm using solr4.1.0 and I'm trying to get common word phrase search to work. This means when searching for "the cat" I want documents containing this phrase to be shown, but not documents containing "the" and "cat" somewhere or in different fields.
What I have:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.CommonGramsFilterFactory" words="lang/stopwords.txt" format="snowball" />
<filter class="solr.StopFilterFactory" words="lang/stopwords.txt" format="snowball" enablePositionIncrements="true" />
</analyzer>
</fieldType>
This should output special gram tokens when a "normal" word is combined with a stopword from stopwords.txt. In analyze view this works as expected, so "the cat" gets common-grammed to "the_cat cat".
The solution my client is after is that when stop words in the query are used in conjunction with normal words, only elements with this exact phrase (stop-word-2-shingle) should match. The overall default operator is still AND.
For example, I have documents with the following fields
id: 1; title: my cat in its natuarl surroundings; desc: the nicest animal in da world is a cat
id: 2; title: the cat is evil; desc: everyone knows that cats are pure evil
id: 3; title: cat solving mysteries; desc: our cat is called Sherlock
The following are examples of what I'd like to achieve... bascially the users are more or less illiterate with respect to searches and queries and operators, thus the search should interpret the input and "do the right thing". The right thing would be:
input: cat
result: docs 1, 2, 3 (w/o scoring for the sake of easiness)
input: cat world
result: doc 1
AND is default
input: cat everyone
result: doc 2
AND spanning multiple fields
input: the cat
result: doc 1
because only this field contains the phrase "the cat", that somehow has to magically appera during query
input: the nice cat
reult: []
because no document contains the phrase "the nice" and the algorithm would interpret this as a common word phrase
input: the cat world
result: doc 1
input: the pure
result: []
The reasoning behind this is that the client has some specific ideas regarding some (carefully selected) stop words.
So is this a realistic way of doing it? Is it necessary to do some kind of query pre-parsing before passing it to solr? Are there other ways to achieve the desired results?

Resources