Searching for Solr Stop words - solr

On of my solr fields is configured in the following manned,
<fieldType name="text_exact" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="1" types="wdfftypes.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="1" types="wdfftypes.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
This works in cases where i don't want stemming, but now there is another use case which is causing a problem, people are beginning to seach for the following combinations,
The Ivy : In this case results with just ivy is being returned, when the expected result would be with The. I understand that this is because of the stop word but is the way to achieve this. For example if they search for "the ivy" within quotes than this should work.
(Mom & Me) OR ("mom and me"): In this case also & is dropped or results including both mom and me in some part of the statement is returned.
I am ok if only new data behaves in the right way but wouldnt be able to reindex. Also, would changing the schema.xml file trigger a full replication?
Regards,
Ayush

You are using the white space tokenizer.
So "The Ivy" is slitted into 2 words.
You could use an less agressive tokenize an followed by the WordDelimiterFilterFactory in order to activate the protected="protwords.txt" options, where you can set "the ivy" as an protected word so that solr will not tokenize that.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

Related

How to configure SOLR for handling similar search keywords: `ABCWord`, `abc word`?

I have content with two title types: ABCWord & ABC Word. When I put to search box keywords like: abc-word, abc word content titled ABC Word is found, but I need to get also ABCWord titled content.
I've tried to use: solr.EdgeNGramFilterFactory and solr.WordDelimiterFilterFactory for it, but it seems I'm using it wrong.
My current schema.xml text field configuration:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
splitOnCaseChange="0"
preserveOriginal="1"/>
<filter class="solr.LengthFilterFactory" min="2" max="100" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="back"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
splitOnCaseChange="0"
preserveOriginal="1"/>
<filter class="solr.LengthFilterFactory" min="2" max="100" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="back"/>
</analyzer>
</fieldType>
You aren't using it wrong, but maybe you're using too much filters that are affecting the final result.
The EdgeNGram should resolve your problem, since it'll create tokens from size 3 to 30 in your input. So, "ABCWord" will become "abc", "abcw", "abcwo", "abcwor" and "abcword", and then a search for "abc" should match.
First of all, I'd recommend you to change the fieldType you're using when you use ngram, because it'll increase a lot your index size. It's better to create a new field type to use only in fields you really need it, instead of the "text" fieldType that probably indexes others values where you don't need ngrams.
Second, if your analyzer definition can be the same for index and search time, you don't need to duplicate the configs, just use 'analyzer' instead of 'analyzer type="index"' and 'analyzer type="query"'.
I strongly recommend you to check the analysis tab in your solr admin to see how Solr processes the indexed and queried text for your input. You can also remove some of the filters in your fieldType config when you're trying to achieve some specific result. It's better to understand what each filter is doing to your input.

SOLR stop word: words with 'of' give no results, but when of is excluded we get correct results

Can any one explain how stop words in SOLR work.
In my stopword.txt I have define of. In schema.xml I have
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"enablePositionIncrements="true"/>
Now when I search for any thing that contains word of does not show up in result.
Example: oil of olay shows no result, where as oil olay shows up correct results.
More of file definition:
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="1"
preserveOriginal="1"
splitOnCaseChange="0"
splitOnNumerics="0"
types="wdtypes.txt"
/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.TrimFilterFactory" updateOffsets="false"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="1"
preserveOriginal="1"
splitOnCaseChange="0"
splitOnNumerics="0"
types="wdtypes.txt"
/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
When debugging:
+(upclist:cream+of+wheat&qt=productresults&rows=10&fq=status%3AActive&fq=facilitystatus%3AActive&fq=facilityid%3A100&fq=inventoryctrlcode%3A%5B0+TO+100%5D&fq=weblifecycle%3A%283+OR+4%29&fq=groupnumber%3A2^1.2 | keywords:cream+of+wheat&qt=productresults&rows=10&fq=status%3aactive&fq=facilitystatus%3aactive&fq=facilityid%3a100&fq=inventoryctrlcode%3a%5b0+to+100%5d&fq=weblifecycle%3a%283+or+4%29&fq=groupnumber%3a2^20.0 | product_elevate:cream+of+wheat&qt=productresults&rows=10&fq=status%3aactive&fq=facilitystatus%3aactive&fq=facilityid%3a100&fq=inventoryctrlcode%3a%5b0+to+100%5d&fq=weblifecycle%3a%283+or+4%29&fq=groupnumber%3a2^5.0 | area:"(cream+of+wheat&qt=productresults&rows=10&fq=status%3aactive&fq=facilitystatus%3aactive&fq=facilityid%3a100&fq=inventoryctrlcode%3a%5b0+to+100%5d&fq=weblifecycle%3a%283+or+4%29&fq=groupnumber%3a2 cream) of wheat qt productresult (row creamofwheatqtproductresultsrow) 10 fq status%3aactive fq facilitystatus%3aactive fq facilityid%3a100 fq inventoryctrlcode%3a%5b0 (to fqstatus%3aactivefqfacilitystatus%3aactivefqfacilityid%3a100fqinventoryctrlcode%3a%5b0to) 100%5d fq weblifecycle%3a%283 (or fqweblifecycle%3a%283or) 4%29 fq (groupnumber%3a2 fqgroupnumber%3a2 creamofwheatqtproductresultsrows10fqstatus%3aactivefqfacilitystatus%3aactivefqfacilityid%3a100fqinventoryctrlcode%3a%5b0to100%5dfqweblifecycle%3a%283or4%29fqgroupnumber%3a2)"~3^2.5 | productid:cream+of+wheat&qt=productresults&rows=10&fq=status%3AActive&fq=facilitystatus%3AActive&fq=facilityid%3A100&fq=inventoryctrlcode%3A%5B0+TO+100%5D&fq=weblifecycle%3A%283+OR+4%29&fq=groupnumber%3A2^1.7 | productname:cream+of+wheat&qt=productresults&rows=10&fq=status%3aactive&fq=facilitystatus%3aactive&fq=facilityid%3a100&fq=inventoryctrlcode%3a%5b0+to+100%5d&fq=weblifecycle%3a%283+or+4%29&fq=groupnumber%3a2^10.0)~0.01 ()
This might not be relevant, since you say you were searching on only one field (I'm posting it anyway because you say you are using edismax and qf). I had a similar issue when I wanted to boost an exact search, so I made the qf something like this: <str name="qf">title^45 title_str^55. The title field was using stopwords and title_str obviously was not. The reason it would often not find the searches using stopwords is described here. Their solution was to play with the mm values. The solution that worked in my case was to put the title_str in the pf tag (and remove it from the qf tag), so the exact find would come to the top.
At last resolved this issue by changing this:
"mm" from 2<-25% To 2<-36%

Solr set more relevance in position of string

How can I make Solr set more relevance in words based on position of the String.
For example, if I search "Macbook" the firsts results are like "Case Logic LAPS-113 13.3-Inch Laptop / MacBook Air" and after "Apple MacBook Pro MD101LL/A 13.3-Inch ".
This is my field declaration:
<fieldType name="text_pt" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="lang/index_synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_pt.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="lang/protwords_pt.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="lang/synonyms.txt" ignoreCase="true" expand="false"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_pt.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="lang/protwords_pt.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="lang/synonyms.txt" ignoreCase="true" expand="false"/>
</analyzer>
</fieldType>
What if product name was "MacBook/Dell/Lenovo Laptop cheap case", it contains Macbook at first position, you still want to boost this document?
I think you should try to fix the root cause of your issue instead, the common issue of how to deal with accessories (such as 'case', 'battery', 'lock' etc) scoring better than the products themselves.
The obvious best choice: index a field that says if the doc is an accessory (I gather you don't have that info, otherwise this is the best way), and boost the ones that are not accessories.
If you don't have that info, you can try by penalizing the docs that contain 'typical' accessory words. For this you need to build such a list, but it is not hard. I have used this approach with good result.

Solr WildCard Search Issue

I am facing an issue in the solr search....The wild card search seems to be working fine but there are issues when i am trying to find terms within another word.....For example: "rtebiggestBug", when i search for biggest , it doesn't give any results.I have the following entries made in the schema.xml file
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<!-- <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>-->
<filter class="solr.LowerCaseFilterFactory"/>
<!-- <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> -->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" preserveOriginal="1"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<!-- <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>-->
<filter class="solr.LowerCaseFilterFactory"/>
<!-- <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>-->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Another issue is,it does not find strings at the end of the word. Example: I searched on “bug” and found bugs and not the word "samplebug".
It would be really helpful if you could help me on this issue..
Thanks in advance.
By default Solr does not support left truncation, like searches for *bug to find samplebug.
Use the solr.ReversedWildcardFilterFactory in order to reverse the term and index the therm in an revered way, like gubelpmas. solr.ReversedWildcardFilterFactory
Here is an tutorial: http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
It seems your query parser is not able to handle leading wildcard searches.
What query parser are you using ?
The Extended dismax query parser supports searches with leading wildcards. you want to check for it.

"protected phrase" in Solr

A customer of mine is a photo agency specialized in photojournalism (well, and gossip), so many of their customers' searches revolve around specific people.
We index about 1.5m documents, with full-text search on headline and caption; and full-text search without stemming on tags. We have a decent list of stop words, and they provide a list of protected words that they feel are not stemmed correctly.
We are using Dismax to search over headline, caption and tags, with different boosts)
This is all working pretty nicely.
However, a few people are proving tricky to get right. For instance, Al Gore. In Italian "al" is a stop word, so a simple query for `al gore' (without quotes) becomes:
+((DisjunctionMaxQuery((caption_text:gor | tags_text:gore^100.0 | headline_text:gor)))~1) ()
That does return hits for the ex VP, but of course also for "Lesley Gore" and "Tipper Gore"; and also, because of stemming, hits for "Gori" and more.
Leaving aside sorting for a second, it does clutter up results, and I'd like to do better.
Wrapping the search terms in quotes doesn't help, "al" gets stripped away anyway.
Marking "gore" as a protected word gets me halfway there, limiting the number of false positives.
I tried playing with SynonymFilterFactory too, but didn't get too far--I have the SynonymFilterFactory as the first filter, so "al" gets removed anyway.
What I think I really need is a way of tokenizing "al gore" as a single token. Is there anything that will allow me to do that, for a set of configurable "phrases"?
Is there another approach I'm overlooking? solr.CommonGramsFilterFactory perhaps?
Some more background info: we are using Solr 1.4.0.
Relevant portions of schema.xml
<!-- used for headline and caption -->
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Italian" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Italian" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="tagsText" class="solr.TextField" sortMissingLast="true" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Have you looked into the CommonGramsFilterFactory
It will:
combine multiple tokens into a single
token
usually used when searching a phrase that contains stop words

Resources