Solr filter for apostrophe's - allow search for both with and without apostophe - solr

Using Solr 9.
I'd like the same results to return for the terms
Lowe's
as well as
Lowes
I can't seem to find the correct combination with this filter:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.KStemFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.KStemFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.LowerCaseFilterFactory" />
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
</analyzer>
</fieldType>
When testing in Solr's analyzer, I would expect that
<filter class="solr.KStemFilterFactory"/>
Would remove the s from the Lowes example in the query, thus matching Lowe in the index step.

Related

SynonymGraphFilterFactory with ShingleFilterFactory missing original string

Synonym.txt
Phone Case,Mobile cover,Mobile case
Managed-Schema
<fieldType name="standardField" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="([\.,])" replacement=" " replace="all"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.FlattenGraphFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([\.,])" replacement=" " replace="all"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
</fieldType>
Document
Mobile cover
Search Term
Mobile cover
Issue
Not sure if this is caused by bad config of SynonymGraphFilterFactory or ShingleFilterFactory Index never having full string as value to match it. Like I never have original string mobile cover as a result all filters, when I search mobile cover I get no documents.
Analysis
Update I
I added solr.FlattenGraphFilterFactory after solr.SynonymGraphFilterFactory as recommended by SOLR docs, now char index looking good to me on Synonym output. But still having same issue.

Solr nGram Filter minGramSize - token for word with only 2 characters

I am working with Solr and would like to understand how EdgeNGramFilterFactory works.
For example, I am searching for a term "1 tb". Pls note that I have few products with attributes for the fields I'm search on.
Here is the filter applied on Index time for this fieldtype.
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="10" />
Now, when I search for the term "1 tb", I do not get desired results.
I have few products with term "5 MegaPixels" and when I search for "5 meg" it gives me result. Later I found it does not work for "5 me". And does not give results.
If I remove nGram filter, it works fine. Moreover, if we set minGramSize ="1", then query "1 tb" works fine.
I was assuming that for term tb, the token tb should be valid. But it seems that it is not created when I apply minGramSize of 2!
Can someone explain why?
Here is the field defined in schema.
<fieldType name="AttributesField" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="10" />
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
Can you try with the below fieldType
<fieldType name="AttributesField" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="10" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
PorterStemFilterFactory : it does normalization process that removes common endings from words.
Example: "riding", "rides", "horses" ==> "ride", "ride", "hors".
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
With the configuration below the string value Nigerian gets broken down to the following terms
Nigerian => "ni", "nig", "nige", "niger", "nigeri", "nigeria", "nigeria", "nigerian"

Solr porter streaming not returning results

My schema is below. I have added PorterStemFilterFactory to schema.xml. I tried to restart it and reimport but not working:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" tokenizerFactory="solr.StandardTokenizerFactory" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>

Solr stop words replaced with _ symbol

I have problems with solr stopwords in my autosuggest. All stopwords was replaced by _ symbol.
For example I have text "the simple text in" in field "deal_title". When I try to search word "simple" solr show me next result "_ simple text _" but I expect "simple text".
Could someone explain me why this works in such way and how to fix it ?
Here is part of my schema.xml
<fieldType class="solr.TextField" name="text_auto">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true" outputUnigramsIfNoShingles="false" />
</analyzer>
<analyzer type="query">
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>
</fieldType>
<field name="deal_title" type="text_auto" indexed="true" stored="true" required="false" multiValued="false"/>
<fieldType name="text_general" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
My solution to this in Solr 6.3 (where enablePositionIncrements="false" isn't possible anymore) was to:
remove stopwords
shingle with fillerToken="" (which removes the _)
remove leading and trailing spaced
remove duplicates
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
<filter class="solr.ShingleFilterFactory" fillerToken=""/>
<filter class="solr.PatternReplaceFilterFactory" pattern="(^ | $)" replacement=""/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
To fix this you need to use<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true" enablePositionIncrements="false" />and <luceneMatchVersion>4.3</luceneMatchVersion> in solconfig.xml

retrieve ngrams in solr for a particular word

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="5" minShingleSize="2" outputUnigrams="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="5" minShingleSize="2" outputUnigrams="true"/>
<solrQueryParser defaultOperator="OR" />
</analyzer>
</fieldType>
I am using the ShingleFilterFactory to create ngrams. Now i want to retrive all the ngrams for a particular word.
Suppose i entered "night" then i want all the ngrams with the word night.
right now i am getting the only the top results from all the ngrams from my documents with the below query:
http://localhost/solr/admin/luke?fl=text&numTerms=50000&wt=json

Resources