Solr nGram Filter minGramSize - token for word with only 2 characters - solr

I am working with Solr and would like to understand how EdgeNGramFilterFactory works.
For example, I am searching for a term "1 tb". Pls note that I have few products with attributes for the fields I'm search on.
Here is the filter applied on Index time for this fieldtype.
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="10" />
Now, when I search for the term "1 tb", I do not get desired results.
I have few products with term "5 MegaPixels" and when I search for "5 meg" it gives me result. Later I found it does not work for "5 me". And does not give results.
If I remove nGram filter, it works fine. Moreover, if we set minGramSize ="1", then query "1 tb" works fine.
I was assuming that for term tb, the token tb should be valid. But it seems that it is not created when I apply minGramSize of 2!
Can someone explain why?
Here is the field defined in schema.
<fieldType name="AttributesField" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="10" />
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>

Can you try with the below fieldType
<fieldType name="AttributesField" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="10" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
PorterStemFilterFactory : it does normalization process that removes common endings from words.
Example: "riding", "rides", "horses" ==> "ride", "ride", "hors".
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
With the configuration below the string value Nigerian gets broken down to the following terms
Nigerian => "ni", "nig", "nige", "niger", "nigeri", "nigeria", "nigeria", "nigerian"

Related

Solr filter for apostrophe's - allow search for both with and without apostophe

Using Solr 9.
I'd like the same results to return for the terms
Lowe's
as well as
Lowes
I can't seem to find the correct combination with this filter:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.KStemFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.KStemFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.LowerCaseFilterFactory" />
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
</analyzer>
</fieldType>
When testing in Solr's analyzer, I would expect that
<filter class="solr.KStemFilterFactory"/>
Would remove the s from the Lowes example in the query, thus matching Lowe in the index step.

SynonymGraphFilterFactory with ShingleFilterFactory missing original string

Synonym.txt
Phone Case,Mobile cover,Mobile case
Managed-Schema
<fieldType name="standardField" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="([\.,])" replacement=" " replace="all"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.FlattenGraphFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([\.,])" replacement=" " replace="all"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
</fieldType>
Document
Mobile cover
Search Term
Mobile cover
Issue
Not sure if this is caused by bad config of SynonymGraphFilterFactory or ShingleFilterFactory Index never having full string as value to match it. Like I never have original string mobile cover as a result all filters, when I search mobile cover I get no documents.
Analysis
Update I
I added solr.FlattenGraphFilterFactory after solr.SynonymGraphFilterFactory as recommended by SOLR docs, now char index looking good to me on Synonym output. But still having same issue.

stop synonyms.txt file Solr from being stemmed

In synonyms.txt file I have an entry
marine => saltwater,marine but both the words are getting stemmed to 'saltwat', 'marin' respectively inspite of being in protected words file. Is there a way to avoid it?
schema.xml
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.CommonGramsFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.CommonGramsFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" analyzer="org.apache.lucene.analysis.en.EnglishAnalyzer" />
</analyzer>
</fieldType>
synonyms.txt
marine => saltwater,marine
protwords.txt
saltwater
marine
now when I do the analysis in admin panel and query for saltwat then saltwat | marin comes up. which means that saltwater did get stemmed to saltwat in synonyms.txt file
The solr analysis works in the same sequence you declare it inside your fieldType definition in schema. So, if you declare any Stem filter after the Synonyms filter, it will be applied after the synonyms changes. If you don't want this, the SynonymsFilter should be configured after the StemFilter, for example:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.CommonGramsFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.CommonGramsFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
</analyzer>
</fieldType>
I recommend you to check Solr Analysis tool in your Solr Admin to check what's going on with your field in both indexing and querying time.
Please share your schema if you need more help.
Protwords (protected words) are words that would be stemmed by the
English porter stemmer that you do not want to be stemmed.
A customized protected word list may be specified with the "protected" attribute in the schema. Any words in the protected word list will not be modified by any stemmer in Solr.
<fieldtype name="myfieldtype" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
<filter class="solr.PorterStemFilterFactory" />
</analyzer>
</fieldtype>

Solr stop words replaced with _ symbol

I have problems with solr stopwords in my autosuggest. All stopwords was replaced by _ symbol.
For example I have text "the simple text in" in field "deal_title". When I try to search word "simple" solr show me next result "_ simple text _" but I expect "simple text".
Could someone explain me why this works in such way and how to fix it ?
Here is part of my schema.xml
<fieldType class="solr.TextField" name="text_auto">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true" outputUnigramsIfNoShingles="false" />
</analyzer>
<analyzer type="query">
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>
</fieldType>
<field name="deal_title" type="text_auto" indexed="true" stored="true" required="false" multiValued="false"/>
<fieldType name="text_general" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
My solution to this in Solr 6.3 (where enablePositionIncrements="false" isn't possible anymore) was to:
remove stopwords
shingle with fillerToken="" (which removes the _)
remove leading and trailing spaced
remove duplicates
<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
<filter class="solr.ShingleFilterFactory" fillerToken=""/>
<filter class="solr.PatternReplaceFilterFactory" pattern="(^ | $)" replacement=""/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
To fix this you need to use<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true" enablePositionIncrements="false" />and <luceneMatchVersion>4.3</luceneMatchVersion> in solconfig.xml

Solr edismax query not returning any result

the following solr query is eluding me, can anyone provide some advice?
fq={!edismax qf=$kwf}myToken&kwf=schemaField1 schemaField2
when myToken is found in the first field all is well, but I never get any hit on the second. I have already checked the tokenizer and analyzer configurations and they seem OK.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory" ... />
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Is there anything else I could be doing obviously wrong?
Best,
Edoardo

Resources