SynonymGraphFilterFactory with ShingleFilterFactory missing original string - solr

Synonym.txt
Phone Case,Mobile cover,Mobile case
Managed-Schema
<fieldType name="standardField" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="([\.,])" replacement=" " replace="all"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.FlattenGraphFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.ShingleFilterFactory" maxShingleSize="5" outputUnigrams="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="([\.,])" replacement=" " replace="all"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
</fieldType>
Document
Mobile cover
Search Term
Mobile cover
Issue
Not sure if this is caused by bad config of SynonymGraphFilterFactory or ShingleFilterFactory Index never having full string as value to match it. Like I never have original string mobile cover as a result all filters, when I search mobile cover I get no documents.
Analysis
Update I
I added solr.FlattenGraphFilterFactory after solr.SynonymGraphFilterFactory as recommended by SOLR docs, now char index looking good to me on Synonym output. But still having same issue.

Related

Solr filter for apostrophe's - allow search for both with and without apostophe

Using Solr 9.
I'd like the same results to return for the terms
Lowe's
as well as
Lowes
I can't seem to find the correct combination with this filter:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.KStemFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.KStemFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.LowerCaseFilterFactory" />
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
</analyzer>
</fieldType>
When testing in Solr's analyzer, I would expect that
<filter class="solr.KStemFilterFactory"/>
Would remove the s from the Lowes example in the query, thus matching Lowe in the index step.

stop synonyms.txt file Solr from being stemmed

In synonyms.txt file I have an entry
marine => saltwater,marine but both the words are getting stemmed to 'saltwat', 'marin' respectively inspite of being in protected words file. Is there a way to avoid it?
schema.xml
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.CommonGramsFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.CommonGramsFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" analyzer="org.apache.lucene.analysis.en.EnglishAnalyzer" />
</analyzer>
</fieldType>
synonyms.txt
marine => saltwater,marine
protwords.txt
saltwater
marine
now when I do the analysis in admin panel and query for saltwat then saltwat | marin comes up. which means that saltwater did get stemmed to saltwat in synonyms.txt file
The solr analysis works in the same sequence you declare it inside your fieldType definition in schema. So, if you declare any Stem filter after the Synonyms filter, it will be applied after the synonyms changes. If you don't want this, the SynonymsFilter should be configured after the StemFilter, for example:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.CommonGramsFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.CommonGramsFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
</analyzer>
</fieldType>
I recommend you to check Solr Analysis tool in your Solr Admin to check what's going on with your field in both indexing and querying time.
Please share your schema if you need more help.
Protwords (protected words) are words that would be stemmed by the
English porter stemmer that you do not want to be stemmed.
A customized protected word list may be specified with the "protected" attribute in the schema. Any words in the protected word list will not be modified by any stemmer in Solr.
<fieldtype name="myfieldtype" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt" />
<filter class="solr.PorterStemFilterFactory" />
</analyzer>
</fieldtype>

Solr nGram Filter minGramSize - token for word with only 2 characters

I am working with Solr and would like to understand how EdgeNGramFilterFactory works.
For example, I am searching for a term "1 tb". Pls note that I have few products with attributes for the fields I'm search on.
Here is the filter applied on Index time for this fieldtype.
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="10" />
Now, when I search for the term "1 tb", I do not get desired results.
I have few products with term "5 MegaPixels" and when I search for "5 meg" it gives me result. Later I found it does not work for "5 me". And does not give results.
If I remove nGram filter, it works fine. Moreover, if we set minGramSize ="1", then query "1 tb" works fine.
I was assuming that for term tb, the token tb should be valid. But it seems that it is not created when I apply minGramSize of 2!
Can someone explain why?
Here is the field defined in schema.
<fieldType name="AttributesField" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="10" />
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
Can you try with the below fieldType
<fieldType name="AttributesField" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="10" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
PorterStemFilterFactory : it does normalization process that removes common endings from words.
Example: "riding", "rides", "horses" ==> "ride", "ride", "hors".
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
With the configuration below the string value Nigerian gets broken down to the following terms
Nigerian => "ni", "nig", "nige", "niger", "nigeri", "nigeria", "nigeria", "nigerian"

Solr : Search for synonyms of query substring

We wanted to search for synonyms of query strings.
if I have "pm, project manager" entry in synonym file. I am getting results properly if I am searching for "pm" or "project manager".
But when I am searching for "project manager operations", Solr not giving proper results including search results of "pm". But when I debug I can see it is expanding "project manager" to "pm"
Following is my configuration
<fieldType name="text_jobs_synonym" class="solr.TextField" positionIncrementGap="1" autoGeneratePhraseQueries="true">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.KeywordTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>
</fieldType>
This is mm value "2<90%".
How to Deal with Multi-Term Synonyms
Using edismax query parser to handle user queries migth help.
check this link

retrieve ngrams in solr for a particular word

<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="5" minShingleSize="2" outputUnigrams="true"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ShingleFilterFactory" maxShingleSize="5" minShingleSize="2" outputUnigrams="true"/>
<solrQueryParser defaultOperator="OR" />
</analyzer>
</fieldType>
I am using the ShingleFilterFactory to create ngrams. Now i want to retrive all the ngrams for a particular word.
Suppose i entered "night" then i want all the ngrams with the word night.
right now i am getting the only the top results from all the ngrams from my documents with the below query:
http://localhost/solr/admin/luke?fl=text&numTerms=50000&wt=json

Resources