How to make Solr's spell checker ignore case? - solr

How do you ask the example spellchecker to ignore case ?
I am using all defaults shown in the demo.
Now I see that if I type Ancient, it asks "did you mean ancient ? " What do I do ?
ps : I don't have anything that has the word "spell" in my schema.xml!!!! How is it working ?

The schema should have a field type called "spell" that is used for spell checking. This will lowercase all words used by the spellchecker so you don't have to worry about case. Here is an example of how to use this field type.
Create a field in your schema for spell checking.
<field name="spelling" type="spell" indexed="true" stored="false"/>
And then use a copy field to copy data into this field. The example, the code below will copy the "product_name" field into the spell checker.
<copyField source="product_name" dest="spelling"/>
Edit...
Sorry... I though the "spell" field type was in the default schema. Add this to your schema in the same section as the other fieldType tags.
<fieldType name="spell" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>

Please post your solrconfig.xml - I think that will provide a clue.
My best guess will be that solrconfig.xml contains the config for the spellchecker (link) which specifies the field to be used for generating spelling suggestions. That field does not have a LowerCaseFilter in your schema.xml

Related

Apache Solr - Default Schema Configuration

I have written below an example default field from the managed-schema.xml file. What I observed is that generally people use classes such as solr.LowerCaseFilterFactory etc., but in the field below, for example, there is a filter called lowercase without a class. So, is this configuration actively working, or is it just a template?
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"/>
<analyzer type="index"/>
<tokenizer class="standard"/>
<filter name="stop" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter name="lowercase"/>
<filter name="englishPossessive"/>
<filter protected="protwords.txt" name="keywordMarker"/>
<filter name="porterStem"/>
</analyzer>
<analyzer type="query">
<tokenizer class="standard"/>
<filter name="synonymGraph" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter name="stop" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter name="lowercase"/>
<filter name="englishPossessive"/>
<filter protected="protwords.txt" name="keywordMarker"/>
<filter name="porterStem"/>
</analyzer>
</fieldType>
It depends on which version of Solr you're using; later versions are able to look up the class name from the short form (i.e. without the FilterFactory postfix. See the example in the current reference guide:
<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="englishPorter"/>
</analyzer>
</fieldType>
Compared to the legacy format shown in the same guide:
<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"/>
</analyzer>
</fieldType>
As you can see there's just a lot of repetition in the class names given, so instead of having the complete class name, Solr resolves it based on the common pattern and the type given instead.

Solr not returning the exact element

Using Solr 7.7.3
I have an element with the label:"alpha-ravi"
and when I search in solr label:"alpha" its returning the element with the label "alpha-ravi"
when looking at the solr doc, it should not return this element.
can anyone explain why this behavior ?
If you want to retrieve the exact results (i.e return docs with "alpha-ravi" only if the user types the exact "alpha-ravi" in the search), then I would suggest you could go with the Keyword tokenizer (solr.KeywordTokenizerFactory). This tokenizer would treat the entire "alpha-ravi" as a single token and thus, will not return partial results if there's a match for "alpha" or "ravi".
For example: in your schema.xml file you should add something like (configure the various filter chains as per your need)
<fieldType name="single_token_string" class="solr.TextField" sortMissingLast="true">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
And then you can use this fieldType in the same schema.xml (referencing the KeywordTokenizer we just defined)
<field name="myField" type="single_token_string" indexed="true" stored="true" />
By default, Solr uses the StandardTokenizer and thus, splits "alpha-ravi" on that hyphen into multiple tokens (thus, matching "alpha" and "ravi").
Also, as an alternative you could run a query with a phrase as well (which will not be tokenized on spaces/delimiters). Possibly something likehttp:localhost:8983/solr/...fq=label:"alpha-ravi"
Hope that helps. All the best!

I am facing issue in search word end with "/" forward slash "Non-Conformity" in solr search engine

I am facing issue in search word "Non-Conformity" in solr search engine that word end with "/" forward slash.
This is the url I used to search
http://localhost:8983/solr/sms/select?q="Non-Conformity"
key word - "Non-Conformity" or "Non-Conformity/" (not working)
Key word - "Non-Conformity/Deficiency" (working)
key word available in document - "Non-Conformity/Deficiency (NCD) Report for Class audits/surveys"
Apply the below field type to your field.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Your field would be like
<field name="name" type="text_general" indexed="true" stored="true" multiValued="false"/>
</analyzer>
If you analyse the same in admin page you will find the text is matching.

How can I have a one-way synonym in Solr?

I am trying to implement one way synonym or one way thesaurus(as in Endeca) in Solr. Where I search for camcorder I get result for camera also but not vice versa. I tried adding following in Synonyms.txt but seems to be not working as it is giving weird results:
camcorder => camera
And my schema.xml is:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ClassicFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ClassicFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>
</fieldType>
I think what you were looking for is:
camcorder => camera, camcorder
If you don't include camcorder on the right side, camcorder won't return any results if you search for "camcorder".
Since you're only expanding synonyms when you're indexing (where you have the SynonymFilter defined), camcorder will be changed to camera for each document on the way in. When you don't have the same expansion taking place when querying, Solr will still search for camcorder (as there is no SynonymFilter defined for the query analysis chain). There is no camcorder token in the index, so there will be no hit.
You'll have to expand synonyms when querying as well as when indexing to achieve what you want with one-way synonyms.

Solr: the query phrase returns results for some cases and doesn't for some

I get Solr results for following:
Sports
World Health Organisation
percent
but I don't get results for the below:
Sport (UK)
World Health Organisat
1-percent
All these are in the text field which definitely contains these phrases and i have used a ngram filter on the indexer so the combination do exist.
While the analysis tab of the solr UI shows me exactly what i am expecting, i am not getting the required results on my java output.
My solrj code is as below:
query.setQuery("full_text:\"World Health Organisation\"");
Also, I have to add the \".."\ as I always get errors in my front end if I remove them and half the results I otherwise get also don't turn up.
Can someone help with what I might be missing?
Much thanks!
Edit Inclusion: Definition of full_text in schema.xml
<field name="full_text" type="text_en" indexed="true" stored="false" multiValued="true"/>
<copyField source="title" dest="full_text"/>
<copyField source="content" dest="full_text"/>
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">>
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="20"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
Solution:
I figured out what the problem was. For cases of "Sports (UK)" and "1-percent", the tokeniser I was using was removing all special characters and so I have change my tokeniser.
As for "World Health Organisation:, it was caused by the stemmer which changed Organisation to Organis and query like "Organisat" was kept as it is.
Hence I did not get results. So I removed the stemmer as I am using a ngram filter.
Hope this helps others in the long run. :)
Figured out what the problem was.
For cases of "Sports (UK)" and "1-percent", the tokeniser I was using was removing all special characters and so I have change my tokeniser.
As for "World Health Organisation", it was caused by the stemmer which changed Organisation to Organis and query like "Organisat" was kept as it is. Hence I did not get results. So I removed the stemmer as I am using a ngram filter.

Resources