How to exclude specific Solr spellchecking results - solr

We got the problem that we get spellchecking results that are technically correct but not suitable for the context of the input term.
For example the user searches for "ventilator" and the spellchecker returns "vibrator" as the corrected term.
We could remove the value "vibrator" from the possible results but if someone misspells "vibrator" we should return the corrected term.
Is it possible to exclude specific mappings (e.g. "ventilator" > "vibrator")?
The current config:
solrconfig.xml:
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<str name="queryAnalyzerFieldType">text_spell</str>
<lst name="spellchecker">
<str name="name">de</str>
<str name="field">spellcheck_de</str>
<str name="buildOnCommit">true</str>
<str name="buildOnOptimize">true</str>
</lst>
And the Field config from schema.xml:
<fieldType name="text_spell_de" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>

you can stuff like 'exclude terms that are less frequent than X' on the index, and the like. But if you want to 'exclude term X when serving suggestions for term Y only' then no, you can't.

Related

Solr Suggester returns 0 results when context language is en-AU

I have a list of product pages for one of my Australia sites where the content is in 2 language versions:
EN
en-AU
I have a search suggestion box where I am trying to populate few of the title fields through a computed field named as autosuggestiontitle_sm
Here's my suggester component defined in solrconfig.xml:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">mySuggester</str>
<str name="lookupImpl">BlendedInfixLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">**autosuggestiontitle_sm**</str>
<str name="contextField">**_contextLanguage**</str>
<str name="suggestAnalyzerFieldType">**text_suggester**</str>
<str name="buildOnStartup">true</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.dictionary">mySuggester</str>
<str name="suggest.count">10</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
Since my suggestAnalyzerFieldType is a custom field, I have included the below entries in managed-schema file as below:
<fieldType **name="text_suggester"** class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
And have added the 2 custom fields by defining the type as text_suggester:
<field name="autosuggestiontitle_sm" type="text_suggester" multiValued="true" indexed="true" stored="true"/>
<field name="_contextLanguage" type="text_suggester" multiValued="false" indexed="true" stored="true"/>
Since _language is a string type I have defined a custom field name as _contextLanguage of type text_suggester so added the below **copyField **entry:
Then, I did restart my solr server and re-indexed my custom index pertaining to my website context.
Now my search term is "fit".
Scenario 1 Query: https://localhost:8983/solr/custom_master_index/suggest?q=fit
Result is as expected which is picking 7 results where "fit" terms appears in title text from both EN and en-AU versions
Scenario 2 Query: https://localhost:8983/solr/custom_master_index/suggest?q=fit&suggest=true&suggest.cfq=en
Result is as expected which is picking 2 results where "fit" terms appears in title text from EN content.
But the issue is that when I query with en-AU which is my current context language of my Australia site, the result is either 0 or at time I see the EN results.
(Issue)Scenario 3 Query: https://localhost:8983/solr/custom_master_index/suggest?q=fit&suggest=true&suggest.cfq=en-AU
Note: I have tried to run the query with different values like suggest.cfq=en-au, suggest.cfq=au (nothing helped)
Can someone help me understand what is being missed so that en-AU contextField is not querying the right values.
Thanks in advance!

Solr 8.8 - trouble matching partial words with eDisMax and EdgeNGramFilter

I am new to Solr and trying to provide partial word matching with Solr 8.8.1, but partials are giving no results. I have combed the blogs without luck to fix this.
For example, the text of the document contains the word longer. Index analysis gives lon, long, longe, longer. If I query longer using alltext_en:longer, I get a match. However, if I query (for example) longe using alltext_en:longe, I get no match. explainOther returns 0.0 = No matching clauses.
It seems that I am missing something obvious, since this is not a complex phrase query.
Apologies in advance if I have missed any needed details - I will update the question if you tell me what else is needed to know.
Here are the relevant field specs from my managed-schema:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="15" minGramSize="3"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<dynamicField name="*_txt_en" type="text_en" indexed="true" stored="true"/>
<field name="alltext_en" type="text_en" multiValued="true" indexed="true" stored="true"/>
<copyField source="*_txt_en" dest="alltext_en"/>
Here is the relevant part of solrconfig.xml:
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<!-- Query settings -->
<str name="defType">edismax</str>
<str name="q">*:*</str>
<str name="q.alt">*:*</str>
<str name="rows">50</str>
<str name="fl">*,score,[explain]</str>
<str name="ps">10</str>
<!-- Highlighting defaults -->
<str name="hl">on</str>
<str name="hl.fl">_text_</str>
<str name="hl.preserveMulti">true</str>
<str name="hl.encoder">html</str>
<str name="hl.simple.pre"><span class="artica-snippet"></str>
<str name="hl.simple.post"></span></str>
<!-- Spell checking defaults -->
<str name="spellcheck">on</str>
<str name="spellcheck.extendedResults">false</str>
<str name="spellcheck.count">5</str>
<str name="spellcheck.alternativeTermCount">2</str>
<str name="spellcheck.maxResultsForSuggest">5</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.collateExtendedResults">true</str>
<str name="spellcheck.maxCollationTries">5</str>
<str name="spellcheck.maxCollations">3</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
That stemming filter will modify the tokens in ways you don't predict - and since they only happen on the token you try to match agains the ngrammed tokens when querying, the token might not be what you expect). If you're generating ngrams, stemming filters should usually be removed. I'd also remove the possessive filter (Also, small note - try to avoid using * when formatting text, since it's hard to know if you've used it when querying and the formatting is an error - instead use a backtick to indicate that the text is a code keyword/query.) – MatsLindh
That answered it - I removed the stemmer from the index step and everything was fine. Brilliant, thank you, #MatsLindh!

SOLR Proximity Search setting

I have some address data that I need to search. I am struggling a bit with the proximity search.
An eg. of address that I am trying to search is:
CATO STREET WEST LAUNCESTON TAS
and my search query for proximity search doesn't return anything when I try to search for (CATO WEST)~2
The configuration for the data field (schema.xml) is as follows:
<field name="street_name_space" type="text_general" indexed="true" stored="true"/>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Request handler is as follows:
<requestHandler name="/proximity" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="echoParams">explicit</str>
<str name="qf">street_name_space</str>
<str name="qs">10</str>
<str name="pf">street_name_space</str>
<str name="ps">10</str>
<str name="echoParams">explicit</str>
<str name="fl">street_name, street_name_clean, street_name_space</str>
</lst>
</requestHandler>
Any idea what I shall be doing to get the results?
the KeywordTokenizerFactory you are using keeps the whole thing as a single term, so the only term indexed is 'cato street west launceston tas'. Of course this does not match your query.
Use some other tokenizer, like the WhitespaceTokenizerFactory and it should work

Lucene / SOLR term to number range proximity search

I am using SOLR 4.9.0 with the following configuration (I am including only the part I consider relevant to the question):
<field name="content" type="text" indexed="true" stored="false"
termVectors="true" multiValued="false" />
<fieldType name="text" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
I can do proximity search for a term being close to another term:
content:"very suggestion"~100
I need to add the functionality of being able to search for a term being close to a number token, such as in:
content:"very [0.01 TO 0.99]"~100
content:"very [100 TO 1000000]"~100
Is there a tokenizer that already provides this functionality?
If not, what would roughly be the steps in order to adapt the standard tokenizer to be able to do that?
Any speculations on what the effect on the index structure, size, and indexing/searching speed would be?
EDIT:
I think that the following SOLR configuration is actually also relevant to my question:
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">id</str>
<str name="wt">json</str>
<str name="indent">true</str>
<str name="fl">* score</str>
</lst>
</requestHandler>
More than two years later, I found the answer to my question :)
By using the
https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-ComplexPhraseQueryParser
one can do:
{!complexphrase inOrder=false}content:"fee [100 10000]"~10

solr suggester not returning any results

I've followed the solr wiki article for suggester almost to the T here: http://wiki.apache.org/solr/Suggester. I have the following xml in my solrconfig.xml:
<searchComponent class="solr.SpellCheckComponent" name="suggest">
<lst name="spellchecker">
<str name="name">suggest</str>
<str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
<str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup</str>
<str name="field">description</str>
<float name="threshold">0.05</float>
<str name="buildOnCommit">true</str>
</lst>
</searchComponent>
<requestHandler class="org.apache.solr.handler.component.SearchHandler" name="/suggest">
<lst name="defaults">
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">suggest</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.count">5</str>
<str name="spellcheck.collate">true</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
However, when I run the following query (or something similar):
../suggest/?q=barbequ
I only get the following result xml back:
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">78</int>
</lst>
<lst name="spellcheck">
<lst name="suggestions"/>
</lst>
</response>
As you can see, this isn't very helpful. Any suggestions to help resolve this?
A couple of things I can think of that might cause this problem:
The source field ("description") is incorrect - ensure that this is indeed the field that seeds terms for your spell checker. It could even be that the field is a different case (eg. "Description" instead of "description").
The source field in your schema.xml is not set up correctly or is being processed by filters that cause the source dictionary to be invalid. I use a separate field to seed the dictionary, and use <copyfield /> to copy relevant other fields to that.
The term "barbeque" doesn't appear in at least 5% of records (you've indicated this requirement by including <float name="threshold">0.05</float>) and therefore is not included in the lookup dictionary
In SpellCheckComponent the <str name="spellcheck.onlyMorePopular">true</str> setting means that only terms that would produce more results are returned as suggestions. According to the Suggester documentation this has a different function (sorting suggestions by weight) but it might be worth switching this to false to see if it is causing the issue.
Relevant parts of my schema.xml:
<schema>
<types>
<!-- Field type specifically for spell checking -->
<fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StandardFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StandardFilterFactory" />
</analyzer>
</fieldType>
</types>
<fields>
<field name="spell" type="textSpell" indexed="true" stored="false" multiValued="true" />
</fields>
<!-- Copy fields which are used to seed the spell checker -->
<copyField source="name" dest="spell" />
<copyField source="description" dest="spell" />
<schema>
Could the problem be that you're querying /suggest instead of /spell
../suggest/?q=barbequ
In my setup this the string I pass in:
/solr/spell?q=barbequ&spellcheck=true&spellcheck.collate=true
And the first time you do a spellcheck you need to include
&spellcheck.build=true
I'm running on solr 4 btw. So, perhaps /suggest is an entirely different endpoint that does something else. If so, apologize.
Please check, if the term-parameter are set in the schema.xml, like:
<field name="TEXT" type="text_en" indexed="true" stored="true" multiValued="true"
termVectors="true"
termPositions="true"
termOffsets="true"/>
...restart solr and reindex again

Resources