Solr Suggester returns 0 results when context language is en-AU - solr

I have a list of product pages for one of my Australia sites where the content is in 2 language versions:
EN
en-AU
I have a search suggestion box where I am trying to populate few of the title fields through a computed field named as autosuggestiontitle_sm
Here's my suggester component defined in solrconfig.xml:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">mySuggester</str>
<str name="lookupImpl">BlendedInfixLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">**autosuggestiontitle_sm**</str>
<str name="contextField">**_contextLanguage**</str>
<str name="suggestAnalyzerFieldType">**text_suggester**</str>
<str name="buildOnStartup">true</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.dictionary">mySuggester</str>
<str name="suggest.count">10</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
Since my suggestAnalyzerFieldType is a custom field, I have included the below entries in managed-schema file as below:
<fieldType **name="text_suggester"** class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
And have added the 2 custom fields by defining the type as text_suggester:
<field name="autosuggestiontitle_sm" type="text_suggester" multiValued="true" indexed="true" stored="true"/>
<field name="_contextLanguage" type="text_suggester" multiValued="false" indexed="true" stored="true"/>
Since _language is a string type I have defined a custom field name as _contextLanguage of type text_suggester so added the below **copyField **entry:
Then, I did restart my solr server and re-indexed my custom index pertaining to my website context.
Now my search term is "fit".
Scenario 1 Query: https://localhost:8983/solr/custom_master_index/suggest?q=fit
Result is as expected which is picking 7 results where "fit" terms appears in title text from both EN and en-AU versions
Scenario 2 Query: https://localhost:8983/solr/custom_master_index/suggest?q=fit&suggest=true&suggest.cfq=en
Result is as expected which is picking 2 results where "fit" terms appears in title text from EN content.
But the issue is that when I query with en-AU which is my current context language of my Australia site, the result is either 0 or at time I see the EN results.
(Issue)Scenario 3 Query: https://localhost:8983/solr/custom_master_index/suggest?q=fit&suggest=true&suggest.cfq=en-AU
Note: I have tried to run the query with different values like suggest.cfq=en-au, suggest.cfq=au (nothing helped)
Can someone help me understand what is being missed so that en-AU contextField is not querying the right values.
Thanks in advance!

Related

Solr 8.8 - trouble matching partial words with eDisMax and EdgeNGramFilter

I am new to Solr and trying to provide partial word matching with Solr 8.8.1, but partials are giving no results. I have combed the blogs without luck to fix this.
For example, the text of the document contains the word longer. Index analysis gives lon, long, longe, longer. If I query longer using alltext_en:longer, I get a match. However, if I query (for example) longe using alltext_en:longe, I get no match. explainOther returns 0.0 = No matching clauses.
It seems that I am missing something obvious, since this is not a complex phrase query.
Apologies in advance if I have missed any needed details - I will update the question if you tell me what else is needed to know.
Here are the relevant field specs from my managed-schema:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="15" minGramSize="3"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<dynamicField name="*_txt_en" type="text_en" indexed="true" stored="true"/>
<field name="alltext_en" type="text_en" multiValued="true" indexed="true" stored="true"/>
<copyField source="*_txt_en" dest="alltext_en"/>
Here is the relevant part of solrconfig.xml:
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<!-- Query settings -->
<str name="defType">edismax</str>
<str name="q">*:*</str>
<str name="q.alt">*:*</str>
<str name="rows">50</str>
<str name="fl">*,score,[explain]</str>
<str name="ps">10</str>
<!-- Highlighting defaults -->
<str name="hl">on</str>
<str name="hl.fl">_text_</str>
<str name="hl.preserveMulti">true</str>
<str name="hl.encoder">html</str>
<str name="hl.simple.pre"><span class="artica-snippet"></str>
<str name="hl.simple.post"></span></str>
<!-- Spell checking defaults -->
<str name="spellcheck">on</str>
<str name="spellcheck.extendedResults">false</str>
<str name="spellcheck.count">5</str>
<str name="spellcheck.alternativeTermCount">2</str>
<str name="spellcheck.maxResultsForSuggest">5</str>
<str name="spellcheck.collate">true</str>
<str name="spellcheck.collateExtendedResults">true</str>
<str name="spellcheck.maxCollationTries">5</str>
<str name="spellcheck.maxCollations">3</str>
</lst>
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>
That stemming filter will modify the tokens in ways you don't predict - and since they only happen on the token you try to match agains the ngrammed tokens when querying, the token might not be what you expect). If you're generating ngrams, stemming filters should usually be removed. I'd also remove the possessive filter (Also, small note - try to avoid using * when formatting text, since it's hard to know if you've used it when querying and the formatting is an error - instead use a backtick to indicate that the text is a code keyword/query.) – MatsLindh
That answered it - I removed the stemmer from the index step and everything was fine. Brilliant, thank you, #MatsLindh!

SOLR Proximity Search setting

I have some address data that I need to search. I am struggling a bit with the proximity search.
An eg. of address that I am trying to search is:
CATO STREET WEST LAUNCESTON TAS
and my search query for proximity search doesn't return anything when I try to search for (CATO WEST)~2
The configuration for the data field (schema.xml) is as follows:
<field name="street_name_space" type="text_general" indexed="true" stored="true"/>
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Request handler is as follows:
<requestHandler name="/proximity" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">edismax</str>
<str name="echoParams">explicit</str>
<str name="qf">street_name_space</str>
<str name="qs">10</str>
<str name="pf">street_name_space</str>
<str name="ps">10</str>
<str name="echoParams">explicit</str>
<str name="fl">street_name, street_name_clean, street_name_space</str>
</lst>
</requestHandler>
Any idea what I shall be doing to get the results?
the KeywordTokenizerFactory you are using keeps the whole thing as a single term, so the only term indexed is 'cato street west launceston tas'. Of course this does not match your query.
Use some other tokenizer, like the WhitespaceTokenizerFactory and it should work

Solr : Suggester dictionary build creates huge temporary files

I'm using sunspot-solr 2.3.0 for my rails app.
I implemented a suggester (AnalyzingSuggester) on Solr for autocompletion. I have a database of about 11M entries with 5 fields indexed by Solr.
When building the suggestions dictionary, two files are created in my /tmp/ folder:
AnalyzingSuggester1784590344675447619.input (number vary). This file gets bigger and bigger until eventually I have no space left then it seems to disappear ?
AnalyzingSuggester8456478182934503596.sorted (number vary too). This file is 0 kb.
I searched a lot but can't seem to understand what exactly is happening and if / how I should prevent this weird behavior. Is this a normal part of the dictionary build ? Is this just some logging ?
I have the same problem with my solr 6.0.1. The tmp file blows up indefinitely until the hard drive is full.
My index only contains about 2500 documents.
Search component:
<searchComponent class="solr.SuggestComponent" name="autoSuggest">
<lst name="suggester">
<str name="name">analyzingSuggester</str>
<str name="lookupImpl">AnalyzingLookupFactory</str>
<str name="storeDir">analyzing_suggestions</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="buildOnCommit">false</str>
<str name="buildOnStartup">false</str>
<str name="field">text_suggest_auto</str>
<str name="suggestAnalyzerFieldType">text_suggestion_auto</str>
</lst>
</searchComponent>
Request handler:
<requestHandler class="solr.SearchHandler" name="/suggestAuto" startup="lazy" >
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.dictionary">analyzingSuggester</str>
<str name="suggest.onlyMorePopular">true</str>
<str name="suggest.count">10</str>
<str name="suggest.collate">true</str>
</lst>
<arr name="components">
<str>autoSuggest</str>
</arr>
</requestHandler>
Field:
<fieldType name="text_suggestion_auto" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_general.txt" format="snowball" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_general.txt" format="snowball" />
</analyzer>
</fieldType>

How to exclude specific Solr spellchecking results

We got the problem that we get spellchecking results that are technically correct but not suitable for the context of the input term.
For example the user searches for "ventilator" and the spellchecker returns "vibrator" as the corrected term.
We could remove the value "vibrator" from the possible results but if someone misspells "vibrator" we should return the corrected term.
Is it possible to exclude specific mappings (e.g. "ventilator" > "vibrator")?
The current config:
solrconfig.xml:
<searchComponent name="spellcheck" class="solr.SpellCheckComponent">
<str name="queryAnalyzerFieldType">text_spell</str>
<lst name="spellchecker">
<str name="name">de</str>
<str name="field">spellcheck_de</str>
<str name="buildOnCommit">true</str>
<str name="buildOnOptimize">true</str>
</lst>
And the Field config from schema.xml:
<fieldType name="text_spell_de" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.RemoveDuplicatesTokenFilterFactory" />
</analyzer>
</fieldType>
you can stuff like 'exclude terms that are less frequent than X' on the index, and the like. But if you want to 'exclude term X when serving suggestions for term Y only' then no, you can't.

Lucene / SOLR term to number range proximity search

I am using SOLR 4.9.0 with the following configuration (I am including only the part I consider relevant to the question):
<field name="content" type="text" indexed="true" stored="false"
termVectors="true" multiValued="false" />
<fieldType name="text" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
I can do proximity search for a term being close to another term:
content:"very suggestion"~100
I need to add the functionality of being able to search for a term being close to a number token, such as in:
content:"very [0.01 TO 0.99]"~100
content:"very [100 TO 1000000]"~100
Is there a tokenizer that already provides this functionality?
If not, what would roughly be the steps in order to adapt the standard tokenizer to be able to do that?
Any speculations on what the effect on the index structure, size, and indexing/searching speed would be?
EDIT:
I think that the following SOLR configuration is actually also relevant to my question:
<requestHandler name="/select" class="solr.SearchHandler">
<lst name="defaults">
<str name="echoParams">explicit</str>
<int name="rows">10</int>
<str name="df">id</str>
<str name="wt">json</str>
<str name="indent">true</str>
<str name="fl">* score</str>
</lst>
</requestHandler>
More than two years later, I found the answer to my question :)
By using the
https://lucene.apache.org/solr/guide/6_6/other-parsers.html#OtherParsers-ComplexPhraseQueryParser
one can do:
{!complexphrase inOrder=false}content:"fee [100 10000]"~10

Resources