Solr suggester returning terms from deleted documents - solr

I have a SolrCloud setup and I'm testing the suggestion component. I have several hundred documents in the index. I did not want some of the documents in the index because they contain gibberish (they were binary files that got improperly converted to text). I've removed them from the index, but the gibberish words from them are still showing up in the suggestions.
My suggest configuration looks like this:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">fuzzySuggester</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">HighFrequencyDictionaryFactory</str>
<str name="storeDir">suggester_fuzzy_dir</str>
<str name="field">dictionary_text</str>
<str name="suggestAnalyzerFieldType">phrase_suggest</str>
<str name="exactMatchFirst">true</str>
<float name="threshold">0.001</float>
<str name="buildOnStartup">false</str>
<str name="buildOnCommit">true</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.dictionary">fuzzySuggester</str>
<str name="suggest.onlyMorePopular">true</str>
<str name="suggest.count">5</str>
<str name="suggest.collate">true</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
Note that buildOnCommit is set to true. I also tried to remove them using a /suggest query with the suggest.build=true parameter, but that had no effect.
Is there something else required to remove terms from the dictionary?

Despite using expungeDeletes=true in the update, the deleted documents were still hanging around. Optimizing removed them and appears to have removed all the gibberish terms from suggestions.

Related

Solr Suggester taking too long to provide response

I am using Solr Suggester to provide suggestion in the search page of our application. But every suggestion request to Solr is taking too long to send response. I have tried with multiple lookup Impl such as AnalyzingLookupFactory, AnalyzingInfixLookupFactory, FuzzyLookupFactory etc.
Below is my configuration:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">mySuggester</str>
<str name="lookupImpl">AnalyzingInfixLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">spell_suggest</str>
<str name="weightField">spell_suggest</str>
<str name="suggestAnalyzerFieldType">text_general</str>
<str name="buildOnStartup">false</str>
</lst>
<lst name="suggester">
<str name="name">altSuggester</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="field">spell_suggest</str>
<str name="weightField">spell_suggest</str>
<str name="suggestAnalyzerFieldType">text_general</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<!--<str name="suggest.dictionary">mySuggester</str> -->
<str name="suggest.dictionary">altSuggester</str>
<str name="suggest">true</str>
<str name="suggest.count">6</str>
<str name="spellcheck">true</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
The response, with just 42000 indexed documents, is taking close to 5 to 7 seconds to provide response. This is impacting the functionality badly in the application
Following is my request: http://<myIP>:8983/solr/mycollection/suggest?df=spell_suggest&suggest=true&suggest.build=true&q=Vendor
Please suggest if I need to provide few more configurations or need to modify existing configurations to improve performance.
Thanks!
When you're issuing suggest.build each time, you're effectively asking for the suggestion index to be rebuilt from scratch each time you're querying the suggester.
It should only be rebuilt after changes if necessary (depending on which dictionaryImpl you're using).

How to optimize documentdictionary build on solr cloud suggester?

I have around 300,000 records to be uploaded on a solr cloud suggester. These records are dynamic i.e. new documents will be added and some document will be deleted in future on a regular basis. The problem I am facing is either:
Use FileDictionaryFactory: this method is an operational nightmare. I would need to keep generating the file and upload it to zookeeper (still haven't figured out how to upload huge file like this to zookeeper). And might need to create index on each server on the solr cloud separately. Doing this frequently does not seems possible.
Use DocumentDictionaryFactory: this method seems like an obvious choice, but building index here is a nightmare as well. Everytime I try to build index, I get the "No space left on the device" error. I tried building it on 5K records and it was successful. But it took 40 minutes and consumed all 10GB of memory during this entire 40 minutes.
My question is, can we optimize this index building time if we follow the second approach.
Or if I follow the first approach what should be the ideal way of dealing with frequent changes to be indexed on solr cloud.
my Configs:
For FileDictionaryFactory:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">suggestions</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">FileDictionaryFactory</str>
<str name="field">searchfield</str>
<str name="weightField">searchscore</str>
<str name="suggestAnalyzerFieldType">text_ngram</str>
<str name="buildOnStartup">false</str>
<str name="buildOnCommit">false</str>
<str name="sourceLocation">spellings.txt</str>
<str name="storeDir">autosuggest_dict</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
<str name="suggest.dictionary">suggestions</str>
<str name="suggest.dictionary">results</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
For DocumentDictionaryFactory:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">suggestions</str>
<str name="lookupImpl">FuzzyLookupFactory</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="field">searchfield</str>
<str name="weightField">searchscore</str>
<str name="payloadField">payload</str>
<str name="suggestAnalyzerFieldType">text_ngram</str>
<str name="buildOnStartup">false</str>
<str name="buildOnCommit">false</str>
<str name="sourceLocation">spellings.txt</str>
<str name="storeDir">autosuggest_dict</str>
</lst>
</searchComponent>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy">
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
<str name="suggest.dictionary">suggestions</str>
<str name="suggest.dictionary">results</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
I think the main issue for the DocumentDictionaryFactory (this is my preferred option) is that you are using text_ngram. If your values are not very short, this will produce (I guess, you didn't share text_ngram definition) a very large FST, thus the time to create it.
Unless I am missing something, you don't need to do that, just use some type that tokenizes with StandardTokenizerFactory and suggestions should work.

query on suggester while build dictionary

My suggester conf:
<searchComponent name="suggest" class="solr.SuggestComponent">
<lst name="suggester">
<str name="name">titleSuggester</str>
<str name="lookupImpl">AnalyzingInfixLookupFactory</str>
<str name="field">name</str>
<str name="suggestAnalyzerFieldType">text_pt</str>
<str name="payloadField">type</str>
<str name="weightField">weightField</str>
<str name="buildOnCommit">false</str>
<str name="buildOnStartup">false</str>
<str name="dictionaryImpl">DocumentDictionaryFactory</str>
<str name="indexPath">/home/dev/suggestions</str>
</lst>
<requestHandler name="/suggest" class="solr.SearchHandler" startup="lazy" >
<lst name="defaults">
<str name="suggest">true</str>
<str name="suggest.count">10</str>
<str name="suggest.dictionary">titleSuggester</str>
<str name="suggest.onlyMorePopular">true</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
It's work! But, i neeed build my dictionary every hour, and this build takes 2 minutes.
Every hour i run:
localhost:8983/solr/AutoComplete/suggest?suggest.q=term&suggest.build=true
During this time i need get results, but when i run a query as:
localhost:8983/solr/AutoComplete/suggest?suggest.q=term
i get this return(because build is running):
<response>
<lst name="responseHeader">
<int name="status">500</int>
<int name="QTime">5</int>
</lst>
<lst name="error">
<str name="msg">suggester was not built</str>
What can I do to get results while the build is running?
This question is quite old, but I have the same problem (my rebuild may run an hour) and I came to this solution:
Configure two components, e.g. suggest_A and suggest_B with different indexPath values.
Configure two request handlers, e.g. suggest and suggest_Rebuild.
Assign suggest_A to suggest and suggest_B to suggest_Rebuild.
Do the rebuild on the suggest_Rebuild handler. After the rebuild is finished, switch the component assignment of both components via the config API (update-requesthandler).
The drawback of this solution is that you need the double disk space.

Results returned by "did you mean" feature of Solr 4.6 are not in proper order

Following is the configuration that I have in solrconfig.xml:
<searchComponent name="suggest" class="solr.SpellCheckComponent">
<str name="queryAnalyzerFieldType">text_m_ss</str>
<lst name="spellchecker">
<str name="name">suggest</str>
<str name="classname">solr.IndexBasedSpellChecker</str>
<str name="spellcheckIndexDir">./spellchecker</str>
<str name="field">text_m_ss</str>
<str name="buildOnCommit">true</str>
</lst>
</searchComponent>
<requestHandler name="/sugegst" class="solr.SearchHandler">
<lst name="defaults">
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">sugegst</str>
<str name="spellcheck.onlyMorePopular">false</str>
<str name="spellcheck.count">25</str>
<str name="spellcheck.collate">true</str>
</lst>
<arr name="components">
<str>sugegst</str>
</arr>
</requestHandler>
Now when i query say appel: it returns appear, happen, apply, apple
The desired result is: apple should be the first result returned where as in this case its 4th.
Is there a way where i can get apple as the 1st result
There's nothing "improper" about your result set. Solr can't read your mind and know that you think apple is better than appear -- that's why this feature is called the suggester.
That said, you can change the order by which the results are sorted -- but you have to realize that it's not going to always pick the same thing your brain does.

Removing unwanted items from solr autosuggester

I am trying to implement auto suggest from a huge set of paragraphs that are indexed. But I would want to filter out certain unwanted words appearing in auto suggest. For example words like "and", "how", "when", etc needs to be avoided. How do i go about it.
This is the configuration I have done for autosuggest in solrconfig.xml..
<requestHandler class="org.apache.solr.handler.component.SearchHandler" name="/suggest">
<lst name="defaults">
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">suggest</str>
<str name="spellcheck.onlyMorePopular">true</str>
<str name="spellcheck.count">5</str>
<str name="spellcheck.collate">true</str>
</lst>
<arr name="components">
<str>suggest</str>
</arr>
</requestHandler>
<searchComponent class="solr.SpellCheckComponent" name="suggest">
<lst name="spellchecker">
<str name="name">suggest</str>
<str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
<str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup</str>
<str name="field">keywords</str>
<float name="threshold">0.005</float>
<str name="buildOnCommit">true</str>
</lst>
I would recommend adding the StopFilterFactory to the backing fieldType definition for your keywords field in your schema.xml file. If you need those words ("and", "how", "when") in your keywords field for other searching requirements, I would suggest creating a custom field in your schema.xml just for the suggester and you can use the copyField directive to populate this new field.

Resources