Spell checking with Solr - solr

I use Solr to index documents (pdf, word, .txt, etc). I need to use spell checker (in french) but I don't know how to do this. I need this function only on the field "content" the type of this field is text_general.

The spellchecker uses the content of your index to build the terms that are used for suggestions - there is no language configuration, since as long as the content that has been indexed is French, the suggestion back to the user will be based on those terms.
The exception is if you're using the FileBasedSpellChecker, where you provide a dictionary of terms with their correct spelling.
# spellcheck.q is only necessary if you want to use a different query than your actual query
&spellcheck=true&spellcheck.q=foo

Related

Solr multilingual stemisation

I'm using Solr to index documents like .pdf or .docx. These documents are in french or in english and I want to use the stemisation for both languages.
For exemple, if I search "chevaux" I want to find "cheval" (french) and if I search "raise" I want to find "raising" (english).
Is there a way to do this without createting 2 core (one in english and one in french) ?
Have two fields, one with the field definition you want for French, and one with the field definition you want for English. Then use the Language Detection feature to submit the content to the correct field.
When searching, query the field that has the correct language as the user, or if you don't know, search both - or use language detection to try to do a better guess.
You can also index the same content into both fields, but my initial guess is that it'll give you weird results down the road, where someone enters a French word, but due to the processing rules for English, you get hit that wouldn't have happened if you only indexed to the correct field.
By enabling langid.map, you can tell Solr to index the content into fields named fieldname_langcode (where fieldname is picked up from langid.fl).
langid.map: Enables field name mapping. If true, Solr will map field names for all fields listed in langid.fl.
You can use langid.map.replace or langid.map.pattern if you want to change the default fieldname_langcode naming, but I'd leave those alone for now.

Solr 3.6.2 spellcheck multi-word phrase: how to get collations without ignored stopwords?

I'm having a problem with the Solr 3.6.2 default (field based) spellchecker configured with query time parameters
spellcheck.onlyMorePopular=true
spellcheck.count=5
spellcheck.collate=true
spellcheck.maxCollations=5
spellcheck.maxCollationTries=5
on a field type which has a solr.StopFilterFactory filter on its analyzers.
The suggestion phase works as intended :
the indexed field does not contain any stopword
no suggestion is provided for a given stopword
But the resulting collation always contains the ignored stopwords, which I don't want: I'd prefer a raw suggestion of combined terms over something which looks like a "sort of" natural language answer.
For instance, searching for "handfum of perries", I'd prefer "handful berry" over "handful of berry".
I don't think that the stopwords excluded from spellchecking suggestions because of the field query analyzer are "marked" for preservation like the official documentation goes about other query elements :
Note that the non-spellcheckable terms such as those for range
queries, prefix queries etc. are detected and excluded for
spellchecking. Such non-spellcheckable terms are preserved in the
collated output so that the original query can be run again, as is.
It seems two solutions would be
either having a custom query converter so the stopwords are ignored right from the start: not sure it is possible in 3.6.2
or having a custom spellchecker that would not try to find any suggestion for a stopword (or would always suggest an "empty" string), without messing up the collation process
Am I missing something ?
Regards

Is there a way to get a list of all the words in the solr spellcheck index?

I'm using Solr's SpellCheckComponent with IndexBasedSpellChecker. Wondering if there's a way to get an output of all the words in the dictionary.
Might help us catch some of the misspellings on our site.
yes, there is. IndexBasedSpellChecker, according to the doc: "The IndexBasedSpellChecker uses a Solr index as the basis for a parallel index used for spell checking. It requires defining a field as the basis for the index terms "
So it just uses one field you choose from the index. To enumerate all terms on a field you use the Terms component and you set terms.fl to that field. If you have lots of terms, you could play do some scrolling with terms.lower, terms.limit and terms.upper to get the info in multiple calls.

Sunspot/Solr: word concatenation

I'm using Solr with the Sunspot Ruby gem. It works great, but I'm noticing that sometimes users will get poor search results because they have concatenated their search terms (e.g. 'foolproof') where the document text was 'fool proof'. Or vice-versa.
I was going to try and address this by creating a set of alternate match fields by manually concatenating the words from the source documents together. This seems kind of hackish, and implementing the other side (breaking up user concatenations into words) is not obvious.
Is there a way to do this properly in Solr/Sunspot?
Did yo have a look at SOLR spellcheck (or spell check) component?
http://wiki.apache.org/solr/SpellCheckComponent
For example, there is a WordBreakSolrSpellChecker, which may provide valid suggestions in such case.

Solr Spell Check result based filter query

I implemented Solr SpellCheck Component based on the document from http://wiki.apache.org/solr/SpellCheckComponent , it works good. But i am trying to filter the spell check result based on some other filter. Consider the following schema
product_name
product_text
product_category
product_spell -> copy string from product_name and product_text . And tokenized using white space analyzer
For the above schema, i am trying to filter the spell check result based on provided category. I tried querying like http://127.0.0.1:8080/solr/colr1/myspellcheck/?q=product_category:160%20appl&spellcheck=true&spellcheck.extendedResults=true&spellcheck.collate=true . Spellcheck results does not consider the product_category:160
Is it because the dictionary was build for all the categories? If so is it a good idea to create the dictionary for every category?
Is it not possible to have another filter condition in spellcheck component?
I am using solr 3.5
I previously understood from the SOLR-2010 issue that filtering through the fq parameter should be possible using collation, but it isn't, I think I misunderstood.
In fact, the SpellCheckComponent has most likely a separate index, except for the DirectoSolrSpellChecker implementation. It means the field you select is indexed in a different index, which contains only the information about that specific field you chose to make spelling corrections.
If you're curious, you can also have a look how that additional index looks like using luke, since it's of course a lucene index. Unfortunately filtering using other fields isn't an option there, simply because there is only one field there, the one you use to make spelling corrections.

Resources