I would be more than grateful for information if sb was able to configure spellcheck in SOLR, so queries returns values when polish characters were replaced with unicoded?
I have spellcheck enabled however I am not getting any results when searching 'slub', while I am getting plenty for 'ślub'
Cheers
You should add an ASCIIFoldingFilterFactory in you spellchecking field configuration.
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false"/>
Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.
Related
After changing splitOnNumerics="0" I can search words with mixed number and normal character such as "90s", "omega30", etc but it is still not working with special characters like "80"", "40)", etc even I escaped them: 80\", 40\), etc. Do you have any idea?
I want to Index text data that contains Special characters like (currency symbols) and emoticons. Presently I am using following code to index this data:
<fieldTypename="text"class="solr.TextField">
<analyzer>
<tokenizerclass="solr.WhitespaceTokenizerFactory"/>
<filterclass="solr.LowerCaseFilterFactory"/>
<filterclass="solr.KeywordRepeatFilterFactory"/>
<filterclass="solr.StopFilterFactory"words="stopwords.txt"
ignoreCase="true"/>
</analyzer>
But while retrieving the data I can see that all the special characters and emoticons and spoiled e.g.
Debtof��1,590.79settledfor��436.00
Please suggest what can be done here.
Application Flow: Data is first stored in HBASE and with real-time indexers it's updated to SOLR.
CDH Ver:5.4.5
SOLR Ver:4.10.3
HBASE VEer:1.0.0
I solved this by converting smileys to HTMLHex and then storing it to SOLR. In SOLR now I can see that Hex code intact and which can be converted back to smileys.
Library Used:
Lib to convert emoticons to Hex emoji-java
How can we map non ASCII char with ASCII character?
Ex.: In solr index we have word contain char ñ, Ñ [LATIN CAPITAL LETTER N WITH TILDE] or normal n,N
Then what filter/token we use to search with Normal N or Ñ and both mapped.
Merging the answers of Solr, Special Chars, and Latin to Cyrilic char conversion
Take a look at Solr's Analyzers, Tokenizers, and Token Filters which give you a good intro to the type of manipulation you're looking for.
Probably the ASCIIFoldingFilterFactory does exactly what you want.
When changing an analyzer to remove the accents, keep in mind that you need to reindex. Otherwise the accented characters will stay within the index, but no user input can be created to match them.
Update
I tried using the ICUFoldingFilterFactory this works fine with those accents. If this one is tricky to set up, have a look into the SO question Can not use ICUTokenizerFactory in Solr
This analyzer
<fieldType name="spanish" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ICUFoldingFilterFactory" />
</analyzer>
</fieldType>
got me these analysis results, the screen-shot is taken from solr-admin
When I am passing queries in solr I pass them as strings (“blah blah”). I am doing this because I have encoding problems with Greek (my input field accept Greek characters only as string). But solr sees the characters inside the quotes as an “exact match” term. Is there a way to remove the double quotes from Solr?
Thanks
If you use solr.StrField in your schema, it makes sense that you get exact matches, see:
http://azeckoski.blogspot.com/2009/06/tricky-solr-schema-issue-with-strfield.html
You should use solr.TextField really, that would allow you to use Greek analyzers. I don't quite understand why it accepts Greek characters only as strings. Can you explain ?
About Greek lower case and stemming:
http://wiki.apache.org/solr/LanguageAnalysis#Greek
On the other hand, please note that if you use stemming, you won't be able to do exact matches anymore...
I need to index words in Spanish and have test with ASCIIFoldingFilterFactory. This filter works great for accented characters (converts á -> a) but also converts ñ -> n and this is not a valid behaviour (give wrong results with some words).
Is there a way to exclude a letter from ASCIIFoldingFilterFactory or another filter to try?
Thanks
You can use MappingCharFilter and customise the mappings that are in mapping-FoldToASCII.txt
<charFilter class="solr.MappingCharFilterFactory"
mapping="/solr/trunk/solr/example/solr/conf/mapping-FoldToASCII.txt"/>
(change location file to location on your system)
you can try extending BaseTokenFilterFactory and in the schema.xml file point to it as one of your index/search filter