Solr indexing ñ - solr

I need to index words in Spanish and have test with ASCIIFoldingFilterFactory. This filter works great for accented characters (converts á -> a) but also converts ñ -> n and this is not a valid behaviour (give wrong results with some words).
Is there a way to exclude a letter from ASCIIFoldingFilterFactory or another filter to try?
Thanks

You can use MappingCharFilter and customise the mappings that are in mapping-FoldToASCII.txt
<charFilter class="solr.MappingCharFilterFactory"
mapping="/solr/trunk/solr/example/solr/conf/mapping-FoldToASCII.txt"/>
(change location file to location on your system)

you can try extending BaseTokenFilterFactory and in the schema.xml file point to it as one of your index/search filter

Related

Solr spellcheck polish characters

I would be more than grateful for information if sb was able to configure spellcheck in SOLR, so queries returns values when polish characters were replaced with unicoded?
I have spellcheck enabled however I am not getting any results when searching 'slub', while I am getting plenty for 'ślub'
Cheers
You should add an ASCIIFoldingFilterFactory in you spellchecking field configuration.
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false"/>
Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

SOLR Special Characters and Emoticons

I want to Index text data that contains Special characters like (currency symbols) and emoticons. Presently I am using following code to index this data:
<fieldTypename="text"class="solr.TextField">
<analyzer>
<tokenizerclass="solr.WhitespaceTokenizerFactory"/>
<filterclass="solr.LowerCaseFilterFactory"/>
<filterclass="solr.KeywordRepeatFilterFactory"/>
<filterclass="solr.StopFilterFactory"words="stopwords.txt"
ignoreCase="true"/>
</analyzer>
But while retrieving the data I can see that all the special characters and emoticons and spoiled e.g.
Debtof��1,590.79settledfor��436.00
Please suggest what can be done here.
Application Flow: Data is first stored in HBASE and with real-time indexers it's updated to SOLR.
CDH Ver:5.4.5
SOLR Ver:4.10.3
HBASE VEer:1.0.0
I solved this by converting smileys to HTMLHex and then storing it to SOLR. In SOLR now I can see that Hex code intact and which can be converted back to smileys.
Library Used:
Lib to convert emoticons to Hex emoji-java

Query Solr accented and unaccented

I'm working on configuring my core solr that save brazilian portuguese data.
About accents, I need to query something like:
search | return
computação | computacao
computacao | computação
What I need basicly is, with or without accent in a query, return both type of words
I tried:
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
Without success
I'm using Solr 5.2.1
Try by adding the BrazilianStemFilterFactory as a filter for your field type which used for searching the field.
This is specifically written for the Brazilian Portuguese.
This could solve your issue.
When using a multilingual index what I have done is create a new field for each language that uses the language specific text field.
So let's say you have English and Portuguese and thus you would declare two fields:
descriptionPt and use text_pt
descriptionEn and use text
Now when you run your search you would specify which field you would like to use or both via qf and specify deftype=edismax.
Worked fine for me.

Solr How to search ñ and Ñ with normal char N and vice verse

How can we map non ASCII char with ASCII character?
Ex.: In solr index we have word contain char ñ, Ñ [LATIN CAPITAL LETTER N WITH TILDE] or normal n,N
Then what filter/token we use to search with Normal N or Ñ and both mapped.
Merging the answers of Solr, Special Chars, and Latin to Cyrilic char conversion
Take a look at Solr's Analyzers, Tokenizers, and Token Filters which give you a good intro to the type of manipulation you're looking for.
Probably the ASCIIFoldingFilterFactory does exactly what you want.
When changing an analyzer to remove the accents, keep in mind that you need to reindex. Otherwise the accented characters will stay within the index, but no user input can be created to match them.
Update
I tried using the ICUFoldingFilterFactory this works fine with those accents. If this one is tricky to set up, have a look into the SO question Can not use ICUTokenizerFactory in Solr
This analyzer
<fieldType name="spanish" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.ICUFoldingFilterFactory" />
</analyzer>
</fieldType>
got me these analysis results, the screen-shot is taken from solr-admin

How to index words with special character in Solr

I would like to index some words with special characters all together.
For example, given m&m, I would like to index it as a whole, rather than delimiting it as m and m (normally & would be considered as a delimiter).
Is there a way to achieve this by using standard tokenizer/filter or should I have to write one myself?
basically text field type filter out special characters before indexing. and you can use string type but it is not advisable for searching on it. you can use types option of WordDelimiterFilterFactory and you can convert those special characters to alphabetical
% => percent
& => and
A Standard Tokenizer factory splits/tokenizes the given text at special characters. To index with special characters you could either write your own custom tokenizer or you can do the following:
Take a list of characters, at which you want to tokenize/split the
text. For eg, my list is {" ",";"}.
Use a PatternTokenizer with the
above list of characters, instead of the StandardTokenizer. Your
configuration will look like:
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern=" |;" />
</analyzer>
you can use WhiteSpaceTokenizerFactory.
http://docs.lucidworks.com/display/solr/Tokenizers#Tokenizers-WhiteSpaceTokenizer
It will tokenize only on whitespaces. For example,
"m&m" will be considered as a single token and so it would indexed like that

Resources