SOLR Special Characters and Emoticons - solr

I want to Index text data that contains Special characters like (currency symbols) and emoticons. Presently I am using following code to index this data:
<fieldTypename="text"class="solr.TextField">
<analyzer>
<tokenizerclass="solr.WhitespaceTokenizerFactory"/>
<filterclass="solr.LowerCaseFilterFactory"/>
<filterclass="solr.KeywordRepeatFilterFactory"/>
<filterclass="solr.StopFilterFactory"words="stopwords.txt"
ignoreCase="true"/>
</analyzer>
But while retrieving the data I can see that all the special characters and emoticons and spoiled e.g.
Debtof��1,590.79settledfor��436.00
Please suggest what can be done here.
Application Flow: Data is first stored in HBASE and with real-time indexers it's updated to SOLR.
CDH Ver:5.4.5
SOLR Ver:4.10.3
HBASE VEer:1.0.0

I solved this by converting smileys to HTMLHex and then storing it to SOLR. In SOLR now I can see that Hex code intact and which can be converted back to smileys.
Library Used:
Lib to convert emoticons to Hex emoji-java

Related

substring match in solr query

I have a requirment where I have to match a substring in a query .
e.g if the field has value :
PREFIXabcSUFFIX
I have to create a query which matches abc. I always know the length of the prefix.
I can not use EdgeNgram and Ngram because of the space constraints.(As they will create more indexes.)
So i need to do this on query time and not on index time. Using a wildcard as prefix something like *abc* will have high impact on performance .
Since I will know the length of the prefix I am hoping to have some way where I can do something like ....abc* where dots represents the exact length of the prefix so that the query is not as bad as searching for the whole index as in the case of wild card query (*abc*).
Is this possible in solr ? Thanks for your time .
Solr version : 4.10
Sure, Wildcard syntax is documented here, you could search something like ????abc*. You could also use a regex query.
However, the performance benefit from this over *abc* will be very small. It will still have to perform a sequential search over the whole index. But if there is no way you can improve your analysis to support your search needs, there may be no getting around that (GIGO).
You could use the RegularExpressionPatternTokenizer for this. For the sample below I guessed that the length of your prefix is 6. Your example text PREFIXabcSUFFIX would become abcSUFFIX. This way you may search for abc*
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern=".{6}(.+)" group="1"/>
</analyzer>
About the Tokenizer:
This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens.

Solr spellcheck polish characters

I would be more than grateful for information if sb was able to configure spellcheck in SOLR, so queries returns values when polish characters were replaced with unicoded?
I have spellcheck enabled however I am not getting any results when searching 'slub', while I am getting plenty for 'ślub'
Cheers
You should add an ASCIIFoldingFilterFactory in you spellchecking field configuration.
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="false"/>
Converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists.

Query Solr accented and unaccented

I'm working on configuring my core solr that save brazilian portuguese data.
About accents, I need to query something like:
search | return
computação | computacao
computacao | computação
What I need basicly is, with or without accent in a query, return both type of words
I tried:
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
Without success
I'm using Solr 5.2.1
Try by adding the BrazilianStemFilterFactory as a filter for your field type which used for searching the field.
This is specifically written for the Brazilian Portuguese.
This could solve your issue.
When using a multilingual index what I have done is create a new field for each language that uses the language specific text field.
So let's say you have English and Portuguese and thus you would declare two fields:
descriptionPt and use text_pt
descriptionEn and use text
Now when you run your search you would specify which field you would like to use or both via qf and specify deftype=edismax.
Worked fine for me.

Solr: Localities & solr.ICUCollationField usage?

I'm learning Solr and have become confused trying to figure out ICUCollation, what it does, what it is for and how to use it. From here. I haven't found any good explanation of this online. The doc appear to be saying that I need to use this ICUCollation and implies that it does magical things for me, but does not seem to explain exactly why or exactly what, and how it integrates with anything else.
Say I have a text field in French and I want stopwords removed, accents, punctuation and case ignored and stemming... how does ICUCollation come into this? Do I set solr.ICUCollationField and locale='fr' and it will do everything else automatically? Or do I set solr.ICUCollationField and then tokenizer and filters on this in addition? Or do I not use solr.ICUCollationField at all because that's for something completely different? And if so, then what?
Collation is the organisation of written information into an order - ICUCollactionField (the API documentation also provides a good description) is meant to enable you to provide locale aware sorting, as the sort order is defined by cultural norms and specific language properties. This is useful to allow different sorting based on those rules, such as the difference between Norwegian and Swedish, where a Swede would order Å before Æ/Ä and Ø/Ö, while a Norwegian would order it Æ/Ä, Ø/Ö and then Å.
Since you usually don't want to sort by a tokenized field (exception: KeywordTokenizer) or a multivalued field, these fields are usually not processed any more than allowing for the sorting / collation to be performed.
There is a case to be made for collation filters for searching as well, as search in practice is just comparison. This means that if you're aiming to search for two words that would be identical when compared in the locale provided, it would be a hit. The tokens indexed will not make any sense when inspected, but as long as the values are reduced to the same token both when indexing and searching, it would work. There's an example of this on the wiki under UnicodeCollation.
Collation does not affect stopwords (StopFilterFactory), accents (ICUFoldingFilterFactory), punctuation, case (depending on locale - if the locale for sorting is case aware, then it does not) (LowercaseFilterFactory or ICUNormalizer2FilterFactory) or stemming (SnowballPorterFilterFactory). Have a look at the suggested filters for that. Most filters or tokenizers in Solr does very specific tasks, and try to avoid doing "everything and the kitchen sink" in one single filter.
You normally have two or more fields for one text input if you want to do different things like:
search: text analysis
sort: language sensitive / case insensitive sorting
facet: string
For search use something like:
<fieldType name="textFR" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.ElisionFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.FrenchLightStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
For sorting use:
<fieldType name="textSortFR" class="solr.ICUCollationField"
locale="fr"
strength="primary" />
or simply:
<fieldType name="textSort" class="solr.ICUCollationField"
locale=""
strength="primary" />
(If you have to support many languages. Should work fine enough in most cases.)
Do make use of the Analysis UI in the SOLR Admin: open the analysis view for your index, select the field type (e.g. your sort field), add a representative input value in the left text area and a test value in the right field (in case of sorting, this right side value is not as interesting as the sort field is not used for matching).
The output will show you whether:
accents are removed
elisions are removed
lower casing is applied
etc.
For example, if you see that elisions (l'atelier) are not remove (atelier) but you would like to discard it for sorting you would have to add the elision filter (see example for search field type above).
https://cwiki.apache.org/confluence/display/solr/Language+Analysis

Solr - removing special characters

a pretty basic question but can anyone tell me how to remove special characters from documents while indexing in solr? I went through Solr wiki but couldn't find anything relevant. I saw few tokenizers like WhiteSpaceTokenizerFactory and StandardTokenizerFactory. I am using WhiteSpaceTokenizerFactory in my schema.xml but it doesn't seem to solve the purpose. I am still able to query using "*" and "-" etc.
Consider using the standard tokenizer:
<tokenizer class="solr.StandardTokenizerFactory"/>
It should remove the characters you have mentioned.
Once the words have been tokenized you may apply further processing, like splitting on case change and numerics, using the WordDelimiterFilterFactory for better matching.
Also, very useful almost all the time when dealing with schema configuration, is the solr's analysis page. It gives you a lot of valuable feedback.

Resources