SOLR Snowball Porter for Arabic - solr

is there a Snowball Porter Filter or any similar filter for Arabic?
<filter class="solr.SnowballPorterFilterFactory" language="English" />
I need it to normalize plural words into singular words for the Arabic language

Solr does provide language support for quite a wide range of languages.
Check for the Arabic ones # link
Documentation :-
Provides character normalization and stemming
If you have english and arabic text you can check Solr Language Detection which will help you keep the fields separate and search accordingly.

Related

Difference between Solr SnowballPorterFilterFactory and PortugueseStemFilterFactory

Solr have the SnowballPorterFilterFactory that you can use with a language parameter
<filter class="solr.SnowballPorterFilterFactory" language="Portuguese" />
Solr also have some language specific stemmers like the PortugueseStemFilterFactory. I have read the documentation but I am unable to find out what the difference between them are.
From the source comments:
Portuguese stemmer implementing the RSLP (Removedor de Sufixos da Lingua Portuguesa) algorithm. This is sometimes also referred to as the Orengo stemmer.
The algorithm used is specifically tailored to the necessities of the Portuguese language, and know about the different word classes and how they should be stemmed in Portuguese.
The Snowball stemmer however is a general stemmer engine, where you give it a dictionary to work with - i.e. suffixes that should be stemmed, etc. These does not allow the same kind of knowledge about how to classify and stem specific word classes.
I can't see any reason why you'd want to use the Snowball version when you have the Portuguese RSLP available, but I haven't done any work in Portuguese (I did however have to manually update the Norwegian one for certain edge cases that Snowball didn't catch by default).

Solr query data with white space needs to be queried

I am new to solr. I have data in solr something like "name":"John Lewis".
Query formed looks and searches perfectly as fq=name%3A+%22John+Lewis%22
This is formed in Solr console and works well.
My requirement is to search a particular word coming from my Java layer as "JohnLewis". It has to be mapped with "John Lewis" in solr repo.
This search is not just restricted to name field(2 words and a space in-between).
I have some other details like "Cash Reward Credit Cards", which has 4 words and user would query like "CashRewardCreditCards".
Could someone help me on this, if this can be handled in schema.xml with any parsers that is available in solr.
You need to create custom fieldType.
First define a fieldType in your solr schema :
<fieldType name="word_concate" class="solr.TextField" indexed="true" stored="false">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\s*" replacement=""/>
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
</fieldType>
Here we named the fieldType as word_concate.
We used CharFilterFactories's solr.PatternReplaceCharFilterFactory
Char Filter is a component that pre-processes input characters. Char Filters can be chained like Token Filters and placed in front of a Tokenizer. PatternReplaceCharFilterFactory filter uses regular expressions to replace or change character patterns
Pattern : \s* means zero or more whitespace character
Second create a field with word_concate as type :
<field name="cfname" type="word_concate"/>
Copy your name field to cfname with copy field
<copyField source="name" dest="cfname"/>
Third reindex the data.
Now you can query : cfname:"JohnLewis" it will return name John Lewis
Assuming your input is CamelCase as shown I would use Solr's Word Delimiter Filter
with the splitOnCaseChange parameter on the query side of your analyzer as a starting point. This will take an input token such as CashRewardCreditCards and generate the tokens Cash Reward Credit Cards
See also:
https://cwiki.apache.org/confluence/display/solr/Filter+Descriptions#FilterDescriptions-WordDelimiterFilter
Look at WordDelimiterFilterFactory
It has a splitOnCaseChange property. If you set that to 1, JohnLewis will be indexed as John Lewis.
You'll need to add this to your query analyzer. If the user searches for JohnLewis, the search will be translated to John Lewis.

Query Solr accented and unaccented

I'm working on configuring my core solr that save brazilian portuguese data.
About accents, I need to query something like:
search | return
computação | computacao
computacao | computação
What I need basicly is, with or without accent in a query, return both type of words
I tried:
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
Without success
I'm using Solr 5.2.1
Try by adding the BrazilianStemFilterFactory as a filter for your field type which used for searching the field.
This is specifically written for the Brazilian Portuguese.
This could solve your issue.
When using a multilingual index what I have done is create a new field for each language that uses the language specific text field.
So let's say you have English and Portuguese and thus you would declare two fields:
descriptionPt and use text_pt
descriptionEn and use text
Now when you run your search you would specify which field you would like to use or both via qf and specify deftype=edismax.
Worked fine for me.

Solr: Localities & solr.ICUCollationField usage?

I'm learning Solr and have become confused trying to figure out ICUCollation, what it does, what it is for and how to use it. From here. I haven't found any good explanation of this online. The doc appear to be saying that I need to use this ICUCollation and implies that it does magical things for me, but does not seem to explain exactly why or exactly what, and how it integrates with anything else.
Say I have a text field in French and I want stopwords removed, accents, punctuation and case ignored and stemming... how does ICUCollation come into this? Do I set solr.ICUCollationField and locale='fr' and it will do everything else automatically? Or do I set solr.ICUCollationField and then tokenizer and filters on this in addition? Or do I not use solr.ICUCollationField at all because that's for something completely different? And if so, then what?
Collation is the organisation of written information into an order - ICUCollactionField (the API documentation also provides a good description) is meant to enable you to provide locale aware sorting, as the sort order is defined by cultural norms and specific language properties. This is useful to allow different sorting based on those rules, such as the difference between Norwegian and Swedish, where a Swede would order Å before Æ/Ä and Ø/Ö, while a Norwegian would order it Æ/Ä, Ø/Ö and then Å.
Since you usually don't want to sort by a tokenized field (exception: KeywordTokenizer) or a multivalued field, these fields are usually not processed any more than allowing for the sorting / collation to be performed.
There is a case to be made for collation filters for searching as well, as search in practice is just comparison. This means that if you're aiming to search for two words that would be identical when compared in the locale provided, it would be a hit. The tokens indexed will not make any sense when inspected, but as long as the values are reduced to the same token both when indexing and searching, it would work. There's an example of this on the wiki under UnicodeCollation.
Collation does not affect stopwords (StopFilterFactory), accents (ICUFoldingFilterFactory), punctuation, case (depending on locale - if the locale for sorting is case aware, then it does not) (LowercaseFilterFactory or ICUNormalizer2FilterFactory) or stemming (SnowballPorterFilterFactory). Have a look at the suggested filters for that. Most filters or tokenizers in Solr does very specific tasks, and try to avoid doing "everything and the kitchen sink" in one single filter.
You normally have two or more fields for one text input if you want to do different things like:
search: text analysis
sort: language sensitive / case insensitive sorting
facet: string
For search use something like:
<fieldType name="textFR" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.ElisionFilterFactory"/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.FrenchLightStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
For sorting use:
<fieldType name="textSortFR" class="solr.ICUCollationField"
locale="fr"
strength="primary" />
or simply:
<fieldType name="textSort" class="solr.ICUCollationField"
locale=""
strength="primary" />
(If you have to support many languages. Should work fine enough in most cases.)
Do make use of the Analysis UI in the SOLR Admin: open the analysis view for your index, select the field type (e.g. your sort field), add a representative input value in the left text area and a test value in the right field (in case of sorting, this right side value is not as interesting as the sort field is not used for matching).
The output will show you whether:
accents are removed
elisions are removed
lower casing is applied
etc.
For example, if you see that elisions (l'atelier) are not remove (atelier) but you would like to discard it for sorting you would have to add the elision filter (see example for search field type above).
https://cwiki.apache.org/confluence/display/solr/Language+Analysis

Solr british and american spelling

Search for 'globali*z*ation' only returns search results for 'globalization' but doesn't include any results for 'globali*s*ation' and vice versa.
I'm looking
into solr.HunspellStemFilterFactory filter (available in Solr 3.5).
<filter class="solr.HunspellStemFilterFactory" dictionary="en_GB.dic,en_US.dic" affix="en_GB.aff,en_US.aff" ignoreCase="true" />
Before upgrading from Solr 3.4 to 3.6.1 I was wondering if Hunspell filter is the way to go?
Thanks
If stemming doesn't solve this for you, you could always use a SynonymFilterFactory in order to normalize both spellings into one, I guess a dictionary containing US/UK spelling variations wouldn't be hard to come by.

Resources