Solr autocompletion with NGrams and MappingCharFilter

Solr autocompletion with NGrams and MappingCharFilter - solr

I want to implement an auto completion search with solr. The user is searching for names of persons. The auto completion is done by NGrams. This is working properly, so when I search for "Caro" i find "Caroline". What i want to do now is a Char Mapping. The user should find "Caroline" by entering "Karo" in the search. So "k" will be mapped to "c". When I search with the config below i get an empty result by searching "Karo" or "Karoline" ("Caro" works).
I have created a mapping.txt with following content:
"k" => "c"
Here is my field configuration:
<fieldType name="string_wildcard" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" side="front"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="/home/martin/mapping.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
I hope you can help me. Thanks!

you are using "k" => "c", which will only replace the lowercase k to c.
you need to add lowercase filters to the filter chain, to make it case insensitive.
<fieldType name="string_wildcard" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="/Users/jayendrapatil/solr/trunk/solr/example/solr/conf/mapping-ISOLatin1Accent.txt"/>
</analyzer>
</fieldType>

Related

Front and back EdgeNGrams in Solr

I would like to use EdgeNGramFilterFactory to generate Edge NGrams from front and back. For front I am using
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="20" minGramSize="4"/>
and for back, I am using
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="15"/>
<filter class="solr.ReverseStringFilterFactory"/>
But when they are used together in a single analyzer, the second set of filter factories are acting on the output of the first EdgeNGramFilterFactory.
Is it possible to generate both front and back EdgeNGrams in a single analyzer? Or do I have to create separate analyzers and use copyField to create a field with both the front and back EdgeNGrams?
Update
Example schema as requested in comments below
<fieldType name="text_suggest_edge" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="12"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="12"/>
</analyzer>
</fieldType>
<fieldType name="text_suggest_edge_end" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="12"/>
<filter class="solr.ReverseStringFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="12"/>
<filter class="solr.ReverseStringFilterFactory"/>
</analyzer>
</fieldType>
<field name="item_name_edge" type="text_suggest_edge" indexed="true" stored="false" multiValued="true"/>
<field name="item_name_edge_end" type="text_suggest_edge_end" indexed="true" stored="false" multiValued="true"/>
<copyField source="item_name" dest="item_name_edge"/>
<copyField source="item_name" dest="item_name_edge_end"/>
Update 2: Including sample input and expected output
Input String
Washington
Required Edge Ngrams
Was, Wash, Washi, ... Washington, ashington, shington, hington ... gton, ton

you could do it in a single analyzer chain if you create your customized version of EdgeNGramFilterFactory (in java, and then plug it into your schema.xml) that creates the additional ngrams from the back.
Otherwise, you are going to need the copyField into an additional field with a separate chain.
I honestly thing the first option is too much trouble, but it is possible for sure.

Ignore special characters

I have the following field within my SOLR configure:
<fieldType name="title" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" preserveOriginal="1" catenateAll="1" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Within the field I could be storing:
Spiderman, Spider-man, Spider man
What I would like is for someone who searches for spiderman to get all 3 options and ideally someone who searches spider-man to get all 3 options. Apart from amending the content when it is indexed is there another way to effectively ignore special characters but not necessarily split on them?

One of the possible solutions, especially if the number of delimeter character is small is to replace them via solr.PatternReplaceFilterFactory like this:
<fieldType name="title" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="-" replacement=""/>
<filter class="solr.PatternReplaceFilterFactory" pattern=" " replacement=""/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
If keyword tokenizer is a bad option, since it will preserve one token (which could be okay for a field like title), you could either create your own tokenizer, which will split title only on needed symbols or add additional filters like ngram to allow partial match on the title field.

I know this is an old post, but the correct answer here is you should add "Spiderman, Spider-man, Spider man" to your synonyms.txt file and restart solr. If this still doesn't work, make sure your schema uses the SynonymGraphFilterFactory analyzer. What you've described here is synonyms.

Solr full name search: how can I find entries containing a dash with wildcards

I'm using solr 4.10.3. I tried to configure Solr to ignore dashes in searches:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<!-- sonderzeichen .,-\/ ignorieren -->
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[\.\-\\\/,]" replacement=""/>
<!-- enthaelt u-umlaut -> u, lowercase und uft8 decomposed -->
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
<analyzer type="query">
<!-- sonderzeichen .,-\/ ignorieren -->
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[\.\-\\\/,]" replacement=""/>
<!-- enthaelt u-umlaut -> u, lowercase und uft8 decomposed -->
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
</fieldtype>
I have an entry "pan-pan, peter", which is found, if i search
(peter pa*)
(peter panpa*)
or even
(pe-te-r panpa*)
also
(peter pa-n-pa-n)
(without *) matches.
but
(peter pan-p*)
(peter pan\-p*)
gives no result.
It seems as if the combination of dash and * is a problem?
I'd like to find "pan-pan, peter" in every stage of typing "peter pan-pan"...

Try using the below field type.
<fieldType name="text_delimeter" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" preserveOriginal="1" catenateAll="1" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I tried with your text and analysed the same. I found the above type would work for you. I have analysed the same in the tool as well.

Solr search from beginning of string

I need to boost results which are found from beginning of string. For Example i have to countries: Egypt and Seychelles.
User types "e" in a text field and solr response will be:
Seychelles
Egypt
But as you can see "Egypt" starts with "e". And i need this result to be boosted up:
Egypt
Seychelles
Any other results should be scored as usual. Is there any kind of special tokenizers/serializers? Or may be special characters in SolrQuery syntax?
UPD:
Part of my schema.xml which describes text field type:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="20" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Problem solved by using EdgeNGramFilterFactory instead of NGramFilterFactory:
<fieldType name="text_start_end" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.PositionFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20" side="front" />
</analyzer>
</fieldType>

How to ignore accent search in Solr

I am using solr as a search engine. I have a case where a text field contains accent text like "María". When user search with "María", it is giving resut. But when user search with "Maria" it is not giving any result.
My schema definition looks like below:
<fieldtype name="my_text" class="solr.TextField">
<analyzer type="Index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="32" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
Please help to solve this issue.

If you're on solr > 3.x you can try using solr.ASCIIFoldingFilterFactory which will change all the accented characters to their unaccented versions from the basic ascii 127-character set.
Remember to put it after any stemming filter you have configured (you're not using one, so you should be ok).
So your config could look like:
<fieldtype name="my_text" class="solr.TextField">
<analyzer type="Index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="32" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldtype>

Answering here because it's the first result that pop when searching "ignore accents solr".
In the schema.xml generated by haystack (and using aldryn_search, djangocms & djangocms-blog), the answer provided by #soulcheck works if you add the <filter class="solr.ASCIIFoldingFilterFactory"/> line in the text_en fieldType.
Screenshot 1, screenshot 2.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Solr autocompletion with NGrams and MappingCharFilter - solr

Related

Front and back EdgeNGrams in Solr

Ignore special characters

Solr full name search: how can I find entries containing a dash with wildcards

Solr search from beginning of string

How to ignore accent search in Solr

Categories

Resources