Solr search from beginning of string - solr

I need to boost results which are found from beginning of string. For Example i have to countries: Egypt and Seychelles.
User types "e" in a text field and solr response will be:
Seychelles
Egypt
But as you can see "Egypt" starts with "e". And i need this result to be boosted up:
Egypt
Seychelles
Any other results should be scored as usual. Is there any kind of special tokenizers/serializers? Or may be special characters in SolrQuery syntax?
UPD:
Part of my schema.xml which describes text field type:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="20" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Problem solved by using EdgeNGramFilterFactory instead of NGramFilterFactory:
<fieldType name="text_start_end" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.PositionFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20" side="front" />
</analyzer>
</fieldType>

Related

Apache Solr - Default Schema Configuration

I have written below an example default field from the managed-schema.xml file. What I observed is that generally people use classes such as solr.LowerCaseFilterFactory etc., but in the field below, for example, there is a filter called lowercase without a class. So, is this configuration actively working, or is it just a template?
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"/>
<analyzer type="index"/>
<tokenizer class="standard"/>
<filter name="stop" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter name="lowercase"/>
<filter name="englishPossessive"/>
<filter protected="protwords.txt" name="keywordMarker"/>
<filter name="porterStem"/>
</analyzer>
<analyzer type="query">
<tokenizer class="standard"/>
<filter name="synonymGraph" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter name="stop" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter name="lowercase"/>
<filter name="englishPossessive"/>
<filter protected="protwords.txt" name="keywordMarker"/>
<filter name="porterStem"/>
</analyzer>
</fieldType>
It depends on which version of Solr you're using; later versions are able to look up the class name from the short form (i.e. without the FilterFactory postfix. See the example in the current reference guide:
<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="englishPorter"/>
</analyzer>
</fieldType>
Compared to the legacy format shown in the same guide:
<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"/>
</analyzer>
</fieldType>
As you can see there's just a lot of repetition in the class names given, so instead of having the complete class name, Solr resolves it based on the common pattern and the type given instead.

Front and back EdgeNGrams in Solr

I would like to use EdgeNGramFilterFactory to generate Edge NGrams from front and back. For front I am using
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="20" minGramSize="4"/>
and for back, I am using
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="15"/>
<filter class="solr.ReverseStringFilterFactory"/>
But when they are used together in a single analyzer, the second set of filter factories are acting on the output of the first EdgeNGramFilterFactory.
Is it possible to generate both front and back EdgeNGrams in a single analyzer? Or do I have to create separate analyzers and use copyField to create a field with both the front and back EdgeNGrams?
Update
Example schema as requested in comments below
<fieldType name="text_suggest_edge" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="12"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="12"/>
</analyzer>
</fieldType>
<fieldType name="text_suggest_edge_end" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="12"/>
<filter class="solr.ReverseStringFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="12"/>
<filter class="solr.ReverseStringFilterFactory"/>
</analyzer>
</fieldType>
<field name="item_name_edge" type="text_suggest_edge" indexed="true" stored="false" multiValued="true"/>
<field name="item_name_edge_end" type="text_suggest_edge_end" indexed="true" stored="false" multiValued="true"/>
<copyField source="item_name" dest="item_name_edge"/>
<copyField source="item_name" dest="item_name_edge_end"/>
Update 2: Including sample input and expected output
Input String
Washington
Required Edge Ngrams
Was, Wash, Washi, ... Washington, ashington, shington, hington ... gton, ton
you could do it in a single analyzer chain if you create your customized version of EdgeNGramFilterFactory (in java, and then plug it into your schema.xml) that creates the additional ngrams from the back.
Otherwise, you are going to need the copyField into an additional field with a separate chain.
I honestly thing the first option is too much trouble, but it is possible for sure.

Solr full name search: how can I find entries containing a dash with wildcards

I'm using solr 4.10.3. I tried to configure Solr to ignore dashes in searches:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<!-- sonderzeichen .,-\/ ignorieren -->
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[\.\-\\\/,]" replacement=""/>
<!-- enthaelt u-umlaut -> u, lowercase und uft8 decomposed -->
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
<analyzer type="query">
<!-- sonderzeichen .,-\/ ignorieren -->
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[\.\-\\\/,]" replacement=""/>
<!-- enthaelt u-umlaut -> u, lowercase und uft8 decomposed -->
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
</fieldtype>
I have an entry "pan-pan, peter", which is found, if i search
(peter pa*)
(peter panpa*)
or even
(pe-te-r panpa*)
also
(peter pa-n-pa-n)
(without *) matches.
but
(peter pan-p*)
(peter pan\-p*)
gives no result.
It seems as if the combination of dash and * is a problem?
I'd like to find "pan-pan, peter" in every stage of typing "peter pan-pan"...
Try using the below field type.
<fieldType name="text_delimeter" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" preserveOriginal="1" catenateAll="1" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I tried with your text and analysed the same. I found the above type would work for you. I have analysed the same in the tool as well.

How to ignore accent search in Solr

I am using solr as a search engine. I have a case where a text field contains accent text like "María". When user search with "María", it is giving resut. But when user search with "Maria" it is not giving any result.
My schema definition looks like below:
<fieldtype name="my_text" class="solr.TextField">
<analyzer type="Index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="32" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
Please help to solve this issue.
If you're on solr > 3.x you can try using solr.ASCIIFoldingFilterFactory which will change all the accented characters to their unaccented versions from the basic ascii 127-character set.
Remember to put it after any stemming filter you have configured (you're not using one, so you should be ok).
So your config could look like:
<fieldtype name="my_text" class="solr.TextField">
<analyzer type="Index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="32" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldtype>
Answering here because it's the first result that pop when searching "ignore accents solr".
In the schema.xml generated by haystack (and using aldryn_search, djangocms & djangocms-blog), the answer provided by #soulcheck works if you add the <filter class="solr.ASCIIFoldingFilterFactory"/> line in the text_en fieldType.
Screenshot 1, screenshot 2.

Solr autocompletion with NGrams and MappingCharFilter

I want to implement an auto completion search with solr. The user is searching for names of persons. The auto completion is done by NGrams. This is working properly, so when I search for "Caro" i find "Caroline". What i want to do now is a Char Mapping. The user should find "Caroline" by entering "Karo" in the search. So "k" will be mapped to "c". When I search with the config below i get an empty result by searching "Karo" or "Karoline" ("Caro" works).
I have created a mapping.txt with following content:
"k" => "c"
Here is my field configuration:
<fieldType name="string_wildcard" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" side="front"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="/home/martin/mapping.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>
I hope you can help me. Thanks!
you are using "k" => "c", which will only replace the lowercase k to c.
you need to add lowercase filters to the filter chain, to make it case insensitive.
<fieldType name="string_wildcard" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="/Users/jayendrapatil/solr/trunk/solr/example/solr/conf/mapping-ISOLatin1Accent.txt"/>
</analyzer>
</fieldType>

Resources