Apache Solr - Default Schema Configuration - solr

I have written below an example default field from the managed-schema.xml file. What I observed is that generally people use classes such as solr.LowerCaseFilterFactory etc., but in the field below, for example, there is a filter called lowercase without a class. So, is this configuration actively working, or is it just a template?
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"/>
<analyzer type="index"/>
<tokenizer class="standard"/>
<filter name="stop" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter name="lowercase"/>
<filter name="englishPossessive"/>
<filter protected="protwords.txt" name="keywordMarker"/>
<filter name="porterStem"/>
</analyzer>
<analyzer type="query">
<tokenizer class="standard"/>
<filter name="synonymGraph" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter name="stop" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter name="lowercase"/>
<filter name="englishPossessive"/>
<filter protected="protwords.txt" name="keywordMarker"/>
<filter name="porterStem"/>
</analyzer>
</fieldType>

It depends on which version of Solr you're using; later versions are able to look up the class name from the short form (i.e. without the FilterFactory postfix. See the example in the current reference guide:
<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="englishPorter"/>
</analyzer>
</fieldType>
Compared to the legacy format shown in the same guide:
<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"/>
</analyzer>
</fieldType>
As you can see there's just a lot of repetition in the class names given, so instead of having the complete class name, Solr resolves it based on the common pattern and the type given instead.

Related

Synonym Graph Filter with synonyms file in Wordnet format not working

I am using Solr 8.3 and trying to pass a synonym file in wordnet format, such as-
s(300880586,1,'augmented',s,1,0).
s(300880765,1,'enhanced',s,1,0).
s(300881030,1,'hyperbolic',s,1,2).
s(300881030,2,'inflated',s,1,1).
In the managed-schema file, I have configured the Synonym Graph Filter as-
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="wn_s.pl" format="wordnet" ignoreCase="true"/>
<filter class="solr.FlattenGraphFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.SynonymGraphFilterFactory" synonyms="wn_s.pl" format="wordnet" expand="true" ignoreCase="true" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
However, it did not work. Perhaps I might have missed some configuration or maybe some issue with format.
So I tried converting the file into Solr format, it works that way somehow.
I wanted to use the wordnet format only, so if anyone can help me understand the mistake I am making here, it would be helpful.

Front and back EdgeNGrams in Solr

I would like to use EdgeNGramFilterFactory to generate Edge NGrams from front and back. For front I am using
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="20" minGramSize="4"/>
and for back, I am using
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="15"/>
<filter class="solr.ReverseStringFilterFactory"/>
But when they are used together in a single analyzer, the second set of filter factories are acting on the output of the first EdgeNGramFilterFactory.
Is it possible to generate both front and back EdgeNGrams in a single analyzer? Or do I have to create separate analyzers and use copyField to create a field with both the front and back EdgeNGrams?
Update
Example schema as requested in comments below
<fieldType name="text_suggest_edge" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="12"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="12"/>
</analyzer>
</fieldType>
<fieldType name="text_suggest_edge_end" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="12"/>
<filter class="solr.ReverseStringFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="12"/>
<filter class="solr.ReverseStringFilterFactory"/>
</analyzer>
</fieldType>
<field name="item_name_edge" type="text_suggest_edge" indexed="true" stored="false" multiValued="true"/>
<field name="item_name_edge_end" type="text_suggest_edge_end" indexed="true" stored="false" multiValued="true"/>
<copyField source="item_name" dest="item_name_edge"/>
<copyField source="item_name" dest="item_name_edge_end"/>
Update 2: Including sample input and expected output
Input String
Washington
Required Edge Ngrams
Was, Wash, Washi, ... Washington, ashington, shington, hington ... gton, ton
you could do it in a single analyzer chain if you create your customized version of EdgeNGramFilterFactory (in java, and then plug it into your schema.xml) that creates the additional ngrams from the back.
Otherwise, you are going to need the copyField into an additional field with a separate chain.
I honestly thing the first option is too much trouble, but it is possible for sure.

Solr search from beginning of string

I need to boost results which are found from beginning of string. For Example i have to countries: Egypt and Seychelles.
User types "e" in a text field and solr response will be:
Seychelles
Egypt
But as you can see "Egypt" starts with "e". And i need this result to be boosted up:
Egypt
Seychelles
Any other results should be scored as usual. Is there any kind of special tokenizers/serializers? Or may be special characters in SolrQuery syntax?
UPD:
Part of my schema.xml which describes text field type:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="20" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Problem solved by using EdgeNGramFilterFactory instead of NGramFilterFactory:
<fieldType name="text_start_end" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.PositionFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20" side="front" />
</analyzer>
</fieldType>

Solr CharFilterFactory makes analysis tool return empty results

I have used a CharFilterFactory in my schema.xml for fileType text_general, so that queries for cafe and café return the same results. It works correctly. Here's the relevant part of my schema.xml:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
...however, the analysis tool within the admin user interface of solr seems to suggest that the tokenizers and filters that come after the charfilterfactory aren't doing anything. That's because if I analyse text_general with any field values for index and query, after MCF (MappingCharFilter), the output for ST, SF and LCF are all empty (grrr - a screen dump would be useful here, but I'm not allowed to post one because my 'reputation' isn't high enough apparently). Is that expected behaviour? Could someone please explain that analysis tool output please?

How to ignore accent search in Solr

I am using solr as a search engine. I have a case where a text field contains accent text like "María". When user search with "María", it is giving resut. But when user search with "Maria" it is not giving any result.
My schema definition looks like below:
<fieldtype name="my_text" class="solr.TextField">
<analyzer type="Index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="32" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
Please help to solve this issue.
If you're on solr > 3.x you can try using solr.ASCIIFoldingFilterFactory which will change all the accented characters to their unaccented versions from the basic ascii 127-character set.
Remember to put it after any stemming filter you have configured (you're not using one, so you should be ok).
So your config could look like:
<fieldtype name="my_text" class="solr.TextField">
<analyzer type="Index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="32" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldtype>
Answering here because it's the first result that pop when searching "ignore accents solr".
In the schema.xml generated by haystack (and using aldryn_search, djangocms & djangocms-blog), the answer provided by #soulcheck works if you add the <filter class="solr.ASCIIFoldingFilterFactory"/> line in the text_en fieldType.
Screenshot 1, screenshot 2.

Resources