Front and back EdgeNGrams in Solr - solr

I would like to use EdgeNGramFilterFactory to generate Edge NGrams from front and back. For front I am using
<filter class="solr.EdgeNGramFilterFactory" maxGramSize="20" minGramSize="4"/>
and for back, I am using
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="4" maxGramSize="15"/>
<filter class="solr.ReverseStringFilterFactory"/>
But when they are used together in a single analyzer, the second set of filter factories are acting on the output of the first EdgeNGramFilterFactory.
Is it possible to generate both front and back EdgeNGrams in a single analyzer? Or do I have to create separate analyzers and use copyField to create a field with both the front and back EdgeNGrams?
Update
Example schema as requested in comments below
<fieldType name="text_suggest_edge" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="12"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="12"/>
</analyzer>
</fieldType>
<fieldType name="text_suggest_edge_end" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="12"/>
<filter class="solr.ReverseStringFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ReverseStringFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="12"/>
<filter class="solr.ReverseStringFilterFactory"/>
</analyzer>
</fieldType>
<field name="item_name_edge" type="text_suggest_edge" indexed="true" stored="false" multiValued="true"/>
<field name="item_name_edge_end" type="text_suggest_edge_end" indexed="true" stored="false" multiValued="true"/>
<copyField source="item_name" dest="item_name_edge"/>
<copyField source="item_name" dest="item_name_edge_end"/>
Update 2: Including sample input and expected output
Input String
Washington
Required Edge Ngrams
Was, Wash, Washi, ... Washington, ashington, shington, hington ... gton, ton

you could do it in a single analyzer chain if you create your customized version of EdgeNGramFilterFactory (in java, and then plug it into your schema.xml) that creates the additional ngrams from the back.
Otherwise, you are going to need the copyField into an additional field with a separate chain.
I honestly thing the first option is too much trouble, but it is possible for sure.

Related

Apache Solr - Default Schema Configuration

I have written below an example default field from the managed-schema.xml file. What I observed is that generally people use classes such as solr.LowerCaseFilterFactory etc., but in the field below, for example, there is a filter called lowercase without a class. So, is this configuration actively working, or is it just a template?
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"/>
<analyzer type="index"/>
<tokenizer class="standard"/>
<filter name="stop" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter name="lowercase"/>
<filter name="englishPossessive"/>
<filter protected="protwords.txt" name="keywordMarker"/>
<filter name="porterStem"/>
</analyzer>
<analyzer type="query">
<tokenizer class="standard"/>
<filter name="synonymGraph" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter name="stop" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter name="lowercase"/>
<filter name="englishPossessive"/>
<filter protected="protwords.txt" name="keywordMarker"/>
<filter name="porterStem"/>
</analyzer>
</fieldType>
It depends on which version of Solr you're using; later versions are able to look up the class name from the short form (i.e. without the FilterFactory postfix. See the example in the current reference guide:
<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer name="standard"/>
<filter name="lowercase"/>
<filter name="englishPorter"/>
</analyzer>
</fieldType>
Compared to the legacy format shown in the same guide:
<fieldType name="text" class="solr.TextField">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory"/>
</analyzer>
</fieldType>
As you can see there's just a lot of repetition in the class names given, so instead of having the complete class name, Solr resolves it based on the common pattern and the type given instead.

I am facing issue in search word end with "/" forward slash "Non-Conformity" in solr search engine

I am facing issue in search word "Non-Conformity" in solr search engine that word end with "/" forward slash.
This is the url I used to search
http://localhost:8983/solr/sms/select?q="Non-Conformity"
key word - "Non-Conformity" or "Non-Conformity/" (not working)
Key word - "Non-Conformity/Deficiency" (working)
key word available in document - "Non-Conformity/Deficiency (NCD) Report for Class audits/surveys"
Apply the below field type to your field.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Your field would be like
<field name="name" type="text_general" indexed="true" stored="true" multiValued="false"/>
</analyzer>
If you analyse the same in admin page you will find the text is matching.

Solr search from beginning of string

I need to boost results which are found from beginning of string. For Example i have to countries: Egypt and Seychelles.
User types "e" in a text field and solr response will be:
Seychelles
Egypt
But as you can see "Egypt" starts with "e". And i need this result to be boosted up:
Egypt
Seychelles
Any other results should be scored as usual. Is there any kind of special tokenizers/serializers? Or may be special characters in SolrQuery syntax?
UPD:
Part of my schema.xml which describes text field type:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="20" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Problem solved by using EdgeNGramFilterFactory instead of NGramFilterFactory:
<fieldType name="text_start_end" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.PositionFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20" side="front" />
</analyzer>
</fieldType>

SOLR How to highlight only multi term matches in a multi valued field

I have a multivalued cathegory field and these cathegories can have multible terms,
e.g. cathegory={"foo","foo-123", "foo-456"}
when I search for "foo-45" only the last cathegory "foo-456" should be highlightet, but instead the string "foo" will be highlightet in all three cathegories.
Which Highlighter can I use and how can I configer it to highlight only matches with all query terms matching?
This is my definition of the field and its type:
<field name="cathegory_field" type="cathegory_field_type" indexed="true" stored="true" multiValued="true" />
<fieldType class="solr.TextField" name="cathegory_field_type">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" splitOnCaseChange="0" catenateWords="1" catenateNumbers="1" catenateAll="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I just tried it in my machine, WordDelimiterFilterFactory creates two tokens (foo and 456).
In my opinion, one thing you could do is removing WordDelimiterFilterFactory from the query analysis.
Another solution is to use another field for highlighting; where you do not use WordDelimiterFilterFactory. Below is a simple definition:
<fieldType class="solr.TextField" name="text_cat_hl">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" splitOnCaseChange="0" catenateWords="1" catenateNumbers="1" catenateAll="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
Then create a new field:
<field name="cat_hl" type="text_cat_hl" indexed="true" stored="true" multiValued="true" />
You need to copy the content of category_field to it:
<copyField source="cathegory_field" dest="cat_hl"/>
Lastly, you issue a query like this:
http://127.0.0.1:8983/solr/collection1?select?q=cathegory_field:foo-456&hl.q=cathegory_hl:foo-456

How to ignore accent search in Solr

I am using solr as a search engine. I have a case where a text field contains accent text like "María". When user search with "María", it is giving resut. But when user search with "Maria" it is not giving any result.
My schema definition looks like below:
<fieldtype name="my_text" class="solr.TextField">
<analyzer type="Index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="32" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
Please help to solve this issue.
If you're on solr > 3.x you can try using solr.ASCIIFoldingFilterFactory which will change all the accented characters to their unaccented versions from the basic ascii 127-character set.
Remember to put it after any stemming filter you have configured (you're not using one, so you should be ok).
So your config could look like:
<fieldtype name="my_text" class="solr.TextField">
<analyzer type="Index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="32" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldtype>
Answering here because it's the first result that pop when searching "ignore accents solr".
In the schema.xml generated by haystack (and using aldryn_search, djangocms & djangocms-blog), the answer provided by #soulcheck works if you add the <filter class="solr.ASCIIFoldingFilterFactory"/> line in the text_en fieldType.
Screenshot 1, screenshot 2.

Resources