SOLR Tokenizer "solr.SimplePatternSplitTokenizerFactory" splits at unexpected characters - solr

I'm having unexpected results with the solr.SimplePatternSplitTokenizerFactory. The pattern used is actually from an example in the SOLR documentation and I do not understand where I made a mistake or why it does not work as expected.
If we take the example input "operative", the analyzer shows that during indexing, the input gets split into the tokens "ope", "a" and "ive", that is the tokenizer splits at the characters "r" and "t", and not at the expected whitespace characters (CR, TAB). Just to be sure I also tried to use more than one backspace in the pattern (e.g. \t and \\t), but this did not change how the input is tokenized during indexing.
What am I missing?
SOLR version used is 7.5.0.
The definition of the field type in the schema is as follows:
<fieldType name="text_custom" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Update found this post on the "Solr - User" mailing list archive:
http://lucene.472066.n3.nabble.com/Solr-Reference-Guide-issue-for-simplified-tokenizers-td4385540.html
Seems the documentation (or the example) is not correct/working. The following usage of the tokenizer is working as intended:
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[
]+"/>

Found this post on the "Solr - User" mailing list archive: http://lucene.472066.n3.nabble.com/Solr-Reference-Guide-issue-for-simplified-tokenizers-td4385540.html
Seems the documentation (or the example) is not correct/working. The following usage of the tokenizer is working as intended:
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[
]+"/>

Related

Solr Queries With Dashes

I am currently using solr edismax to do searches on our website. What I'm looking to do, is essentially have dashes get ignored.
So if I search the words, "wi-fi adapter". And I have a document, with a title, "wifi adapter". I'll get no results.
I am currently using solr.MappingCharFilterFactory to map dashes to spaces. This is what my text_general fieldtype looks like in my schema.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>
</fieldType>
My mapping.txt contains the line..
"-" => " "
So what this rule does, is it converts the dashes to a space.
So if I search "wi fi adapter", it will always show the same results as "wi fi adapter", but won't show results for "wifi adapter".
Is there any way to treat dashes like this? Essentially I'd want to treat "wifi adapter", "wi-fi adapter", and "wi fi adapter" the same.
You can use the WordDelimiterGraphFilterFactory for your analyzer. It has lot many attributes that could be used. I have listed few.
The WordDelimiterGraphFilterFactory has many attributes.
generateWordParts : (integer, default 1) If non-zero, splits words at delimiters. For example: "CamelCase", "hot-spot" → "Camel", "Case", "hot", "spot"
preserveOriginal : (integer, default 0) If non-zero, the original token is preserved: "Zap-Master-9000" → "Zap-Master-9000", "Zap", "Master", "9000"
catenateWords : (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor’s" → "hotspotsensor"
So in your case it would be like
<fieldType name="text_wd" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<!-- Splits words based on whitespace characters -->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- splits words at delimiters based on different arguments -->
<filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="1" catenateWords="1"/>
<!-- Transforms text to lower case -->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The more information on it would be found at Fiters available in solr

Ignore special characters

I have the following field within my SOLR configure:
<fieldType name="title" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" preserveOriginal="1" catenateAll="1" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Within the field I could be storing:
Spiderman, Spider-man, Spider man
What I would like is for someone who searches for spiderman to get all 3 options and ideally someone who searches spider-man to get all 3 options. Apart from amending the content when it is indexed is there another way to effectively ignore special characters but not necessarily split on them?
One of the possible solutions, especially if the number of delimeter character is small is to replace them via solr.PatternReplaceFilterFactory like this:
<fieldType name="title" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="-" replacement=""/>
<filter class="solr.PatternReplaceFilterFactory" pattern=" " replacement=""/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
If keyword tokenizer is a bad option, since it will preserve one token (which could be okay for a field like title), you could either create your own tokenizer, which will split title only on needed symbols or add additional filters like ngram to allow partial match on the title field.
I know this is an old post, but the correct answer here is you should add "Spiderman, Spider-man, Spider man" to your synonyms.txt file and restart solr. If this still doesn't work, make sure your schema uses the SynonymGraphFilterFactory analyzer. What you've described here is synonyms.

How can I have a one-way synonym in Solr?

I am trying to implement one way synonym or one way thesaurus(as in Endeca) in Solr. Where I search for camcorder I get result for camera also but not vice versa. I tried adding following in Synonyms.txt but seems to be not working as it is giving weird results:
camcorder => camera
And my schema.xml is:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ClassicFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ClassicFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
</analyzer>
</fieldType>
I think what you were looking for is:
camcorder => camera, camcorder
If you don't include camcorder on the right side, camcorder won't return any results if you search for "camcorder".
Since you're only expanding synonyms when you're indexing (where you have the SynonymFilter defined), camcorder will be changed to camera for each document on the way in. When you don't have the same expansion taking place when querying, Solr will still search for camcorder (as there is no SynonymFilter defined for the query analysis chain). There is no camcorder token in the index, so there will be no hit.
You'll have to expand synonyms when querying as well as when indexing to achieve what you want with one-way synonyms.

Solr CharFilterFactory makes analysis tool return empty results

I have used a CharFilterFactory in my schema.xml for fileType text_general, so that queries for cafe and café return the same results. It works correctly. Here's the relevant part of my schema.xml:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
...however, the analysis tool within the admin user interface of solr seems to suggest that the tokenizers and filters that come after the charfilterfactory aren't doing anything. That's because if I analyse text_general with any field values for index and query, after MCF (MappingCharFilter), the output for ST, SF and LCF are all empty (grrr - a screen dump would be useful here, but I'm not allowed to post one because my 'reputation' isn't high enough apparently). Is that expected behaviour? Could someone please explain that analysis tool output please?

How to ignore accent search in Solr

I am using solr as a search engine. I have a case where a text field contains accent text like "María". When user search with "María", it is giving resut. But when user search with "Maria" it is not giving any result.
My schema definition looks like below:
<fieldtype name="my_text" class="solr.TextField">
<analyzer type="Index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="32" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
Please help to solve this issue.
If you're on solr > 3.x you can try using solr.ASCIIFoldingFilterFactory which will change all the accented characters to their unaccented versions from the basic ascii 127-character set.
Remember to put it after any stemming filter you have configured (you're not using one, so you should be ok).
So your config could look like:
<fieldtype name="my_text" class="solr.TextField">
<analyzer type="Index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="32" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldtype>
Answering here because it's the first result that pop when searching "ignore accents solr".
In the schema.xml generated by haystack (and using aldryn_search, djangocms & djangocms-blog), the answer provided by #soulcheck works if you add the <filter class="solr.ASCIIFoldingFilterFactory"/> line in the text_en fieldType.
Screenshot 1, screenshot 2.

Resources