Some characters breaks phrase search in text field - solr

I have a text field, which contains titles of tv-series or movies. In several cases I want to perform a phrase query on what I'd say a pretty normal text field. This works fine for most phrase terms, but in some reproducable cases it doesn't, but simply returns nothing. It seems to be related to some "special" characters, but not all special characters I'd assume are affected.
Title:("Mission: Impossible") works
Title:("Disney A.N.T.") doesn't work
Title:("Stephen King's Shining") doesn't work
Title:("Irgendwie L. A.") works
After trying several other titles I'd assume, that it is somehow related to dot . and apostroph ' and maybe other I don't know yet. I have no idea, where to look know
relevant schema.xml
<fieldType name="title" class="solr.TextField" sortMissingLast="true"
positionIncrementGap="100" autoGeneratePhraseQueries="false">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1"
splitOnCaseChange="0" splitOnNumerics="0" stemEnglishPossessive="0"
generateWordParts="1" generateNumberParts="0"
catenateWords="1" catenateNumbers="0" catenateAll="0" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Your Question is about phrases on a field where the analyzer of type "index" contains a solr.WordDelimiterFilterFactory but in type "query" it does not.
MatsLindh told us, the first step is to open the analysis screen.
In this case the position value is important.
With your attributes in solr.WordDelimiterFilterFactory the token "King's" is converted to "king's" "king" "kings" "s" and the last "s" is on !second! position.
This does not explain
solr.StandardTokenizerFactory
So if you are search for the phrase "Stephen King's Shining" without solr.WordDelimiterFilterFactory the token "Shining" is on position three but if you are indexing with solr.WordDelimiterFilterFactory the token "Shining" is on position four, so only "Stephen King's Shining"~2 (with Slop) will match, but not "Stephen King's Shining".
This does not explain your problem with "Disney A.N.T.". But be aware that solr.StandardTokenizerFactory would remove the last dot, and solr.WhitespaceTokenizerFactory does not.

Related

SOLR Tokenizer "solr.SimplePatternSplitTokenizerFactory" splits at unexpected characters

I'm having unexpected results with the solr.SimplePatternSplitTokenizerFactory. The pattern used is actually from an example in the SOLR documentation and I do not understand where I made a mistake or why it does not work as expected.
If we take the example input "operative", the analyzer shows that during indexing, the input gets split into the tokens "ope", "a" and "ive", that is the tokenizer splits at the characters "r" and "t", and not at the expected whitespace characters (CR, TAB). Just to be sure I also tried to use more than one backspace in the pattern (e.g. \t and \\t), but this did not change how the input is tokenized during indexing.
What am I missing?
SOLR version used is 7.5.0.
The definition of the field type in the schema is as follows:
<fieldType name="text_custom" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Update found this post on the "Solr - User" mailing list archive:
http://lucene.472066.n3.nabble.com/Solr-Reference-Guide-issue-for-simplified-tokenizers-td4385540.html
Seems the documentation (or the example) is not correct/working. The following usage of the tokenizer is working as intended:
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[
]+"/>
Found this post on the "Solr - User" mailing list archive: http://lucene.472066.n3.nabble.com/Solr-Reference-Guide-issue-for-simplified-tokenizers-td4385540.html
Seems the documentation (or the example) is not correct/working. The following usage of the tokenizer is working as intended:
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[
]+"/>

Solr 4.7 using 'solr.EdgeNGramFilterFactory' highlighting issue

Can someone help me with highlighting issue that I'm having when I search for 'cars' it is highlighting 'car','cars' expected behavior and also all the words that start with car for example 'cards','carriers' etc.
user requirement is we don't want to highlight anything that starts with 'car'?? here is my schema.xml
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[({.,\[\]})]" replacement=" "/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" preserveOriginal="1" catenateAll="1" />
<filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" />
The problem is that when you're indexing cards with an edgengramfilter, you get the tokens c, ca, car, card and cards. When you're then searching for cars and you have the same edgengramfilter for the field, youll search for any document matching any of the tokensc,ca,car, andcars`.
The solution is to either drop the edgengramfilter when indexing (so that you don't get a hit for c, ca or car), or use a different field for highlighting (with hl.fl) that only have standard tokenization / whitespace tokenization applied, together with possibly a stemmer (I'd go with solr.EnglishMinimalStemFilterFactory to only remove plural indicators).

Ignore special characters

I have the following field within my SOLR configure:
<fieldType name="title" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" preserveOriginal="1" catenateAll="1" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Within the field I could be storing:
Spiderman, Spider-man, Spider man
What I would like is for someone who searches for spiderman to get all 3 options and ideally someone who searches spider-man to get all 3 options. Apart from amending the content when it is indexed is there another way to effectively ignore special characters but not necessarily split on them?
One of the possible solutions, especially if the number of delimeter character is small is to replace them via solr.PatternReplaceFilterFactory like this:
<fieldType name="title" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="-" replacement=""/>
<filter class="solr.PatternReplaceFilterFactory" pattern=" " replacement=""/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
If keyword tokenizer is a bad option, since it will preserve one token (which could be okay for a field like title), you could either create your own tokenizer, which will split title only on needed symbols or add additional filters like ngram to allow partial match on the title field.
I know this is an old post, but the correct answer here is you should add "Spiderman, Spider-man, Spider man" to your synonyms.txt file and restart solr. If this still doesn't work, make sure your schema uses the SynonymGraphFilterFactory analyzer. What you've described here is synonyms.

Solr 5.1: Problems with search queries containing underscores

I've indexed an internal website using Solr 5.1 and the new managed schema. I've indexed the page title, url, and body using "text_en" and "text_en_splitting". I get pretty much the behavior I want except when the query string contains underscores.
My use case is the following: Suppose we have 3 terms, "first", "second" and "third", and that "second" does not exist in the index but "first" and "third" do. When the search term is "first second third", I get the behavior I want (i.e. pages with "first" and "third" are returned).
However, when the search term is "first_second_third", I get 0 results, but I would expect to get something since "first" and "third" exist in the index.
I'm using edismax search with qf=url_txt_en title_txt_en title_txt_en_split text_txt_en_split
Can someone suggest a way to tweak my config to get what I want?
Are you using the definition for text_en_splitting that comes with the Solr examples?
If so, the issue is that this type uses WhitespaceTokenizerFactory, which creates tokens separated by splitting on whitespace. It will ignore underscores.
Instead, it sounds like you need to tokenize on both whitespace and underscores. So try replacing that with PatternTokenizerFactory, like so:
<tokenizer class="solr.PatternTokenizerFactory" pattern="_\s*" />
Don't forget to change this in both the index and query analyzer blocks.
Try with below field type which used WordDelimiterFilterFactory. It Splits words into subwords and performs optional transformations on subword groups.
By default, words are split into subwords with the following rules:
1.split on intra-word delimiters (all non alpha-numeric characters).
"Wi-Fi" -> "Wi", "Fi"
2.split on case transitions (can be turned off - see splitOnCaseChange parameter)
"PowerShot" -> "Power", "Shot"
3.split on letter-number transitions (can be turned off - see splitOnNumerics parameter)
"SD500" -> "SD", "500"
<fieldtype name="subword" class="solr.TextField">
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
preserveOriginal="1"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
preserveOriginal="1"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldtype>
You can just convert _ with any non-alphanumeric character that your Tokenizer tokenize on. In following case I converted it to hyphen '-' which is a valid delimiter for StandardTokenizerFactory
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="_"
replacement="-"/>
<tokenizer class="solr.StandardTokenizerFactory"/>

Search for partial words using Solr

I'm trying to search for a partial word using Solr, but I can't get it to work.
I'm using this in my schema.xml file.
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize="15" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" stemEnglishPossessive="1" splitOnNumerics="1" splitOnCaseChange="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="1"/>
</analyzer>
</fieldType>
Searching for die h won't work, but die hard returns some results.
I've reindexed the database after the above configuration was added.
Here is the url and output when searching for die hard. The debugger is turned on.
Here is the url and output when searching for die h. The debugger is turned on.
I'm using Solr 3.3. Here is the rest of the schema.xml file.
The query you've shared is searching the "title_text" field, but the schema you posted above defines the "text" field. Assuming this was just an oversight, and the title_text field is defined as in your post, I think a probable issue is that the NGramTokenizer is configured with minGramSize="3", and you are expecting to match using a single-character token.
You could try changing minGramSize to 1, but this will inevitably lead to some very inefficient indexes; and I wonder whether you really are keen on having "e" match every movie with an e in the title?

Resources