Solr 5.1: Problems with search queries containing underscores - solr

I've indexed an internal website using Solr 5.1 and the new managed schema. I've indexed the page title, url, and body using "text_en" and "text_en_splitting". I get pretty much the behavior I want except when the query string contains underscores.
My use case is the following: Suppose we have 3 terms, "first", "second" and "third", and that "second" does not exist in the index but "first" and "third" do. When the search term is "first second third", I get the behavior I want (i.e. pages with "first" and "third" are returned).
However, when the search term is "first_second_third", I get 0 results, but I would expect to get something since "first" and "third" exist in the index.
I'm using edismax search with qf=url_txt_en title_txt_en title_txt_en_split text_txt_en_split
Can someone suggest a way to tweak my config to get what I want?

Are you using the definition for text_en_splitting that comes with the Solr examples?
If so, the issue is that this type uses WhitespaceTokenizerFactory, which creates tokens separated by splitting on whitespace. It will ignore underscores.
Instead, it sounds like you need to tokenize on both whitespace and underscores. So try replacing that with PatternTokenizerFactory, like so:
<tokenizer class="solr.PatternTokenizerFactory" pattern="_\s*" />
Don't forget to change this in both the index and query analyzer blocks.

Try with below field type which used WordDelimiterFilterFactory. It Splits words into subwords and performs optional transformations on subword groups.
By default, words are split into subwords with the following rules:
1.split on intra-word delimiters (all non alpha-numeric characters).
"Wi-Fi" -> "Wi", "Fi"
2.split on case transitions (can be turned off - see splitOnCaseChange parameter)
"PowerShot" -> "Power", "Shot"
3.split on letter-number transitions (can be turned off - see splitOnNumerics parameter)
"SD500" -> "SD", "500"
<fieldtype name="subword" class="solr.TextField">
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
preserveOriginal="1"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
preserveOriginal="1"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldtype>

You can just convert _ with any non-alphanumeric character that your Tokenizer tokenize on. In following case I converted it to hyphen '-' which is a valid delimiter for StandardTokenizerFactory
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="_"
replacement="-"/>
<tokenizer class="solr.StandardTokenizerFactory"/>

Related

How to search the field which could contains spaces,- and a concatenated number.?

Hi I have a field with the following schema,
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
I am storing complete pdf documents.
Now suppose I have 4 documents with the following content.
1. stackoverflow is a good site.
2. stack-overflow is a good site.
3. stack overflow is a good site.
4. stackoverflow2018 is a good site.
Now when I search stackoverflow It should return me 1,
when I search stack-overflow it should return me 2.
when I search stack overflow it should return me 3.
when I search stackoverflow2018 it should return me 4.
what should the schema for it the schema not working in this case.
Is there any thing I could specify in the query ?
A Word Delimiter Graph Filter will split on non-alphanumerics (-), case changes, and numbers by default.
The rules for determining delimiters are determined as follows:
A change in case within a word: "CamelCase" -> "Camel", "Case". This
can be disabled by setting splitOnCaseChange="0".
A transition from alpha to numeric characters or vice versa:
"Gonzo5000" -> "Gonzo", "5000" "4500XL" -> "4500", "XL". This can be
disabled by setting splitOnNumerics="0".
Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
A trailing "'s" is removed: "O’Reilly’s" -> "O", "Reilly"
Any leading or trailing delimiters are discarded: "--hot-spot--" ->
"hot", "spot"
If you don't want that behavior, remove the WordDelimiterFilter from your filter list and add other filters to support the part of the WDF behavior that you need.

Solr wildcard issue with '-' character

i am using solr and tokenizing a field as follows:
<field name="Title" type="text_general" multiValued="false" indexed="true" stored="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</field>
i append * at each search field to get the matching result:
Title:app*
for example app* will give me app,application and similar result
But if i search for the term with '-' in it the query fails to return anything.
For example:
Title:child-play*
Does not return any result
but Title:child-play does !!
Can anyone point me what might be the issue.
after debug i got this :
for Title:child-play
"debug":{
"rawquerystring":"Title:child-play",
"querystring":"Title::child-play",
"parsedquery":"Title::child Title::play",
"parsedquery_toString":"Title::child Title::play",
for Title:child-play*
"debug":{
"rawquerystring":"CompanyName:child-play*",
"querystring":"CompanyName:child-play*",
"parsedquery":"CompanyName:child-play*",
"parsedquery_toString":"CompanyName:child-play*",
I recommend you to use WordDelimiterFilterFactory
Just change type of field to "custom type", in my case it's 'text_general"
<field name="Title" type="text_general"/>
Then you need to create a new type
For example, my settings. You can customise it how you want.
<fieldType name="text_general" class="solr.TextField" omitNorms="false" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" types="wdfftypes.txt" generateNumberParts="0" stemEnglishPossessive="0" splitOnCaseChange="1" preserveOriginal="1" catenateAll="1" catenateWords="1" catenateNumbers="1" generateWordParts="1" splitOnNumerics="1"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" types="wdfftypes.txt" generateNumberParts="1" stemEnglishPossessive="0" splitOnCaseChange="1" preserveOriginal="1" catenateAll="1" catenateWords="1" catenateNumbers="1" generateWordParts="1" splitOnNumerics="1"/>
</analyzer>
</fieldType>
Look at my screenshot.
Please read more information here
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
Arguments:
generateWordParts: (integer, default 1) If non-zero, splits words at delimiters.
For example:"CamelCase", "hot-spot" -> "Camel", "Case", "hot", "spot"
generateNumberParts: (integer, default 1) If non-zero, splits numeric strings at delimiters:"1947-32" ->"1947", "32"
splitOnCaseChange: (integer, default 1) If 0, words are not split on camel-case changes:"BugBlaster-XL" -> "BugBlaster", "XL". Example 1 below illustrates the default (non-zero) splitting behavior.
splitOnNumerics: (integer, default 1) If 0, don't split words on transitions from alpha to numeric:"FemBot3000" -> "Fem", "Bot3000"
catenateWords: (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor's" -> "hotspotsensor"
catenateNumbers: (integer, default 0) If non-zero, maximal runs of number parts will be joined: 1947-32" -> "194732"
catenateAll: (0/1, default 0) If non-zero, runs of word and number parts will be joined: "Zap-Master-9000" -> "ZapMaster9000"
preserveOriginal: (integer, default 0) If non-zero, the original token is
preserved: "Zap-Master-9000" -> "Zap-Master-9000", "Zap", "Master", "9000"
protected: (optional) The pathname of a file that contains a list of protected words that should be passed through without splitting.
stemEnglishPossessive: (integer, default 1) If 1, strips the possessive "'s" from each subword.

Some characters breaks phrase search in text field

I have a text field, which contains titles of tv-series or movies. In several cases I want to perform a phrase query on what I'd say a pretty normal text field. This works fine for most phrase terms, but in some reproducable cases it doesn't, but simply returns nothing. It seems to be related to some "special" characters, but not all special characters I'd assume are affected.
Title:("Mission: Impossible") works
Title:("Disney A.N.T.") doesn't work
Title:("Stephen King's Shining") doesn't work
Title:("Irgendwie L. A.") works
After trying several other titles I'd assume, that it is somehow related to dot . and apostroph ' and maybe other I don't know yet. I have no idea, where to look know
relevant schema.xml
<fieldType name="title" class="solr.TextField" sortMissingLast="true"
positionIncrementGap="100" autoGeneratePhraseQueries="false">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1"
splitOnCaseChange="0" splitOnNumerics="0" stemEnglishPossessive="0"
generateWordParts="1" generateNumberParts="0"
catenateWords="1" catenateNumbers="0" catenateAll="0" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Your Question is about phrases on a field where the analyzer of type "index" contains a solr.WordDelimiterFilterFactory but in type "query" it does not.
MatsLindh told us, the first step is to open the analysis screen.
In this case the position value is important.
With your attributes in solr.WordDelimiterFilterFactory the token "King's" is converted to "king's" "king" "kings" "s" and the last "s" is on !second! position.
This does not explain
solr.StandardTokenizerFactory
So if you are search for the phrase "Stephen King's Shining" without solr.WordDelimiterFilterFactory the token "Shining" is on position three but if you are indexing with solr.WordDelimiterFilterFactory the token "Shining" is on position four, so only "Stephen King's Shining"~2 (with Slop) will match, but not "Stephen King's Shining".
This does not explain your problem with "Disney A.N.T.". But be aware that solr.StandardTokenizerFactory would remove the last dot, and solr.WhitespaceTokenizerFactory does not.

Searching for Solr Stop words

On of my solr fields is configured in the following manned,
<fieldType name="text_exact" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="1" types="wdfftypes.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="1" types="wdfftypes.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
This works in cases where i don't want stemming, but now there is another use case which is causing a problem, people are beginning to seach for the following combinations,
The Ivy : In this case results with just ivy is being returned, when the expected result would be with The. I understand that this is because of the stop word but is the way to achieve this. For example if they search for "the ivy" within quotes than this should work.
(Mom & Me) OR ("mom and me"): In this case also & is dropped or results including both mom and me in some part of the statement is returned.
I am ok if only new data behaves in the right way but wouldnt be able to reindex. Also, would changing the schema.xml file trigger a full replication?
Regards,
Ayush
You are using the white space tokenizer.
So "The Ivy" is slitted into 2 words.
You could use an less agressive tokenize an followed by the WordDelimiterFilterFactory in order to activate the protected="protwords.txt" options, where you can set "the ivy" as an protected word so that solr will not tokenize that.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

Search for partial words using Solr

I'm trying to search for a partial word using Solr, but I can't get it to work.
I'm using this in my schema.xml file.
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize="15" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" stemEnglishPossessive="1" splitOnNumerics="1" splitOnCaseChange="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="1"/>
</analyzer>
</fieldType>
Searching for die h won't work, but die hard returns some results.
I've reindexed the database after the above configuration was added.
Here is the url and output when searching for die hard. The debugger is turned on.
Here is the url and output when searching for die h. The debugger is turned on.
I'm using Solr 3.3. Here is the rest of the schema.xml file.
The query you've shared is searching the "title_text" field, but the schema you posted above defines the "text" field. Assuming this was just an oversight, and the title_text field is defined as in your post, I think a probable issue is that the NGramTokenizer is configured with minGramSize="3", and you are expecting to match using a single-character token.
You could try changing minGramSize to 1, but this will inevitably lead to some very inefficient indexes; and I wonder whether you really are keen on having "e" match every movie with an e in the title?

Resources