Solr wildcard issue with '-' character

Solr wildcard issue with '-' character - solr

i am using solr and tokenizing a field as follows:
<field name="Title" type="text_general" multiValued="false" indexed="true" stored="true">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</field>
i append * at each search field to get the matching result:
Title:app*
for example app* will give me app,application and similar result
But if i search for the term with '-' in it the query fails to return anything.
For example:
Title:child-play*
Does not return any result
but Title:child-play does !!
Can anyone point me what might be the issue.
after debug i got this :
for Title:child-play
"debug":{
"rawquerystring":"Title:child-play",
"querystring":"Title::child-play",
"parsedquery":"Title::child Title::play",
"parsedquery_toString":"Title::child Title::play",
for Title:child-play*
"debug":{
"rawquerystring":"CompanyName:child-play*",
"querystring":"CompanyName:child-play*",
"parsedquery":"CompanyName:child-play*",
"parsedquery_toString":"CompanyName:child-play*",

I recommend you to use WordDelimiterFilterFactory
Just change type of field to "custom type", in my case it's 'text_general"
<field name="Title" type="text_general"/>
Then you need to create a new type
For example, my settings. You can customise it how you want.
<fieldType name="text_general" class="solr.TextField" omitNorms="false" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" types="wdfftypes.txt" generateNumberParts="0" stemEnglishPossessive="0" splitOnCaseChange="1" preserveOriginal="1" catenateAll="1" catenateWords="1" catenateNumbers="1" generateWordParts="1" splitOnNumerics="1"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" types="wdfftypes.txt" generateNumberParts="1" stemEnglishPossessive="0" splitOnCaseChange="1" preserveOriginal="1" catenateAll="1" catenateWords="1" catenateNumbers="1" generateWordParts="1" splitOnNumerics="1"/>
</analyzer>
</fieldType>
Look at my screenshot.
Please read more information here
https://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
Arguments:
generateWordParts: (integer, default 1) If non-zero, splits words at delimiters.
For example:"CamelCase", "hot-spot" -> "Camel", "Case", "hot", "spot"
generateNumberParts: (integer, default 1) If non-zero, splits numeric strings at delimiters:"1947-32" ->"1947", "32"
splitOnCaseChange: (integer, default 1) If 0, words are not split on camel-case changes:"BugBlaster-XL" -> "BugBlaster", "XL". Example 1 below illustrates the default (non-zero) splitting behavior.
splitOnNumerics: (integer, default 1) If 0, don't split words on transitions from alpha to numeric:"FemBot3000" -> "Fem", "Bot3000"
catenateWords: (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor's" -> "hotspotsensor"
catenateNumbers: (integer, default 0) If non-zero, maximal runs of number parts will be joined: 1947-32" -> "194732"
catenateAll: (0/1, default 0) If non-zero, runs of word and number parts will be joined: "Zap-Master-9000" -> "ZapMaster9000"
preserveOriginal: (integer, default 0) If non-zero, the original token is
preserved: "Zap-Master-9000" -> "Zap-Master-9000", "Zap", "Master", "9000"
protected: (optional) The pathname of a file that contains a list of protected words that should be passed through without splitting.
stemEnglishPossessive: (integer, default 1) If 1, strips the possessive "'s" from each subword.

Related

Solr Queries With Dashes

I am currently using solr edismax to do searches on our website. What I'm looking to do, is essentially have dashes get ignored.
So if I search the words, "wi-fi adapter". And I have a document, with a title, "wifi adapter". I'll get no results.
I am currently using solr.MappingCharFilterFactory to map dashes to spaces. This is what my text_general fieldtype looks like in my schema.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>
</fieldType>
My mapping.txt contains the line..
"-" => " "
So what this rule does, is it converts the dashes to a space.
So if I search "wi fi adapter", it will always show the same results as "wi fi adapter", but won't show results for "wifi adapter".
Is there any way to treat dashes like this? Essentially I'd want to treat "wifi adapter", "wi-fi adapter", and "wi fi adapter" the same.

You can use the WordDelimiterGraphFilterFactory for your analyzer. It has lot many attributes that could be used. I have listed few.
The WordDelimiterGraphFilterFactory has many attributes.
generateWordParts : (integer, default 1) If non-zero, splits words at delimiters. For example: "CamelCase", "hot-spot" → "Camel", "Case", "hot", "spot"
preserveOriginal : (integer, default 0) If non-zero, the original token is preserved: "Zap-Master-9000" → "Zap-Master-9000", "Zap", "Master", "9000"
catenateWords : (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor’s" → "hotspotsensor"
So in your case it would be like
<fieldType name="text_wd" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<!-- Splits words based on whitespace characters -->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- splits words at delimiters based on different arguments -->
<filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="1" catenateWords="1"/>
<!-- Transforms text to lower case -->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The more information on it would be found at Fiters available in solr

How to search the field which could contains spaces,- and a concatenated number.?

Hi I have a field with the following schema,
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" generateNumberParts="1" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.WordDelimiterFilterFactory" catenateNumbers="0" protected="protwords.txt" splitOnCaseChange="1" generateWordParts="0" preserveOriginal="1" catenateAll="0" catenateWords="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
I am storing complete pdf documents.
Now suppose I have 4 documents with the following content.
1. stackoverflow is a good site.
2. stack-overflow is a good site.
3. stack overflow is a good site.
4. stackoverflow2018 is a good site.
Now when I search stackoverflow It should return me 1,
when I search stack-overflow it should return me 2.
when I search stack overflow it should return me 3.
when I search stackoverflow2018 it should return me 4.
what should the schema for it the schema not working in this case.
Is there any thing I could specify in the query ?

A Word Delimiter Graph Filter will split on non-alphanumerics (-), case changes, and numbers by default.
The rules for determining delimiters are determined as follows:
A change in case within a word: "CamelCase" -> "Camel", "Case". This
can be disabled by setting splitOnCaseChange="0".
A transition from alpha to numeric characters or vice versa:
"Gonzo5000" -> "Gonzo", "5000" "4500XL" -> "4500", "XL". This can be
disabled by setting splitOnNumerics="0".
Non-alphanumeric characters (discarded): "hot-spot" -> "hot", "spot"
A trailing "'s" is removed: "O’Reilly’s" -> "O", "Reilly"
Any leading or trailing delimiters are discarded: "--hot-spot--" ->
"hot", "spot"
If you don't want that behavior, remove the WordDelimiterFilter from your filter list and add other filters to support the part of the WDF behavior that you need.

haystack 2 SearchQuerySet with filter by list matching with exact string

How can I filter by string list in haystack 2.0?
In Haystack 1.2 with Solr, if I have this code:
result = SearchQuerySet().models(MyModel).filter(my_field__in=['A', 'B', 'C'])
Result will exactly return objects with my_field equals to 'A', 'B' or 'C'. Instead, in Haystack 2.0 with Solr we will obtain objects with my_field as 'A', 'A something', 'B', 'B something'. I need preserve the haystack 1.2 behavior. Any idea?
If I use in Haystack 2.0:
result = SearchQuerySet().models(MyModel).filter(my_field=Exact('A'))
I will obtain objects with my_field equals to 'A'. Good! But I do not find one solution for filter with exact values in a list.
I need your help. Thanks you.

I have found one solution. In Haystack 2.0, the schema.xml generated for solr has the next definition related to default text field:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
<filter class="solr.EnglishMinimalStemFilterFactory"/>
-->
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPossessiveFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<!-- Optionally you may want to use this less aggressive stemmer instead of PorterStemFilterFactory:
<filter class="solr.EnglishMinimalStemFilterFactory"/>
-->
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
In Haystack 1.2, the schema.xml generated for solr has this other definition:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_en.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
If you compare the analyzers in both definitions you can check that Haystack 1.2 version use solr.WhitespaceTokenizerFactory, and Haystack 2.0 version use solr.StandardTokenizerFactory.
To keep the behavior of Haystack 1.2 in Haystack 2.0 you can create new text field type named, for example, "text_exact" with the same content of text field definition of Haystack 1.2 and then you associate all text fields in schema.xml with "text_en" to "text_exact".
From this version:
<field name="sales_user" type="text_en" indexed="true" stored="true" multiValued="false" />
We will obtain this other version:
<field name="sales_user" type="text_exact" indexed="true" stored="true" multiValued="false" />
The Haystack 2.0 official documentation does not provide a solution to this big change. It would be interesting to include some examples like this in section dedicated to migration from Haystack 1.x to 2.x.

I am using the Haystack 2.4.1.
I faced similar problem. The reason is that the field should not be analyzed.
To make the index of field 'not_analyzed' pass indexed parameter as False.
So while indexing the field I have used the following parameters.
field_name = indexes.CharField(model_attr='db_field',indexed=False)
So that field_name will become 'not_analyzed'
After that you apply your query.

Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires

I have configured WordDelimiterFilterFactory for custom tokenizers for & and - which is working fine.
And for few tokenizer (like . _ :) we need to split on boundries only. And not to split if in between of word.
e.g.
test.com (should tokenized to test.com)
newyear. coming (should tokenized to newyear and coming)
new_car (should tokenized to new_car)
..
..
I checked that types can be used in Solr.WordDelimiterFilterFactory are LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM . there's no description available for use of each type. as per name suggest , i thought type SUBWORD_DELIM may fulfill my need, but it doesn't seem to work.
Below is defination for text field
<fieldType name="text_general_preserved" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange ="0"
splitOnNumerics ="0"
stemEnglishPossessive ="0"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
preserveOriginal="0"
protected="protwords_general.txt"
types="wdfftypes_general.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange ="0"
splitOnNumerics ="0"
stemEnglishPossessive ="0"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
preserveOriginal="0"
protected="protwords_general.txt"
types="wdfftypes_general.txt"
/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
below is wdfftypes_general.txt content
& => ALPHA
- => ALPHA
_ => SUBWORD_DELIM
: => SUBWORD_DELIM
. => SUBWORD_DELIM
Can anybody suggest me how can i set configuration for Solr.WordDelimiterFilterFactory to fulfill my requirement.
Thanks.

Based on the documentation for WordDelimiterFilterFactory, the SUBOWRD_DELIM settings in the wdfftypes.txt file only impact the behavior based on the splitOnCaseChange and splitOnNumerics settings. Therefore, I would add : _ . as ALPHA entries in the wdfftypes.txt file and add a new PatternReplaceCharFilterFactory after the WordDelimiterFilterFactory in your fieldType to remove those leading or trailing character from any tokens.

Searching for Solr Stop words

On of my solr fields is configured in the following manned,
<fieldType name="text_exact" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="1" types="wdfftypes.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="1" types="wdfftypes.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
This works in cases where i don't want stemming, but now there is another use case which is causing a problem, people are beginning to seach for the following combinations,
The Ivy : In this case results with just ivy is being returned, when the expected result would be with The. I understand that this is because of the stop word but is the way to achieve this. For example if they search for "the ivy" within quotes than this should work.
(Mom & Me) OR ("mom and me"): In this case also & is dropped or results including both mom and me in some part of the statement is returned.
I am ok if only new data behaves in the right way but wouldnt be able to reindex. Also, would changing the schema.xml file trigger a full replication?
Regards,
Ayush

You are using the white space tokenizer.
So "The Ivy" is slitted into 2 words.
You could use an less agressive tokenize an followed by the WordDelimiterFilterFactory in order to activate the protected="protwords.txt" options, where you can set "the ivy" as an protected word so that solr will not tokenize that.
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory