I'm using SOLR for search on an e-commerce site.
Many products contain a dimension in the description, using the " notation for inches, and the ' for feet.
So I have 2 questions:
What analyzer/tokenizer would I use to add that to the Index, and
Would a simple addition to synonyms.txt (inch => " feet => ') work?
I ran into the same problem. My preference was to use the StandardTokenizer but it strips the ' and " and I could not find a way to add an exception. This meant synonyms, which are post-tokenizer, would be useless for the task. I searched for another tokenizer that would not strip the quotes and apostrophe but still be useful for "standard" tokenizing. I came up empty.
The solution I ended up going with was to use a charFilter before the tokenizer to change the " and ' to something else that was easier to work with. I used the PatternReplaceCharFilter to achieve this.
Since I am using the StandardTokenizer on the index and the query, I decided to also do this text replacement on both. In my case I wanted to be sure that the value was followed or preceded by white space. You can adjust the regex to your particular needs.
I should note that I do have the synonyms set as well (from my prior, failed, efforts). However, I am assuming that they are not playing a role in the case of these two characters, since they are being converted pre-tokenizer.
This also has a PatternCaptureGroupFilter to help better index things like 1x1mm or 2.5"x15"
Analyzer
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\d\.]+)"\s" replacement="$1 inch "/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\d\.]+)'\s" replacement="$1 feet "/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\s([\d\.]+)"" replacement=" $1 inch"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\s([\d\.]+)'" replacement=" $1 feet"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.PatternCaptureGroupFilterFactory" pattern=".*(([0-9\.]+([a-z"']?)x[0-9\.]+)([a-z"']?))\s*" preserve_original="true"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" types="word-delim-special-chars.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\d\.]+)"\s" replacement="$1 inch "/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\d\.]+)'\s" replacement="$1 feet "/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\s([\d\.]+)"" replacement=" $1 inch"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\s([\d\.]+)'" replacement=" $1 feet"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" types="word-delim-special-chars.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
I am including the following for clarity and thoroughness, but I do not believe they are playing a role in the final result (related to the quote and apostrophe).
word-delim-special-chars.txt
" => ALPHA
' => ALPHA
. => ALPHANUM
_ => ALPHA
synonyms.txt
",inch,inches,in.
feet,ft,',ft.,foot
oz,ounce,ounces,oz.
mm,millimeter,mm.,millimeters,mms
by,x
gram,g,grams
cm,centimeter,centimeters
Related
I have content with two title types: ABCWord & ABC Word. When I put to search box keywords like: abc-word, abc word content titled ABC Word is found, but I need to get also ABCWord titled content.
I've tried to use: solr.EdgeNGramFilterFactory and solr.WordDelimiterFilterFactory for it, but it seems I'm using it wrong.
My current schema.xml text field configuration:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
splitOnCaseChange="0"
preserveOriginal="1"/>
<filter class="solr.LengthFilterFactory" min="2" max="100" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="back"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
splitOnCaseChange="0"
preserveOriginal="1"/>
<filter class="solr.LengthFilterFactory" min="2" max="100" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="back"/>
</analyzer>
</fieldType>
You aren't using it wrong, but maybe you're using too much filters that are affecting the final result.
The EdgeNGram should resolve your problem, since it'll create tokens from size 3 to 30 in your input. So, "ABCWord" will become "abc", "abcw", "abcwo", "abcwor" and "abcword", and then a search for "abc" should match.
First of all, I'd recommend you to change the fieldType you're using when you use ngram, because it'll increase a lot your index size. It's better to create a new field type to use only in fields you really need it, instead of the "text" fieldType that probably indexes others values where you don't need ngrams.
Second, if your analyzer definition can be the same for index and search time, you don't need to duplicate the configs, just use 'analyzer' instead of 'analyzer type="index"' and 'analyzer type="query"'.
I strongly recommend you to check the analysis tab in your solr admin to see how Solr processes the indexed and queried text for your input. You can also remove some of the filters in your fieldType config when you're trying to achieve some specific result. It's better to understand what each filter is doing to your input.
How can I make Solr set more relevance in words based on position of the String.
For example, if I search "Macbook" the firsts results are like "Case Logic LAPS-113 13.3-Inch Laptop / MacBook Air" and after "Apple MacBook Pro MD101LL/A 13.3-Inch ".
This is my field declaration:
<fieldType name="text_pt" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="lang/index_synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_pt.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="lang/protwords_pt.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="lang/synonyms.txt" ignoreCase="true" expand="false"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_pt.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="lang/protwords_pt.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="lang/synonyms.txt" ignoreCase="true" expand="false"/>
</analyzer>
</fieldType>
What if product name was "MacBook/Dell/Lenovo Laptop cheap case", it contains Macbook at first position, you still want to boost this document?
I think you should try to fix the root cause of your issue instead, the common issue of how to deal with accessories (such as 'case', 'battery', 'lock' etc) scoring better than the products themselves.
The obvious best choice: index a field that says if the doc is an accessory (I gather you don't have that info, otherwise this is the best way), and boost the ones that are not accessories.
If you don't have that info, you can try by penalizing the docs that contain 'typical' accessory words. For this you need to build such a list, but it is not hard. I have used this approach with good result.
I am using solr 3.6.2. Search passes on the prefix, suffix and the middle of the word.
If I search for "20%", then in the search results is an expression of "20%", "* 0%" and "* 20 *". How do I exclude from the search results "0% *" and "* 20 *" and leave only an exact match "20%"? File schema.xml below:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[.-_]" replacement=" ">
<tokenizer class="solr.StandardTokenizerFactory">
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false">
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="back">
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front">
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1">
<filter class="solr.LowerCaseFilterFactory">
<filter class="solr.RemoveDuplicatesTokenFilterFactory">
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory">
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt">
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true">
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1">
<filter class="solr.LowerCaseFilterFactory">
</analyzer>
</fieldType>
If you refer to this information on the WordDelimiterFilterFactory you can see that non-alphanumeric characters are discarded. For example, the string "20%50" will be broken into two tokens "20" and "50".
A Solr wiki page covering WordDelimiterFilterFactory explains how to change this behavior. In summary, the analyzer for your filter will need to change to:
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory">
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<!-- The last parameter to the next filter is new! -->
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" types="myTypes.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
The parameter types="myTypes.txt" specifies a file in which you control how special characters should be interpreted. Your myTypes.txt should be in the solr/conf directory, and its contents might look like this:
% => DIGIT
This causes the '%' to be treated as a digit. See the Solr wiki link above for more details.
I have configured WordDelimiterFilterFactory for custom tokenizers for & and - which is working fine.
And for few tokenizer (like . _ :) we need to split on boundries only. And not to split if in between of word.
e.g.
test.com (should tokenized to test.com)
newyear. coming (should tokenized to newyear and coming)
new_car (should tokenized to new_car)
..
..
I checked that types can be used in Solr.WordDelimiterFilterFactory are LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM . there's no description available for use of each type. as per name suggest , i thought type SUBWORD_DELIM may fulfill my need, but it doesn't seem to work.
Below is defination for text field
<fieldType name="text_general_preserved" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange ="0"
splitOnNumerics ="0"
stemEnglishPossessive ="0"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
preserveOriginal="0"
protected="protwords_general.txt"
types="wdfftypes_general.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange ="0"
splitOnNumerics ="0"
stemEnglishPossessive ="0"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
preserveOriginal="0"
protected="protwords_general.txt"
types="wdfftypes_general.txt"
/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
below is wdfftypes_general.txt content
& => ALPHA
- => ALPHA
_ => SUBWORD_DELIM
: => SUBWORD_DELIM
. => SUBWORD_DELIM
Can anybody suggest me how can i set configuration for Solr.WordDelimiterFilterFactory to fulfill my requirement.
Thanks.
Based on the documentation for WordDelimiterFilterFactory, the SUBOWRD_DELIM settings in the wdfftypes.txt file only impact the behavior based on the splitOnCaseChange and splitOnNumerics settings. Therefore, I would add : _ . as ALPHA entries in the wdfftypes.txt file and add a new PatternReplaceCharFilterFactory after the WordDelimiterFilterFactory in your fieldType to remove those leading or trailing character from any tokens.
A customer of mine is a photo agency specialized in photojournalism (well, and gossip), so many of their customers' searches revolve around specific people.
We index about 1.5m documents, with full-text search on headline and caption; and full-text search without stemming on tags. We have a decent list of stop words, and they provide a list of protected words that they feel are not stemmed correctly.
We are using Dismax to search over headline, caption and tags, with different boosts)
This is all working pretty nicely.
However, a few people are proving tricky to get right. For instance, Al Gore. In Italian "al" is a stop word, so a simple query for `al gore' (without quotes) becomes:
+((DisjunctionMaxQuery((caption_text:gor | tags_text:gore^100.0 | headline_text:gor)))~1) ()
That does return hits for the ex VP, but of course also for "Lesley Gore" and "Tipper Gore"; and also, because of stemming, hits for "Gori" and more.
Leaving aside sorting for a second, it does clutter up results, and I'd like to do better.
Wrapping the search terms in quotes doesn't help, "al" gets stripped away anyway.
Marking "gore" as a protected word gets me halfway there, limiting the number of false positives.
I tried playing with SynonymFilterFactory too, but didn't get too far--I have the SynonymFilterFactory as the first filter, so "al" gets removed anyway.
What I think I really need is a way of tokenizing "al gore" as a single token. Is there anything that will allow me to do that, for a set of configurable "phrases"?
Is there another approach I'm overlooking? solr.CommonGramsFilterFactory perhaps?
Some more background info: we are using Solr 1.4.0.
Relevant portions of schema.xml
<!-- used for headline and caption -->
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Italian" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Italian" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="tagsText" class="solr.TextField" sortMissingLast="true" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Have you looked into the CommonGramsFilterFactory
It will:
combine multiple tokens into a single
token
usually used when searching a phrase that contains stop words