Solr - WordDelimiterFactory with Custom Tokenizer to split only on Boundires - solr

I have configured WordDelimiterFilterFactory for custom tokenizers for & and - which is working fine.
And for few tokenizer (like . _ :) we need to split on boundries only. And not to split if in between of word.
e.g.
test.com (should tokenized to test.com)
newyear. coming (should tokenized to newyear and coming)
new_car (should tokenized to new_car)
..
..
I checked that types can be used in Solr.WordDelimiterFilterFactory are LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM . there's no description available for use of each type. as per name suggest , i thought type SUBWORD_DELIM may fulfill my need, but it doesn't seem to work.
Below is defination for text field
<fieldType name="text_general_preserved" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange ="0"
splitOnNumerics ="0"
stemEnglishPossessive ="0"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
preserveOriginal="0"
protected="protwords_general.txt"
types="wdfftypes_general.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange ="0"
splitOnNumerics ="0"
stemEnglishPossessive ="0"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
preserveOriginal="0"
protected="protwords_general.txt"
types="wdfftypes_general.txt"
/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
below is wdfftypes_general.txt content
& => ALPHA
- => ALPHA
_ => SUBWORD_DELIM
: => SUBWORD_DELIM
. => SUBWORD_DELIM
Can anybody suggest me how can i set configuration for Solr.WordDelimiterFilterFactory to fulfill my requirement.
Thanks.

Based on the documentation for WordDelimiterFilterFactory, the SUBOWRD_DELIM settings in the wdfftypes.txt file only impact the behavior based on the splitOnCaseChange and splitOnNumerics settings. Therefore, I would add : _ . as ALPHA entries in the wdfftypes.txt file and add a new PatternReplaceCharFilterFactory after the WordDelimiterFilterFactory in your fieldType to remove those leading or trailing character from any tokens.

Related

How to configure SOLR for handling similar search keywords: `ABCWord`, `abc word`?

I have content with two title types: ABCWord & ABC Word. When I put to search box keywords like: abc-word, abc word content titled ABC Word is found, but I need to get also ABCWord titled content.
I've tried to use: solr.EdgeNGramFilterFactory and solr.WordDelimiterFilterFactory for it, but it seems I'm using it wrong.
My current schema.xml text field configuration:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
splitOnCaseChange="0"
preserveOriginal="1"/>
<filter class="solr.LengthFilterFactory" min="2" max="100" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="back"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
splitOnCaseChange="0"
preserveOriginal="1"/>
<filter class="solr.LengthFilterFactory" min="2" max="100" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="back"/>
</analyzer>
</fieldType>
You aren't using it wrong, but maybe you're using too much filters that are affecting the final result.
The EdgeNGram should resolve your problem, since it'll create tokens from size 3 to 30 in your input. So, "ABCWord" will become "abc", "abcw", "abcwo", "abcwor" and "abcword", and then a search for "abc" should match.
First of all, I'd recommend you to change the fieldType you're using when you use ngram, because it'll increase a lot your index size. It's better to create a new field type to use only in fields you really need it, instead of the "text" fieldType that probably indexes others values where you don't need ngrams.
Second, if your analyzer definition can be the same for index and search time, you don't need to duplicate the configs, just use 'analyzer' instead of 'analyzer type="index"' and 'analyzer type="query"'.
I strongly recommend you to check the analysis tab in your solr admin to see how Solr processes the indexed and queried text for your input. You can also remove some of the filters in your fieldType config when you're trying to achieve some specific result. It's better to understand what each filter is doing to your input.

To find an exact match in Solr 3.6.2

I am using solr 3.6.2. Search passes on the prefix, suffix and the middle of the word.
If I search for "20%", then in the search results is an expression of "20%", "* 0%" and "* 20 *". How do I exclude from the search results "0% *" and "* 20 *" and leave only an exact match "20%"? File schema.xml below:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[.-_]" replacement=" ">
<tokenizer class="solr.StandardTokenizerFactory">
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false">
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="back">
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front">
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1">
<filter class="solr.LowerCaseFilterFactory">
<filter class="solr.RemoveDuplicatesTokenFilterFactory">
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory">
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt">
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true">
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1">
<filter class="solr.LowerCaseFilterFactory">
</analyzer>
</fieldType>
If you refer to this information on the WordDelimiterFilterFactory you can see that non-alphanumeric characters are discarded. For example, the string "20%50" will be broken into two tokens "20" and "50".
A Solr wiki page covering WordDelimiterFilterFactory explains how to change this behavior. In summary, the analyzer for your filter will need to change to:
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory">
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<!-- The last parameter to the next filter is new! -->
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" types="myTypes.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
The parameter types="myTypes.txt" specifies a file in which you control how special characters should be interpreted. Your myTypes.txt should be in the solr/conf directory, and its contents might look like this:
% => DIGIT
This causes the '%' to be treated as a digit. See the Solr wiki link above for more details.

Searching/Indexing SOLR documents with a quote in the field

I'm using SOLR for search on an e-commerce site.
Many products contain a dimension in the description, using the " notation for inches, and the ' for feet.
So I have 2 questions:
What analyzer/tokenizer would I use to add that to the Index, and
Would a simple addition to synonyms.txt (inch => " feet => ') work?
I ran into the same problem. My preference was to use the StandardTokenizer but it strips the ' and " and I could not find a way to add an exception. This meant synonyms, which are post-tokenizer, would be useless for the task. I searched for another tokenizer that would not strip the quotes and apostrophe but still be useful for "standard" tokenizing. I came up empty.
The solution I ended up going with was to use a charFilter before the tokenizer to change the " and ' to something else that was easier to work with. I used the PatternReplaceCharFilter to achieve this.
Since I am using the StandardTokenizer on the index and the query, I decided to also do this text replacement on both. In my case I wanted to be sure that the value was followed or preceded by white space. You can adjust the regex to your particular needs.
I should note that I do have the synonyms set as well (from my prior, failed, efforts). However, I am assuming that they are not playing a role in the case of these two characters, since they are being converted pre-tokenizer.
This also has a PatternCaptureGroupFilter to help better index things like 1x1mm or 2.5"x15"
Analyzer
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\d\.]+)"\s" replacement="$1 inch "/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\d\.]+)&apos;\s" replacement="$1 feet "/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\s([\d\.]+)"" replacement=" $1 inch"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\s([\d\.]+)&apos;" replacement=" $1 feet"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.PatternCaptureGroupFilterFactory" pattern=".*(([0-9\.]+([a-z"&apos;]?)x[0-9\.]+)([a-z"&apos;]?))\s*" preserve_original="true"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" types="word-delim-special-chars.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\d\.]+)"\s" replacement="$1 inch "/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([\d\.]+)&apos;\s" replacement="$1 feet "/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\s([\d\.]+)"" replacement=" $1 inch"/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\s([\d\.]+)&apos;" replacement=" $1 feet"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.WordDelimiterFilterFactory" splitOnNumerics="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1" types="word-delim-special-chars.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
I am including the following for clarity and thoroughness, but I do not believe they are playing a role in the final result (related to the quote and apostrophe).
word-delim-special-chars.txt
" => ALPHA
' => ALPHA
. => ALPHANUM
_ => ALPHA
synonyms.txt
",inch,inches,in.
feet,ft,',ft.,foot
oz,ounce,ounces,oz.
mm,millimeter,mm.,millimeters,mms
by,x
gram,g,grams
cm,centimeter,centimeters

Solr WildCard Search Issue

I am facing an issue in the solr search....The wild card search seems to be working fine but there are issues when i am trying to find terms within another word.....For example: "rtebiggestBug", when i search for biggest , it doesn't give any results.I have the following entries made in the schema.xml file
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<!-- <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>-->
<filter class="solr.LowerCaseFilterFactory"/>
<!-- <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> -->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" preserveOriginal="1"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<!-- <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>-->
<filter class="solr.LowerCaseFilterFactory"/>
<!-- <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>-->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Another issue is,it does not find strings at the end of the word. Example: I searched on “bug” and found bugs and not the word "samplebug".
It would be really helpful if you could help me on this issue..
Thanks in advance.
By default Solr does not support left truncation, like searches for *bug to find samplebug.
Use the solr.ReversedWildcardFilterFactory in order to reverse the term and index the therm in an revered way, like gubelpmas. solr.ReversedWildcardFilterFactory
Here is an tutorial: http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
It seems your query parser is not able to handle leading wildcard searches.
What query parser are you using ?
The Extended dismax query parser supports searches with leading wildcards. you want to check for it.

"protected phrase" in Solr

A customer of mine is a photo agency specialized in photojournalism (well, and gossip), so many of their customers' searches revolve around specific people.
We index about 1.5m documents, with full-text search on headline and caption; and full-text search without stemming on tags. We have a decent list of stop words, and they provide a list of protected words that they feel are not stemmed correctly.
We are using Dismax to search over headline, caption and tags, with different boosts)
This is all working pretty nicely.
However, a few people are proving tricky to get right. For instance, Al Gore. In Italian "al" is a stop word, so a simple query for `al gore' (without quotes) becomes:
+((DisjunctionMaxQuery((caption_text:gor | tags_text:gore^100.0 | headline_text:gor)))~1) ()
That does return hits for the ex VP, but of course also for "Lesley Gore" and "Tipper Gore"; and also, because of stemming, hits for "Gori" and more.
Leaving aside sorting for a second, it does clutter up results, and I'd like to do better.
Wrapping the search terms in quotes doesn't help, "al" gets stripped away anyway.
Marking "gore" as a protected word gets me halfway there, limiting the number of false positives.
I tried playing with SynonymFilterFactory too, but didn't get too far--I have the SynonymFilterFactory as the first filter, so "al" gets removed anyway.
What I think I really need is a way of tokenizing "al gore" as a single token. Is there anything that will allow me to do that, for a set of configurable "phrases"?
Is there another approach I'm overlooking? solr.CommonGramsFilterFactory perhaps?
Some more background info: we are using Solr 1.4.0.
Relevant portions of schema.xml
<!-- used for headline and caption -->
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Italian" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Italian" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="tagsText" class="solr.TextField" sortMissingLast="true" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Have you looked into the CommonGramsFilterFactory
It will:
combine multiple tokens into a single
token
usually used when searching a phrase that contains stop words

Resources