I am using solr 3.6.2. Search passes on the prefix, suffix and the middle of the word.
If I search for "20%", then in the search results is an expression of "20%", "* 0%" and "* 20 *". How do I exclude from the search results "0% *" and "* 20 *" and leave only an exact match "20%"? File schema.xml below:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.PatternReplaceCharFilterFactory"
pattern="[.-_]" replacement=" ">
<tokenizer class="solr.StandardTokenizerFactory">
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false">
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="back">
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" side="front">
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1">
<filter class="solr.LowerCaseFilterFactory">
<filter class="solr.RemoveDuplicatesTokenFilterFactory">
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory">
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt">
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true">
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1">
<filter class="solr.LowerCaseFilterFactory">
</analyzer>
</fieldType>
If you refer to this information on the WordDelimiterFilterFactory you can see that non-alphanumeric characters are discarded. For example, the string "20%50" will be broken into two tokens "20" and "50".
A Solr wiki page covering WordDelimiterFilterFactory explains how to change this behavior. In summary, the analyzer for your filter will need to change to:
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory">
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<!-- The last parameter to the next filter is new! -->
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" types="myTypes.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
The parameter types="myTypes.txt" specifies a file in which you control how special characters should be interpreted. Your myTypes.txt should be in the solr/conf directory, and its contents might look like this:
% => DIGIT
This causes the '%' to be treated as a digit. See the Solr wiki link above for more details.
Related
I have content with two title types: ABCWord & ABC Word. When I put to search box keywords like: abc-word, abc word content titled ABC Word is found, but I need to get also ABCWord titled content.
I've tried to use: solr.EdgeNGramFilterFactory and solr.WordDelimiterFilterFactory for it, but it seems I'm using it wrong.
My current schema.xml text field configuration:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
splitOnCaseChange="0"
preserveOriginal="1"/>
<filter class="solr.LengthFilterFactory" min="2" max="100" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="back"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
splitOnCaseChange="0"
preserveOriginal="1"/>
<filter class="solr.LengthFilterFactory" min="2" max="100" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="back"/>
</analyzer>
</fieldType>
You aren't using it wrong, but maybe you're using too much filters that are affecting the final result.
The EdgeNGram should resolve your problem, since it'll create tokens from size 3 to 30 in your input. So, "ABCWord" will become "abc", "abcw", "abcwo", "abcwor" and "abcword", and then a search for "abc" should match.
First of all, I'd recommend you to change the fieldType you're using when you use ngram, because it'll increase a lot your index size. It's better to create a new field type to use only in fields you really need it, instead of the "text" fieldType that probably indexes others values where you don't need ngrams.
Second, if your analyzer definition can be the same for index and search time, you don't need to duplicate the configs, just use 'analyzer' instead of 'analyzer type="index"' and 'analyzer type="query"'.
I strongly recommend you to check the analysis tab in your solr admin to see how Solr processes the indexed and queried text for your input. You can also remove some of the filters in your fieldType config when you're trying to achieve some specific result. It's better to understand what each filter is doing to your input.
How can I make Solr set more relevance in words based on position of the String.
For example, if I search "Macbook" the firsts results are like "Case Logic LAPS-113 13.3-Inch Laptop / MacBook Air" and after "Apple MacBook Pro MD101LL/A 13.3-Inch ".
This is my field declaration:
<fieldType name="text_pt" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="lang/index_synonyms.txt" ignoreCase="true" expand="false"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_pt.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="lang/protwords_pt.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="lang/synonyms.txt" ignoreCase="true" expand="false"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_pt.txt" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="lang/protwords_pt.txt"/>
<filter class="solr.SynonymFilterFactory" synonyms="lang/synonyms.txt" ignoreCase="true" expand="false"/>
</analyzer>
</fieldType>
What if product name was "MacBook/Dell/Lenovo Laptop cheap case", it contains Macbook at first position, you still want to boost this document?
I think you should try to fix the root cause of your issue instead, the common issue of how to deal with accessories (such as 'case', 'battery', 'lock' etc) scoring better than the products themselves.
The obvious best choice: index a field that says if the doc is an accessory (I gather you don't have that info, otherwise this is the best way), and boost the ones that are not accessories.
If you don't have that info, you can try by penalizing the docs that contain 'typical' accessory words. For this you need to build such a list, but it is not hard. I have used this approach with good result.
IN SOLR Apache 3.6 when doing a search for USC with highlights selected, why does it not also pick up U.S.C. as well in the highlighted results?
The field type is the following:
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
I want SOLR to return U.S.C. as well as USC in the highlighted search results.
However it's returning just USC:
<response><lst name="responseHeader"><int name="status">0</int><int name="QTime">7</int><lst name="params"><str name="explainOther"/><str name="fl">*,score</str><str name="indent">on</str><str name="start">0</str><str name="q">USC</str><str name="hl.fl">*</str><str name="wt"/><str name="fq"/><str name="hl">on</str><str name="version">2.2</str><str name="rows">10</str></lst></lst><result name="response" numFound="1" start="0" maxScore="0.047945753"><doc><float name="score">0.047945753</float><str name="id">978-064172344522</str><arr name="title"><str>my link power-shot PowerShot USC Utility <br>hello</br> Rejections Under 35 U.S.C. 101 and 35 U.S.C. 112, First Paragraph Petitions to correct inventorship of an issued patent are decided by the <Underline>Supervisory Patent Examiner</Underline>, as set forth</str></arr></doc></result><lst name="highlighting"><lst name="978-064172344522"><arr name="title"><str>my link power-shot PowerShot <em>USC</em> Utility <br>hello</br> Rejections Under</str></arr></lst></lst></response>
If you go to the analysis page in Solr, and run the string "U.S.C." on a fieldType of text_en_splitting you will see that it gets indexed as three separate tokens: u, s, and c. Play around with the attributes of the WordDelimiterFilterFactory (perhaps the catenateAll attribute) and see if you can get it to index as usc (one token) instead of three split up tokens. If that doesn't work, maybe you'll have to extend the tokenizer to accommodate your case.
I am facing an issue in the solr search....The wild card search seems to be working fine but there are issues when i am trying to find terms within another word.....For example: "rtebiggestBug", when i search for biggest , it doesn't give any results.I have the following entries made in the schema.xml file
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<!-- <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>-->
<filter class="solr.LowerCaseFilterFactory"/>
<!-- <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> -->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" preserveOriginal="1"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<!-- <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>-->
<filter class="solr.LowerCaseFilterFactory"/>
<!-- <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>-->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Another issue is,it does not find strings at the end of the word. Example: I searched on “bug” and found bugs and not the word "samplebug".
It would be really helpful if you could help me on this issue..
Thanks in advance.
By default Solr does not support left truncation, like searches for *bug to find samplebug.
Use the solr.ReversedWildcardFilterFactory in order to reverse the term and index the therm in an revered way, like gubelpmas. solr.ReversedWildcardFilterFactory
Here is an tutorial: http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
It seems your query parser is not able to handle leading wildcard searches.
What query parser are you using ?
The Extended dismax query parser supports searches with leading wildcards. you want to check for it.
A customer of mine is a photo agency specialized in photojournalism (well, and gossip), so many of their customers' searches revolve around specific people.
We index about 1.5m documents, with full-text search on headline and caption; and full-text search without stemming on tags. We have a decent list of stop words, and they provide a list of protected words that they feel are not stemmed correctly.
We are using Dismax to search over headline, caption and tags, with different boosts)
This is all working pretty nicely.
However, a few people are proving tricky to get right. For instance, Al Gore. In Italian "al" is a stop word, so a simple query for `al gore' (without quotes) becomes:
+((DisjunctionMaxQuery((caption_text:gor | tags_text:gore^100.0 | headline_text:gor)))~1) ()
That does return hits for the ex VP, but of course also for "Lesley Gore" and "Tipper Gore"; and also, because of stemming, hits for "Gori" and more.
Leaving aside sorting for a second, it does clutter up results, and I'd like to do better.
Wrapping the search terms in quotes doesn't help, "al" gets stripped away anyway.
Marking "gore" as a protected word gets me halfway there, limiting the number of false positives.
I tried playing with SynonymFilterFactory too, but didn't get too far--I have the SynonymFilterFactory as the first filter, so "al" gets removed anyway.
What I think I really need is a way of tokenizing "al gore" as a single token. Is there anything that will allow me to do that, for a set of configurable "phrases"?
Is there another approach I'm overlooking? solr.CommonGramsFilterFactory perhaps?
Some more background info: we are using Solr 1.4.0.
Relevant portions of schema.xml
<!-- used for headline and caption -->
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Italian" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="Italian" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
<fieldType name="tagsText" class="solr.TextField" sortMissingLast="true" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.it.txt"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Have you looked into the CommonGramsFilterFactory
It will:
combine multiple tokens into a single
token
usually used when searching a phrase that contains stop words