have a solr configured for french content. Search is fine, but when i activate facet search, words are truncated in a special way.
All e disappear, for eg automobil instead of automobile, montagn instead of montagne, styl instead of style , homm => homme etc....
<lst name="keywords">
<int name="automobil">1</int>
<int name="citroen">1</int>
<int name="minist">0</int>
<int name="polit">0</int>
<int name="pric">0</int>
<int name="shinawatr">0</int>
<int name="thailand">0</int>
</lst
here is the query q=fulltextfield:champpions&facet=true&facet.field=keywords
the keyword content :
<arr name="keywords">
<str>Ski</str>
<str>sport</str>
<str>Free style</str>
<str>automobile</str>
<str>Rallye</str>
<str>Citroen</str>
<str>montagne</str>
</arr>
here is the schema used :
<fieldtype name="text_fr" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_fr.txt"/>
<filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" />
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="French"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_fr.txt"/>
<filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="French"/>
</analyzer>
</fieldtype>
the field def :
If somebody have an idea about that issue....
Thanks for your answer.
regards
Jerome longet
Generally, if you want to use a field as a facet, it should be stored as a string.
You're faceting on a tokenized and filtered field, so the individual values are the processed words in your keywords field.
All above said is correct, I just want to add one thing one facets. The facet values are the indexed terms, and not the stored ones. One recommendation for facets is to use a string-type. This is often a good choice. But sometimes you would like to to some things to your facet terms. In that case, you can use a text type, but treat the input only lightly. Avoid in any case your above choices of Stemming (SnowballPorter) or WordDelimiter.
A good choice to start with is KeywordTokenizerFactory, you could to PatternReplace to clean up your terms and input, and do a TrimFilter at the end. Don't do lowercasing, if your users are going to see the terms.
An example, my input are alphabetic language codes. The PatternReplace clean up non-alphabetic characters, the second correct an input-mistake:
`
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.PatternReplaceFilterFactory"
pattern="([^a-z])"
replacement=""
replace="all" />
<filter class="solr.PatternReplaceFilterFactory"
pattern="fer|xxx"
replacement="und"
replace="all" />
<filter class="solr.LengthFilterFactory" min="3" max="3" />
</analyzer>
`
Have fun with solr
Oliver
Related
I have content with two title types: ABCWord & ABC Word. When I put to search box keywords like: abc-word, abc word content titled ABC Word is found, but I need to get also ABCWord titled content.
I've tried to use: solr.EdgeNGramFilterFactory and solr.WordDelimiterFilterFactory for it, but it seems I'm using it wrong.
My current schema.xml text field configuration:
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<!-- Case insensitive stop word removal.
add enablePositionIncrements=true in both the index and query
analyzers to leave a 'gap' for more accurate phrase queries.
-->
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="0"
splitOnCaseChange="0"
preserveOriginal="1"/>
<filter class="solr.LengthFilterFactory" min="2" max="100" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="back"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.WordDelimiterFilterFactory"
protected="protwords.txt"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
splitOnCaseChange="0"
preserveOriginal="1"/>
<filter class="solr.LengthFilterFactory" min="2" max="100" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="front"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="30" side="back"/>
</analyzer>
</fieldType>
You aren't using it wrong, but maybe you're using too much filters that are affecting the final result.
The EdgeNGram should resolve your problem, since it'll create tokens from size 3 to 30 in your input. So, "ABCWord" will become "abc", "abcw", "abcwo", "abcwor" and "abcword", and then a search for "abc" should match.
First of all, I'd recommend you to change the fieldType you're using when you use ngram, because it'll increase a lot your index size. It's better to create a new field type to use only in fields you really need it, instead of the "text" fieldType that probably indexes others values where you don't need ngrams.
Second, if your analyzer definition can be the same for index and search time, you don't need to duplicate the configs, just use 'analyzer' instead of 'analyzer type="index"' and 'analyzer type="query"'.
I strongly recommend you to check the analysis tab in your solr admin to see how Solr processes the indexed and queried text for your input. You can also remove some of the filters in your fieldType config when you're trying to achieve some specific result. It's better to understand what each filter is doing to your input.
Can any one explain how stop words in SOLR work.
In my stopword.txt I have define of. In schema.xml I have
<filter class="solr.StopFilterFactory" ignoreCase="true"
words="stopwords.txt"enablePositionIncrements="true"/>
Now when I search for any thing that contains word of does not show up in result.
Example: oil of olay shows no result, where as oil olay shows up correct results.
More of file definition:
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="1"
preserveOriginal="1"
splitOnCaseChange="0"
splitOnNumerics="0"
types="wdtypes.txt"
/>
<filter class="solr.KeywordRepeatFilterFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.TrimFilterFactory" updateOffsets="false"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="stopwords.txt"
enablePositionIncrements="true"
/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory"
generateWordParts="1"
generateNumberParts="1"
catenateWords="1"
catenateNumbers="1"
catenateAll="1"
preserveOriginal="1"
splitOnCaseChange="0"
splitOnNumerics="0"
types="wdtypes.txt"
/>
<filter class="solr.EnglishMinimalStemFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
When debugging:
+(upclist:cream+of+wheat&qt=productresults&rows=10&fq=status%3AActive&fq=facilitystatus%3AActive&fq=facilityid%3A100&fq=inventoryctrlcode%3A%5B0+TO+100%5D&fq=weblifecycle%3A%283+OR+4%29&fq=groupnumber%3A2^1.2 | keywords:cream+of+wheat&qt=productresults&rows=10&fq=status%3aactive&fq=facilitystatus%3aactive&fq=facilityid%3a100&fq=inventoryctrlcode%3a%5b0+to+100%5d&fq=weblifecycle%3a%283+or+4%29&fq=groupnumber%3a2^20.0 | product_elevate:cream+of+wheat&qt=productresults&rows=10&fq=status%3aactive&fq=facilitystatus%3aactive&fq=facilityid%3a100&fq=inventoryctrlcode%3a%5b0+to+100%5d&fq=weblifecycle%3a%283+or+4%29&fq=groupnumber%3a2^5.0 | area:"(cream+of+wheat&qt=productresults&rows=10&fq=status%3aactive&fq=facilitystatus%3aactive&fq=facilityid%3a100&fq=inventoryctrlcode%3a%5b0+to+100%5d&fq=weblifecycle%3a%283+or+4%29&fq=groupnumber%3a2 cream) of wheat qt productresult (row creamofwheatqtproductresultsrow) 10 fq status%3aactive fq facilitystatus%3aactive fq facilityid%3a100 fq inventoryctrlcode%3a%5b0 (to fqstatus%3aactivefqfacilitystatus%3aactivefqfacilityid%3a100fqinventoryctrlcode%3a%5b0to) 100%5d fq weblifecycle%3a%283 (or fqweblifecycle%3a%283or) 4%29 fq (groupnumber%3a2 fqgroupnumber%3a2 creamofwheatqtproductresultsrows10fqstatus%3aactivefqfacilitystatus%3aactivefqfacilityid%3a100fqinventoryctrlcode%3a%5b0to100%5dfqweblifecycle%3a%283or4%29fqgroupnumber%3a2)"~3^2.5 | productid:cream+of+wheat&qt=productresults&rows=10&fq=status%3AActive&fq=facilitystatus%3AActive&fq=facilityid%3A100&fq=inventoryctrlcode%3A%5B0+TO+100%5D&fq=weblifecycle%3A%283+OR+4%29&fq=groupnumber%3A2^1.7 | productname:cream+of+wheat&qt=productresults&rows=10&fq=status%3aactive&fq=facilitystatus%3aactive&fq=facilityid%3a100&fq=inventoryctrlcode%3a%5b0+to+100%5d&fq=weblifecycle%3a%283+or+4%29&fq=groupnumber%3a2^10.0)~0.01 ()
This might not be relevant, since you say you were searching on only one field (I'm posting it anyway because you say you are using edismax and qf). I had a similar issue when I wanted to boost an exact search, so I made the qf something like this: <str name="qf">title^45 title_str^55. The title field was using stopwords and title_str obviously was not. The reason it would often not find the searches using stopwords is described here. Their solution was to play with the mm values. The solution that worked in my case was to put the title_str in the pf tag (and remove it from the qf tag), so the exact find would come to the top.
At last resolved this issue by changing this:
"mm" from 2<-25% To 2<-36%
I have configured WordDelimiterFilterFactory for custom tokenizers for & and - which is working fine.
And for few tokenizer (like . _ :) we need to split on boundries only. And not to split if in between of word.
e.g.
test.com (should tokenized to test.com)
newyear. coming (should tokenized to newyear and coming)
new_car (should tokenized to new_car)
..
..
I checked that types can be used in Solr.WordDelimiterFilterFactory are LOWER, UPPER, ALPHA, DIGIT, ALPHANUM, SUBWORD_DELIM . there's no description available for use of each type. as per name suggest , i thought type SUBWORD_DELIM may fulfill my need, but it doesn't seem to work.
Below is defination for text field
<fieldType name="text_general_preserved" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange ="0"
splitOnNumerics ="0"
stemEnglishPossessive ="0"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
preserveOriginal="0"
protected="protwords_general.txt"
types="wdfftypes_general.txt"
/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false" />
<filter class="solr.WordDelimiterFilterFactory"
splitOnCaseChange ="0"
splitOnNumerics ="0"
stemEnglishPossessive ="0"
generateWordParts="1"
generateNumberParts="1"
catenateWords="0"
catenateNumbers="0"
catenateAll="0"
preserveOriginal="0"
protected="protwords_general.txt"
types="wdfftypes_general.txt"
/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
below is wdfftypes_general.txt content
& => ALPHA
- => ALPHA
_ => SUBWORD_DELIM
: => SUBWORD_DELIM
. => SUBWORD_DELIM
Can anybody suggest me how can i set configuration for Solr.WordDelimiterFilterFactory to fulfill my requirement.
Thanks.
Based on the documentation for WordDelimiterFilterFactory, the SUBOWRD_DELIM settings in the wdfftypes.txt file only impact the behavior based on the splitOnCaseChange and splitOnNumerics settings. Therefore, I would add : _ . as ALPHA entries in the wdfftypes.txt file and add a new PatternReplaceCharFilterFactory after the WordDelimiterFilterFactory in your fieldType to remove those leading or trailing character from any tokens.
IN SOLR Apache 3.6 when doing a search for USC with highlights selected, why does it not also pick up U.S.C. as well in the highlighted results?
The field type is the following:
<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
<analyzer type="query">
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory"
ignoreCase="true"
words="lang/stopwords_en.txt"
enablePositionIncrements="true"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>
I want SOLR to return U.S.C. as well as USC in the highlighted search results.
However it's returning just USC:
<response><lst name="responseHeader"><int name="status">0</int><int name="QTime">7</int><lst name="params"><str name="explainOther"/><str name="fl">*,score</str><str name="indent">on</str><str name="start">0</str><str name="q">USC</str><str name="hl.fl">*</str><str name="wt"/><str name="fq"/><str name="hl">on</str><str name="version">2.2</str><str name="rows">10</str></lst></lst><result name="response" numFound="1" start="0" maxScore="0.047945753"><doc><float name="score">0.047945753</float><str name="id">978-064172344522</str><arr name="title"><str>my link power-shot PowerShot USC Utility <br>hello</br> Rejections Under 35 U.S.C. 101 and 35 U.S.C. 112, First Paragraph Petitions to correct inventorship of an issued patent are decided by the <Underline>Supervisory Patent Examiner</Underline>, as set forth</str></arr></doc></result><lst name="highlighting"><lst name="978-064172344522"><arr name="title"><str>my link power-shot PowerShot <em>USC</em> Utility <br>hello</br> Rejections Under</str></arr></lst></lst></response>
If you go to the analysis page in Solr, and run the string "U.S.C." on a fieldType of text_en_splitting you will see that it gets indexed as three separate tokens: u, s, and c. Play around with the attributes of the WordDelimiterFilterFactory (perhaps the catenateAll attribute) and see if you can get it to index as usc (one token) instead of three split up tokens. If that doesn't work, maybe you'll have to extend the tokenizer to accommodate your case.
I am facing an issue in the solr search....The wild card search seems to be working fine but there are issues when i am trying to find terms within another word.....For example: "rtebiggestBug", when i search for biggest , it doesn't give any results.I have the following entries made in the schema.xml file
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- in this example, we will only use synonyms at query time
<filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
-->
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<!-- <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>-->
<filter class="solr.LowerCaseFilterFactory"/>
<!-- <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/> -->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="50" preserveOriginal="1"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
<!-- <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>-->
<filter class="solr.LowerCaseFilterFactory"/>
<!-- <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt"/>-->
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
Another issue is,it does not find strings at the end of the word. Example: I searched on “bug” and found bugs and not the word "samplebug".
It would be really helpful if you could help me on this issue..
Thanks in advance.
By default Solr does not support left truncation, like searches for *bug to find samplebug.
Use the solr.ReversedWildcardFilterFactory in order to reverse the term and index the therm in an revered way, like gubelpmas. solr.ReversedWildcardFilterFactory
Here is an tutorial: http://solr.pl/en/2010/12/20/wildcard-queries-and-how-solr-handles-them/
It seems your query parser is not able to handle leading wildcard searches.
What query parser are you using ?
The Extended dismax query parser supports searches with leading wildcards. you want to check for it.