I am currently using solr edismax to do searches on our website. What I'm looking to do, is essentially have dashes get ignored.
So if I search the words, "wi-fi adapter". And I have a document, with a title, "wifi adapter". I'll get no results.
I am currently using solr.MappingCharFilterFactory to map dashes to spaces. This is what my text_general fieldtype looks like in my schema.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>
</fieldType>
My mapping.txt contains the line..
"-" => " "
So what this rule does, is it converts the dashes to a space.
So if I search "wi fi adapter", it will always show the same results as "wi fi adapter", but won't show results for "wifi adapter".
Is there any way to treat dashes like this? Essentially I'd want to treat "wifi adapter", "wi-fi adapter", and "wi fi adapter" the same.
You can use the WordDelimiterGraphFilterFactory for your analyzer. It has lot many attributes that could be used. I have listed few.
The WordDelimiterGraphFilterFactory has many attributes.
generateWordParts : (integer, default 1) If non-zero, splits words at delimiters. For example: "CamelCase", "hot-spot" → "Camel", "Case", "hot", "spot"
preserveOriginal : (integer, default 0) If non-zero, the original token is preserved: "Zap-Master-9000" → "Zap-Master-9000", "Zap", "Master", "9000"
catenateWords : (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor’s" → "hotspotsensor"
So in your case it would be like
<fieldType name="text_wd" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<!-- Splits words based on whitespace characters -->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- splits words at delimiters based on different arguments -->
<filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="1" catenateWords="1"/>
<!-- Transforms text to lower case -->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The more information on it would be found at Fiters available in solr
Related
I'm having unexpected results with the solr.SimplePatternSplitTokenizerFactory. The pattern used is actually from an example in the SOLR documentation and I do not understand where I made a mistake or why it does not work as expected.
If we take the example input "operative", the analyzer shows that during indexing, the input gets split into the tokens "ope", "a" and "ive", that is the tokenizer splits at the characters "r" and "t", and not at the expected whitespace characters (CR, TAB). Just to be sure I also tried to use more than one backspace in the pattern (e.g. \t and \\t), but this did not change how the input is tokenized during indexing.
What am I missing?
SOLR version used is 7.5.0.
The definition of the field type in the schema is as follows:
<fieldType name="text_custom" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[ \t\r\n]+"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.SynonymGraphFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Update found this post on the "Solr - User" mailing list archive:
http://lucene.472066.n3.nabble.com/Solr-Reference-Guide-issue-for-simplified-tokenizers-td4385540.html
Seems the documentation (or the example) is not correct/working. The following usage of the tokenizer is working as intended:
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[
]+"/>
Found this post on the "Solr - User" mailing list archive: http://lucene.472066.n3.nabble.com/Solr-Reference-Guide-issue-for-simplified-tokenizers-td4385540.html
Seems the documentation (or the example) is not correct/working. The following usage of the tokenizer is working as intended:
<tokenizer class="solr.SimplePatternSplitTokenizerFactory" pattern="[
]+"/>
I have the following field within my SOLR configure:
<fieldType name="title" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" preserveOriginal="1" catenateAll="1" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Within the field I could be storing:
Spiderman, Spider-man, Spider man
What I would like is for someone who searches for spiderman to get all 3 options and ideally someone who searches spider-man to get all 3 options. Apart from amending the content when it is indexed is there another way to effectively ignore special characters but not necessarily split on them?
One of the possible solutions, especially if the number of delimeter character is small is to replace them via solr.PatternReplaceFilterFactory like this:
<fieldType name="title" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="-" replacement=""/>
<filter class="solr.PatternReplaceFilterFactory" pattern=" " replacement=""/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
If keyword tokenizer is a bad option, since it will preserve one token (which could be okay for a field like title), you could either create your own tokenizer, which will split title only on needed symbols or add additional filters like ngram to allow partial match on the title field.
I know this is an old post, but the correct answer here is you should add "Spiderman, Spider-man, Spider man" to your synonyms.txt file and restart solr. If this still doesn't work, make sure your schema uses the SynonymGraphFilterFactory analyzer. What you've described here is synonyms.
I'm using solr 4.10.3. I tried to configure Solr to ignore dashes in searches:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<!-- sonderzeichen .,-\/ ignorieren -->
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[\.\-\\\/,]" replacement=""/>
<!-- enthaelt u-umlaut -> u, lowercase und uft8 decomposed -->
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
<analyzer type="query">
<!-- sonderzeichen .,-\/ ignorieren -->
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[\.\-\\\/,]" replacement=""/>
<!-- enthaelt u-umlaut -> u, lowercase und uft8 decomposed -->
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
</fieldtype>
I have an entry "pan-pan, peter", which is found, if i search
(peter pa*)
(peter panpa*)
or even
(pe-te-r panpa*)
also
(peter pa-n-pa-n)
(without *) matches.
but
(peter pan-p*)
(peter pan\-p*)
gives no result.
It seems as if the combination of dash and * is a problem?
I'd like to find "pan-pan, peter" in every stage of typing "peter pan-pan"...
Try using the below field type.
<fieldType name="text_delimeter" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" preserveOriginal="1" catenateAll="1" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I tried with your text and analysed the same. I found the above type would work for you. I have analysed the same in the tool as well.
I am using solr as a search engine. I have a case where a text field contains accent text like "María". When user search with "María", it is giving resut. But when user search with "Maria" it is not giving any result.
My schema definition looks like below:
<fieldtype name="my_text" class="solr.TextField">
<analyzer type="Index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="32" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
Please help to solve this issue.
If you're on solr > 3.x you can try using solr.ASCIIFoldingFilterFactory which will change all the accented characters to their unaccented versions from the basic ascii 127-character set.
Remember to put it after any stemming filter you have configured (you're not using one, so you should be ok).
So your config could look like:
<fieldtype name="my_text" class="solr.TextField">
<analyzer type="Index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="32" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldtype>
Answering here because it's the first result that pop when searching "ignore accents solr".
In the schema.xml generated by haystack (and using aldryn_search, djangocms & djangocms-blog), the answer provided by #soulcheck works if you add the <filter class="solr.ASCIIFoldingFilterFactory"/> line in the text_en fieldType.
Screenshot 1, screenshot 2.
Solr 4.3.0:
I want to enter "Basel 3" in the Solr-searchfield and want to get all results for either "Basel 3" or "Basel III". I'm on the right way, because the analyzer shows a match with this query:
http://localhost/solr/analysis/field?analysis.fieldvalue=Basel III&q=Basel:3&analysis.fieldname=description&analysis.showmatch=true
Also a result of the search for "Basel 3" shows a correct highlighting of "Basel III", which happen to be in the same result.
But, and that's my problem now, all the results containing only "Basel III" are missing.
schema.xml:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="2" outputUnigramsIfNoShingles="true" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" tokenizerFactory="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.ShingleFilterFactory" minShingleSize="2" maxShingleSize="2" outputUnigramsIfNoShingles="true"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
synonyms.txt:
Basel 3, Basel III
"toggle parsed query" shows:
(+DisjunctionMaxQuery((title:"(basel basel 3) 3"^2.5 | recruiter:basel 3^1.5 | description:"(basel basel 3) 3"^0.5)))/no_coord
EDIT:
There seems to be a whole family of open issues around "multiword synonyms":
- https://issues.apache.org/jira/browse/LUCENE-2605
- https://issues.apache.org/jira/browse/LUCENE-1622
- https://issues.apache.org/jira/browse/SOLR-4381
(...)
A very interesting article about this subject: http://nolanlawson.com/2012/10/31/better-synonym-handling-in-solr/