Ignore special characters

Ignore special characters - solr

I have the following field within my SOLR configure:
<fieldType name="title" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" preserveOriginal="1" catenateAll="1" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Within the field I could be storing:
Spiderman, Spider-man, Spider man
What I would like is for someone who searches for spiderman to get all 3 options and ideally someone who searches spider-man to get all 3 options. Apart from amending the content when it is indexed is there another way to effectively ignore special characters but not necessarily split on them?

One of the possible solutions, especially if the number of delimeter character is small is to replace them via solr.PatternReplaceFilterFactory like this:
<fieldType name="title" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.PatternReplaceFilterFactory" pattern="-" replacement=""/>
<filter class="solr.PatternReplaceFilterFactory" pattern=" " replacement=""/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
If keyword tokenizer is a bad option, since it will preserve one token (which could be okay for a field like title), you could either create your own tokenizer, which will split title only on needed symbols or add additional filters like ngram to allow partial match on the title field.

I know this is an old post, but the correct answer here is you should add "Spiderman, Spider-man, Spider man" to your synonyms.txt file and restart solr. If this still doesn't work, make sure your schema uses the SynonymGraphFilterFactory analyzer. What you've described here is synonyms.

Related

Solr Queries With Dashes

I am currently using solr edismax to do searches on our website. What I'm looking to do, is essentially have dashes get ignored.
So if I search the words, "wi-fi adapter". And I have a document, with a title, "wifi adapter". I'll get no results.
I am currently using solr.MappingCharFilterFactory to map dashes to spaces. This is what my text_general fieldtype looks like in my schema.
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100" multiValued="true">
<analyzer type="index">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<charFilter class="solr.MappingCharFilterFactory" mapping="mapping.txt"/>
<filter class="solr.SynonymGraphFilterFactory" expand="true" ignoreCase="true" synonyms="synonyms.txt"/>
<filter class="solr.LowerCaseFilterFactory"/>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
</analyzer>
</fieldType>
My mapping.txt contains the line..
"-" => " "
So what this rule does, is it converts the dashes to a space.
So if I search "wi fi adapter", it will always show the same results as "wi fi adapter", but won't show results for "wifi adapter".
Is there any way to treat dashes like this? Essentially I'd want to treat "wifi adapter", "wi-fi adapter", and "wi fi adapter" the same.

You can use the WordDelimiterGraphFilterFactory for your analyzer. It has lot many attributes that could be used. I have listed few.
The WordDelimiterGraphFilterFactory has many attributes.
generateWordParts : (integer, default 1) If non-zero, splits words at delimiters. For example: "CamelCase", "hot-spot" → "Camel", "Case", "hot", "spot"
preserveOriginal : (integer, default 0) If non-zero, the original token is preserved: "Zap-Master-9000" → "Zap-Master-9000", "Zap", "Master", "9000"
catenateWords : (integer, default 0) If non-zero, maximal runs of word parts will be joined: "hot-spot-sensor’s" → "hotspotsensor"
So in your case it would be like
<fieldType name="text_wd" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<!-- Splits words based on whitespace characters -->
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<!-- splits words at delimiters based on different arguments -->
<filter class="solr.WordDelimiterGraphFilterFactory" preserveOriginal="1" catenateWords="1"/>
<!-- Transforms text to lower case -->
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
The more information on it would be found at Fiters available in solr

Solr full name search: how can I find entries containing a dash with wildcards

I'm using solr 4.10.3. I tried to configure Solr to ignore dashes in searches:
<fieldtype name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<!-- sonderzeichen .,-\/ ignorieren -->
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[\.\-\\\/,]" replacement=""/>
<!-- enthaelt u-umlaut -> u, lowercase und uft8 decomposed -->
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
<analyzer type="query">
<!-- sonderzeichen .,-\/ ignorieren -->
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[\.\-\\\/,]" replacement=""/>
<!-- enthaelt u-umlaut -> u, lowercase und uft8 decomposed -->
<tokenizer class="solr.ICUTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
</analyzer>
</fieldtype>
I have an entry "pan-pan, peter", which is found, if i search
(peter pa*)
(peter panpa*)
or even
(pe-te-r panpa*)
also
(peter pa-n-pa-n)
(without *) matches.
but
(peter pan-p*)
(peter pan\-p*)
gives no result.
It seems as if the combination of dash and * is a problem?
I'd like to find "pan-pan, peter" in every stage of typing "peter pan-pan"...

Try using the below field type.
<fieldType name="text_delimeter" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" preserveOriginal="1" catenateAll="1" splitOnCaseChange="0"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I tried with your text and analysed the same. I found the above type would work for you. I have analysed the same in the tool as well.

How to ignore accent search in Solr

I am using solr as a search engine. I have a case where a text field contains accent text like "María". When user search with "María", it is giving resut. But when user search with "Maria" it is not giving any result.
My schema definition looks like below:
<fieldtype name="my_text" class="solr.TextField">
<analyzer type="Index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="32" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
Please help to solve this issue.

If you're on solr > 3.x you can try using solr.ASCIIFoldingFilterFactory which will change all the accented characters to their unaccented versions from the basic ascii 127-character set.
Remember to put it after any stemming filter you have configured (you're not using one, so you should be ok).
So your config could look like:
<fieldtype name="my_text" class="solr.TextField">
<analyzer type="Index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="32" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldtype>

Answering here because it's the first result that pop when searching "ignore accents solr".
In the schema.xml generated by haystack (and using aldryn_search, djangocms & djangocms-blog), the answer provided by #soulcheck works if you add the <filter class="solr.ASCIIFoldingFilterFactory"/> line in the text_en fieldType.
Screenshot 1, screenshot 2.

partial word search in solr example: sarvesh , i want search like rves

examples:Beautiful
search based: auti...
I would like to search with only part of a word, not the whole word.
For example when I search auti only the middle 3 letters ,not the whole word.I am not getting results : For the moment I am using the search api with apache solr (and perhaps views).
Any suggestions please?
I am using this one
<fieldType name="string_ci" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="10"/>
</analyzer>
</fieldType>

You can use wildcard query.
In your example above, you should prepend and append your search terms with an asterix, so if someone searches for auti, the query you send to server will be auti
This should pull all results with all words that contain the word auti within them.
http://www.solrtutorial.com/solr-query-syntax.html

Now since you wanna search for sub-strings inside words, you can add side="back" to your definition, and that should help you achieve your goal.
So your fieldtype definition will look like this:
<fieldType name="string_ci" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="10" side="front" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="10" side="back" />
</analyzer>
</fieldType>

Search for partial words using Solr

I'm trying to search for a partial word using Solr, but I can't get it to work.
I'm using this in my schema.xml file.
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize="15" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" stemEnglishPossessive="1" splitOnNumerics="1" splitOnCaseChange="1" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="1" preserveOriginal="1"/>
</analyzer>
</fieldType>
Searching for die h won't work, but die hard returns some results.
I've reindexed the database after the above configuration was added.
Here is the url and output when searching for die hard. The debugger is turned on.
Here is the url and output when searching for die h. The debugger is turned on.
I'm using Solr 3.3. Here is the rest of the schema.xml file.

The query you've shared is searching the "title_text" field, but the schema you posted above defines the "text" field. Assuming this was just an oversight, and the title_text field is defined as in your post, I think a probable issue is that the NGramTokenizer is configured with minGramSize="3", and you are expecting to match using a single-character token.
You could try changing minGramSize to 1, but this will inevitably lead to some very inefficient indexes; and I wonder whether you really are keen on having "e" match every movie with an e in the title?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Ignore special characters - solr

I know this is an old post, but the correct answer here is you should add "Spiderman, Spider-man, Spider man" to your synonyms.txt file and restart solr. If this still doesn't work, make sure your schema uses the SynonymGraphFilterFactory analyzer. What you've described here is synonyms.

Related

Solr Queries With Dashes

Solr full name search: how can I find entries containing a dash with wildcards

How to ignore accent search in Solr

partial word search in solr example: sarvesh , i want search like rves

Search for partial words using Solr

Categories

Resources