Solr tokenizer for search - solr

I have defined a new field type in Solr for a auto suggest,
<fieldType name="auto_text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
now if I search for a particular field for example
/solr/select?q=ree
Im able to get the response like "reebok shirt" but not able to fetch the records like "white reebok shirt", should I add any other tokenizer to acheive the same???

See wiki. KeywordTokenizerFactory does this: Treats the entire field as a single token, regardless of its content. Use WhitespaceTokenizerFactory instead.

Related

Solr search from beginning of string

I need to boost results which are found from beginning of string. For Example i have to countries: Egypt and Seychelles.
User types "e" in a text field and solr response will be:
Seychelles
Egypt
But as you can see "Egypt" starts with "e". And i need this result to be boosted up:
Egypt
Seychelles
Any other results should be scored as usual. Is there any kind of special tokenizers/serializers? Or may be special characters in SolrQuery syntax?
UPD:
Part of my schema.xml which describes text field type:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="1" maxGramSize="20" side="front" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
Problem solved by using EdgeNGramFilterFactory instead of NGramFilterFactory:
<fieldType name="text_start_end" class="solr.TextField" omitNorms="false">
<analyzer>
<tokenizer class="solr.ClassicTokenizerFactory"/>
<filter class="solr.PositionFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="20" side="front" />
</analyzer>
</fieldType>

How to ignore accent search in Solr

I am using solr as a search engine. I have a case where a text field contains accent text like "María". When user search with "María", it is giving resut. But when user search with "Maria" it is not giving any result.
My schema definition looks like below:
<fieldtype name="my_text" class="solr.TextField">
<analyzer type="Index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="32" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
Please help to solve this issue.
If you're on solr > 3.x you can try using solr.ASCIIFoldingFilterFactory which will change all the accented characters to their unaccented versions from the basic ascii 127-character set.
Remember to put it after any stemming filter you have configured (you're not using one, so you should be ok).
So your config could look like:
<fieldtype name="my_text" class="solr.TextField">
<analyzer type="Index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="32" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
</analyzer>
</fieldtype>
Answering here because it's the first result that pop when searching "ignore accents solr".
In the schema.xml generated by haystack (and using aldryn_search, djangocms & djangocms-blog), the answer provided by #soulcheck works if you add the <filter class="solr.ASCIIFoldingFilterFactory"/> line in the text_en fieldType.
Screenshot 1, screenshot 2.

KeywordTokenizerFactory with LowerCaseFilterFactory

I wanted to use a NGramFilterFactory for my index and saw following example and tryed it out:
<fieldType name="NGramText" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="25" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
<field name="mark" type="NGramText" indexed="true" stored="true" omitNorms="true" omitTermFreqAndPositions="true"/>
The example is using KeywordTokenizerFactory. What is the purpose of using this? From what i understand it really do not do anythig, " the entire
input string is preserved as a single token " it says on the net.
Is there a good reason to use KeywordTokenizerFactory to make Ngrams or could i change it for WhitespaceTokenizerFactory with out slowing down the searches?
And also with this example LowerCaseFilterFactory is not making the fields lowercase could that has something to do with the conjunction with KeywordTokenizerFactory?

partial word search in solr example: sarvesh , i want search like rves

examples:Beautiful
search based: auti...
I would like to search with only part of a word, not the whole word.
For example when I search auti only the middle 3 letters ,not the whole word.I am not getting results : For the moment I am using the search api with apache solr (and perhaps views).
Any suggestions please?
I am using this one
<fieldType name="string_ci" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="10"/>
</analyzer>
</fieldType>
You can use wildcard query.
In your example above, you should prepend and append your search terms with an asterix, so if someone searches for auti, the query you send to server will be auti
This should pull all results with all words that contain the word auti within them.
http://www.solrtutorial.com/solr-query-syntax.html
Now since you wanna search for sub-strings inside words, you can add side="back" to your definition, and that should help you achieve your goal.
So your fieldtype definition will look like this:
<fieldType name="string_ci" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="10" side="front" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="10" side="back" />
</analyzer>
</fieldType>

Solr filter factory syntax not working

So I am attempting to have a custom field in my Solr schema that is filtered and processed a certain way but it doesn't seem to be working.
<fieldType name="removeWhitespace" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="\s" replacement="" replace="all" />
</analyzer>
</fieldType>
<field name="whiteSpaceRmved" type="removeWhitespace" stored="true" indexed="true"/>
<copyField source="original" dest="whiteSpaceRmved"/>
Basically, if I have a field like,
Hello World
I want to have that field, and a new field name that looks like,
HelloWorld
But when I try it, it copies the field, but doesn't change it in any way. Any ideas?
You need to move the tokenizer <tokenizer class="solr.StandardTokenizerFactory" />to the end of your analyzer chain. Currently, it is breaking the field values into tokens before you are removing whitespace. And actually since you are removing whitespace, you might not even need a tokenizer, since it looks like you want to store the values as strings really.
You should use KeywordTokenizer, which does no actual tokenizing, so the entire input string is preserved as a single token
<fieldType name="removeWhitespace" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.PatternReplaceFilterFactory"
pattern="(\s)" replacement="" replace="all"
/>
</analyzer>
</fieldType>

Resources