Solr preserve whitspace search - solr

Below is my fieldtype and I want to preserve the white space during search
<fieldType name="searchterm" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="250" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
So example: input = "alpha beta" and I search for either "alpha" ,"beta" will match, but how do I enforce the non match for a search term like "alpha eta" (which should not match). I should also match for "eta","pha" but not "alpha eta"

Would be nice to know what kind of application needs such a search :-).
You can do the following:
if your search term has no spaces, use your existing field searchterm.
to help with search queries that have space(s) in them, create a new copyField (say called newsearchterm) which uses EdgeNGramFilterFactory instead of NGramFilterFactory.
For newsearchterm the analysis will happen this way:
alpha beta ==> alp, alph, alpha, bet, beta
so a search newsearchterm:(alpha AND eta) won't match alpha beta.

Related

Solr not returning the exact element

Using Solr 7.7.3
I have an element with the label:"alpha-ravi"
and when I search in solr label:"alpha" its returning the element with the label "alpha-ravi"
when looking at the solr doc, it should not return this element.
can anyone explain why this behavior ?
If you want to retrieve the exact results (i.e return docs with "alpha-ravi" only if the user types the exact "alpha-ravi" in the search), then I would suggest you could go with the Keyword tokenizer (solr.KeywordTokenizerFactory). This tokenizer would treat the entire "alpha-ravi" as a single token and thus, will not return partial results if there's a match for "alpha" or "ravi".
For example: in your schema.xml file you should add something like (configure the various filter chains as per your need)
<fieldType name="single_token_string" class="solr.TextField" sortMissingLast="true">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ASCIIFoldingFilterFactory"/>
<filter class="solr.TrimFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
</analyzer>
</fieldType>
And then you can use this fieldType in the same schema.xml (referencing the KeywordTokenizer we just defined)
<field name="myField" type="single_token_string" indexed="true" stored="true" />
By default, Solr uses the StandardTokenizer and thus, splits "alpha-ravi" on that hyphen into multiple tokens (thus, matching "alpha" and "ravi").
Also, as an alternative you could run a query with a phrase as well (which will not be tokenized on spaces/delimiters). Possibly something likehttp:localhost:8983/solr/...fq=label:"alpha-ravi"
Hope that helps. All the best!

Solr substring search yields all indexed results

To do a substring search, I have added a new fieldType - "Text" with NgramFilter.
It works fine perfectly but downside is this problem
Example
name = ['Apple','Samy','And','a']
When I do a search name:a, then all the above items gets pulled up. Even when search changes to "App". All the above items are pulled. How can I fix this issue?
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="100" />
</analyzer>
</fieldType>
As you can see in the analysis, both the indexed value and the query value gets parsed through the EdgeNGramFilter - meaning that it will match anything that is a substring of anything else. Add a simpler filter for querying the field, and you should be good to go.
The example from the Wiki should be usable by just copying and pasting it:
<fieldType name="text_general_edge_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="15" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.LowerCaseTokenizerFactory"/>
</analyzer>
</fieldType>
My initial guess was that since you don't provide two alternative definitions, Solr will use the same chain for both. Your analysis output confirms that suspicion. Try adding a analyser with type="query" to have a specific chain for querying the field (you do not want EdgeNGram both places).

Solr tokenizer for search

I have defined a new field type in Solr for a auto suggest,
<fieldType name="auto_text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>
now if I search for a particular field for example
/solr/select?q=ree
Im able to get the response like "reebok shirt" but not able to fetch the records like "white reebok shirt", should I add any other tokenizer to acheive the same???
See wiki. KeywordTokenizerFactory does this: Treats the entire field as a single token, regardless of its content. Use WhitespaceTokenizerFactory instead.

partial word search in solr example: sarvesh , i want search like rves

examples:Beautiful
search based: auti...
I would like to search with only part of a word, not the whole word.
For example when I search auti only the middle 3 letters ,not the whole word.I am not getting results : For the moment I am using the search api with apache solr (and perhaps views).
Any suggestions please?
I am using this one
<fieldType name="string_ci" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="10"/>
</analyzer>
</fieldType>
You can use wildcard query.
In your example above, you should prepend and append your search terms with an asterix, so if someone searches for auti, the query you send to server will be auti
This should pull all results with all words that contain the word auti within them.
http://www.solrtutorial.com/solr-query-syntax.html
Now since you wanna search for sub-strings inside words, you can add side="back" to your definition, and that should help you achieve your goal.
So your fieldtype definition will look like this:
<fieldType name="string_ci" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="10" side="front" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="10" side="back" />
</analyzer>
</fieldType>

Solr filter factory syntax not working

So I am attempting to have a custom field in my Solr schema that is filtered and processed a certain way but it doesn't seem to be working.
<fieldType name="removeWhitespace" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.PatternReplaceFilterFactory" pattern="\s" replacement="" replace="all" />
</analyzer>
</fieldType>
<field name="whiteSpaceRmved" type="removeWhitespace" stored="true" indexed="true"/>
<copyField source="original" dest="whiteSpaceRmved"/>
Basically, if I have a field like,
Hello World
I want to have that field, and a new field name that looks like,
HelloWorld
But when I try it, it copies the field, but doesn't change it in any way. Any ideas?
You need to move the tokenizer <tokenizer class="solr.StandardTokenizerFactory" />to the end of your analyzer chain. Currently, it is breaking the field values into tokens before you are removing whitespace. And actually since you are removing whitespace, you might not even need a tokenizer, since it looks like you want to store the values as strings really.
You should use KeywordTokenizer, which does no actual tokenizing, so the entire input string is preserved as a single token
<fieldType name="removeWhitespace" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.TrimFilterFactory" />
<filter class="solr.PatternReplaceFilterFactory"
pattern="(\s)" replacement="" replace="all"
/>
</analyzer>
</fieldType>

Resources