How to config solr that use Synonym base on KeywordTokenizerFactory - solr

synonym eg: "AAA" => "AVANT AT ALJUNIED"
If i search AAA*BBB
I can get AVANT AT ALJUNIEDBBB.
I was used StandardTokenizerFactory.But it's always breaking field data into lexical units,and then ignore relative position for search words.
On other way,I try to use StandardTokenizerFactory or other filter like WordDelimiterFilterFactory to split word via * . It don't work

You can't - synonyms works with tokens, and KeywordTokenizer keeps the whole string as a single token. So you can't expand just one part of the string when indexing if you're using KT.
In addition the SynonymFilter isn't MultiTermAware, so it's not invoked on query time when doing a wildcard search - so you can't expand synonyms for parts of the string there, regardless of which tokenizer you're using.
This is probably a good case for preprocessing the string and doing the replacements before sending it to Solr, or if the number of replacements are small, having filters to do pattern replacements inside of the strings when indexing to have both versions indexed.

Related

How to search word with and without special characters in Solr

We have used StandardTokenizerFactory in the solr. but we have faced issue when we have search without special character.
Like we have search "What’s the Score?" and its content special character. now we have only search with "Whats the Score" but we didn't get proper result. its
means search title with and without special character should we work.
Please suggest which Filter we need to use and satisfy both condition.
If you have a recent version of Solr, try adding to your analyzer chain solr.WordDelimiterGraphFilterFactory having catenateWords=1.
This starting from What's should create three tokens What, s and Whats.
Not sure if ' is in the list of characters used by filter to concatenate words, in any case you can add it using the parameter types="characters.txt"

substring match in solr query

I have a requirment where I have to match a substring in a query .
e.g if the field has value :
PREFIXabcSUFFIX
I have to create a query which matches abc. I always know the length of the prefix.
I can not use EdgeNgram and Ngram because of the space constraints.(As they will create more indexes.)
So i need to do this on query time and not on index time. Using a wildcard as prefix something like *abc* will have high impact on performance .
Since I will know the length of the prefix I am hoping to have some way where I can do something like ....abc* where dots represents the exact length of the prefix so that the query is not as bad as searching for the whole index as in the case of wild card query (*abc*).
Is this possible in solr ? Thanks for your time .
Solr version : 4.10
Sure, Wildcard syntax is documented here, you could search something like ????abc*. You could also use a regex query.
However, the performance benefit from this over *abc* will be very small. It will still have to perform a sequential search over the whole index. But if there is no way you can improve your analysis to support your search needs, there may be no getting around that (GIGO).
You could use the RegularExpressionPatternTokenizer for this. For the sample below I guessed that the length of your prefix is 6. Your example text PREFIXabcSUFFIX would become abcSUFFIX. This way you may search for abc*
<analyzer>
<tokenizer class="solr.PatternTokenizerFactory" pattern=".{6}(.+)" group="1"/>
</analyzer>
About the Tokenizer:
This tokenizer uses a Java regular expression to break the input text stream into tokens. The expression provided by the pattern argument can be interpreted either as a delimiter that separates tokens, or to match patterns that should be extracted from the text as tokens.

solr edismax search for words containing substring

Using eDisMax with SOLR 5.2.1 to search for a string, when I set the q parameter to that string, SOLR only matches fields containing that string as a whole word. For example,
q=bc123 will match "aa-bc123" but not "aabc123". If I add the * character before or after the phrase, than to match the search, there must be trailing and leading characters. For example, q=*bc123* will match "abc123a" but will not match "bc123".
The questions is -- what query string will match words containing the search words with or without trailing/leading characters?
Please note:
There are multiple fields to match, which are defined using the qf parameter
qf=field1^4 field2^3 field2^2 ...
The search may contain multiple words, eg. for q=abc def I want fields that contain both words containing "abc" and words containing "def", such as using q.op=AND
I have tried to use fuzzy search, but I have gotten a varying degree of false positives or omitted results, depending on the threshold.
You can use an NGramFilter to achieve this. It will split the terms into multiple tokens, where each token will be a substring of the original token.
The filter is only required when indexing (when querying, the tokens should match directly).

Disable boolean query in Solr for edismax

How do I disable boolean operators in edismax for solr?
The following query: Edismax -The Extended DisMax Query Parser should not exclude results mentioning "the" (given that stop words is not used).
I don't believe that Solr has an option to deactivate boolean operators. (Though I could be unaware of it - Solr is huge!)
My standard practice is to modify user-entered queries before passing them along to Solr. If punctuation isn't relevant in your search structure anyway, you could simply remove the hyphen, replace it with a space, or if you want to preserve the structure of hyphenated terms for your Solr analyzers to play with, you might selectively replace the specific pattern " -" with a single space " ", and so leave regular hyphenated expressions alone.
If you're not sure that the hyphen is irrelevant data in your search you could replace it instead with a sentinel character or sequence of characters that will pass cleanly though your query parser and field analysis, but you would probably want to do the same thing to the input data going into the search index so the two sentinel values can match within Solr.

tokenizer for keepwordfilterfactory in solr

I want to use the solr keepwordfilterfactory but not getting the appropriate tokenizer for that. Use case is, i have a string say hi i am coming, bla-bla go out. Now from the following string i want to keep the words like hi i, coming,,bla-blaetc. So what tokenizer to use with the filter factory so that i am able to get any such combination in facets. Tried different tokenizer but not getting the exact result. I am using solr 4.0. Is there any such tokenizer that tokenizes based on the keepwords used.
What are your 'rules' for tokenization (splitting long text into individual tokens). The example above seem to be implying that sometimes you have single word tokens and sometimes a multi-word ("hi i"). The multi-word case is problematic here, but you might be able to do it by combining ShingleFilterFactory to give you multi-word tokens as well as the original ones and then you keep only the items you want.
I am not sure whether KeepWord filter deals correctly with multi-word strings. If it does not, you may want to have a special separator character during shingle process and then regex filter it back to space as the last step.

Resources