How to search word with and without special characters in Solr - solr

We have used StandardTokenizerFactory in the solr. but we have faced issue when we have search without special character.
Like we have search "What’s the Score?" and its content special character. now we have only search with "Whats the Score" but we didn't get proper result. its
means search title with and without special character should we work.
Please suggest which Filter we need to use and satisfy both condition.

If you have a recent version of Solr, try adding to your analyzer chain solr.WordDelimiterGraphFilterFactory having catenateWords=1.
This starting from What's should create three tokens What, s and Whats.
Not sure if ' is in the list of characters used by filter to concatenate words, in any case you can add it using the parameter types="characters.txt"

Related

How to config solr that use Synonym base on KeywordTokenizerFactory

synonym eg: "AAA" => "AVANT AT ALJUNIED"
If i search AAA*BBB
I can get AVANT AT ALJUNIEDBBB.
I was used StandardTokenizerFactory.But it's always breaking field data into lexical units,and then ignore relative position for search words.
On other way,I try to use StandardTokenizerFactory or other filter like WordDelimiterFilterFactory to split word via * . It don't work
You can't - synonyms works with tokens, and KeywordTokenizer keeps the whole string as a single token. So you can't expand just one part of the string when indexing if you're using KT.
In addition the SynonymFilter isn't MultiTermAware, so it's not invoked on query time when doing a wildcard search - so you can't expand synonyms for parts of the string there, regardless of which tokenizer you're using.
This is probably a good case for preprocessing the string and doing the replacements before sending it to Solr, or if the number of replacements are small, having filters to do pattern replacements inside of the strings when indexing to have both versions indexed.

Solr OR query on a text field

How to perform a simple query on a text field with an OR condition? Something like name:ABC OR name:XYZ so the resulting set would contain only those docs where name is exactly "XYZ" or "ABC"
Dug tons of manuals, cannot figure this out.
I use Solr 5.5.0
Update: Upgraded to Solr 6.6.0, still cannot figure it out. Below are illustrations to demonstrate my issue:
This works:
This works too:
This still works:
But this does not! Omg why!?
There are many ways to perform OR query. Below I have listed some of them. You can select any of it.
[Simple Query]
q=name:(XYZ OR ABC)
[Lucene Query Parser]
q={!lucene q.op=OR df=name v="XYZ ABC"}
Your syntax is right, but what you're asking for isn't what text fields are made for. A text field is tokenized (split into multiple tokens), and each token is searched by itself. So if the text inserted is "ABC DEF GHI", it will be split into three separate tokens, namely "ABC", "DEF" and "GHI". So when you're searching field:ABC, you're really asking for any document that has the token "ABC" somewhere.
Since you want to perform an exact match, you want to query against a field that is defined as a string field, as this will keep the value verbatim (including casing, so the matching will be case sensitive). You can tell Solr to index the same content into multiple fields by adding a copyFile instruction, telling it to take the content submitted for field foo and also copying it into field bar, allowing you to perform both an exact match if needed and a more general search if necessary.
If you need to perform exact, but case insensitive, searches, you can use a KeywordTokenizer - the KeywordTokenizer does nothing, keeping the whole string as a single token, before allowing you to add filters to the analysis chain. By adding a LowercaseFilter you tell Solr to lowercase the string as well before storing it (or querying for it).
You can use the "Analysis" page under the Solr admin page to experiment and see how content for your field is being processed for each step.
After that querying as string_field:ABC OR string_field:XYZ should do what you want (or string_field:(ABC OR XYZ) or a few other ways to express the same.
A wacky workaround I've just come up with:

how to do solr search including sepcial characters like (-,&.. etc)?

I need to do solr search for a string like BEBIL1407-GREEN with special character(-) but it is ignoring - and searches for only with BEBIL1407. I need to search with a hole word.Im using solr_4.5.1
Example Query :
q=BEBIL1407-GREEN&qt=select&start=0&rows=24&fq=clg%3A%222%22&fq=isAproved%3A%22Y%22&fl=id
Your question is about searching for BEBIL1407-GREEN but finding BEBIL1407.
You did not post your schem or your query parser.
As default solr using the standard query parser on field "text" with fieldtype "text_general".
You can test with the solr analysis screen the way from a word (in real text) to the corresponding token in the index.
For "text_general" the word "BEBIL1407-GREEN" goes to two token: "bebil1407" and "green".
The Standard-Parser does support escaping of special characters this would help if your word starts with a hyphen(minus sign). But in this case most possible the tokenizer is the reason of "finding unexpected documents".
Solution:
You can search with a phrase. In this case "BIBIL1407-GREEN" will also find "BIBIL1407 GREEN"
You can use an other FieldType e.g. one with WhiteSpaceTokenizer
Hope this helps, otherwise post your search field and your definition from schema.xml...

Escape LukeRequest with spaces, slashes, and colon

I am using Solr 4.1. Using LukeRequest, I want to get the number of documents with data for a specific field. The name of the field is something like http://foo.org/bar/ baz (note the space between bar/ and baz). When I visit http://127.0.0.1:8983/root/admin/luke I get a list of all of my fields, including the aforementioned one. When I visit
http://127.0.0.1:8983/root/admin/luke?fl=http://foo.org/bar/ baz
I get no hits. I have tried url-encoding the string, escaping slashes, escaping the colon, escaping the space, using + instead of space, and every possible combination of backslashes I can think of. The solution posted at another StackOverflow question field listing in solr with "fl" parameter for a field having space in between didn't work for me.
I am really only looking for a yes-no answer to whether any documents have a value for this particular field, so if there is a better way to do this than LukeRequest, I'm all ears for that too.
AFAIK, escaping special characters using a backslash works for values, not for parameters like fl or sort.
This answer on lucene mailing list also confirms my thoughts. I guess you shouldn't have spaces in field names.
I believe you could accomplish the same thing using the TermsComponent as it can tell you if there are any terms associated with a field in the index. However, you will need to specify the field name in the query, so you will run into a similar issue. As Srikanth answered, you are better off not using spaces or special characters in field names.

tokenizer for keepwordfilterfactory in solr

I want to use the solr keepwordfilterfactory but not getting the appropriate tokenizer for that. Use case is, i have a string say hi i am coming, bla-bla go out. Now from the following string i want to keep the words like hi i, coming,,bla-blaetc. So what tokenizer to use with the filter factory so that i am able to get any such combination in facets. Tried different tokenizer but not getting the exact result. I am using solr 4.0. Is there any such tokenizer that tokenizes based on the keepwords used.
What are your 'rules' for tokenization (splitting long text into individual tokens). The example above seem to be implying that sometimes you have single word tokens and sometimes a multi-word ("hi i"). The multi-word case is problematic here, but you might be able to do it by combining ShingleFilterFactory to give you multi-word tokens as well as the original ones and then you keep only the items you want.
I am not sure whether KeepWord filter deals correctly with multi-word strings. If it does not, you may want to have a special separator character during shingle process and then regex filter it back to space as the last step.

Resources