Azure Search doesn't seem to handle unicode punctuators - azure-cognitive-search

I have an Azure Search index with a bunch of text entries. I've observed that if the index contains an entry like "AI's" (with the Unicode apostrophe character 8217), searching for the word 'AI' fails to return the result. The indexed should handle punctuators - including Unicode variants. Searching for "John" should return an item that has "John's." Please confirm if this is a known bug and if so when it will be fixed.
Expecting to find "AI's" when I search for "AI" (where the apostrophe is a Unicode character 8217). The item is not returned as one would expect.

can you confirm what analyzer you are using in your index? We support many analyzers that will break down your search terms and document terms into different tokens. For example, if your content is in English, you could use the en.microsoft analyzer, which should split your "AI's" term into two tokens -> "AI" and "AI's".
More info on analyzers here ->
https://learn.microsoft.com/en-us/azure/search/search-analyzers
and here
https://learn.microsoft.com/en-us/azure/search/index-add-language-analyzers

Related

Azure search: Wild card queries does not work with japanese/chinese characters

I used icu_tokenizer using custom analyzer to create a search index for Japanese words. Index was created successfully. Using icu_tokenizer as for asian languages it works better than the default azure search tokenizer.
Now when I use query for string Ex:- 赤城 I see multiple search results (total 131) from the index. But when I use the wild card search with the same word, Ex: 赤城* (adding * at the end of the word) or /赤城.*/ (using regex search query) i see 0 search results. The weird part is that * seems to work with single japanese character 赤* gives me same number of search results as 赤 gives. But as soon as I increase the number of japanese characters from 1, wild card queries with * stops working and returns 0 search result. All of these queries I am testing it on search explorer on Azure portal using querytype=full (lucene syntax query)
In my application search terms are normally used as prefix search so normally we append * at the end of the search string to fetch search results but looks like these lucene wildcard queries with japanse characters just do not work. Any idea, how can I make these prefix queries (using wildcard * at end of search strings) work when search strings are given in japanese characters?
Any quick help will be much appreciated!!
I tested with my installation now and I can confirm that wildcards only work with Japanese content when you use a Japanese analyzer.
In my example I set up one index using a property Body that does not have a specific analyzer defined. Then I set up another index where Body uses the ja.microsoft language analyzer. The content in both indexes are identical. I then tried to search for 自動車 (automobile) with a trailing wildcard.
自動車* returns multiple hits from my index using the japanese analyzer. No hits are returned from the index without a specific analyzer defined.
sorry for the late reply.
Have you tried using one of the Japanese language analyzers? For example, ja.microsoft
Also, if you want to use prefix search, you can try experimenting with the suggester feature which is designed to be efficient for this scenario.

How to config solr that use Synonym base on KeywordTokenizerFactory

synonym eg: "AAA" => "AVANT AT ALJUNIED"
If i search AAA*BBB
I can get AVANT AT ALJUNIEDBBB.
I was used StandardTokenizerFactory.But it's always breaking field data into lexical units,and then ignore relative position for search words.
On other way,I try to use StandardTokenizerFactory or other filter like WordDelimiterFilterFactory to split word via * . It don't work
You can't - synonyms works with tokens, and KeywordTokenizer keeps the whole string as a single token. So you can't expand just one part of the string when indexing if you're using KT.
In addition the SynonymFilter isn't MultiTermAware, so it's not invoked on query time when doing a wildcard search - so you can't expand synonyms for parts of the string there, regardless of which tokenizer you're using.
This is probably a good case for preprocessing the string and doing the replacements before sending it to Solr, or if the number of replacements are small, having filters to do pattern replacements inside of the strings when indexing to have both versions indexed.

How to search word with and without special characters in Solr

We have used StandardTokenizerFactory in the solr. but we have faced issue when we have search without special character.
Like we have search "What’s the Score?" and its content special character. now we have only search with "Whats the Score" but we didn't get proper result. its
means search title with and without special character should we work.
Please suggest which Filter we need to use and satisfy both condition.
If you have a recent version of Solr, try adding to your analyzer chain solr.WordDelimiterGraphFilterFactory having catenateWords=1.
This starting from What's should create three tokens What, s and Whats.
Not sure if ' is in the list of characters used by filter to concatenate words, in any case you can add it using the parameter types="characters.txt"

how to do solr search including sepcial characters like (-,&.. etc)?

I need to do solr search for a string like BEBIL1407-GREEN with special character(-) but it is ignoring - and searches for only with BEBIL1407. I need to search with a hole word.Im using solr_4.5.1
Example Query :
q=BEBIL1407-GREEN&qt=select&start=0&rows=24&fq=clg%3A%222%22&fq=isAproved%3A%22Y%22&fl=id
Your question is about searching for BEBIL1407-GREEN but finding BEBIL1407.
You did not post your schem or your query parser.
As default solr using the standard query parser on field "text" with fieldtype "text_general".
You can test with the solr analysis screen the way from a word (in real text) to the corresponding token in the index.
For "text_general" the word "BEBIL1407-GREEN" goes to two token: "bebil1407" and "green".
The Standard-Parser does support escaping of special characters this would help if your word starts with a hyphen(minus sign). But in this case most possible the tokenizer is the reason of "finding unexpected documents".
Solution:
You can search with a phrase. In this case "BIBIL1407-GREEN" will also find "BIBIL1407 GREEN"
You can use an other FieldType e.g. one with WhiteSpaceTokenizer
Hope this helps, otherwise post your search field and your definition from schema.xml...

Solr StandardTokenizer wild card an punctuation sing together in the same word strange behavior

I have a problem with StandardTokenizer of Solr.
If i am searching for:
text_field:lastname
it will find something
If i am searching for:
text_field:last*ame
it will find soething
If I am searching for:
text_field:lastname;
But if I search for:
text_field:last*ame;
the search doesn't return anything. Why? StandardTokenizer shouldn't strip the punctuation sign from the end of the word? Basically if I use a wild card and a punctuation sign in a word the punctuation sign is not striped anymore. There is a way to strip out punctuation signs even if we use wildcards?
Solr does not perform any analysis on the query when you're doing wildcard queries. The term will just be used to do a wild card match against the tokens stored for the field. StandardTokenizer will split on word boundaries, and the ; will be considered a boundary - which will mean that the tokens indexed does not contain ;, but the query will.
You probably want to remove the ; in your query layer.
Here is the link to the SOLR documentation that further explains why wildcard and other multiterm queries don't undergo analysis.

Resources