Azure search services issue for white space and wildcard search of special characters - azure-cognitive-search

We have an application that allows the users to enter anything on the summary field. The users can type in any special characters like #$!#~ etc including white space and they request that they can search based on those special characters as well. For example, one of the entry is "test testing **** #### !!!!! ???? # $".
I created a cognitive search index with analyzer to be standard.lucene, shown below:
{
"name": "Summary",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "standard.lucene",
"synonymMaps": []
}
When I used the postman query:
{ "top":"1000",
"queryType": "full",
"searchMode":"all",
"search": "testing",
"searchFields": "Summary",
"count":true
}
I can get the expected result.
If I use the following:
{ "top":"1000",
"queryType": "full",
"searchMode":"all",
"search": "testing ****",
"searchFields": "Summary",
"count":true
}
I got the error with "InvalidRequestParameter".
If I changed to the following query:
{ "top":"1000",
"queryType": "full",
"searchMode":"all",
"search": ""****"",
"searchFields": "Summary",
"count":true
}
Then I am not getting any results back.
Per this article:
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#escaping-special-characters
In order to use any of the search operators as part of the search text, escape the character by prefixing it with a single backslash ().
Special characters that require escaping include the following:
& | ! ( ) { } [ ] ^ " ~ * ? : \ /
I need to prefix with single backslash for the special characters. But in my case it doesn't seem to work. Any help will be appreciated!

If you are using standard lucene analyzer for your indexing, I believe the "****" is not saved as a word. Lucene analyzer breaks the words on special characters.
For fields that you need to be searched on, e.g., the summary field in your example, you need to create a custom analyzer for that field. This document talks about how you can do that, test your analyzer. Once you have built an analyzer that tokenizes the input the way you want, you can use that in your index definition for the fields that need it as follows.
...
{
"name": "Summary",
"type": "Edm.String",
"retrievable": true,
"searchable": true,
"analyzer": "custom_analyzer_for_tokenizing_as_is"
},
...

I finally get this one resolved by creating a customized analyzer. The index definition:
{
"name": "FieldName",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "specialcharanalyzer",
"synonymMaps": []
},
The analyzer is specified below:
"analyzers": [
{
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "specialcharanalyzer",
"tokenizer": "whitespace",
"tokenFilters": [
"lowercase"
],
"charFilters": []
}
],
Then you can use the format this document specified https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#special-characters
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#special-characters
Special characters that require escaping include the following:
+ - & | ! ( ) { } [ ] ^ " ~ * ? : \ /
For characters not in the above required escaping character, use the following format for infix search:
"search": "/.*SearchChar.*/",
For example, if you want to search for $, then use the following format:
"search": "/.*$.*/",
For special characters in the list, use this format:
"search" : "/.*\\escapingcharacter.*/",
For example to search for +, use the following query;
"search" : "/.*\\+.*/",
# is also considered to be escaping character if it is in a statement.
To search for *, use this format:
"search":"/\\**/",

Related

Null or empty values are not stored in solr

I have solr database where I inserted string field like this:
{
"add-field": [
{
"name": "string__single_line_text_field__LC",
"type": "string",
"stored": true,
"indexed": true,
"required": true,
"default": ""
}
]
}
I set field to be required and define its default value. In my solr database, this field is like this:
The problem is because solr doesn't store my default value as empty string when string is null or empty (it simply doesn't exist) - it stores only non null/non empty values. Any idea how to solve this issue?

How to get all solr field names except for multivalued fields?

I'm new to solr and I'm trying to query for field names excluding fields with multiValued=true.
So far I have
select?q=*:*&wt=csv&rows=0&facet
which returns all the fields.
Is there a way to modify the query to check if a field is multivalued?
You can retrieve information about all the defined fields through the Schema API. The response will contain a multiValued field set to true if the field is defined as multivalued:
v1 API:
http://localhost:8983/techproducts/schema/fields
v2 API:
http://localhost:8983/api/collections/techproducts/schema/fields
{
"fields": [
{
"indexed": true,
"name": "_version_",
"stored": true,
"type": "long"
},
{
"indexed": true,
"multiValued": true, <----
"name": "cat",
"stored": true,
"type": "string"
},
],
"responseHeader": {
"QTime": 1,
"status": 0
}
}

Azure Cognitive Search Complex object filtering

I have an index with azure cognative search but cant seem to find the right syntax to query it for what I need.
I have documents that looks like the below and want to be able to pass in a search for "black denim shirt" and have that matched against each item object in the document rather than the whole document.
I need this match to be confined to the objects as I don't want the "black" and "denim" from the "black denim shirt" query to be matched to a "black denim jeans". Therefore the match/higher ranked result should be Document 2
Document 1:
{
"id": "Style1",
"itemKeyWords": [
{
"productKeyWords": "shirt,oversized shirt,denim",
"attributeKeyWords": "blue"
},
{
"productKeyWords": "Skinny, denim, jeans",
"attributeKeyWords": "black"
}
]
}
Document 2:
{
"id": "Style2",
"itemKeyWords": [
{
"productKeyWords": "shirt,oversized shirt,denim",
"attributeKeyWords": "black"
},
{
"productKeyWords": "Skinny, denim, jeans",
"attributeKeyWords": "blue"
}
]
}
I have the itemKeyWords set up in the index as a
{
"name": "itemKeyWords",
"type": "Collection(Edm.ComplexType)",
"fields": [
{
"name": "productKeyWords",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "en.lucene",
"normalizer": null,
"synonymMaps": []
},
{
"name": "attributeKeyWords",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "en.lucene",
"normalizer": null,
"synonymMaps": []
}
]
}
I have tried various attempts using this as a guid but cant seem to get the syntax right
https://learn.microsoft.com/en-gb/azure/search/search-howto-complex-data-types?tabs=portal
Unfortunately, as of today, it is not possible to make "search" requests (queries that rely on the tokenized content) that enforce the requirement to have all matches within a specific entry of a complex object collection. This is only supported for filters right now (as long as the filter does not rely on the search.in function).
I can think of two (less than idea) work around:
Index each entry of the collection as separate documents
Flatten the sub-fields into a single field:
AggregateField: "Skinny, denim, jeans. black"
And then emit a query that use proximity search (to make sure all terms are within a certain distance):
queryType=full&search="black denim jeans"~5
If it's important for you to still keep the structured version of the content in the document (attribute and keywords separately), you can still index them along side the aggregated field for retrieval purpose (you can target different fields for matching purpose vs the one you actually return in the response by using select and searchFields)
queryType=full&search="black denim jeans"~3&searchFields=aggregatedFields&select=productKeyWords, attributeKeyWords
or
queryType=full&search=aggregatedFields:"black denim jeans"~3&select=productKeyWords,attributeKeyWords

Azure Cognitive Search prefix searching as single token

I'm trying to create an Azure Search index with a searchable Name field that should not be tokenized and be treated as a single string.
So if I have two values:
"Total Insurance"
"Invoice Total"
With a search term like this: search=Total*, then only "Total Insurance" should be returned because it starts with "Total".
My assumption was that the 'keyword' analyzer is to be used for this type of search
https://learn.microsoft.com/en-us/azure/search/index-add-custom-analyzers#built-in-analyzers
But it doesn't seem to work like that, it doesn't return any results with search=Total*.
Is there a different setup for this type of search?
Something like this is required:
{
"name":"myIndex",
"fields": [
{
"name":"Name",
"type":"Edm.String",
"searchable":true,
"filterable": true,
"retrievable": true,
"sortable": true,
"searchAnalyzer":"keyword",
"indexAnalyzer":"prefixAnalyzer"
}
],
"analyzers": [
{
"name":"prefixAnalyzer",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"keyword_v2",
"tokenFilters":[ "lowercase", "my_edgeNGram" ]
}
],
"tokenFilters": [
{
"name":"my_edgeNGram",
"#odata.type":"#Microsoft.Azure.Search.EdgeNGramTokenFilterV2",
"minGram":3,
"maxGram":7
}
]
}

Possibly broken azure-search tokenizer - PathHierarchyTokenizerV2

Lately, I wanted to take advantage of a field on my search index that uses a custom analyzer with the PathHierarchyTokenizerV2 tokenizer.
this same index used to work, and the custom analyzer did break the text into the correct path segments when using the "Analyzer Test" API.
i.e. the text l1/l2/l3 turns into:
l1,
l1/l2,
l1/l2/l3,
At the moment, it seems like this functionality no longer works. Or, am I doing something wrong?
I reproduce by creating an index with the following field:
{
"name": "tags",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "categoryPathAnalyzer",
"synonymMaps": []
}
Where categoryPathAnalyzer is defined as:
{
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "categoryPathAnalyzer",
"tokenizer": "path_hierarchy_v2",
"tokenFilters": [
"lowercase"
],
"charFilters": []
}
The "Analyzer Test" API is called with the following body:
{
"text": "a/b",
"analyzer": "categoryPathAnalyzer"
}
And the result is empty:
{
"#odata.context": "https://x.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01_Preview.AnalyzeResult",
"tokens": []
}
If it matters, this index and calls are all using the latest 2016-09-01-Preview API version.
Thanks for reporting this. We found a bug in the built-in path_hierarchy_v2 tokenizer. The bug has been fixed. Please let us know if the issue persists.

Resources