Possibly broken azure-search tokenizer - PathHierarchyTokenizerV2

Possibly broken azure-search tokenizer - PathHierarchyTokenizerV2 - azure-cognitive-search

Lately, I wanted to take advantage of a field on my search index that uses a custom analyzer with the PathHierarchyTokenizerV2 tokenizer.
this same index used to work, and the custom analyzer did break the text into the correct path segments when using the "Analyzer Test" API.
i.e. the text l1/l2/l3 turns into:
l1,
l1/l2,
l1/l2/l3,
At the moment, it seems like this functionality no longer works. Or, am I doing something wrong?
I reproduce by creating an index with the following field:
{
"name": "tags",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "categoryPathAnalyzer",
"synonymMaps": []
}
Where categoryPathAnalyzer is defined as:
{
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "categoryPathAnalyzer",
"tokenizer": "path_hierarchy_v2",
"tokenFilters": [
"lowercase"
],
"charFilters": []
}
The "Analyzer Test" API is called with the following body:
{
"text": "a/b",
"analyzer": "categoryPathAnalyzer"
}
And the result is empty:
{
"#odata.context": "https://x.search.windows.net/$metadata#Microsoft.Azure.Search.V2016_09_01_Preview.AnalyzeResult",
"tokens": []
}
If it matters, this index and calls are all using the latest 2016-09-01-Preview API version.

Thanks for reporting this. We found a bug in the built-in path_hierarchy_v2 tokenizer. The bug has been fixed. Please let us know if the issue persists.

Related

How to get all solr field names except for multivalued fields?

I'm new to solr and I'm trying to query for field names excluding fields with multiValued=true.
So far I have
select?q=*:*&wt=csv&rows=0&facet
which returns all the fields.
Is there a way to modify the query to check if a field is multivalued?

You can retrieve information about all the defined fields through the Schema API. The response will contain a multiValued field set to true if the field is defined as multivalued:
v1 API:
http://localhost:8983/techproducts/schema/fields
v2 API:
http://localhost:8983/api/collections/techproducts/schema/fields
{
"fields": [
{
"indexed": true,
"name": "_version_",
"stored": true,
"type": "long"
},
{
"indexed": true,
"multiValued": true, <----
"name": "cat",
"stored": true,
"type": "string"
},
],
"responseHeader": {
"QTime": 1,
"status": 0
}
}

Azure Cognitive Search Complex object filtering

I have an index with azure cognative search but cant seem to find the right syntax to query it for what I need.
I have documents that looks like the below and want to be able to pass in a search for "black denim shirt" and have that matched against each item object in the document rather than the whole document.
I need this match to be confined to the objects as I don't want the "black" and "denim" from the "black denim shirt" query to be matched to a "black denim jeans". Therefore the match/higher ranked result should be Document 2
Document 1:
{
"id": "Style1",
"itemKeyWords": [
{
"productKeyWords": "shirt,oversized shirt,denim",
"attributeKeyWords": "blue"
},
{
"productKeyWords": "Skinny, denim, jeans",
"attributeKeyWords": "black"
}
]
}
Document 2:
{
"id": "Style2",
"itemKeyWords": [
{
"productKeyWords": "shirt,oversized shirt,denim",
"attributeKeyWords": "black"
},
{
"productKeyWords": "Skinny, denim, jeans",
"attributeKeyWords": "blue"
}
]
}
I have the itemKeyWords set up in the index as a
{
"name": "itemKeyWords",
"type": "Collection(Edm.ComplexType)",
"fields": [
{
"name": "productKeyWords",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "en.lucene",
"normalizer": null,
"synonymMaps": []
},
{
"name": "attributeKeyWords",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "en.lucene",
"normalizer": null,
"synonymMaps": []
}
]
}
I have tried various attempts using this as a guid but cant seem to get the syntax right
https://learn.microsoft.com/en-gb/azure/search/search-howto-complex-data-types?tabs=portal

Unfortunately, as of today, it is not possible to make "search" requests (queries that rely on the tokenized content) that enforce the requirement to have all matches within a specific entry of a complex object collection. This is only supported for filters right now (as long as the filter does not rely on the search.in function).
I can think of two (less than idea) work around:
Index each entry of the collection as separate documents
Flatten the sub-fields into a single field:
AggregateField: "Skinny, denim, jeans. black"
And then emit a query that use proximity search (to make sure all terms are within a certain distance):
queryType=full&search="black denim jeans"~5
If it's important for you to still keep the structured version of the content in the document (attribute and keywords separately), you can still index them along side the aggregated field for retrieval purpose (you can target different fields for matching purpose vs the one you actually return in the response by using select and searchFields)
queryType=full&search="black denim jeans"~3&searchFields=aggregatedFields&select=productKeyWords, attributeKeyWords
or
queryType=full&search=aggregatedFields:"black denim jeans"~3&select=productKeyWords,attributeKeyWords

Azure Cognitive Search prefix searching as single token

I'm trying to create an Azure Search index with a searchable Name field that should not be tokenized and be treated as a single string.
So if I have two values:
"Total Insurance"
"Invoice Total"
With a search term like this: search=Total*, then only "Total Insurance" should be returned because it starts with "Total".
My assumption was that the 'keyword' analyzer is to be used for this type of search
https://learn.microsoft.com/en-us/azure/search/index-add-custom-analyzers#built-in-analyzers
But it doesn't seem to work like that, it doesn't return any results with search=Total*.
Is there a different setup for this type of search?

Something like this is required:
{
"name":"myIndex",
"fields": [
{
"name":"Name",
"type":"Edm.String",
"searchable":true,
"filterable": true,
"retrievable": true,
"sortable": true,
"searchAnalyzer":"keyword",
"indexAnalyzer":"prefixAnalyzer"
}
],
"analyzers": [
{
"name":"prefixAnalyzer",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"keyword_v2",
"tokenFilters":[ "lowercase", "my_edgeNGram" ]
}
],
"tokenFilters": [
{
"name":"my_edgeNGram",
"#odata.type":"#Microsoft.Azure.Search.EdgeNGramTokenFilterV2",
"minGram":3,
"maxGram":7
}
]
}

Azure search services issue for white space and wildcard search of special characters

We have an application that allows the users to enter anything on the summary field. The users can type in any special characters like #$!#~ etc including white space and they request that they can search based on those special characters as well. For example, one of the entry is "test testing **** #### !!!!! ???? # $".
I created a cognitive search index with analyzer to be standard.lucene, shown below:
{
"name": "Summary",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "standard.lucene",
"synonymMaps": []
}
When I used the postman query:
{ "top":"1000",
"queryType": "full",
"searchMode":"all",
"search": "testing",
"searchFields": "Summary",
"count":true
}
I can get the expected result.
If I use the following:
{ "top":"1000",
"queryType": "full",
"searchMode":"all",
"search": "testing ****",
"searchFields": "Summary",
"count":true
}
I got the error with "InvalidRequestParameter".
If I changed to the following query:
{ "top":"1000",
"queryType": "full",
"searchMode":"all",
"search": ""****"",
"searchFields": "Summary",
"count":true
}
Then I am not getting any results back.
Per this article:
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#escaping-special-characters
In order to use any of the search operators as part of the search text, escape the character by prefixing it with a single backslash ().
Special characters that require escaping include the following:
& | ! ( ) { } [ ] ^ " ~ * ? : \ /
I need to prefix with single backslash for the special characters. But in my case it doesn't seem to work. Any help will be appreciated!

If you are using standard lucene analyzer for your indexing, I believe the "****" is not saved as a word. Lucene analyzer breaks the words on special characters.
For fields that you need to be searched on, e.g., the summary field in your example, you need to create a custom analyzer for that field. This document talks about how you can do that, test your analyzer. Once you have built an analyzer that tokenizes the input the way you want, you can use that in your index definition for the fields that need it as follows.
...
{
"name": "Summary",
"type": "Edm.String",
"retrievable": true,
"searchable": true,
"analyzer": "custom_analyzer_for_tokenizing_as_is"
},
...

I finally get this one resolved by creating a customized analyzer. The index definition:
{
"name": "FieldName",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "specialcharanalyzer",
"synonymMaps": []
},
The analyzer is specified below:
"analyzers": [
{
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "specialcharanalyzer",
"tokenizer": "whitespace",
"tokenFilters": [
"lowercase"
],
"charFilters": []
}
],
Then you can use the format this document specified https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#special-characters
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#special-characters
Special characters that require escaping include the following:
+ - & | ! ( ) { } [ ] ^ " ~ * ? : \ /
For characters not in the above required escaping character, use the following format for infix search:
"search": "/.*SearchChar.*/",
For example, if you want to search for $, then use the following format:
"search": "/.*$.*/",
For special characters in the list, use this format:
"search" : "/.*\\escapingcharacter.*/",
For example to search for +, use the following query;
"search" : "/.*\\+.*/",
# is also considered to be escaping character if it is in a statement.
To search for *, use this format:
"search":"/\\**/",

Prioritize search results based on certain parameter in AzureSearch.

I have an index on AzureSearch similar to this one:
"fields": [
{
"name": "key",
"type": "Edm.String",
"filterable": true,
},
{
"name": "title",
"type": "Edm.String",
"searchable": true
},
{
"name": "followers",
"type": "Collection(Edm.String)",
"filterable": true,
}
]
Here, title is the title of a Post and its text searchable. followers contains the user ids of users who are following that particular Post.
I am getting current logged in userId from session. Now when a user does some text search, I want to show those Posts on top which current user is following.
Please tell if this is achievable in AzureSearch using ScoringProfiles or anything else?

Tag boosting in ScoringProfile does exactly that. All you need to do is to add a scoring profile as below :
{
"scoringProfiles": [
  {
    "name": "personalized",
    "functions": [
    {
      "type": "tag",
      "boost": 2,
      "fieldName": "followers",
      "tag": { "tagsParameter": "follower" }
    }
    ]
  }
  ]
}
Then, at query time, issue a search query with the scoring profile with the parameters to customize the ranking :
docs?search=some%20post&&scoringProfile=personalized&scoringParameter=follower:user_abc
Hope this helps. You can read more about it here.
https://azure.microsoft.com/en-us/blog/personalizing-search-results-announcing-tag-boosting-in-azure-search/
Nate

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Possibly broken azure-search tokenizer - PathHierarchyTokenizerV2 - azure-cognitive-search

Thanks for reporting this. We found a bug in the built-in path_hierarchy_v2 tokenizer. The bug has been fixed. Please let us know if the issue persists.

Related

How to get all solr field names except for multivalued fields?

Azure Cognitive Search Complex object filtering

Azure Cognitive Search prefix searching as single token

Azure search services issue for white space and wildcard search of special characters

Prioritize search results based on certain parameter in AzureSearch.

Categories

Resources