Azure Search on text files - azure-cognitive-search

Azure Search on text files - azure-cognitive-search

Trying to setup Azure Search on blob container that has text files.
I have created a storage account and I'm using the Azure Data Lake gen2 stored in the blob container. I have one file as I'm testing the azure search. I have created the index, datasource and when I try to create the indexer I get
{"error":{"code":"","message":"Error with data source: The remote server returned an error: (400) Bad Request. Please adjust your data source definition in order to proceed."}}
My file has no headers, contains 16 columns, and is pipe delimited
so here is what I have tried for index
{
"name" : "test-index",
"fields": [
{ "name": "id", "type": "Edm.String", "key": true, "searchable": false },
{"name":"TransactionId", "type": "Edm.String", "key": false, "searchable": true },
{"name":"TransactionEventId", "type": "Edm.String", "key": false, "searchable": true },
{"name":"EventTypeId", "type": "Edm.String", "key": false, "searchable": true },
{"name":"EventSourceId", "type": "Edm.String", "key": false, "searchable": true },
{"name":"SourceUserId", "type": "Edm.String", "key": false, "searchable": true },
{"name":"SourceRecordId", "type": "Edm.String", "key": false, "searchable": true },
{"name":"SourceDetails", "type": "Edm.String", "key": false, "searchable": true },
{"name":"UserGlobalId", "type": "Edm.Int32", "key": false, "searchable": false },
{"name":"CallDistributorKey", "type": "Edm.Int32", "key": false, "searchable": false },
{"name":"CreatedDateTime", "type": "Edm.DateTimeOffset", "key": false, "searchable": false },
{"name":"AccountId", "type": "Edm.Int32", "key": false, "searchable": false },
{"name":"LobId", "type": "Edm.Int32", "key": false, "searchable": false },
{"name":"StartEvent", "type": "Edm.Int32", "key": false, "searchable": false },
{"name":"EndEvent", "type": "Edm.Int32", "key": false, "searchable": false },
{"name":"OnCall", "type": "Edm.Int32", "key": false, "searchable": false },
{"name":"PresenceEventId", "type": "Edm.Int32", "key": false, "searchable": false },
{"name":"EventProcessedUtcTime", "type": "Edm.DateTimeOffset", "key": false, "searchable": false }
]
}
---also tried something much simpler
"fields": [
{ "name": "id", "type": "Edm.String", "key": true, "searchable": false },
{ "name": "content", "type": "Edm.String", "key":false, "retrievable": false , "filterable": false, "sortable": false, "facetable":false, "searchable": true}
]
--datasource
{
"name" : "test-ds",
"type" : "azureblob",
"credentials" : { "connectionString" :"DefaultEndpointsProtocol=https;AccountName=......;AccountKey=..." },
"container" : { "name" : "test" }
}
--indexer
{
"name" : "test-indexer",
"dataSourceName" : "test-ds",
"targetIndexName" : "test-index"
}
---get error
{"error":{"code":"","message":"Error with data source: The remote server returned an error: (400) Bad Request. Please adjust your data source definition in order to proceed."}}
--tried this indexer create as well
{
"name" : "test-indexer",
"dataSourceName" : "test-ds",
"targetIndexName" : "test-index",
"parameters" : { "configuration" : { "parsingMode" : "delimitedText", "delimitedTextDelimiter" : "|" , "delimitedTextHeaders" : "TransactionId
TransactionEventId,EventTypeId,EventSourceId,SourceUserIdSourceRecordId,
SourceDetails,UserGlobalId,CallDistributorKey,CreatedDateTime,AccountId,
LobId,StartEvent,EndEvent,OnCall,PresenceEventId,EventProcessedUtcTime" } }
}
---get error
{"error":{"code":"","message":"Error with data source: The remote server returned an error: (400) Bad Request. Please adjust your data source definition in order to proceed."}}
Any pointers would be great....

This is an existing issue of API interoperability between Azure Data Lake Storage Gen2 and Blob Storage.
Azure Search uses Blob Storage APIs and these APIs are currently not compatible with hierarchical namespaces. You can disable the hierarchical namespaces feature to enable Azure Search indexing, but you'll lose some Azure Data Lake Storage Gen2 specific features.

Related

Filter inside nested ComplexTypes

I'm trying to filter based upon the value of tagdata/tags/tag. Any ideas for me? Basically I just want to select documents where the text of the tag matches a specific string. Thanks!
The filter:
search=*&$filter=tagdata/tags/any(tag: tagdata/tags/tag eq 'text1')
The error:
Invalid expression: The parent value for a property access of a property 'tag' is not a single value. Property access can only be applied to a single value.
I've got a ComplexType definition that looks like this:
{
"name": "tagdata",
"type": "Edm.ComplexType",
"analyzer": null,
"synonymMaps": [],
"fields": [
{
"name": "tags",
"type": "Collection(Edm.ComplexType)",
"analyzer": null,
"synonymMaps": [],
"fields": [
{
"name": "tagid",
"type": "Edm.Int64",
"facetable": false,
"filterable": true,
"retrievable": true,
"sortable": false,
"analyzer": null,
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
},
{
"name": "tag",
"type": "Edm.String",
"facetable": true,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "en.lucene",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
]
}
]
},
The data looks like this:
{
"tags": [
{
"tagid": 83,
"tag": "text1"
},
{
"tagid": 29,
"tag": "text2"
},
{
"tagid": 69,
"tag": "text3"
},
{
"tagid": 115,
"tag": "text4"
}
]
}

This should work:
search=*&$filter=tagdata/tags/any(t: t/tag eq 'text1')
Think of the any lambda expression as a loop over the tags collection, where the identifier behind the colon is the loop variable. That variable is of complex type, so you can access its properties using a slash.

path_hierarchy_v2 not working along with facet field in Azure cognitive search

I am unable to use path_hierarchy_v2 tokenizer in the facet field. But while analyzing the text with analyze api it tokenizes the text in path hierarchy.
{
"text": "a/b/c",
"analyzer": "my_path_analyzer"
}
this gives:
a, a/b, a/b/c
but while using with facet it does not work.
and the result it returns is:
{
"count": 2,
"value": "a/b/c"
}
But i want to get something like this
{
"count": 2,
"value": "a"
},
{
"count": 1,
"value": "a/b"
},
{
"count": 1,
"value": "a/b/c"
}
This is my field mapping:
{
"name": "hierarchy_field",
"type": "Edm.String",
"facetable": true,
"filterable": true,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": true,
"analyzer": "my_path_analyzer",
"indexAnalyzer": "",
"searchAnalyzer": "",
"synonymMaps": []
}

Apologies for the delay. You may try to break that down during pre-processing and assign each 'tier' of the hierarchy to a different field, as below:
Top_Level_Field=”a/b/c”
Next_Level_Field=”b/c”
Bottom_Level_Field=”c”
Typically, each facet tier can have different values but you need to have a pre-determined hierarchy depth.

Azure Search Normalized Lowercase Field

I am unable to add a normalized copy of the "Title" field to our search index. Ultimately, I'm trying to use this field for case-insensitive order by. Currently, titles are returned in the following order (with $orderBy=TitleCaseInsensitive):
Abc
Bbc
abc
And instead I want: Abc->abc->Bbc. I have forked the "Title" field out into two fields via a Field Mapping and am then applying a Custom Analyzer with the "lowercase" tokenFilter, to the normalized field. Can someone explain why I am not getting the desired results? Here is the relevant portion of the index definition:
"index":{
"name": "current-local-inventory",
"fields": [
{"name": "TitleCaseInsensitive","indexAnalyzer":"caseInsensitiveAnalyzer","searchAnalyzer":"keyword", "type": "Edm.String","filterable": false, "sortable": true, "facetable": false, "searchable": true},
{"name": "Title", "type": "Edm.String","filterable": true, "sortable": true, "facetable": false, "searchable": true},
],
"analyzers": [
{
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"name":"caseInsensitiveAnalyzer",
"charFilters":[],
"tokenizer":"keyword_v2",
"tokenFilters":["lowercase"]
}
]
},
"indexers":[{
"fieldMappings" : [
{"sourceFieldName" : "Title", "targetFieldName" : "Title" },
{"sourceFieldName" : "Title", "targetFieldName" : "TitleCaseInsensitive" }
]
}]

See my answer in the related post Azure Search - Accent insensitive analyzer not working when sorting. When you include the lowercase token filter it only affects search and not sorting. See Azure Search User Voice entry Case-insensitive sorting for string fields
My suggested workaround as I explain in the related post is to create a forked/shadow property. However, using an analyzer with a lowercase token filter won't help. The only way I could get your example working was to include a copy of your Title property that was already lowercased. Notice that I don't use fieldMapping and I don't use different analyzers for indexing and search like you have in your example.
CREATE INDEX
Create the index. Replace variables wrapped in angle brackets as suitable for your env.
{
"#odata.context": "https://{{SEARCH_SVC}}.{{DNS_SUFFIX}}/$metadata#indexes/$entity",
"#odata.etag": "\"0x8D8761DCBBCCD00\"",
"name": "{{INDEX_NAME}}",
"defaultScoringProfile": null,
"fields": [
{"name": "Id", "type": "Edm.String", "searchable": false, "filterable": true, "retrievable": true, "sortable": true, "facetable": false, "key": true, "indexAnalyzer": null, "searchAnalyzer": null, "analyzer": null, "synonymMaps": [] },
{"name": "TitleCaseInsensitive","indexAnalyzer": null, "searchAnalyzer": null, "analyzer": "caseInsensitiveAnalyzer", "type": "Edm.String","filterable": false, "sortable": true, "facetable": false, "searchable": true},
{"name": "Title", "type": "Edm.String","filterable": true, "sortable": true, "facetable": false, "searchable": true}
],
"scoringProfiles": [],
"corsOptions": null,
"suggesters": [],
"analyzers": [ {
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"name":"caseInsensitiveAnalyzer",
"charFilters":[],
"tokenizer":"keyword_v2",
"tokenFilters":["lowercase"]
}],
"tokenizers": [],
"tokenFilters": [],
"charFilters": [],
"encryptionKey": null
}
UPLOAD
Upload three sample documents.
{
"value": [
{
"#search.action": "mergeOrUpload",
"Id": "1",
"Title": "Abc",
"TitleCaseInsensitive": "abc"
},
{
"#search.action": "mergeOrUpload",
"Id": "2",
"Title": "abc",
"TitleCaseInsensitive": "abc"
},
{
"#search.action": "mergeOrUpload",
"Id": "3",
"Title": "Bbc",
"TitleCaseInsensitive": "bbc"
}
]
}
QUERY
Then, query with $orderby on your lowercased (normalized) property.
https://{{SEARCH_SVC}}.{{DNS_SUFFIX}}/indexes/{{INDEX_NAME}}/docs?search=*&$count=true&$select=Id,Title,TitleCaseInsensitive&searchMode=all&queryType=full&api-version={{API-VERSION}}&$orderby=TitleCaseInsensitive asc
And you'll get the expected results where Title is sorted in a case-insensitive way.
{
"#odata.context": "https://<your-search-service>.search.windows.net/indexes('dg-test-65526118')/$metadata#docs(*)",
"#odata.count": 3,
"value": [
{
"#search.score": 1.0,
"Id": "2",
"TitleCaseInsensitive": "abc",
"Title": "abc"
},
{
"#search.score": 1.0,
"Id": "1",
"TitleCaseInsensitive": "abc",
"Title": "Abc"
},
{
"#search.score": 1.0,
"Id": "3",
"TitleCaseInsensitive": "bbc",
"Title": "Bbc"
}
]
}
I would love to be corrected with a simple way to accomplish this.

Please check out the Text normalization for case-insensitive filtering, faceting and sorting feature that's in Preview.
You can update your index to use this "normalizer" feature for the fields in which you'd like case-insensitive order-by operations.
You don't need a separate field TitleCaseInsensitive anymore. You can add "normalizer": "lowercase" to the Title field, and $orderBy=Title will sort in the order you'd like, ignoring casing.
The "lowercase" normalizer is pre-defined. If you'd like other filters to be applied, please look at predefined and custom normalizers
"index": {
"name": "current-local-inventory",
"fields": [
{"name": "Title", "type": "Edm.String", "filterable": true, "sortable": true, "facetable": false, "searchable": true, "normalizer":"lowercase"}
]
},
"indexers":[{
"fieldMappings" : [
{"sourceFieldName" : "Title", "targetFieldName" : "Title" }
]
}]

Azure Search - unaccent

Trying to figure out how to get ignore accents ability in azure search. Texts in my application are in Polish lanuage. For searchable fields I tried to use pl.microsoft and pl.lucene analyzer. Both of them are able to change singular form to plural. What I'm not able to achive is ignoring accents. The only way that I found (How to ignore accents in Azure Search?) is to use standardasciifolding.lucene analyzer. It ignores accents but on the other hand doesn't change form from singular to plural. Is there any way to combine two analyzers?
Thanks

I think I found a solution, but I'm not sure if it's the simplest approach or maybe I complicated it too much:
{
"name": "test",
"fields": [
{
"name": "id",
"type": "Edm.String",
"searchable": false,
"filterable": false,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": true,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null
},
{
"name": "name",
"type": "Edm.String",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": false,
"facetable": false,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": "pl_analyzer",
"analyzer": null
}
],
"scoringProfiles": [],
"defaultScoringProfile": "",
"corsOptions": null,
"suggesters": [],
"analyzers":[
{
"name":"pl_analyzer",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"polish_tokenizer",
"tokenFilters":[ "lowercase", "asciifolding" ]
}],
"tokenizers": [
{
"#odata.type": "#Microsoft.Azure.Search.MicrosoftLanguageStemmingTokenizer",
"name": "polish_tokenizer",
"isSearchTokenizer": true,
"language": "polish"
}
],
"tokenFilters": [],
"charFilters": []
}

How can I rank exact matches higher in azure search

I have an index in azure search that consists of person data like firstname and lastname.
When I search for 3 letter lastnames with a query like
rau&searchFields=LastName
/indexes/customers-index/docs?api-version=2016-09-01&search=rau&searchFields=LastName
The name rau is found but it is quite far at the end.
{
"#odata.context": "myurl/indexes('customers-index')/$metadata#docs(ID,FirstName,LastName)",
"value": [
{
"#search.score": 8.729204,
"ID": "someid",
"FirstName": "xxx",
"LastName": "Liebetrau"
},
{
"#search.score": 8.729204,
"ID": "someid",
"FirstName": "xxx",
"LastName": "Damerau"
},
{
"#search.score": 8.729204,
"ID": "someid",
"FirstName": "xxx",
"LastName": "Rau"
More to the top are names like "Liebetrau","Damerau".
Is there a way to have exact matches at the top?
EDIT
Querying the index definition using the RestApi
GET https://myproduct.search.windows.net/indexes('customers-index')?api-version=2015-02-28-Preview
returned for LastName
"name": "LastName",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": "prefix",
"searchAnalyzer": "standard",
"analyzer": null,
"synonymMaps": []
Edit 1
The analyzer definition
"scoringProfiles": [],
"defaultScoringProfile": null,
"corsOptions": null,
"suggesters": [],
"analyzers": [
{
"name": "prefix",
"tokenizer": "standard",
"tokenFilters": [
"lowercase",
"my_edgeNGram"
],
"charFilters": []
}
],
"tokenizers": [],
"tokenFilters": [
{
"name": "my_edgeNGram",
"minGram": 2,
"maxGram": 20,
"side": "back"
}
],
"charFilters": []
Edit 2
At the end specifying a ScoringProfile that i use whene querying did the trick
{
"name": "person-index",
"fields": [
{
"name": "ID",
"type": "Edm.String",
"searchable": false,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": true,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": null
}
,
{
"name": "LastName",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"analyzer": "my_standard"
},
{
"name": "PartialLastName",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": "prefix",
"searchAnalyzer": "standard",
"analyzer": null
}
],
"analyzers":[
{
"name":"my_standard",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"standard_v2",
"tokenFilters":[ "lowercase", "asciifolding" ]
},
{
"name":"prefix",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"standard_v2",
"tokenFilters":[ "lowercase", "my_edgeNGram" ]
}
],
"tokenFilters":[
{
"name":"my_edgeNGram",
"#odata.type":"#Microsoft.Azure.Search.EdgeNGramTokenFilterV2",
"minGram":2,
"maxGram":20,
"side": "back"
}
],
"scoringProfiles":[
{
"name":"exactFirst",
"text":{
"weights":{ "LastName":2, "PartialLastName":1 }
}
}
]
}

The analyzer "prefix" set on the LastName field produces the following terms for the name Liebetrau : au, rau, trau, etrau, betrau, ebetrau, iebetrau, libetrau. These are edge ngrams of length ranging from 2 to 20 starting from the back of the word, as defined in the my_edgeNGram token filter in your index definition. The analyzer will process other names in the same way.
When you search for the name rau, it matches all names as they all end with those characters. That's why all documents in your result set have the same relevance score.
You can test your analyzer configurations using the Analyze API.
To learn more about custom analyzers please go here and here.
Hope that helps

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Azure Search on text files - azure-cognitive-search

Related

Filter inside nested ComplexTypes

path_hierarchy_v2 not working along with facet field in Azure cognitive search

Azure Search Normalized Lowercase Field

Azure Search - unaccent

How can I rank exact matches higher in azure search

Categories

Resources