Azure Search Custom Analyzer - azure-cognitive-search

Azure Search Custom Analyzer - azure-cognitive-search

We are trying to use a custom analyzer (KeywordAnalyzer) using Azure Search Rest api-version: 2015-02-28-preview.
The Index definition code you see below is copied exactly from Microsoft docs.
This works if we put the Analyzer Type to CustomAnalyzer. However, if we make a single change by changing the analyzer type from CustomAnalyzer to any other analyzer such as KeywordAnalyzer, you get a Bad Request error when creating the Index and the Index is not created.
Would appreciate if anyone coud tell us how we can specify an Analyzer.
Many thanks
{
"name":"homes",
"fields":[
{
"name":"Id",
"type":"Edm.String",
"key":true,
"searchable":false},
{
"name":"IdStd",
"type":"Edm.String",
"searchable":true,
"analyzer":"my_analyzer"}
],
"analyzers":[
{
"name":"my_analyzer",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"my_standard_tokenizer",
"tokenFilters":[
"my_asciifolding",
"lowercase"
]
}
],
"tokenizers":[
{
"name":"my_standard_tokenizer",
"#odata.type":"#Microsoft.Azure.Search.StandardTokenizer",
"maxTokenLength":20}
],
"tokenFilters":[
{
"name":"my_asciifolding",
"#odata.type":"#Microsoft.Azure.Search.AsciiFoldingTokenFilter",
"preserveOriginal":true}
]
}

I'm from Azure Search. What's the error message you're seeing together with the BadRequest response code?
Edit:
I reread you question. Potentially you are specifying the tokenizer and tokenFilter properties for the KeywordAnalyzer. These properties only apply to the CustomAnalyzer. Please let me know if you find the documentation insufficient or confusing. We'll make sure to make it more clear and easier to follow.

Related

Error Creating Index Using SDK When analyzers or tokenFilters Elements are Present

I have an index definition that includes a token filter definition and a corresponding custom analyzer definition as shown below.
"suggesters": [],
"scoringProfiles": [],
"defaultScoringProfile": "",
"corsOptions": null,
"analyzers": [
{
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "test_soundex",
"tokenizer": "standard_v2",
"tokenFilters": [
"lowercase",
"test_phonetic"
],
"charFilters": []
}
],
"charFilters": [],
"tokenFilters": [
{
"#odata.type": "#Microsoft.Azure.Search.PhoneticTokenFilter",
"name": "test_phonetic",
"encoder": "soundex",
"replace": false
}
],
"tokenizers": [],
When I attempt to create a new index using this definition I get the following error:
Microsoft.Rest.Azure.CloudException: The request is invalid. Details: index : A type named ‘Analyzer’ could not be resolved by the model. When a model is available, each type name must resolve to a valid type.
If I remove the analyzers and tokenFilters elements from the definition the index gets created with no issue. If I remove the custom analyzer definition I get a similar error where "A type named ‘TokenFilter’ could not be resolved by the model".
I’m running with the latest version of the SDK (10.1.0).
For further clarity here is the code that I'm using to create the index. I'm not instantiating the Analyzer directly. It's being created when the Index object is deserialized below.
var text = System.IO.File.ReadAllText(#"index.json");
var indexDefinition = JsonConvert.DeserializeObject<Index>(text);
_searchServiceClient.Indexes.Create(indexDefinition);
I know the index definition is valid as the same JSON works fine for creating the index when submitted using the API via Postman. Any thoughts?

The JSON deserializer needs to instantiate a CustomAnalyzer instead of an Analyzer. The Analyzer class ought to be abstract, but isn’t currently. Unless you use the same JSON serializer settings as the SDK itself, you won't be able to successfully serialize and deserialize SDK model classes on your own. This is not actually a supported scenario -- you might be able to get it to work, but it's not something we test.
As an aside for folks who might instantiate Analyzer manually by mistake, we're working on a new .NET SDK for Azure Cognitive Search that is currently in preview. In the new library, Analyzer has been renamed to LexicalAnalyzer and its constructor is no longer accessible.

Differences between Suggesters and NGram

I've built an index with a Custom Analyzer
"analyzers": [
{
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "ingram",
"tokenizer": "whitespace",
"tokenFilters": [ "lowercase", "NGramTokenFilter" ],
"charFilters": []
}
],
"tokenFilters": [
{
"#odata.type": "#Microsoft.Azure.Search.NGramTokenFilterV2",
"name": "NGramTokenFilter",
"minGram": 3,
"maxGram": 8
}
],
I came upon Suggesters and was wondering what the pros/cons were between these 2 approaches.
Basically, I'm doing an JavaScript autocomplete text box. I need to do partial text search inside of the search text (i.e. search=ell would match on "Hello World".

Azure Search offers two features to enable this depending on the experience you want to give to your users:
- Suggestions: https://learn.microsoft.com/en-us/rest/api/searchservice/suggestions
- Autocomplete: https://learn.microsoft.com/en-us/rest/api/searchservice/autocomplete
Suggestions will return a list of matching documents even with incomplete query terms, and you are right that it can be reproduced with a custom analyzer that uses ngrams. It's just a simpler way to accomplish that (since we took care of setting up the analyzer for you).
Autocomplete is very similar, but instead of returning matching documents, it will simply return a list of completed "terms" that match the incomplete term in your query. This will make sure terms are not duplicated in the autocomplete list (which can happen when using the suggestions API, since as I mentioned above, suggestions return matching documents, rather than a list of terms).

Azure Search - basic search in Czech language

I have an index created in Azure Search service where I have several string fields marked as searchable using Czech - Lucene analyzer. In Czech language we use some accented characters and it is common that people replace accented characters with non-accented when typing. Therefore, for example "Václav" (name) has the same meaning as "Vaclav". In my index, I have few documents with word "Václav" and none with word "Vaclav".
As much as I'd expect that Azure Search would return all documents containing word "Václav" when I search for "Vaclav", it is not the case. I'm wondering if I have to parse the query somehow before sending to the search engine.
I ran my tests both thru Azure Portal (setting API version to 2015-02-28-Preview) and thru my code using the very latest SDK Microsoft.Azure.Search 1.1.1.

By default Lucene and Microsoft analyzers for the Czech language don't ignore diacritics. The easiest way to achieve what you want is to use standardasciifolding.lucene analyzer instead. Alternatively, you could build a custom analyzer to add the ASCII folding token filter to the standard analysis chain for Czech. For example:
{
"name":"example",
"fields":[
{
"name":"id",
"type":"Edm.String",
"key":true
},
{
"name":"text",
"type":"Edm.String",
"searchable":true,
"retrievable":true,
"analyzer":"my_czech_analyzer"
}
],
"analyzers":[
{
"name":"my_czech_analyzer",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"standard",
"tokenFilters":[
"lowercase",
"czech_stop_filter",
"czech_stemmer",
"asciifolding"
]
}
],
"tokenFilters":[
{
"name":"czech_stop_filter",
"#odata.type":"#Microsoft.Azure.Search.StopTokenFilter",
"stopwords_list":"_czech_"
},
{
"name":"czech_stemmer",
"#odata.type":"#Microsoft.Azure.Search.StemmerTokenFilter",
"language":"czech"
}
]
}
We realize that the experience is not optimal now. We’re working to make customizations like this easier.
Let me know if this answers your question

Azure Search Suggester

The suggester in Azure Search has only 1 SearchMode and that is it will match on any word within the field. Although this might be appropriate for many applications, it also is not for many others.
Is there any way we can configure the suggester so that a match occurs only when the beginning of the field is a match?
Many thanks for your assistance.

Consider creating a custom analyzer that at index time generates prefixes of words from your documents:
{
"name":"names",
"fields": [
{ "name":"id", "type":"Edm.String", "key":true, "searchable":false },
{ "name":"partialName", "type":"Edm.String", "searchable":true, "searchAnalyzer":"standard", "indexAnalyzer":"prefixAnalyzer" }
],
"analyzers": [
{
"name":"prefixAnalyzer",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"standard",
"tokenFilters":[ "lowercase", "my_edgeNGram" ]
}
],
"tokenFilters": [
{
"name":"my_edgeNGram",
"#odata.type":"#Microsoft.Azure.Search.EdgeNGramTokenFilter",
"minGram":2,
"maxGram":20
}
]
}
Notice the partialName field uses the standard analyzer for search and the custom (prefixAnalyzer) analyzer for indexing. You can now issue regular Search requests with prefixes of words as query terms.
You can learn more about the EdgeNGramTokenFilter from our documentation page about Analysis in Azure Search.
Let me know if this helps.

Currently only infix matching is supported in suggestions.

Highlight matches in MongoDB full text search

Is it possible to define which part of the text in which of the indexed text fields matches the query?

No, as far as I know and can tell from the Jira, no such feature exists currently. You can, of course, attempt to highlight the parts of the text yourself, but that requires to implement the highlighting and also implement the stemming according to the rules applied by MongoDB.
The whole feature is somewhat complicated - even consuming it - as can be seen from the respective elasticsearch documentation.

Refer to Mongodb Doc Highlighting
db.fruit.aggregate([
{
$searchBeta: {
"search": {
"path": "description",
"query": ["variety", "bunch"]
},
"highlight": {
"path": "description"
}
}
},
{
$project: {
"description": 1,
"_id": 0,
"highlights": { "$meta": "searchHighlights" }
}
}
])

I'm afraid that solution applies only to MongoDB Atlas at the moment #LF00.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Azure Search Custom Analyzer - azure-cognitive-search

Related

Error Creating Index Using SDK When analyzers or tokenFilters Elements are Present

Differences between Suggesters and NGram

Azure Search - basic search in Czech language

Azure Search Suggester

Highlight matches in MongoDB full text search

Categories

Resources