Azure search Language analyzers and diacritics - azure-cognitive-search

We're setting up an azure search index, and we're using language analyzers, which seem to work great. (splitting words, adding stems etc.) However, we have an issue with diacritics (accents).
In Dutch, a patient is written as patiënt. When adding text containing patiënt to a a field set to microsoft.nl, it also adds a token for patient. So if I search for patient (without the ë), it also finds this document.
The problem arises when the situation is reversed. If someone types patient in a document (because he is to lazy to add the ë), the tokenizer doesn't add a patiënt token. When someone now searches for patiënt, the document is not found.
What is the proper solution for this? I would like for it not to matter if I add the diacritics in the search text or not. I've been looking at custom analyzers to remove diacritics altogether, but they don't seem to play nice with language analyzers
To clarify: I'm looking for a solution for all cases with diacritics, not only for this specific word
current field definition:
{
"name": "Contents_nlnl",
"type": "Edm.String",
"facetable": false,
"filterable": false,
"key": false,
"retrievable": false,
"searchable": true,
"sortable": false,
"analyzer": "nl.microsoft",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}

You could use a synonym map with explicit mapping
https://learn.microsoft.com/en-us/azure/search/search-synonyms#explicit-mapping
See example below
{
"name": "patient",
"format": "solr",
"synonyms": "patient=> patiënt"
}

Related

How to use Split Skill in azure cognitive search?

I am new to Azure cognitive search. I have a docx file which is stored in azure blob storage.I am using #Microsoft.Skills.Text.SplitSkill to split the document into multiple pages(chunks).But when I index the output of this skill,I am getting the entire docx file content.how do I return the "pages" from the SplitSkill so that the user sees the portion of the original document that was found by their search instead of returning entire document?
Please assist me.Thank you in advance.
The split skill allows you to split text into smaller chunks/pages that can be then processed by additional cognitive skills.
Here is what a minimalistic skillset that does splitting and translation may look like:
"skillset": [
{
"#odata.type": "#Microsoft.Skills.Text.SplitSkill",
"textSplitMode": "pages",
"maximumPageLength": 1000,
"defaultLanguageCode": "en",
"inputs": [
{
"name": "text",
"source": "/document/content"
},
{
"name": "languageCode",
"source": "/document/language"
}
],
"outputs": [
{
"name": "textItems",
"targetName": "mypages"
}
]
},
{
"#odata.type": "#Microsoft.Skills.Text.TranslationSkill",
"name": "#2",
"description": null,
"context": "/document/mypages/*",
"defaultFromLanguageCode": null,
"defaultToLanguageCode": "es",
"suggestedFrom": "en",
"inputs": [
{
"name": "text",
"source": "/document/mypages/*"
}
],
"outputs": [
{
"name": "translatedText",
"targetName": "translated_text"
}
]
}
]
Note that the split skill generated a collection of text elements under the "\document\mypages" node in the enriched tree. Also not that by providing the context "\document\mypages*" to the translation skill, we are telling the translation skill to perform translation on "each page".
I should point out that documents will still be indexed at the document level though. Skillsets are not really built to "change the cardinality of the index". That said, a workaround for that may be to project each of the pages as separate elements into a knowledge store, and then create a separate index that is actually focused on indexing each page.
Learn more about the knowledge store projections here:
https://learn.microsoft.com/en-us/azure/search/knowledge-store-concept-intro

Extend Azure Search fields with Cosmos DB indexer

I have a multiple Cosmos DB document types which are aggregated into a single Azure Search index.
For each document type I have a data source + indexer pair. It is an autogenerated pair that maps from Cosmos DB document fields into the index.
I faced with an issue: I generated a new indexer + data source for a new document type. I see that indexer completed the indexing. But my Index does not contain new fields. And I can't even add a new field manually, which will contain the data added by indexer.
Basing on docs it is possible to add a new field without rebuilding the index. But it's not clear how exactly this can be done
What problem are you seeing when you try to add the new index field manually? You should be able to add a new field to the index definition with a new field via REST or the portal interface.
Via portal: Navigate to your search service, select appropriate index from list, click on the "Fields" tab and click "Add field". A new field should show up at the bottom of the index grid and you should be able to edit field attributes/name/type.
Via REST: You have the right link to the update operation. You can do a GET on the existing index definition and add your new field to the "fields" array before using PUT for update. Your new index field JSON should be similar to the other fields in the index, for example:
"fields": [
...,
{
"name": "new field",
"type": "Edm.String",
"facetable": false,
"filterable": false,
"key": false,
"retrievable": true,
"searchable": true,
"sortable": false,
"analyzer": "standard.lucene",
"indexAnalyzer": null,
"searchAnalyzer": null,
"synonymMaps": [],
"fields": []
}
]
Here's a doc reference on the index definition JSON.
Be sure to set index field attributes (Filterable, Sortable, Searchable...) on new field creation as not all attributes can be added without rebuilding the index. Only retrievable, searchAnalyzer, and synonymMaps can be modified to an existing field.

Differences between Suggesters and NGram

I've built an index with a Custom Analyzer
"analyzers": [
{
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "ingram",
"tokenizer": "whitespace",
"tokenFilters": [ "lowercase", "NGramTokenFilter" ],
"charFilters": []
}
],
"tokenFilters": [
{
"#odata.type": "#Microsoft.Azure.Search.NGramTokenFilterV2",
"name": "NGramTokenFilter",
"minGram": 3,
"maxGram": 8
}
],
I came upon Suggesters and was wondering what the pros/cons were between these 2 approaches.
Basically, I'm doing an JavaScript autocomplete text box. I need to do partial text search inside of the search text (i.e. search=ell would match on "Hello World".
Azure Search offers two features to enable this depending on the experience you want to give to your users:
- Suggestions: https://learn.microsoft.com/en-us/rest/api/searchservice/suggestions
- Autocomplete: https://learn.microsoft.com/en-us/rest/api/searchservice/autocomplete
Suggestions will return a list of matching documents even with incomplete query terms, and you are right that it can be reproduced with a custom analyzer that uses ngrams. It's just a simpler way to accomplish that (since we took care of setting up the analyzer for you).
Autocomplete is very similar, but instead of returning matching documents, it will simply return a list of completed "terms" that match the incomplete term in your query. This will make sure terms are not duplicated in the autocomplete list (which can happen when using the suggestions API, since as I mentioned above, suggestions return matching documents, rather than a list of terms).

Text inside entities in Draft.js

I've been playing with the Entity system in Draft.js. One limitation I see is that entities have to correspond with a range of text in the content they are inserted into. I was hoping I could make a zero-length entity which would have a display based on the data in the entity rather than the text-content in the block. Is this possible?
This is possible when you have a whole block. As you can see in the code example this serialised blockMap contains a block containing no text, but the character list has one entry with an entity attached to it. There is also some discussion going on regarding adding meta-data to a block. see https://github.com/facebook/draft-js/issues/129
"blockMap": {
"80sam": {
"key": "80sam",
"type": "sticker",
"text": "",
"characterList": [
{
"style": [],
"entity": "1"
}
],
"depth": 0
},
},

Highlight matches in MongoDB full text search

Is it possible to define which part of the text in which of the indexed text fields matches the query?
No, as far as I know and can tell from the Jira, no such feature exists currently. You can, of course, attempt to highlight the parts of the text yourself, but that requires to implement the highlighting and also implement the stemming according to the rules applied by MongoDB.
The whole feature is somewhat complicated - even consuming it - as can be seen from the respective elasticsearch documentation.
Refer to Mongodb Doc Highlighting
db.fruit.aggregate([
{
$searchBeta: {
"search": {
"path": "description",
"query": ["variety", "bunch"]
},
"highlight": {
"path": "description"
}
}
},
{
$project: {
"description": 1,
"_id": 0,
"highlights": { "$meta": "searchHighlights" }
}
}
])
I'm afraid that solution applies only to MongoDB Atlas at the moment #LF00.

Resources