Azure Search - basic search in Czech language - azure-cognitive-search

I have an index created in Azure Search service where I have several string fields marked as searchable using Czech - Lucene analyzer. In Czech language we use some accented characters and it is common that people replace accented characters with non-accented when typing. Therefore, for example "Václav" (name) has the same meaning as "Vaclav". In my index, I have few documents with word "Václav" and none with word "Vaclav".
As much as I'd expect that Azure Search would return all documents containing word "Václav" when I search for "Vaclav", it is not the case. I'm wondering if I have to parse the query somehow before sending to the search engine.
I ran my tests both thru Azure Portal (setting API version to 2015-02-28-Preview) and thru my code using the very latest SDK Microsoft.Azure.Search 1.1.1.

By default Lucene and Microsoft analyzers for the Czech language don't ignore diacritics. The easiest way to achieve what you want is to use standardasciifolding.lucene analyzer instead. Alternatively, you could build a custom analyzer to add the ASCII folding token filter to the standard analysis chain for Czech. For example:
{
"name":"example",
"fields":[
{
"name":"id",
"type":"Edm.String",
"key":true
},
{
"name":"text",
"type":"Edm.String",
"searchable":true,
"retrievable":true,
"analyzer":"my_czech_analyzer"
}
],
"analyzers":[
{
"name":"my_czech_analyzer",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"standard",
"tokenFilters":[
"lowercase",
"czech_stop_filter",
"czech_stemmer",
"asciifolding"
]
}
],
"tokenFilters":[
{
"name":"czech_stop_filter",
"#odata.type":"#Microsoft.Azure.Search.StopTokenFilter",
"stopwords_list":"_czech_"
},
{
"name":"czech_stemmer",
"#odata.type":"#Microsoft.Azure.Search.StemmerTokenFilter",
"language":"czech"
}
]
}
We realize that the experience is not optimal now. We’re working to make customizations like this easier.
Let me know if this answers your question

Related

Alfresco 7: Search both in the version store and the content store

I have Alfresco 7.0 community edition, with search services version 2.0.2.
My need is to search for all documents having some metadata values, both in the version store and the content store.
I've tryied to search against the versionStore from public api with the following body
{
"query": {
"language": "afts",
"query": "TYPE:\"myc:projects\" AND myc:prop:\"pippo\""
},
"paging": {
"maxItems": 100,
"skipCount": 0
},
"include": [
"allowableOperations",
"properties"
],
"scope": {
"locations": "versions"
}
}
but I get http status 500.
If I try to search from Node Browser I receive this error message No solr query support for store workspace://version2Store.
I also tried the ligthweigth store (which is the difference?)
Is it possible to search with lucene, AFTS against the versions? Do I need to enable some property in alfresco-global.properties?
I've also seen this question and it's seems to me they can do searches.
Thanks a lot.
I've been able to handle both types of searches doing a CMIS query, but I had to write a module that handle that type of reseach (the form, results to be joined and displayed...).
Versioned nodes have a property which point to the noreRef of the current node.

Differences between Suggesters and NGram

I've built an index with a Custom Analyzer
"analyzers": [
{
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "ingram",
"tokenizer": "whitespace",
"tokenFilters": [ "lowercase", "NGramTokenFilter" ],
"charFilters": []
}
],
"tokenFilters": [
{
"#odata.type": "#Microsoft.Azure.Search.NGramTokenFilterV2",
"name": "NGramTokenFilter",
"minGram": 3,
"maxGram": 8
}
],
I came upon Suggesters and was wondering what the pros/cons were between these 2 approaches.
Basically, I'm doing an JavaScript autocomplete text box. I need to do partial text search inside of the search text (i.e. search=ell would match on "Hello World".
Azure Search offers two features to enable this depending on the experience you want to give to your users:
- Suggestions: https://learn.microsoft.com/en-us/rest/api/searchservice/suggestions
- Autocomplete: https://learn.microsoft.com/en-us/rest/api/searchservice/autocomplete
Suggestions will return a list of matching documents even with incomplete query terms, and you are right that it can be reproduced with a custom analyzer that uses ngrams. It's just a simpler way to accomplish that (since we took care of setting up the analyzer for you).
Autocomplete is very similar, but instead of returning matching documents, it will simply return a list of completed "terms" that match the incomplete term in your query. This will make sure terms are not duplicated in the autocomplete list (which can happen when using the suggestions API, since as I mentioned above, suggestions return matching documents, rather than a list of terms).

Using search.in with all

Follwing statement find all profiles that has Facebook or twitter and this works:
$filter=SocialAccounts/any(x: search.in(x, 'Facebook,Twitter'))
But I cant find any samples for finding all that has both Facebook and twitter. I tried:
$filter=SocialAccounts/all(x: search.in(x, 'Facebook,Twitter'))
But this is not valid query.
Azure Search does not support the type of ‘all’ filter that you’re looking for. Using search.in with ‘all’ would be equivalent to using OR, but Azure Search can only handle AND in the body of an ‘all’ lambda (which is equivalent to OR in the body of an ‘any’ lambda).
You might try a workaround like this:
$filter=tags/any(t: t eq 'Facebook') and tags/any(t: t eq 'Twitter')
However, this isn't actually equivalent to using all with search.in. The query as expressed using all is matching documents where every social account is strictly either Facebook or Twitter. If any other social account is present, the document won’t match. The workaround doesn’t have this property. A document must have at least Facebook and Twitter in order to match, but not exclusively those. This is certainly a valid scenario; it just isn't the same as using all with search.in, which was the original question.
No matter how you try to rewrite the query, you won’t be able to express an equivalent to the all query. This is a limitation due to the way Azure Search stores collections of strings and other primitive types in the inverted index.
Please vote on user voice to help prioritize:
https://feedback.azure.com/forums/263029-azure-search/suggestions/37166749-efficient-way-to-express-a-true-all
A possible workaround is to use the new Complex Types feature, which does allow more expressive filters inside lambda expressions. For example, if you model tags as objects with a single value property instead of as a collection of strings, you should be able to execute a filter like this:
$filter=tags/all(t: search.in(t/value, 'Facebook,Twitter'))
In the REST API, you'd define tags like this:
{
"name": "myindex",
"fields": [
...
{
"name": "tags",
"type": "Collection(Edm.ComplexType)",
"fields": [
{ "name": "value", "type": "Edm.String", "filterable": true }
]
}
]
}
Note that this feature is in preview at the time of this writing, but will be generally available (and publicly documented) soon.

Azure Search Custom Analyzer

We are trying to use a custom analyzer (KeywordAnalyzer) using Azure Search Rest api-version: 2015-02-28-preview.
The Index definition code you see below is copied exactly from Microsoft docs.
This works if we put the Analyzer Type to CustomAnalyzer. However, if we make a single change by changing the analyzer type from CustomAnalyzer to any other analyzer such as KeywordAnalyzer, you get a Bad Request error when creating the Index and the Index is not created.
Would appreciate if anyone coud tell us how we can specify an Analyzer.
Many thanks
{
"name":"homes",
"fields":[
{
"name":"Id",
"type":"Edm.String",
"key":true,
"searchable":false},
{
"name":"IdStd",
"type":"Edm.String",
"searchable":true,
"analyzer":"my_analyzer"}
],
"analyzers":[
{
"name":"my_analyzer",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"my_standard_tokenizer",
"tokenFilters":[
"my_asciifolding",
"lowercase"
]
}
],
"tokenizers":[
{
"name":"my_standard_tokenizer",
"#odata.type":"#Microsoft.Azure.Search.StandardTokenizer",
"maxTokenLength":20}
],
"tokenFilters":[
{
"name":"my_asciifolding",
"#odata.type":"#Microsoft.Azure.Search.AsciiFoldingTokenFilter",
"preserveOriginal":true}
]
}
I'm from Azure Search. What's the error message you're seeing together with the BadRequest response code?
Edit:
I reread you question. Potentially you are specifying the tokenizer and tokenFilter properties for the KeywordAnalyzer. These properties only apply to the CustomAnalyzer. Please let me know if you find the documentation insufficient or confusing. We'll make sure to make it more clear and easier to follow.

Azure Search Suggester

The suggester in Azure Search has only 1 SearchMode and that is it will match on any word within the field. Although this might be appropriate for many applications, it also is not for many others.
Is there any way we can configure the suggester so that a match occurs only when the beginning of the field is a match?
Many thanks for your assistance.
Consider creating a custom analyzer that at index time generates prefixes of words from your documents:
{
"name":"names",
"fields": [
{ "name":"id", "type":"Edm.String", "key":true, "searchable":false },
{ "name":"partialName", "type":"Edm.String", "searchable":true, "searchAnalyzer":"standard", "indexAnalyzer":"prefixAnalyzer" }
],
"analyzers": [
{
"name":"prefixAnalyzer",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"standard",
"tokenFilters":[ "lowercase", "my_edgeNGram" ]
}
],
"tokenFilters": [
{
"name":"my_edgeNGram",
"#odata.type":"#Microsoft.Azure.Search.EdgeNGramTokenFilter",
"minGram":2,
"maxGram":20
}
]
}
Notice the partialName field uses the standard analyzer for search and the custom (prefixAnalyzer) analyzer for indexing. You can now issue regular Search requests with prefixes of words as query terms.
You can learn more about the EdgeNGramTokenFilter from our documentation page about Analysis in Azure Search.
Let me know if this helps.
Currently only infix matching is supported in suggestions.

Resources