Azure Search Suggester - azure-cognitive-search

The suggester in Azure Search has only 1 SearchMode and that is it will match on any word within the field. Although this might be appropriate for many applications, it also is not for many others.
Is there any way we can configure the suggester so that a match occurs only when the beginning of the field is a match?
Many thanks for your assistance.

Consider creating a custom analyzer that at index time generates prefixes of words from your documents:
{
"name":"names",
"fields": [
{ "name":"id", "type":"Edm.String", "key":true, "searchable":false },
{ "name":"partialName", "type":"Edm.String", "searchable":true, "searchAnalyzer":"standard", "indexAnalyzer":"prefixAnalyzer" }
],
"analyzers": [
{
"name":"prefixAnalyzer",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"standard",
"tokenFilters":[ "lowercase", "my_edgeNGram" ]
}
],
"tokenFilters": [
{
"name":"my_edgeNGram",
"#odata.type":"#Microsoft.Azure.Search.EdgeNGramTokenFilter",
"minGram":2,
"maxGram":20
}
]
}
Notice the partialName field uses the standard analyzer for search and the custom (prefixAnalyzer) analyzer for indexing. You can now issue regular Search requests with prefixes of words as query terms.
You can learn more about the EdgeNGramTokenFilter from our documentation page about Analysis in Azure Search.
Let me know if this helps.

Currently only infix matching is supported in suggestions.

Related

Differences between Suggesters and NGram

I've built an index with a Custom Analyzer
"analyzers": [
{
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "ingram",
"tokenizer": "whitespace",
"tokenFilters": [ "lowercase", "NGramTokenFilter" ],
"charFilters": []
}
],
"tokenFilters": [
{
"#odata.type": "#Microsoft.Azure.Search.NGramTokenFilterV2",
"name": "NGramTokenFilter",
"minGram": 3,
"maxGram": 8
}
],
I came upon Suggesters and was wondering what the pros/cons were between these 2 approaches.
Basically, I'm doing an JavaScript autocomplete text box. I need to do partial text search inside of the search text (i.e. search=ell would match on "Hello World".
Azure Search offers two features to enable this depending on the experience you want to give to your users:
- Suggestions: https://learn.microsoft.com/en-us/rest/api/searchservice/suggestions
- Autocomplete: https://learn.microsoft.com/en-us/rest/api/searchservice/autocomplete
Suggestions will return a list of matching documents even with incomplete query terms, and you are right that it can be reproduced with a custom analyzer that uses ngrams. It's just a simpler way to accomplish that (since we took care of setting up the analyzer for you).
Autocomplete is very similar, but instead of returning matching documents, it will simply return a list of completed "terms" that match the incomplete term in your query. This will make sure terms are not duplicated in the autocomplete list (which can happen when using the suggestions API, since as I mentioned above, suggestions return matching documents, rather than a list of terms).

Azure Search - basic search in Czech language

I have an index created in Azure Search service where I have several string fields marked as searchable using Czech - Lucene analyzer. In Czech language we use some accented characters and it is common that people replace accented characters with non-accented when typing. Therefore, for example "Václav" (name) has the same meaning as "Vaclav". In my index, I have few documents with word "Václav" and none with word "Vaclav".
As much as I'd expect that Azure Search would return all documents containing word "Václav" when I search for "Vaclav", it is not the case. I'm wondering if I have to parse the query somehow before sending to the search engine.
I ran my tests both thru Azure Portal (setting API version to 2015-02-28-Preview) and thru my code using the very latest SDK Microsoft.Azure.Search 1.1.1.
By default Lucene and Microsoft analyzers for the Czech language don't ignore diacritics. The easiest way to achieve what you want is to use standardasciifolding.lucene analyzer instead. Alternatively, you could build a custom analyzer to add the ASCII folding token filter to the standard analysis chain for Czech. For example:
{
"name":"example",
"fields":[
{
"name":"id",
"type":"Edm.String",
"key":true
},
{
"name":"text",
"type":"Edm.String",
"searchable":true,
"retrievable":true,
"analyzer":"my_czech_analyzer"
}
],
"analyzers":[
{
"name":"my_czech_analyzer",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"standard",
"tokenFilters":[
"lowercase",
"czech_stop_filter",
"czech_stemmer",
"asciifolding"
]
}
],
"tokenFilters":[
{
"name":"czech_stop_filter",
"#odata.type":"#Microsoft.Azure.Search.StopTokenFilter",
"stopwords_list":"_czech_"
},
{
"name":"czech_stemmer",
"#odata.type":"#Microsoft.Azure.Search.StemmerTokenFilter",
"language":"czech"
}
]
}
We realize that the experience is not optimal now. We’re working to make customizations like this easier.
Let me know if this answers your question

Azure Search Custom Analyzer

We are trying to use a custom analyzer (KeywordAnalyzer) using Azure Search Rest api-version: 2015-02-28-preview.
The Index definition code you see below is copied exactly from Microsoft docs.
This works if we put the Analyzer Type to CustomAnalyzer. However, if we make a single change by changing the analyzer type from CustomAnalyzer to any other analyzer such as KeywordAnalyzer, you get a Bad Request error when creating the Index and the Index is not created.
Would appreciate if anyone coud tell us how we can specify an Analyzer.
Many thanks
{
"name":"homes",
"fields":[
{
"name":"Id",
"type":"Edm.String",
"key":true,
"searchable":false},
{
"name":"IdStd",
"type":"Edm.String",
"searchable":true,
"analyzer":"my_analyzer"}
],
"analyzers":[
{
"name":"my_analyzer",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"my_standard_tokenizer",
"tokenFilters":[
"my_asciifolding",
"lowercase"
]
}
],
"tokenizers":[
{
"name":"my_standard_tokenizer",
"#odata.type":"#Microsoft.Azure.Search.StandardTokenizer",
"maxTokenLength":20}
],
"tokenFilters":[
{
"name":"my_asciifolding",
"#odata.type":"#Microsoft.Azure.Search.AsciiFoldingTokenFilter",
"preserveOriginal":true}
]
}
I'm from Azure Search. What's the error message you're seeing together with the BadRequest response code?
Edit:
I reread you question. Potentially you are specifying the tokenizer and tokenFilter properties for the KeywordAnalyzer. These properties only apply to the CustomAnalyzer. Please let me know if you find the documentation insufficient or confusing. We'll make sure to make it more clear and easier to follow.

Highlight matches in MongoDB full text search

Is it possible to define which part of the text in which of the indexed text fields matches the query?
No, as far as I know and can tell from the Jira, no such feature exists currently. You can, of course, attempt to highlight the parts of the text yourself, but that requires to implement the highlighting and also implement the stemming according to the rules applied by MongoDB.
The whole feature is somewhat complicated - even consuming it - as can be seen from the respective elasticsearch documentation.
Refer to Mongodb Doc Highlighting
db.fruit.aggregate([
{
$searchBeta: {
"search": {
"path": "description",
"query": ["variety", "bunch"]
},
"highlight": {
"path": "description"
}
}
},
{
$project: {
"description": 1,
"_id": 0,
"highlights": { "$meta": "searchHighlights" }
}
}
])
I'm afraid that solution applies only to MongoDB Atlas at the moment #LF00.

fuzzy search in elasticsearch different than fuzziness match boolean

i'm trying to figure out why the following queries produce vastly different results. i'm told a fuzzy query is almost never a good idea per this document Found-fuzzy so i'm trying to use a match query with a fuzziness parameter. they produce extremely different results. i'm not sure what's the best way of doing this.
my example is a movie title containing 'batman'. the user, however, types 'bat man' (with a space). this would make sense that a fuzzy query should find batman. it should also find other variations like spider man, but for now that's ok i guess. (not really, but...)
so the fuzzy search is actually returning more relevant results than the match one below. any ideas?
--fuzzy:
{
"query":{
"bool":{
"should": [
{
"fuzzy": {
"title": {
"value": "bat man",
"boost": 4
}
}
}
], "minimum_number_should_match": 1
}
}
}
--match:
{
"query":{
"bool":{
"should": [
{
"match": {
"title": {
"query": "bat man",
"boost": 4
}
}
}
], "minimum_number_should_match": 1
}
}
}
EDIT
i'm adding examples of what gets returned.
first, nothing gets returned using the match query, even with a high fuzziness value added (fuzziness: 5)
but i do get several 'batman' related titles using the fuzzy query such as 'batman' or 'batman returns'.
this gets even stranger when i do multiple fuzzy searches on 'bat man' using the fuzzy search... if i search my 'starring' field, in addition to the title field, (starring contains lists of actors), i get 'jason bateman' as well as the title 'batman'.
{
"_index": "store24",
"_type": "searchdata",
"_id": "081227987909",
"_score": 4.600759,
"fields": {
"title": [
"Batman"
]
}
},
{
"_index": "store24",
"_type": "searchdata",
"_id": "883929053353",
"_score": 4.1418676,
"fields": {
"title": [
"Batman Forever"
]
}
},
{
"_index": "store24",
"_type": "searchdata",
"_id": "883929331789",
"_score": 3.5298011,
"fields": {
"title": [
"Batman Returns"
]
}
}
BEST SO FAR (STILL NOT GREAT)
what i've found that works best so far is to combine both queries. this seems redundant, but i can't as yet make one work like the other. so, this seems to be better:
"should": [
{
"fuzzy": {
"title": {
"boost": 6.0,
"min_similarity": 1.0,
"value": "batman"
}
}
},
{
"match": {
"title": {
"query": "batman",
"boost": 6.0
,"fuzziness": 1
}
}
}
]
Elastic Search analyzes docs and converts them into terms, which are what is actually searched (not the docs themselves). The key difference between the two query types is that the match query does not analyze the query text before sending the query. So consider the example below:
The search of 'bat man' in a fuzzy search would first tokenize the term, then search. So what it really looks for is 'btmn,' which might not turn up the same matches. A good example of this is how Jason Bateman showed up because the last name was tokenized to btmn or a similar form.
More detailed information on the Analyzing of text fields when searching can be read http://exploringelasticsearch.com/searching_data.html#sec-searching-analysis
When a search is performed on an analyzed field, the query itself is
analyzed, matching it up to the documents which are analyzed when
added to the database. Reducing words to these short tokens normalizes
the text allowing for fast efficient lookups. Whether you’re searching
for "rollerblading" in any form, internally we’re just looking for
"rollerblad".

Resources