Azure Cognitive Search prefix searching as single token - azure-cognitive-search

I'm trying to create an Azure Search index with a searchable Name field that should not be tokenized and be treated as a single string.
So if I have two values:
"Total Insurance"
"Invoice Total"
With a search term like this: search=Total*, then only "Total Insurance" should be returned because it starts with "Total".
My assumption was that the 'keyword' analyzer is to be used for this type of search
https://learn.microsoft.com/en-us/azure/search/index-add-custom-analyzers#built-in-analyzers
But it doesn't seem to work like that, it doesn't return any results with search=Total*.
Is there a different setup for this type of search?

Something like this is required:
{
"name":"myIndex",
"fields": [
{
"name":"Name",
"type":"Edm.String",
"searchable":true,
"filterable": true,
"retrievable": true,
"sortable": true,
"searchAnalyzer":"keyword",
"indexAnalyzer":"prefixAnalyzer"
}
],
"analyzers": [
{
"name":"prefixAnalyzer",
"#odata.type":"#Microsoft.Azure.Search.CustomAnalyzer",
"tokenizer":"keyword_v2",
"tokenFilters":[ "lowercase", "my_edgeNGram" ]
}
],
"tokenFilters": [
{
"name":"my_edgeNGram",
"#odata.type":"#Microsoft.Azure.Search.EdgeNGramTokenFilterV2",
"minGram":3,
"maxGram":7
}
]
}

Related

How to get all solr field names except for multivalued fields?

I'm new to solr and I'm trying to query for field names excluding fields with multiValued=true.
So far I have
select?q=*:*&wt=csv&rows=0&facet
which returns all the fields.
Is there a way to modify the query to check if a field is multivalued?
You can retrieve information about all the defined fields through the Schema API. The response will contain a multiValued field set to true if the field is defined as multivalued:
v1 API:
http://localhost:8983/techproducts/schema/fields
v2 API:
http://localhost:8983/api/collections/techproducts/schema/fields
{
"fields": [
{
"indexed": true,
"name": "_version_",
"stored": true,
"type": "long"
},
{
"indexed": true,
"multiValued": true, <----
"name": "cat",
"stored": true,
"type": "string"
},
],
"responseHeader": {
"QTime": 1,
"status": 0
}
}

Azure search services issue for white space and wildcard search of special characters

We have an application that allows the users to enter anything on the summary field. The users can type in any special characters like #$!#~ etc including white space and they request that they can search based on those special characters as well. For example, one of the entry is "test testing **** #### !!!!! ???? # $".
I created a cognitive search index with analyzer to be standard.lucene, shown below:
{
"name": "Summary",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "standard.lucene",
"synonymMaps": []
}
When I used the postman query:
{ "top":"1000",
"queryType": "full",
"searchMode":"all",
"search": "testing",
"searchFields": "Summary",
"count":true
}
I can get the expected result.
If I use the following:
{ "top":"1000",
"queryType": "full",
"searchMode":"all",
"search": "testing ****",
"searchFields": "Summary",
"count":true
}
I got the error with "InvalidRequestParameter".
If I changed to the following query:
{ "top":"1000",
"queryType": "full",
"searchMode":"all",
"search": ""****"",
"searchFields": "Summary",
"count":true
}
Then I am not getting any results back.
Per this article:
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#escaping-special-characters
In order to use any of the search operators as part of the search text, escape the character by prefixing it with a single backslash ().
Special characters that require escaping include the following:
& | ! ( ) { } [ ] ^ " ~ * ? : \ /
I need to prefix with single backslash for the special characters. But in my case it doesn't seem to work. Any help will be appreciated!
If you are using standard lucene analyzer for your indexing, I believe the "****" is not saved as a word. Lucene analyzer breaks the words on special characters.
For fields that you need to be searched on, e.g., the summary field in your example, you need to create a custom analyzer for that field. This document talks about how you can do that, test your analyzer. Once you have built an analyzer that tokenizes the input the way you want, you can use that in your index definition for the fields that need it as follows.
...
{
"name": "Summary",
"type": "Edm.String",
"retrievable": true,
"searchable": true,
"analyzer": "custom_analyzer_for_tokenizing_as_is"
},
...
I finally get this one resolved by creating a customized analyzer. The index definition:
{
"name": "FieldName",
"type": "Edm.String",
"searchable": true,
"filterable": true,
"retrievable": true,
"sortable": true,
"facetable": true,
"key": false,
"indexAnalyzer": null,
"searchAnalyzer": null,
"analyzer": "specialcharanalyzer",
"synonymMaps": []
},
The analyzer is specified below:
"analyzers": [
{
"#odata.type": "#Microsoft.Azure.Search.CustomAnalyzer",
"name": "specialcharanalyzer",
"tokenizer": "whitespace",
"tokenFilters": [
"lowercase"
],
"charFilters": []
}
],
Then you can use the format this document specified https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#special-characters
https://learn.microsoft.com/en-us/azure/search/query-lucene-syntax#special-characters
Special characters that require escaping include the following:
+ - & | ! ( ) { } [ ] ^ " ~ * ? : \ /
For characters not in the above required escaping character, use the following format for infix search:
"search": "/.*SearchChar.*/",
For example, if you want to search for $, then use the following format:
"search": "/.*$.*/",
For special characters in the list, use this format:
"search" : "/.*\\escapingcharacter.*/",
For example to search for +, use the following query;
"search" : "/.*\\+.*/",
# is also considered to be escaping character if it is in a statement.
To search for *, use this format:
"search":"/\\**/",

Mongo 4.2: Remove Null fields

Documents in my MongoDB collection look like this:
My Mongo version is 4.2.3
{
"_id": "SAGE-UW-00005",
"carriers": [
{
"bindable": true,
"carrierCode": "LMICO",
"mapped": true,
"products": [
{
"industries": [
{
"industryCode": null,
"states": "GA"
}
],
"isAllNCCIValid": null,
"isAllstateValid": true,
}
],
"questionCode": "LMGENRL17"
}
],
"column": 1,
"dataType": null,
}
This is my desired output:
{
"_id": "SAGE-UW-00005",
"carriers": [
{
"bindable": true,
"carrierCode": "LMICO",
"mapped": true,
"products": [
{
"industries": [
{
"states": "GA"
}
],
"isAllstateValid": true,
}
],
"questionCode": "LMGENRL17"
}
],
"column": 1,
}
I am not sure the depth of nested subdocuments in the collection, but there should be a lot of null fields in the collection. My backend code uses $exists to query the fields in the collection, so null is creating a problem here.
I am not sure the depth of nested subdocuments in the collection, but there should be a lot of null fields in the collection
It is a dynamic question. Best option would be replace the document after removing null fields in the code.
As you have nested levels, I would suggest you to map your data to pojo and check whether any entry and field is null. Unless you aware of the fields, it is not efficient to remove them.

"Unknown Error: mango_idx :: {no_usable_index,missing_sort_index}"}

I have the following query:
{'type': 'text',
'name': 'album-rating-text',
'index': {'fields': [
{'type': 'string', 'name': 'user_id'},
{'type': 'string', 'name': 'album_id'},
{'type': 'number', 'name': 'timestamp'}
]}}
Here is the query:
{'sort': [
{'user_id': 'desc'},
{'album_id': 'desc'},
{'timestamp': 'desc'}
],
'limit': 1,
'fields': ['user_id', 'album_id', 'timestamp'],
'selector': {
'$and': [
{'user_id': {'$eq': 'a#a.com'}},
{'album_id': {'$in': ['bf129f0d', '380e3a05'
]
}}]}}
The error:
{
"error":"unknown_error",
"reason":"Unknown Error: mango_idx :: {no_usable_index,missing_sort_index}"
}
I've seen a similar question however, all the fields that I'm indexing on are in my sort list.
Update:
As a workaround, I attempted to simplify by dropping the timestamp field:
{"type": "text",
"name": "album-rating-text",
"index": {"fields": [
{"type": "string", "name": "user_id"},
{"type": "string", "name": "album_id"}
]}}
And query as so ...
{"selector": {"$and": [
{"user_id": {"$eq": "a#a.com"}},
{"album_id": {"$in": ["bf129f0d", "380e3a05"]}
}]},
"fields": ["user_id", "album_id"]}
I get the following error:
{"warning":"no matching index found, create an index to optimize query time",
"docs":[
]}
To use sort function for a custom field, that field needs to be manually registered to "Query-index".
Cloudant doesn't do this, because it's resource consuming:
"The example in the editor shows how to index the field "foo" using
the json type index. You can automatically index all the fields in all
of your documents using a text type index with the syntax '{ "index":
{}, "type": "text" }', Note that indexing all fields can be resource
consuming on large data sets."
You can do this using the Cloudant dashboard. Go to your database and look for "Queryable indexes". Click Edit.
Add your field to the default template:
{
"index": {
"fields": [
"user_id"
]
},
"type": "json"
}
Press "Create index"
Field "user_id" is now queryable, and you can now use sort-function to it.
All fields need to be add manually, or you can register all fields as Query-index with:
{ "index": {}, "type": "text" }
Video instructions for creating Query-index:
https://www.youtube.com/watch?v=B3ZkxSFau8U
Try using a JSON index instead of the text index:
{
"type": "json",
"name": "album-rating-text",
"index": {
"fields": ["user_id", "album_id", "timestamp"]
}
}
If I remember correct, my query requirements changed and I chose to use a standard Cloudant Search index instead of a Mango index.

Elasticsearch not returning hits for multi-valued field

I am using Elasticsearch with no modifications whatsoever. This means the mappings, norms, and analyzed/not_analyzed is all default config. I have a very small data set of two items for experimentation purposes. The items have several fields but I query only on one, which is a multi-valued/array of strings field. The doc looks like this:
{
"_index": "index_profile",
"_type": "items",
"_id": "ega",
"_version": 1,
"found": true,
"_source": {
"clicked": [
"ega"
],
"profile_topics": [
"Twitter",
"Entertainment",
"ESPN",
"Comedy",
"University of Rhode Island",
"Humor",
"Basketball",
"Sports",
"Movies",
"SnapChat",
"Celebrities",
"Rite Aid",
"Education",
"Television",
"Country Music",
"Seattle",
"Beer",
"Hip Hop",
"Actors",
"David Cameron",
... // other topics
],
"id": "ega"
}
}
A sample query is:
GET /index_profile/items/_search
{
"size": 10,
"query": {
"bool": {
"should": [{
"terms": {
"profile_topics": [
"Basketball"
]
}
}]
}
}
}
Again there are only two items and the one listed should match the query because the profile_topics field matches with the "Basketball" term. The other item does not match. I only get a result if I ask for clicked = ega in the should.
With Solr I would probably specify that the fields are multi-valued string arrays and are to have no norms and no analyzer so profile_topics are not stemmed or tokenized since all values should be treated as tokens (even the spaces). Not sure this would solve the problem but it is how I treat similar data on Solr.
I assume I have run afoul of some norm/analyzer/TF-IDF issue, if so how do I solve this so that even with two items the query will return ega. If possible I'd like to solve this index or type wide rather than field specific.
Basketball (with capital B) in terms will not be analyzed. This means this is the way it will be searched in the Elasticsearch index.
You say you have the defaults. If so, indexing Basketball under profile_topics field means that the actual term in the index will be basketball (with lowercase b) which is the result of the standard analyzer. So, either you set profile_topics as not_analyzed or you search for basketball and not Basketball.
Read this about terms.
Regarding to setting all the fields to not_analyzed you could do that with a dynamic template. Still with a template you can do what Logstash is doing: defining a .raw subfield for each string field and only this subfield is not_analyzed. The original/parent field still holds the analyzed version of the same text, maybe you will use in the future the analyzed field.
Take a look at this dynamic template. It's the one Logstash is using.
More specifically:
{
"template": "your_indices_name-*",
"mappings": {
"_default_": {
"_all": {
"enabled": true,
"omit_norms": true
},
"dynamic_templates": [
{
"string_fields": {
"match": "*",
"match_mapping_type": "string",
"mapping": {
"type": "string",
"index": "analyzed",
"omit_norms": true,
"fields": {
"raw": {
"type": "string",
"index": "not_analyzed"
}
}
}
}
}
]
}
}
}

Resources