Set Index Key to Output Field Mapping - azure-cognitive-search

Set Index Key to Output Field Mapping - azure-cognitive-search

In my index, I've a field called id. During my enrichment pipeline I compute a value called /document/documentId, which I'm attempting to map to the id field. However, this mapping does not seem to work as the id always seems to be some long value that looks like a hash. All my other output field mappings work as expected.
Portion of the Index:
{
'name': 'id',
'type': 'Edm.String',
'facetable': false,
'filterable': true,
'key': true,
'retrievable': true,
'searchable': true,
'sortable': true,
'analyzer': null,
'indexAnalyzer': null,
'searchAnalyzer': null,
'synonymMaps': [],
'fields': []
}
Portion of the Indexer:
'outputFieldMappings': [
{
'sourceFieldName': '/document/documentId',
'targetFieldName': 'id'
}
]
Expected Value: 4b160942-050f-42b3-bbbb-f4531eb4ad7c
Actual Value: aHR0cHM6Ly9zdGRvY3VtZW50c2Rldi5ibG9iLmNvcmUud2luZG93cy5uZXQvMDNiZTBmMzEtNGMyZC00NDRjLTkzOTQtODJkZDY2YTc4MjNmL29yaWdpbmFscy80YjE2MDk0Mi0wNTBmLTQyYjMtYmJiYi1mNDUzMWViNGFkN2MucGRm0
Any thoughts on how to fix this would be much appreciated!

TL;DR - Can't use output field mappings for Keys. Can only use source fields.
According to Microsoft, it's not possible to set the document key using the output field mapping. Apparently, there is an issue in cases of deleting documents so the key has to exist straight out of the document.
I ended up using a mapping function in the fieldMappings.
"fieldMappings": [
{
"sourceFieldName": "metadata_storage_name",
"targetFieldName": "filename"
},
{
"sourceFieldName": "metadata_storage_name",
"targetFieldName": "id",
"mappingFunction": {
"name": "extractTokenAtPosition",
"parameters": {
"delimiter": ".",
"position": 0
}
}
}
]
Since my file name is something like 4b160942-050f-42b3-bbbb-f4531eb4ad7c.pdf then this ends up mapping mapping correctly to my Id.

You can use a regular field mapping rather than an output field mapping. If you created your indexer in the Azure portal, your key (which is "id", since key is true in your index definition of "id" above) was probably base64-encoded (that option is checked by default). You will need to base64-decode it to get your original value, OR you can store a second copy of the original value without encoding it (the key will need to be encoded). Here's how you do the latter - this can replace your output field mapping:
"fieldMappings": [
{
"sourceFieldName": "documentId",
"targetFieldName": "documentId"
},
{
"sourceFieldName": "documentId",
"targetFieldName": "id",
"mappingFunction": {
"name": "base64Encode"
}
}
]
Note that you will also need to add a documentId field in your index since you are storing this in its original format as well.
{
'name': 'documentId',
'type': 'Edm.String',
'facetable': false,
'filterable': true,
'key': false,
'retrievable': true,
'searchable': true,
'sortable': true,
'analyzer': null,
'indexAnalyzer': null,
'searchAnalyzer': null,
'synonymMaps': [],
'fields': []
}
Alternatively, you could just base64 encode (when storing) and decode (when retrieving) the id value. This key value is base64-encoded so it's safe to use as an Azure Cognitive Search document key. Check out https://learn.microsoft.com/azure/search/search-indexer-field-mappings for more info.

Related

Null or empty values are not stored in solr

I have solr database where I inserted string field like this:
{
"add-field": [
{
"name": "string__single_line_text_field__LC",
"type": "string",
"stored": true,
"indexed": true,
"required": true,
"default": ""
}
]
}
I set field to be required and define its default value. In my solr database, this field is like this:
The problem is because solr doesn't store my default value as empty string when string is null or empty (it simply doesn't exist) - it stores only non null/non empty values. Any idea how to solve this issue?

Mongo 4.2: Remove Null fields

Documents in my MongoDB collection look like this:
My Mongo version is 4.2.3
{
"_id": "SAGE-UW-00005",
"carriers": [
{
"bindable": true,
"carrierCode": "LMICO",
"mapped": true,
"products": [
{
"industries": [
{
"industryCode": null,
"states": "GA"
}
],
"isAllNCCIValid": null,
"isAllstateValid": true,
}
],
"questionCode": "LMGENRL17"
}
],
"column": 1,
"dataType": null,
}
This is my desired output:
{
"_id": "SAGE-UW-00005",
"carriers": [
{
"bindable": true,
"carrierCode": "LMICO",
"mapped": true,
"products": [
{
"industries": [
{
"states": "GA"
}
],
"isAllstateValid": true,
}
],
"questionCode": "LMGENRL17"
}
],
"column": 1,
}
I am not sure the depth of nested subdocuments in the collection, but there should be a lot of null fields in the collection. My backend code uses $exists to query the fields in the collection, so null is creating a problem here.

I am not sure the depth of nested subdocuments in the collection, but there should be a lot of null fields in the collection
It is a dynamic question. Best option would be replace the document after removing null fields in the code.
As you have nested levels, I would suggest you to map your data to pojo and check whether any entry and field is null. Unless you aware of the fields, it is not efficient to remove them.

E11000 (DuplicateKey) error when using a partial multikey unique index

Consider a collection with the following documents:
{
name: "John Doe",
emails: [
{
value: "some#domain.com",
isValid: true,
isPreferred: true
}
]
},
{
name: "John Doe",
emails: [
{
value: "john.doe#gmail.com",
isValid: false,
isPreferred: false
},
{
value: "john.doe#domain.com",
isValid: true,
isPreferred: true
}
]
}
There should be no users with the same valid and preferred emails, so there is a unique index for that:
db.users.createIndex( { "emails.value": 1 }, { name: "loginEmail", unique: true, partialFilterExpression: { "emails.isValid": true, "emails.isPreferred": true } } )
Adding the following email to the first document triggers the unique constraint violation:
{
name: "John Doe",
emails: [
{
value: "john.doe#gmail.com",
isValid: false,
isPreferred: false
}
]
}
Caused by: com.mongodb.MongoCommandException: Command failed with
error 11000 (DuplicateKey): 'E11000 duplicate key error collection:
profiles.users index: loginEmail dup key: { emails.value:
"john.doe#gmail.com", emails.isValid: false, emails.isPreferred: false
}' on server profiles-db-mongodb.dev:27017. The full response is
{"ok": 0.0, "errmsg": "E11000 duplicate key error collection:
profiles.users index: loginEmail dup key: { emails.value:
"john.doe#gmail.com", emails.isValid: false, emails.isPreferred:
false }", "code": 11000, "codeName": "DuplicateKey", "keyPattern":
{"emails.value": 1, "emails.isValid": 1, "emails.isPreferred": 1},
"keyValue": {"emails.value": "john.doe#gmail.com", "emails.isValid":
false, "emails.isPreferred": false}}
As I can understand, this happens because the filter expression is applied to the collection, not to the embedded documents, so although being somewhat counterintuitive and unexpected, the index behaves as described.
My question is how can I ensure partial uniqueness without having false negatives?

TLDR: You cant.
Let's understand why it's happening first, maybe then we'll understand what can be done. The problem originates due to a combination of two Mongo features.
the dot notation syntax. The dot notation syntax allows you to query subdocuments in arrays at ease ("emails.isPreferred": true). However when you want to start using multiple conditions for subdocuments like in your case you need to use something like $elemMatch sadly the restrictions for partialFilterExpression are quite restrictive and do not give you such power.
Which means even docs with emails such as:
{
"_id": ObjectId("5f106c0e823eea49427eea64"),
"name": "John Doe",
"emails": [
{
"value": "john.doe#gmail.com",
"isValid": true,
"isPreferred": false
},
{
"value": "john.doe#domain.com",
"isValid": false,
"isPreferred": true
}
]
}
Will be indexed. So ok, We will have some extra indexed documents in the collection but still apart from (falsely) increasing index size you still hope it might work, but it doesn't due to point 2.
multikey indexes:
MongoDB uses multikey indexes to index the content stored in arrays. ... , MongoDB creates separate index entries for every element of the array.
So when you create an index on an array or on any field of a sub document in an array Mongo will "flatten" the array and create a unique entry for each of the documents. and in this case it will create a unique index for all emails in the array.
So due to all these "features" and the restrictions of the partial filter syntax usage we can't really achieve what you want.
So what can you do? I'm sure you're already thinking of possible work arounds through this. A simple solution would be to maintain an extra field that will only contain those isValid and isPreferred emails. then a unique sparse index will do the trick.

count number of rows in cloudant in response

I have below response from my map reduce .
Now i want to count the number of rows in the response can any one help me how i can do it in cloudant? I need something in response like to get the total count of distinct correlationid in a period.
{
rows: [
{
key: [
"201705",
"aws-60826346-"
],
value: null
},
{
key: [
"201705",
"aws-60826348802-"
],
value: null
},
{
key: [
"201705",
"aws-las97628elb"
],
value: null
},
{
key: [
"201705",
"aws-ve-test"
],
value: null
},
{
key: [
"201705",
"aws-6032dcbce"
],
value: null
},
{
key: [
"201705",
"aws-60826348831d"
],
value: null
},
{
key: [
"201705",
"aws-608263488833926e"
],
value: null
},
{
key: [
"201705",
"aws-608263488a74f"
],
value: null
}
]
}

You need to implement a slightly obscure concept called "chained map-reduce" to accomplish this. You can't do this in the Cloudant administrative GUI, so you'll have to write your design document by hand.
Have your map/reduce emit an array as the key. The 1st array element will be month and the second will be your correlationid. The value should be 1. Then specify the built-in _count as the reduce function.
Now you need to add the chaining part. Chaining basically involves automatically copying the result of a map/reduce into a new database. You can then do another map/reduce on that database. Thereby creating a chain of map/reduces...
Here's a tiny sample database using your example:
https://rajsingh.cloudant.com/so44106569/_all_docs?include_docs=true&limit=200
Here's the design document containing the map/reduce, along with the dbcopy command that updates a new database (in this case called sob44106569) with the results of the view called view:
{
"_id": "_design/ddoc",
"_rev": "11-88ff7d977dfff81a05c50b13d854a78f",
"options": {
"epi": {
"dbcopy":
{
"view": "sob44106569"
}
}
},
"language": "javascript",
"views": {
"view": {
"reduce": "_count",
"map": "function (doc) {\n emit([doc.month, doc.machine], 1);\n}"
}
}
}
Here's the result of the map function (no reduce) showing 10 rows. Notice that there are two documents with month 201705 and machine aws-6032dcbce:
https://rajsingh.cloudant.com/so44106569/_design/ddoc/_view/view?limit=200&reduce=false
If you just do the built-in _count reduce on this view at group_level=1, you'll get a value of 9 for 201705, which is wrong for your purposes because you want to count that aws-6032dcbce only once, even though it shows up in the data twice:
https://rajsingh.cloudant.com/so44106569/_design/ddoc/_view/view?limit=200&reduce=true&group=true&group_level=1
So let's take a quick look at the map/reduce at group_level=2. This is what gets copied to the new database:
https://rajsingh.cloudant.com/so44106569/_design/ddoc/_view/view?limit=200&reduce=true&group=true&group_level=2
Here you see that aws-6032dcbce only shows up once (but with value=2), so this is a useful view. The dbcopy part of our map/reduce creates the database sob44106569 based on this view. Let's look at that:
https://rajsingh.cloudant.com/sob44106569/_all_docs?include_docs=true
Now we can run a very simple map/reduce on that database, emitting the month and machine again (now they are in an array so have different names), but this time the repeated values for machine have already been "reduced" away.
function (doc) {
if (doc.key && doc.key.length == 2 )
emit(doc.key[0], doc.key[1]);
}
And finally here's the count of distinct "machines". Now we can finally see the desired value of 8 for 201705.
https://rajsingh.cloudant.com/sob44106569/_design/views/_view/counted?limit=200&reduce=true&group=true&group_level=1
response:
{
"rows": [
{
"key": "201705",
"value": 8
},
{
"key": "201706",
"value": 1
}
]
}

Emit 1 instead of null and use the built-in reducer _count.

Set criteria in query for fields and fields in nested objects

I have a document like this:
{
"InDate": "11.09.2015",
"Kst2Kst": true,
"OutDate": "11.09.2015",
"__v": 0,
"_id": ObjectId('55f2df2d7e12a9f1f52837e6'),
"accepted": true,
"inventar": [
{
"accepted": "1",
"name": "AAAA",
"isstammkost": true,
"stammkost": "IWXI"
},
{
"accepted": "1",
"name": "BBBB",
"isstammkost": false,
"stammkost": "null"
}
]
}
I want to select the data with "isstammkost": true in the inventar-array.
My query is:
Move.findOne({accepted : true, 'inventar.isstammkost' : true},
'OutDate InDate inventar.name', function(err, res)
It doesn't work -> It selects all, even with inventar.isstammkost : false.
The "normal" query works like I want (without criteria in sub-array). Whats the right way to set criteria in sub-array?

Of course it will return the "isstammkost": false part, because that is part of the same document as the "isstammkost": true. They are both objects in the array "inventar", a top-level field in a single document. Without some sort of projection, the entire document will always be returned to a mongodb query and thus nodejs will pass them on to you.
I'm not terribly up-to-speed on nodejs, but if this were the mongo shell it would look like this:
> db.MyDB.findOne({{accepted : true, "inventar.isstammkost" : true}, {"inventar.isstammkost.$": 1});
You will need to find out how to add that extra parameter to the nodejs function.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight