MongoDB - searching in array using $elemMatch slower with index than without - arrays

I have a collection with 500k documents with the following structure:
{
"_id" : ObjectId("5f2d30b0c7cc16c0da84a57d"),
"RecipientId" : "6a28d20f-4741-4c14-a055-2eb2593dcf13",
...
"Actions" : [
{
"CampaignId" : "7fa216da-db22-44a9-9ea3-c987c4152ba1",
"ActionDatetime" : ISODate("1998-01-13T00:00:00.000Z"),
"ActionDescription" : "OPEN"
},
...
]
}
I need to count the top level documents whose subdocuments inside the "Actions" array meet certain criteria, and for this I've created the following Multikey index (taking only the "ActionDatetime" field as an example):
db.getCollection("recipients").createIndex( { "Actions.ActionDatetime": 1 } )
The problem is that when I write the query using an $elemMatch, the operation is much slower than when I don't use the Multikey index at all:
db.getCollection("recipients").count({
"Actions":
{ $elemMatch:{ ActionDatetime: {$gt: new Date("1950-08-04")} }}}
)
The stats for this query:
{
"executionSuccess" : true,
"nReturned" : 0,
"executionTimeMillis" : 13093,
"totalKeysExamined" : 8706602,
"totalDocsExamined" : 500000,
"executionStages" : {
"stage" : "COUNT",
"nReturned" : 0,
"executionTimeMillisEstimate" : 1050,
"works" : 8706603,
"advanced" : 0,
"needTime" : 8706602,
"needYield" : 0,
"saveState" : 68020,
"restoreState" : 68020,
"isEOF" : 1,
"nCounted" : 500000,
"nSkipped" : 0,
"inputStage" : {
"stage" : "FETCH",
"filter" : {
"Actions" : {
"$elemMatch" : {
"ActionDatetime" : {
"$gt" : ISODate("1950-08-04T00:00:00.000Z")
}
}
}
},
"nReturned" : 500000,
"executionTimeMillisEstimate" : 1040,
"works" : 8706603,
"advanced" : 500000,
"needTime" : 8206602,
"needYield" : 0,
"saveState" : 68020,
"restoreState" : 68020,
"isEOF" : 1,
"docsExamined" : 500000,
"alreadyHasObj" : 0,
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : 500000,
"executionTimeMillisEstimate" : 266,
"works" : 8706603,
"advanced" : 500000,
"needTime" : 8206602,
"needYield" : 0,
"saveState" : 68020,
"restoreState" : 68020,
"isEOF" : 1,
"keyPattern" : {
"Actions.ActionDatetime" : 1.0
},
"indexName" : "Actions.ActionDatetime_1",
"isMultiKey" : true,
"multiKeyPaths" : {
"Actions.ActionDatetime" : [
"Actions"
]
},
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 2,
"direction" : "forward",
"indexBounds" : {
"Actions.ActionDatetime" : [
"(new Date(-612576000000), new Date(9223372036854775807)]"
]
},
"keysExamined" : 8706602,
"seeks" : 1,
"dupsTested" : 8706602,
"dupsDropped" : 8206602
}
}
}
}
This query took 14sec to execute, whereas if I remove the index, the COLLSCAN takes 1 second.
I understand that I'd have a better performance by not using $elemMatch, and filtering by "Actions.ActionDatetime" directly, but in reality I'll need to filter by more than one field inside the array, so the $elemMatch becomes mandatory.
I suspect that it's the FETCH phase which is killing the performance, but I've noticed that when i use the "Actions.ActionDatetime" directly, MongoDB is able to use a COUNT_SCAN instead of the fetch, but the performance is still poorer than the COLLSCAN (4s).
I'd like to know if there's a better indexing strategy for indexing subdocuments with high cardinality inside an array, or if I'm missing something with my current approach.
As the volume grows, indexing this information will be a necessity and I don't want to rely on a COLLSCAN.

The problem here is twofold:
Every document matches your query
Consider the analogy of an index being the catalog in a library. If you want to find a single book, looking it up in the catalog permits you to go straight to the shelf holding it, which is much faster than starting at the first shelf and searching through the books (unless of course it actually is on that first shelf). However, if you want to get all of the books in the library, it will be much faster to just start taking them off them shelf than checking the catalog for each one and then going to get it.
While this analogy is far from perfect it does show that the collection scan can be expected to be much more efficient than index lookups when a large percentage of the documents will be considered.
Multikey indexes have more than one entry for each document
When mongod builds an index on an array, it creates a separate entry in the index for each discreet element. When you match a value from an array element, the index can get you to a matching document quickly, but because a single document is expected to have multiple entries in the index deduplication is required afterward.
These are further exacerbated by the $elemMatch. Since the index contains values for the separate indexed fields, it is unable to determine if the values for different fields occur within the same array element from the index, so it must load each document to check that.
Essentially, when using elemMatch with the index and a query that matches every document, the mongod node will examine the index to identify matching values, deduplicate that list, then load each document (likely in the order encountered in the index) to see if a single array value satisfies the elemMatch.
When compared with the non-indexed collection scan execution where the mongod must load each document in the order encountered on disk, and check if a single array element matches satisfies the elemMatch, it should be apparent that the indexed query will perform worse if a large percentage of the documents match the query.

TLDR: this is the expected behaviour of a multikey index combined with an $elemMatch.
Why is this happening?
So it is the FETCH stage that's ruining your query performance, unfortunately this is the expected behaviour.
From the covered query section of the multikey index documents:
Multikey indexes cannot cover queries over array field(s).
Meaning all information about a sub-document is not in the multikey index - count with one field is an exception where it can do better. But in this case $elemMatch is still forcing a FETCH stage? why it is only using a single field.
Imagine this scenario:
//doc1
{
"Actions" : [
{
"CampaignId" : "7fa216da-db22-44a9-9ea3-c987c4152ba1",
"ActionDatetime" : ISODate("1998-01-13T00:00:00.000Z"),
"ActionDescription" : "OPEN"
},
...
]
}
//doc2
{
"Actions" : {
"CampaignId" : "7fa216da-db22-44a9-9ea3-c987c4152ba1",
"ActionDatetime" : ISODate("1998-01-13T00:00:00.000Z"),
"ActionDescription" : "OPEN"
}
}
Because Mongo "flattens" the array it indexes, once the index is built Mongo cannot differentiate between these 2 documents, but $elemMatch requires an array object to match it has to fetch these documents to determine which one qualifies.
This is the exact problem you're facing.
What can you do?
Well not much sadly. I'm not sure how dynamic your queries are but the only way to solve this issue is to preprocess the documents to contain the "answers" to your queries.
I still find it hard to believe that COLSCAN is doing better than the index query. I'm assuming you're matching a large portion of your collection combined with the fact that Actions array are very large.
What I would suggest as performance will keep being an issue especially if your queries will continue to match a large portion of the collection is to restructure
you data. Just save each Actions entry as it's own document.
{
"Actions" : {
"CampaignId" : "7fa216da-db22-44a9-9ea3-c987c4152ba1",
"ActionDatetime" : ISODate("1998-01-13T00:00:00.000Z"),
"ActionDescription" : "OPEN"
}
}
Then your query will be allowed to use an index, you'll have to use a different query than count. A distinct on RecipientId sounds like a valid options.

Related

How do I choose between column-family and a document store database?

I'm working on a project, and I'm struggling to make a definitive decision on whether to use a column-family or a document store. My situation is as follows:
The project I am working on is a hass.io application that will visualize certain data for Tesla cars. My project will run on a raspberry pi (pi 3), so database size is an issue.
My data will look something like this:
{
"cars" : [
{
"car_id" : 3241123,
"model" : "Tesla S",
"data" : [
{
"timestamp": 23840923804982309,
"temperature": 24.5,
"battery_level" : 40,
"is_charging" : true,
"speed" : null
},
{
"timestamp": 23840923804982333,
"temperature": 26.0,
"battery_level" : 35,
"is_charging" : false,
"speed" : 30
}
]
},
{
"car_id" : 3241157,
"model" : "Renault Zoey",
"data" : [
{
"timestamp": 23840923804982309,
"temperature": 23.3,
"battery_level" : 90,
"is_charging" : true,
"speed" : null
},
{
"timestamp": 23840923804982350,
"temperature": 23.0,
"battery_level" : 92,
"is_charging" : true,
"speed" : null
}
]
}
]
}
my project HAS to use a NoSQL database
This example is in JSON, but it's just to show the data. It doesn't have to be stored in the database as a JSON file per se.
It is expected that the amount of cars will be low (2-4) and the amount of data will grow quite large (a couple of new entries per minute)
I want to be able to plot the data in a graph, so most likely my queries will have to return the timestamp for every data point of every car and some other value, like for example speed or battery level. My database will have a very low amount of clients, and real-time data visualization is not required. Therefore read speed is not very important.
As far as my research has shown according to these requirements, the column-family and document store architectures don't differ too much. Except for scalability, but I don't believe my database will grow to the scale that I will have to start thinking about sharding, and if I do I probably will first have to think about vertical scaling. Am I right in believing this or is there an actual difference?
On a side note: I am asking this question comparing column-families to document store, but perhaps this comparison is futile at this level, and I have to start looking at specific column-stores and document stores. If so any advice in this direction is also appreciated.

MongoDB arrays - atomic update or push element

I have following document in MongoDB.
{
"_id" : ObjectId("521aff65e4b06121b688fabc"),
"user" : "abc",
"servers" : [
{
"name" : "server1",
"cpu" : 4,
"memory" : 4
},
{
"name" : "server2",
"cpu" : 6,
"memory" : 6
},
{
"name" : "server3",
"cpu" : 8,
"memory" : 8
}
]
}
Based on certain events, I have to either update the cpu and memory fields of an existing server or add a new server to the array if it does not exist in the array. Currently, I am performing this operation in two steps. First check whether the server already exist in the array. If yes, update cpu and memory fields. Else, push a new sub-document in the array. Due to multi threading nature of the application, sometimes same server is added to the array multiple times. Is there any atomic operator to perform the following two operations (similar to $setOnInsert operator):
If element exists in the array, update its field.
If element does not exist in the array, push new element.
Note: Operator $addToSet is not working in the above case as the value of cpu or memory can be different.
I think you can use findAndModify() to do this as it provides atomic update.
But your document structure is not appropriate.
If you can change your document to this (i.e, serverID become the key of array):
{
"_id" : ObjectId("521aff65e4b06121b688fabc"),
"user" : "abc",
"servers" : {
"server1" : {
"cpu" : 4,
"memory" : 4
},
"server2" : {
"cpu" : 6,
"memory" : 6
},
"server3" : {
"cpu" : 8,
"memory" : 8
},
}
}
Then you can use one atomic command findAndModify() to update without the need of using two separate find() and update():
db.collection.findAndModify
(
{query:
{"_id" : ObjectId("521aff65e4b06121b688fabc")},
update:
{$set: {"servers.server4": {"cpu":5, "memory":5 }}},
new: true}
)
When using this, if servers.server4 does not exist, it will be inserted, otherwise updated.
MongoDb already has atomic operations on individual documents. So as far as sending 2 commands to the DB at the same time goes, you're covered there out of the box.
Your issue arises when those two commands are contradictory anyways. IE if you send 2 updates with the same object, Mongo's individual document atmoicness isn't going to help you. So what you need to do is manage your application's multi-threading such that it does not send Mongo multiple commands that may conflict.

MongoDb - Find a specific obj within nested arrays

I hope someone can shed some light on this issue that I have, it's driving me crazy to a point that I have been spending the past three days, learning more and more about mongoDB but still can't figure out this simple query.
What I need to do is to get the object containing the "carId" = "3C".
In other words the object that I want the query to return is:
{
"carId" : "3C",
"_id" : ObjectId("51273329b64f07a40ef1c15e")
}
Here is the dataset (cars):
{
"_id" : ObjectId("56223329b64f07a40ef1c15c"),
"username" : "john",
"email" : "john#john.com",
"accounts" : [
{
"_id" : ObjectId("56322329b61f07a40ef1c15d"),
"cars" : [
{
"carId" : "6A",
"_id" : ObjectId("56323329b64f07a40ef1c15e")
},
{
"carId" : "6B",
"_id" : ObjectId("56323329b64f07a40ef1c15e")
}
]
}
]
},
{
"_id" : ObjectId("56223125b64f07a40ef1c15c"),
"username" : "paul",
"email" : "paul#paul.com",
"accounts" : [
{
"_id" : ObjectId("5154729b61f07a40ef1c15d"),
"cars" : [
{
"carId" : "5B",
"_id" : ObjectId("56323329854f07a40ef1c15e")
}
]
},
{
"_id" : ObjectId("56322117b61f07a40ef1c15d"),
"cars" : [
{
"carId" : "6G",
"_id" : ObjectId("51212929b64f07a40ef1c15e")
},
{
"carId" : "3C",
"_id" : ObjectId("51273329b64f07a40ef1c15e")
},
{
"carId" : "4N",
"_id" : ObjectId("51241279b64f07a40ef1c15e")
}
]
}
]
}
Please note that I have two nested arrays, and apparently MongoDb lacks when it comes to dealing with Projections with deep arrays. The $ operator can only be used once in a projection; leaving with no clues as how to to achieve this simple task.
So again I want to find --only-- the document that has "carId" : "3C" and only return the immediate obj containing the "carId" : "3C". but not the parent objects.
Any help would be so much appreciated. Possibly using either direct MongoDb or Mongoose. Mongoose would be preferred.
As for reference, I have already covered these other related issues wasn't able to figure it out.
Updating a deep record in MongoDb
How to Update Multiple Array Elements in mongodb
Hope in the future, this question and your solutions will help others.
Amir,
You must use the Aggregation Framework. You can build a pipeline that processes a stream documents through several building blocks: filtering, projecting,grouping,sorting,etc.
When dealing with nested arrays you will have to use the $unwind command. You can get what you want by doing the following.
db.cars.aggregate(
//De-normalized the nested array of accounts
{"$unwind": "$accounts"},
//De-normalized the nested array of cars
{"$unwind": "$accounts.cars"},
//match carId to 3C
{"$match": {"accounts.cars.carId" : "3C"}},
//Project the accoutns.cars object only
{"$project" : {"accounts.cars" : 1}},
//Group and return only the car object
{"$group":{"_id":"$accounts.cars"}}
).pretty();
You can use the aggregation framework for "array filtering" by using $unwind .
You can delete each line from the bottom of each command in the aggregation pipeline in the above code to observe the pipelines behavior.
Here's an example without the aggregation framework. I don't think there's a way purely from querying that you'll be able to get just the individual nested object you're looking for so you have to do a little post processing work. something like Mongoose may provide a way to do this but I'm not really up on what the Mongoose API's look like currently.
var doc = db.cars.findOne({"accounts.cars" : {$elemMatch: {"carId" : "3C"}}}, {"accounts.cars.$": 1, _id: 0})
var car = doc.accounts[0].cars[0]

MongoDB: updating an array in array

I seem to be having an issue accessing the contents of an array nested within an array in a mongodb document. I have no problems accessing the first array "groups" with a query like the following...
db.orgs.update({_id: org_id, "groups._id": group_id} , {$set: {"groups.$.name": "new_name"}});
Where I run into trouble is when I try to modify properties of an element in the array "features" nested within the "group" array.
Here is what an example document looks like
{
"_id" : "v5y8nggzpja5Pa7YS",
"name" : "Example",
"display_name" : "EX1",
"groups" : [
{
"_id" : "s86CbNBdqJnQ5NWaB",
"name" : "Group1",
"display_name" : "G1",
"features" : [
{
_id : "bNQ5Bs8BWqJn6CdNa"
type : "blog",
name : "[blog name]"
owner_id : "ga5YgvP5yza7pj8nS"
},
]
},
]
},
And this is the query I tried to use.
db.orgs.update({_id: "v5y8nggzpja5Pa7YS", "groups._id": "qBX3KDrtMeJGvZWXZ", "groups.features._id":"bNQ5Bs8BWqJn6CdNa" }, {$set: {"groups.$.features.$.name":"New Blog Name"}});
It returns with an error message:
WriteResult({
"nMatched" : 0,
"nUpserted" : 0,
"nModified" : 0,
"writeError" : {
"code" : 2,
"errmsg" : "Too many positional (i.e. '$') elements found in path 'groups.$.features.$.name'"
}
})
It seems that mongo doesn't support modifying arrays nested within arrays via the positional element?
Is there a way to modify this array without taking the entire thing out, modifying it, and then putting it back in? With multiple nesting like this is it standard practice to create a new collection? (Even though the data is only ever needed when the parent data is necessary) Should I change the document structure so that the second nested array is an object, and access it via key? (Where the key is an integer value that can act as an "_id")
groups.$.features.[KEY].name
What is considered the "correct" way to do this?
After some more research, it looks like the only way to modify the array within an array would be with some outside logic to find the index of the element I want to change. Doing this would require every change to have a find query to locate the index, and then an update query to modify the array. This doesn't seem like the best way.
Link to a 2010 JIRA case requesting multiple positional elements...
Since I will always know the ID of the feature, I have opted to revise my document structure.
{
"_id" : "v5y8nggzpja5Pa7YS",
"name" : "Example",
"display_name" : "EX1",
"groups" : [
{
"_id" : "s86CbNBdqJnQ5NWaB",
"name" : "Group1",
"display_name" : "G1",
"features" : {
"1" : {
type : "blog",
name : "[blog name]"
owner_id : "ga5YgvP5yza7pj8nS"
},
}
},
]
},
With the new structure, changes can be made in the following manner:
db.orgs.update({_id: "v5y8nggzpja5Pa7YS", "groups._id": "s86CbNBdqJnQ5NWaB"}, {$set: {"groups.$.features.1.name":"Blog Test 1"}});

MongoDB - Using Aggregate to get more than one Matching Object in an Array

I'm trying to do exactly what the poster in this link was trying to accomplish. I have documents with the same structure as the poster; in my documents there is an array of objects, each with many keys. I want to bring back all objects (not just the first, as you can with an $elemMatch) in that array where a key's value matches my query. I want my query's result to simply be an array of objects, where there is a key in each object that matches my query. For example, in the case of the linked question, I would want to return an array of objects where "class":"s2". I would want returned:
"FilterMetric" : [
{
"min" : "0.00",
"max" : "16.83",
"avg" : "0.00",
"class" : "s2"
},
{
"min" : "0.00",
"max" : "16.83",
"avg" : "0.00",
"class" : "s2"
}
]
I tried all the queries in the answer. The first two queries bring back an empty array in robomongo. In the shell, the command does nothing and return me to the next line. Here's a screenshot of robomongo:
On the third query in the answer, I get an unexpected token for the line where "input" is.
I'm using MongoDB version 3.0.2. It appears as if the OP was successful with the answer, so I'm wondering if there is a version issue, or if I'm approaching this the wrong way.
The only problem with the answers in that question seems to be that they're using the wrong casing for FilterMetric. For example, this works:
db.sample.aggregate([
{ "$match": { "FilterMetric.class": "s2" } },
{ "$unwind": "$FilterMetric" },
{ "$match": { "FilterMetric.class": "s2" } },
{ "$group": {
"_id": "$_id",
"FilterMetric": { "$push": "$FilterMetric" }
}}
])

Resources