There is a query of the following type that takes a long time (in a collection of millions of records), the indexes are set on the _id and cpe_id fields, the state. How to understand the matter in the $ in operator due to the increase in search and also because of the large collection, as I have complexity O (N * logM), where N is the length in in, M is the number of elements in the collection. Are there any options to somehow improve the performance of the query?
db.collection.aggregate([
{$match :
{"cpe_id" :
{$in : ["e389439e-bd04-f3fb-c512-00193b0c4385","d389439e-bd04-f3fb-c512-00193b13d00c"....]}
}
},
{$sort : {state: 1, _id : 1}},
{$skip : 0},
{$limit : 100},
])
The $in operator can be effectively serviced by an index on the queried field, i.e. {cpe_id: 1}.
In terms of performance, it will need to scan one region of the index for each value in the provided array. I would expect that part to scale linearly with the array size.
The sort can be accomplished using an index as well, but MongoDB can use only one index to perform the sort, and it must be the same index used to select the documents.
If there is no single index that can be use for both, it will first find all matching documents, load them into memory, sort them, and only then apply the skip and limit.
If there is an index on {cpe_id: 1, state: 1, _id: 1} or {state: 1, _id: 1, cpe_id: 1} there are several optimizations the query planner can use:
documents are selected using the index, so no non-matching documents are loaded
since the values in the index are already sorted in the desired order, it can omit the in-memory sort
without the blocking sort, the execution can halt after (skip + limit) documents have been found.
You can use the db.collection.explain shell helper or explain command with the "allPlansExecution" option to see which indexes were considered, and how each performed.
Related
Suppose that this is my compound index in my MongoDB collection:
{
"age": 1,
"income": 1,
}
Will all returned records be sorted first by age and then by income?
That is not guaranteed. Each MongoDB node will return the documents in the order they are encountered. This means that if the query uses an index to select documents, they will be encountered in the order they appear in the index.
This will result in the results being sorted as long as the database is a single-node, or a single replica set.
However, if this is run on a sharded cluster, each shard will return the results ordered by the index, but these per-shard result sets will be combined at the mongos in the order they happen to arrive.
This means that you may be in for a nasty surprise when your results are no longer returned in a stable order when your application becomes popular enough that you have to scale up to a sharded cluster.
It would be better in the long run to explicitly add a sort to the request. In the case of the single-node or single-replica set it will not add any extra overhead to an index-serviced query, but it will help to future-proof your code.
Every document has a users array. I want to check in db if a document has in its users array the value -> ['123','456'] OR ['456','123'], the order doesn't matter but I need that THESE AND ONLY THESE values are present in users array.
I'm basically looking for something like set(users.array) == set(456, 123)
I tried using $in operator but it is not working properly because it gives "true" if just one of these exists, and $all as I have seen don't check if these are the only values present.
You will need to test 2 things:
the array contains all of the desired elements
the array does not contain any undesired elements
The $all query operator handles the first test. {$all: ["123","456"]}
For the second test, you can use then $nin operator, {$nin: ["123","456"]} Since that tests the entire array at once, by itself it will only match documents that don't contain the desired values.
Using the $elemMatch operator to apply the $nin test to each element separately
matches unwanted documents. {$elemMatch:{$nin: ["123","456"}}
Invert the $eleMatch using $not to eliminate the undesired documents.
Putting that all together:
{$and: [
{key: {$all: ["123","456"]}},
{key: {$not: {$elemMatch: {$nin: {"123","456"}}}}}
]}
Playground
I'm trying to create an interface which gives our team the ability to build relatively simple queries to segment customers and aggregate data about those customers. In addition to this, I want to build a "List" feature" that allows users to upload a CSV that can be separately generated for more complex one-off queries. In this scenario, I'm trying to figure out the best way to query the database.
My main question is how the $in operator works. The example below has an aggregate which tries to check if a primary key (object ID) is in an array of object IDs.
How well does the $in operator perform? I'm mostly wondering how this query will run – does it loop over the array and look for documents that match each value in the array for N lookups, or will it loop over all of the documents in the database and for each one, loop over the array and check if the field matches?
db.getCollection('customers').aggregate([
{
$match: {
'_id': { $in: ['ObjectId("idstring1")','ObjectId("idstring2")'...'ObjectId("idstring5000")']}
}
}
])
If that's not how it works, what's the most efficient way of aggregating a set of documents given a bunch of object IDs? Should I just do the N lookups manually and pipe the documents into the aggregation?
My main question is how the $in operator works. The example below has
an aggregate which tries to check if a primary key (object ID) is in
an array of object IDs.
Consider the code:
var OBJ_ARR = [ ObjectId("5df9b50e7b7941c4273a5369"), ObjectId("5df9b50f7b7941c4273a5525"), ObjectId("5df9b50e7b7941c4273a515f"), ObjectId("5df9b50e7b7941c4273a51ba") ]
db.test.aggregate( [
{
$match: {
_id: { $in: OBJ_ARR }
}
}
])
The query tries to match each of the array elements with the documents in the collection. Since, there are four elements in the OBJ_ARR there might be four documents returned or lesser depending upon the matches.
If you have N _id's in the lookup array, the operation will try to find match for all elements in the input array. The more number of values you have in the array, the more time it takes; the number of ObjectIds matter in query performance. In case you have a single element in the array, it is considered as one equal match.
From the documentation - the $in works like an $or operator with equality checks.
... what's the most efficient way of aggregating a set of documents
given a bunch of object IDs?
The _id field of the collection has a unique index, by default. The query, you are trying will use this index to match the documents. Running an explain() on a query (with a small set of test data) confirms that there is an index scan (IXSCAN) on the match operation with $in used with the aggregation query. That is a better performing query (as it is) becuse of the index usage. But, the aggregation query's later/following stages, size of the data set, the input array size and other factors will influence the overall performance and efficiency.
Also, see:
Pipeline Operators and Indexes and Aggregation Pipeline Optimization.
Well. Here's the DB schema/architecture problem.
Currently in our project we use MongoDB. We have one DB with one collection. Overall there are almost 4 billions of documents in that collection (value is constant). Each document has a unique specific ID and there is a lot of different information related to this ID (that's why MongoDB was chosen - data is totally different, so schemaless is perfect).
{
"_id": ObjectID("5c619e81aeeb3aa0163acf02"),
"our_id": 1552322211,
"field_1": "Here is some information",
"field_a": 133,
"field_с": 561232,
"field_b": {
"field_0": 1,
"field_z": [45, 11, 36]
}
}
The purpose of that collection is to store a lot of data, that is easy to update (some data is being updated every day, some is updated once a month) and to search over different fields to retrieve the ID. Also we store the "history" of each field (and we should have ability to search over history as well). So when overtime updates were turned on we faced a problem called MongoDB 16MB maximum document size.
We've tried several workarounds (like splitting document), but all of them include either $group or $lookup stage in aggregation (grouping up by id, see example below), but both can't use indexes, which makes search over several fields EXTREMELY long.
{
"_id": ObjectID("5c619e81aeeb3aa0163acd12"),
"our_id": 1552322211,
"field_1": "Here is some information",
"field_a": 133
}
{
"_id": ObjectID("5c619e81aeeb3aa0163acd11"),
"our_id": 1552322211,
"field_с": 561232,
"field_b": {
"field_0": 1,
"field_z": [45, 11, 36]
}
}
Also we can't use $match stage before those, because the search can include logical operators (like field_1 = 'a' && field_c != 320, where field_1 is from one document and field_c is from another, so the search must be done after grouping/joining documents together) + the logical expression can be VERY complex.
So are there any tricky workarounds? If no, what other DB's can you suggest for moving to?
Kind regards.
Okay, so after some time spent on testing different approaches, I've finally ended up with using Elasticsearch, because there is no way to perform requested searches through MongoDB in adequate amount of time.
Given the following Mongo document model:
Each document in the collection represents 1 hour of resource monitoring. In each document, there is a collection of summaries. There is also a count of the number of summaries as an integer, as it may make life easier.
Is there an efficient way to query the collection and return either just the recent most 1000 summaries as an aggregated list?
Or an efficient way to query the collection and return the number of documents that contain the recent most 1000 summaries?
The number of summaries in each document will differ, but the number of summaries in one single document will never equal more than 1000.
EDIT: I should mention I am using mongo with the .NET driver so have LINQ available to me.
Have you looked at Mongo aggregation ? If you want to return the 1000 most recent summaries you could go with an $unwind followed by a $replaceRoot operation, here's the shell query I tried :
db.getCollection('test').aggregate([
{$match : {your timestamp match query here}},
{$sort : {"timestamp": -1}},
{$unwind : "$summaries"},
{$limit : 1000},
{$replaceRoot: {newRoot: "$summaries"}}
])
The match operation at the beginning of your aggregation pipeline is important as indexes are only used at the first step. If you unwind your whole collection your performance might drop drastically.