I'm trying to create an interface which gives our team the ability to build relatively simple queries to segment customers and aggregate data about those customers. In addition to this, I want to build a "List" feature" that allows users to upload a CSV that can be separately generated for more complex one-off queries. In this scenario, I'm trying to figure out the best way to query the database.
My main question is how the $in operator works. The example below has an aggregate which tries to check if a primary key (object ID) is in an array of object IDs.
How well does the $in operator perform? I'm mostly wondering how this query will run – does it loop over the array and look for documents that match each value in the array for N lookups, or will it loop over all of the documents in the database and for each one, loop over the array and check if the field matches?
db.getCollection('customers').aggregate([
{
$match: {
'_id': { $in: ['ObjectId("idstring1")','ObjectId("idstring2")'...'ObjectId("idstring5000")']}
}
}
])
If that's not how it works, what's the most efficient way of aggregating a set of documents given a bunch of object IDs? Should I just do the N lookups manually and pipe the documents into the aggregation?
My main question is how the $in operator works. The example below has
an aggregate which tries to check if a primary key (object ID) is in
an array of object IDs.
Consider the code:
var OBJ_ARR = [ ObjectId("5df9b50e7b7941c4273a5369"), ObjectId("5df9b50f7b7941c4273a5525"), ObjectId("5df9b50e7b7941c4273a515f"), ObjectId("5df9b50e7b7941c4273a51ba") ]
db.test.aggregate( [
{
$match: {
_id: { $in: OBJ_ARR }
}
}
])
The query tries to match each of the array elements with the documents in the collection. Since, there are four elements in the OBJ_ARR there might be four documents returned or lesser depending upon the matches.
If you have N _id's in the lookup array, the operation will try to find match for all elements in the input array. The more number of values you have in the array, the more time it takes; the number of ObjectIds matter in query performance. In case you have a single element in the array, it is considered as one equal match.
From the documentation - the $in works like an $or operator with equality checks.
... what's the most efficient way of aggregating a set of documents
given a bunch of object IDs?
The _id field of the collection has a unique index, by default. The query, you are trying will use this index to match the documents. Running an explain() on a query (with a small set of test data) confirms that there is an index scan (IXSCAN) on the match operation with $in used with the aggregation query. That is a better performing query (as it is) becuse of the index usage. But, the aggregation query's later/following stages, size of the data set, the input array size and other factors will influence the overall performance and efficiency.
Also, see:
Pipeline Operators and Indexes and Aggregation Pipeline Optimization.
Related
Every document has a users array. I want to check in db if a document has in its users array the value -> ['123','456'] OR ['456','123'], the order doesn't matter but I need that THESE AND ONLY THESE values are present in users array.
I'm basically looking for something like set(users.array) == set(456, 123)
I tried using $in operator but it is not working properly because it gives "true" if just one of these exists, and $all as I have seen don't check if these are the only values present.
You will need to test 2 things:
the array contains all of the desired elements
the array does not contain any undesired elements
The $all query operator handles the first test. {$all: ["123","456"]}
For the second test, you can use then $nin operator, {$nin: ["123","456"]} Since that tests the entire array at once, by itself it will only match documents that don't contain the desired values.
Using the $elemMatch operator to apply the $nin test to each element separately
matches unwanted documents. {$elemMatch:{$nin: ["123","456"}}
Invert the $eleMatch using $not to eliminate the undesired documents.
Putting that all together:
{$and: [
{key: {$all: ["123","456"]}},
{key: {$not: {$elemMatch: {$nin: {"123","456"}}}}}
]}
Playground
There is a query of the following type that takes a long time (in a collection of millions of records), the indexes are set on the _id and cpe_id fields, the state. How to understand the matter in the $ in operator due to the increase in search and also because of the large collection, as I have complexity O (N * logM), where N is the length in in, M is the number of elements in the collection. Are there any options to somehow improve the performance of the query?
db.collection.aggregate([
{$match :
{"cpe_id" :
{$in : ["e389439e-bd04-f3fb-c512-00193b0c4385","d389439e-bd04-f3fb-c512-00193b13d00c"....]}
}
},
{$sort : {state: 1, _id : 1}},
{$skip : 0},
{$limit : 100},
])
The $in operator can be effectively serviced by an index on the queried field, i.e. {cpe_id: 1}.
In terms of performance, it will need to scan one region of the index for each value in the provided array. I would expect that part to scale linearly with the array size.
The sort can be accomplished using an index as well, but MongoDB can use only one index to perform the sort, and it must be the same index used to select the documents.
If there is no single index that can be use for both, it will first find all matching documents, load them into memory, sort them, and only then apply the skip and limit.
If there is an index on {cpe_id: 1, state: 1, _id: 1} or {state: 1, _id: 1, cpe_id: 1} there are several optimizations the query planner can use:
documents are selected using the index, so no non-matching documents are loaded
since the values in the index are already sorted in the desired order, it can omit the in-memory sort
without the blocking sort, the execution can halt after (skip + limit) documents have been found.
You can use the db.collection.explain shell helper or explain command with the "allPlansExecution" option to see which indexes were considered, and how each performed.
I have been searching through the MongoDB query syntax with various combinations of terms to see if I can find the right syntax for the type of query I want to create.
We have a collection containing documents with an array field. This array field contains ids of items associated with the document.
I want to be able to check if an item has been associated more than once. If it has then more than one document will have the id element present in its array field.
I don't know in advance the id(s) to check for as I don't know which items are associated more than once. I am trying to detect this. It would be comparatively straightforward to query for all documents with a specific value in their array field.
What I need is some query that can return all the documents where one of the elements of its array field is also present in the array field of a different document.
I don't know how to do this. In SQL it might have been possible with subqueries. In Mongo Query Language I don't know how to do this or even if it can be done.
You can use $lookup to self join the rows and output the document when there is a match and $project with exclusion to drop the joined field in 3.6 mongo version.
$push with [] array non equality match to output document where there is matching document.
db.col.aggregate([
{"$unwind":"$array"},
{"$lookup":{
"from":col,
"localField":"array",
"foreignField":"array",
"as":"jarray"
}},
{"$group":{
"_id":"$_id",
"fieldOne":{"$first":"$fieldOne"},
... other fields
"jarray":{"$push":"$jarray"}
}},
{"$match":{"jarray":{"$ne":[]}}},
{"$project":{"jarray":0}}
])
I have a records collection with the following indexes:
{"_id":1}
{"car.make":1,"city":1,"car.mileage":1}
And performing the following query:
db.records.aggregate([
{
"$match":{
"car.mileage":{"$in":[1000,2000,3000,4000]},
"car.make":"Honda",
"city":{"$in":["Miami","San Francisco","New York", "Chigaco", "Seattle", "Boston"]}
}
},
{
"$sort":{"_id":-1}
}
])
The query without the $sort clause finished in a few milliseconds but adding the $sort clause makes it takes around 2 minutes. This query should return around 40 documents from a collection of 6m documents. Any clues about what could cause this huge difference in query time?
After additional testing, this problem goes away by sorting on a different field like creation_date even if creation_date is not indexed. Any ideas why the _id field would perform so much worse than the unindexed creation_date field in this aggregation?
I ran into the same problem today and I'm speculating here but I believe in this case sorting by _id sorts all the entries in the collection before other operations (I said speculating because if you omit the $match clause and keep only the $sort clause even then you get your data in milliseconds).
The workaround that helped me was projection.
If you use a $project clause between $match and $sort then you will get your data in milliseconds again. So you can either use fields like creation_date or if you must use _id then use $project before sorting it.
I want to apply skip and limit for paging in nested array of a document how can I perform this [Efficient Way]
My Document recored like
{
"_id":"",
"name":"",
"ObjectArray":[{
"url":"",
"value":""
}]
}
I want to retrieve multiple document and every document contain 'n' number of record.
I am using $in in find query to retrieve multiple record on basis of _id but how can i get certain number of element of ObjectArray in every document?
You can try like this -
db.collection.find({}, {ObjectArray:{$slice:[0, 3]}})
This will provide you records from 0..3
$slice:[SKIP_VALUE, LIMIT_VALUE]}
For your example:-
db.collection.find({"_id":""}, {ObjectArray:{$slice:[0, 3]}})
Here is the reference for MongoDB Slice feature.
http://docs.mongodb.org/manual/reference/operator/projection/slice/