MongoDB query to find document with duplicate value in array - arrays

tldr; I'm struggling to construct a query to
Make an aggregation to get a count of values on a certain key ("original_text_source"), which
Is in a sub-document that is in an array
Full description
I have embedded documents with arrays that are structured like this:
{
"_id" : ObjectId("0123456789"),
"type" : "some_object",
"relationships" : {
"x" : [ ObjectId("0123456789") ],
"y" : [ ObjectId("0123456789") ],
},
"properties" : [
{
"a" : "1"
},
{
"b" : "1"
},
{
"original_text_source" : "foo.txt"
},
]
}
The docs were created from exactly 10k text files, sorted in various folders. During inserting documents into the MongoDB (in batches) I messed up and moved a few files around, causing one file to be imported twice (my database has a count of exactly 10001 docs), but obviously I don't know which one it is. Since one of the "original_text_source" values has to have a count of 2, I was planning on just deleting one.
I read up on solutions with $elemMatch, but since my array element is a document, I'm not sure how to proceed. Maybe with mapReduce? But I can't transfer the logic to my doc structure.
I also could just create a new collection and reupload all, but in case I mess up again, I'd rather like to learn how to query for duplicates. It seems more elegant :-)

You can find duplicates with a simple aggregation like this:
db.collection.aggregate(
{ $group: { _id: "$properties.original_text_source", docIds: { $push: "$_id" }, docCount: { $sum: 1 } } },
{ $match: { "docCount": { $gt: 1 } } }
)
which gives you something like this:
{
"_id" : [
"foo.txt"
],
"docIds" : [
ObjectId("59d6323613940a78ba1d5ffa"),
ObjectId("59d6324213940a78ba1d5ffc")
],
"docCount" : 2.0
}

Run the following:
db.collection.aggregate([
{ $group: {
_id: { name: "$properties.original_text_source" },
idsForDuplicatedDocs: { $addToSet: "$_id" },
count: { $sum: 1 }
} },
{ $match: {
count: { $gte: 2 }
} },
{ $sort : { count : -1} }
]);
Given a collection which contains two copies of the document you showed in your question, the above command will return:
{
"_id" : {
"name" : [
"foo.txt"
]
},
"idsForDuplicatedDocs" : [
ObjectId("59d631d2c26584cd8b7b3337"),
ObjectId("59d631cbc26584cd8b7b3333")
],
"count" : 2
}
Where ...
The attribute _id.name is the value of the duplicated properties.original_text_source
The attribute idsForDuplicatedDocs contains the _id values for each of the documents which have a duplicated properties.original_text_source

"reviewAndRating": [
{
"review": "aksjdhfkashdfkashfdkjashjdkfhasdkjfhsafkjhasdkjfhasdjkfhsdakfj",
"productId": "5bd956f29fcaca161f6b7517",
"_id": "5bd9745e2d66162a6dd1f0ef",
"rating": "5"
},
{
"review": "aksjdhfkashdfkashfdkjashjdkfhasdkjfhsafkjhasdkjfhasdjkfhsdakfj",
"productId": "5bd956f29fcaca161f6b7518",
"_id": "5bd974612d66162a6dd1f0f0",
"rating": "5"
},
{
"review": "aksjdhfkashdfkashfdkjashjdkfhasdkjfhsafkjhasdkjfhasdjkfhsdakfj",
"productId": "5bd956f29fcaca161f6b7517",
"_id": "5bd974622d66162a6dd1f0f1",
"rating": "5"
}
]

Related

Mongo DB find value in array of multiple nested arrays

I need to check if an ObjectId exists in a non nested array and in multiple nested arrays, I've managed to get very close using the aggregation framework, but got stuck in the very last step.
My documents have this structure:
{
"_id" : ObjectId("605ce5f063b1c2eb384c2b7f"),
"name" : "Test",
"attrs" : [
ObjectId("6058e94c3994d04d28639616"),
ObjectId("6058e94c3994d04d28639627"),
ObjectId("6058e94c3994d04d28639622"),
ObjectId("6058e94c3994d04d2863962e")
],
"variations" : [
{
"varName" : "Var1",
"attrs" : [
ObjectId("6058e94c3994d04d28639616"),
ObjectId("6058e94c3994d04d28639627"),
ObjectId("6058e94c3994d04d28639622"),
ObjectId("60591791d4d41d0a6817d23f")
],
},
{
"varName" : "Var2",
"attrs" : [
ObjectId("60591791d4d41d0a6817d22a"),
ObjectId("60591791d4d41d0a6817d255"),
ObjectId("6058e94c3994d04d28639622"),
ObjectId("60591791d4d41d0a6817d23f")
],
},
],
"storeId" : "9acdq9zgke49pw85"
}
Let´s say I need to check if this if this _id exists "6058e94c3994d04d28639616" in all arrays named attrs.
My aggregation query goes like this:
db.product.aggregate([
{
$match: {
storeId,
},
},
{
$project: {
_id: 0,
attrs: 1,
'variations.attrs': 1,
},
},
{
$project: {
attrs: 1,
vars: '$variations.attrs',
},
},
{
$unwind: '$vars',
},
{
$project: {
attr: {
$concatArrays: ['$vars', '$attrs'],
},
},
},
]);
which results in this:
[
{
attr: [
6058e94c3994d04d28639616,
6058e94c3994d04d28639627,
6058e94c3994d04d28639622,
6058e94c3994d04d2863962e,
6058e94c3994d04d28639616,
6058e94c3994d04d28639627,
6058e94c3994d04d28639622,
60591791d4d41d0a6817d23f,
60591791d4d41d0a6817d22a,
60591791d4d41d0a6817d255,
6058e94c3994d04d28639622,
60591791d4d41d0a6817d23f
]
},
{
attr: [
60591791d4d41d0a6817d22a,
60591791d4d41d0a6817d255,
6058e94c3994d04d28639622,
60591791d4d41d0a6817d23f,
6058e94c3994d04d28639624,
6058e94c3994d04d28639627,
6058e94c3994d04d28639628,
6058e94c3994d04d2863963e
]
}
]
Assuming I have two products in my DB, I get this result. Each element in the outermost array is a different product.
The last bit, which is checking for this key "6058e94c3994d04d28639616", I could not find a way to do it with $group, since I dont have keys to group on.
Or with $match, adding this to the end of the aggregation:
{
$match: {
attr: "6058e94c3994d04d28639616",
},
},
But that results in an empty array. I know that $match does not query arrays like this, but could not find a way to do it with $in as well.
Is this too complicated of a Schema? I cannot have the original data embedded, since it is mutable and I would not be happy to change all products if something changed.
Will this be very expensive if I had like 10000 products?
Thanks in advance
You are trying to compare string 6058e94c3994d04d28639616 with ObjectId. Convert the string to ObjectId using $toObjectId operator when perform $match operation like this:
{
$match: {
$expr: {
$in: [{ $toObjectId: "6058e94c3994d04d28639616" }, "$attr"]
}
}
}

mongodb - Get one array from two arrays in collection

In my mongodb collection, I have sometimes two, sometimes one and sometimes null arrays on a document. Now I'd like to get one array over the whole collection with the values of these arrays.
The document looks like this:
{
"title" : "myDocument",
"listOne" : [
"valueOne",
"valueTwo"
],
"listTwo" : [
"abc",
"qwer"
]
},
{
"title" : "myDocumentTwo",
"listTwo" : [
"321"
]
},
{
"title" : "myDocumentAlpha",
"listOne" : [
"alpha",
"beta"
]
},
{
"title" : "myDocumentbeta"
}
And I expect the following output:
"combinedList" : [
"valueOne",
"valueTwo",
"abc",
"qwer",
"321",
"alpha",
"beta"
]
It's like every possible value from these twos array out of every document in this collection.
You can do this using aggregate and $concatArrays
db.collection.aggregate([
{
$project: {
combinedList: {
$concatArrays: [{$ifNull: ["$listOne", []]}, {$ifNull: ["$listTwo", []]}]
}
}
},
{ $unwind: "$combinedList" },
{ $group: { _id: null, combinedList: { $addToSet: "$combinedList"}}},
{ $project: { _id: 0, combinedList: 1 }}
])

Sort by deep document field in MongoDb

I have a collection called Visitor which has an array of chats and each array has a document called user.
I need to find some documents on this collection and sort them by if they have some specific user in their chats first.
The path for the user id is:
chats.user._id
where:
chats // array
user // document
_id // ObjectId
The below script does sort the documents correctly, however, it expands the chats array and multiplies the document for each chat in the array.
I only need the sorting, so can I sort and not use the unwind pipeline or make it somehow not multiply the documents?
db.getCollection('Visitor').aggregate([
{$unwind: "$chats"},
{ $match: {'event._id':ObjectId('5c942a3591deb389bfd92579'), 'chats.enabled': {$exists: true}}},
{
"$project": {
"_id": 1,
"chats.user._id": 1,
"weight": {
"$cond": [
{ "$eq": [ "$chats.user._id", ObjectId("5c942a3591deb389bfd92579") ] },
10,
0
]
}
}
},
{ "$sort": { "weight": -1 } },
])
EDIT: I don't need to sort the inner array, but sort the find command by checking if a specific user is in the chats array.
Some sample of Visitor collection:
[
{
"_id" : ObjectId("5c9a3a1bd86e0ba64106e90e"),
"event" : {
"_id" : ObjectId("5c942a3591deb389bfd92579")
},
"chats" : [
{
"enabled" : false,
"user" : {
"_id" : ObjectId("5c81232f09a923b559763418")
},
"_id" : ObjectId("5c9a3a1bd86e0ba64106e915")
}
]
},
{
"_id" : ObjectId("5c9a3a35d86e0ba64106e950"),
"event" : {
"_id" : ObjectId("5c942a3591deb389bfd92579")
},
"chats" : [
{
"enabled" : true,
"user" : {
"_id" : ObjectId("5c81232f09a923b559763418")
},
"_id" : ObjectId("5c9a3a35d86e0ba64106e957")
},
{
"enabled" : true,
"user" : {
"_id" : ObjectId("5c942a3591deb389bfd92579")
},
"_id" : ObjectId("5c9a3a34d86e0ba64106e91d")
}
]
}
]
In the above sample, I need to make the second document to be sorted first because it has the user with the _id ObjectId("5c942a3591deb389bfd92579").
The problem here is that using $unwind you modify initial structure of your documents (you will get one document per chats. I would suggest using $map to get an array of weights based on specified userId and then you can use $max to get final weight
db.col.aggregate([
{ $match: {'event._id':ObjectId('5c942a3591deb389bfd92579'), 'chats.enabled': {$exists: true}}},
{
"$project": {
"_id": 1,
"chats.user._id": 1,
"weight": {
$max: { $map: { input: "$chats", in: { $cond: [ { $eq: [ "$$this.user._id", ObjectId("5c942a3591deb389bfd92579") ] }, 10, 0 ] } } }
}
}
},
{ "$sort": { "weight": -1 } },
])

In mongo DB, how do I grab multiple counts in a single query?

I'm currently trying to massage out counts from the mLab API for reasons I don't have control over. So I want to grab the data I need from there in one query so I can limit the amount of API calls.
Assuming that my data looks like this:
{
"_id": {
"$oid": "12345"
},
"dancer": "Beginner",
"pirate": "Advanced",
"chef": "Mid",
"beartamer": "Mid",
"swordsman": "Mid",
"total": "Mid"
}
I know I can do 6 queries with something similar to:
db.score.aggregate({"$group": { _id: {"total":"$total"}, count: {$sum:1} }} )
but how do I query to get the count for each key? I'd like to see something akin to:
{ "_id" : { "total" : "Advanced" }, "count" : 1 }
{ "_id" : { "total" : "Mid" }, "count" : 1 }
{ "_id" : { "total" : "Beginner" }, "count" : 4 }
{ "_id" : { "pirate" : "Advanced" }, "count" : 1 }
//...etc
The following should give you precisely what you want:
db.scores.aggregate({
$project: {
"_id": 0 // get rid of the "_id" field since we do not want to count it
}
}, {
$project: {
"doc": {
$objectToArray: "$$ROOT" // transform all documents into key-value pairs
}
}
}, {
$unwind: "$doc" // flatten the resulting array into separate documents
}, {
$group: {
"_id": "$doc", // group by distinct key-value combination
"count": { $sum: 1 } // count documents per bucket
}
}, {
$project: {
"_id": { // some more transformation magic to recreate the desired output structure
$mergeObjects: [
{ $arrayToObject: [ [ "$_id" ] ] },
{ "count": "$count" }
]
},
}
}, {
$replaceRoot: {
"newRoot": "$_id" // this moves the contents of the "_id" field to the root of the documents
}
})

MongoDB get count of particular key in an array

In mongoDB, how can we get the count of particular key in an array
{
"_id" : ObjectId("52d9212608a224e99676d378"),
"business" : [
{
"name" : "abc",
"rating" : 4.5
},
{
"name" : "pqr"
},
{
"name" : "xyz",
"rating" : 3.6
}
]
}
in the above example, business is an array (with "name" and/or "rating" keys)
How can i get the count of business array with only "rating" key existing ?
Expected output is : 2
Looks like you have to use Aggregation Framework. In particular you need to $unwind your array, then match only elements with rating field included, then $group documents back to original format.
Try something like this:
db.test.aggregate([
{ $match: { /* your query criteria document */ } },
{ $unwind: "$business" },
{ $match: {
"business.rating": { $exists: 1 }
}
},
{ $group: {
_id: "$_id",
business: { $push: "$business" },
business_count: { $sum: 1 }
}
}
])
Result will look like the following:
{
_id: ObjectId("52d9212608a224e99676d378"),
business: [
{ name: "abc", rating: 4.5 },
{ name: "xyz", rating: 3.6 }
],
business_count: 2
}
UPD Looks like OP doesn't want to group results by wrapping document _id field. Unfortunately $group expression must specify _id value, otherwise it fails with exception. But, this value can actually be constant (e.g. plain null or 'foobar') so there will be only one resulting group with collection-wise aggregation.

Resources