MongoDB remove duplicate subdocuments inside array based on a specific field - arrays

My documents have the following structure:
{
_id: ObjectId("59303aa1bad1081d4b98d636"),
clear_number: "83490",
items: [
{
name: "83490_1",
file_id: "e7209bbb",
hash: "2f568bb196f74263c64b7cf273f8ceaa",
},
{
name: "83490_2",
file_id: "9a56a935",
hash: "9c6230f7bf19d3f3186c6c3231ac2055",
},
{
name: "83490_2",
file_id: "ce5f6773",
hash: "9c6230f7bf19d3f3186c6c3231ac2055",
}
],
group_id: null
}
How to remove one of two subdocuments with the same items hash?

The following should do the trick if I understand you question correctly:
collection.aggregate({
$unwind: "$items" // flatten the items array
}, {
$group: {
"_id": { "_id": "$_id", "clear_number": "$clear_number", "group_id": "$group_id", "hash": "$items.hash" }, // per each document group by hash value
"items": { $first: "$items" } // keep only the first of all matching ones per group
}
}, {
$group: {
"_id": { "_id": "$_id._id", "clear_number": "$_id.clear_number", "group_id": "$_id.group_id" }, // now let's group everything again without the hashes
"items": { $push: "$items" } // push all single items into the "items" array
}
}, {
$project: { // this is just to restore the original document layout
"_id": "$_id._id",
"clear_number": "$_id.clear_number",
"group_id": "$_id.group_id",
"items": "$items"
}
})
In response to your comment I would suggest the following query to get the list of all document ids that contain duplicate hashes:
collection.aggregate({
$addFields: {
"hashes": {
$setUnion: [
[ { $size: "$items.hash" } ], // total number of hashes
[ { $size: { $setUnion: "$items.hash" } } ] // number of distinct hashes
]
}
}
}, {
$match:
{
"hashes.1": { $exists: true } // find all documents with a different value for distinct vs total number of hashes
}
}, {
$project: { _id: 1 } // only return _id field
})
There might be different approaches but this one seems pretty straight forward:
Basically, in the $addFields part, for each document, we first create an array consisting of two numbers:
the total number of hashes
the number of distinct hashes
Then we drive this array of two numbers through a $setUnion. After this step there can
either be two different numbers left in the array in which case the hash field does contain duplicates
or there is only one element left, in which case the number of distinct hashes equals the total number of hashes (so there are no duplicates).
We can check if there are two items in the array by testing if the element at position 1 (arrays are zero-based!) exists. That's what the $match stage does.
And the final $project stage is just to limit the output to the _id field only.

Related

MongoDB Aggregation: How to return only the values that don't exist in all documents

Lets say I have an array ['123', '456', '789']
I want to Aggregate and look through every document with the field books and only return the values that are NOT in any documents. For example if '123' is in a document, and '456' is, but '789' is not, it would return an array with ['789'] as it's not included in any books fields in any document.
.aggregate( [
{
$match: {
books: {
$in: ['123', '456', '789']
}
}
},
I don't want the documents returned, but just the actual values that are not in any documents.
Here's one way to scan the entire collection to look for missing book values.
db.collection.aggregate([
{ // "explode" books array to docs with individual book values
"$unwind": "$books"
},
{ // scan entire collection creating set of book values
"$group": {
"_id": null,
"allBooksSet": {
"$addToSet": "$books" // <-- generate set of book values
}
}
},
{
"$project": {
"_id": 0, // don't need this anymore
"missing": { // use $setDifference to find missing values
"$setDifference": [
[ "123", "456", "789" ], // <-- your values go here
"$allBooksSet" // <-- the entire collection's set of book values
]
}
}
}
])
Example output:
[
{
"missing": [ "789" ]
}
]
Try it on mongoplayground.net.
Based on #rickhg12hs's answer, there is another variation replacing $unwind with $reduce, which considered less costly. Two out of Three steps are the same:
db.collection.aggregate([
{
$group: {
_id: null,
allBooks: {$push: "$books"}
}
},
{
$project: {
_id: 0,
allBooksSet: {
$reduce: {
input: "$allBooks",
initialValue: [],
in: {$setUnion: ["$$value", "$$this"]}
}
}
}
},
{
$project: {
missing: {
$setDifference: [["123","456", "789"], "$allBooksSet"]
}
}
}
])
Try it on mongoplayground.net.

MongoDB: How to take multiple fields within a document and output their values into an array (as a new field)?

MongoDB: 4.4.9, Mongosh: 1.0.4
I have a MongoDB collection full of documents with monthly production data as separate fields (monthlyProd1, monthlyProd2, etc.). Each field is one month's production data, and the values are an object data type.
Document example:
_id: ObjectId("314e0e088f183fb7e699d635")
name: "documentName"
monthlyProd1: Object
monthlyProd2: Object
monthlyProd3: Object
...
I want to take all the months and put them into a single new field (monthlyProd) -- a single array of objects.
I can't seem to access the fields with the different methods I've tried. For example, this gets close to doing what I want:
db.monthlyProdData.updateMany({},
{ $push: { "monthlyProd": { $each: [ "$monthlyProd1", "$monthlyProd2", "$monthlyProd3" ] } } }
)
...but instead of taking the value / object data from each field, like I had hoped, it just outputs a string into the monthlyProd array ("$monthlyProd1", "$monthlyProd2", ...):
Actual output:
monthlyProd: Array
0: "$monthlyProd1"
1: "$monthlyProd2"
2: "$monthlyProd3"
...
Desired output:
monthlyProd: Array
0: Object
1: Object
2: Object
...
I want the data, not a string! Lol. Thank you for your help!
Note: some months/fields may be an empty string ("") because there was no production. I want to make sure to not add empty strings into the array -- only months with production / fields that have an object data type. That being said, I can try figuring that out on my own, if I can just get access to these fields' data!
Try this one:
db.collection.updateMany({}, [
// convert to k-v Array
{ $set: { monthlyProd: { $objectToArray: "$$ROOT" } } },
{
$set: {
monthlyProd: {
// removed not needed objects
$filter: {
input: "$monthlyProd",
cond: { $not: { $in: [ "$$this.k", [ "name", "_id" ] ] } }
// or cond: { $in: [ "$$this.k", [ "monthlyProd1", "monthlyProd2", "monthlyProd3" ] ] }
}
}
}
},
// output array value
{ $project: { monthlyProd: "$monthlyProd.v" } }
])
Mongo playground
Thank you to #Wernfried for the original solution to this question. I have modified the solution to incorporate my "Note" about ignoring any empty monthlyProd# values (aka months that didn't have any production), so that they are not added into the final monthlyProd array.
To do this, I added an $and operator to the cond: within $filter, and added the following as the second expression for the $and operator (I used "" and {} to take care of the empty field values if they are of either string or object data type):
{ $not: { $in: [ "$$this.v", [ "", {} ] ] } }
Final solution:
db.monthlyProdData2.updateMany({}, [
// convert to k-v Array
{ $set: { monthlyProd: { $objectToArray: "$$ROOT" } } },
{
$set: {
monthlyProd: {
// removed not needed objects
$filter: {
input: "$monthlyProd",
cond: { $and: [
{ $not: { $in: [ "$$this.k", [ "name", "_id" ] ] } },
{ $not: { $in: [ "$$this.v", [ "", {} ] ] } }
]}
}
}
}
},
// output array value
{ $project: { monthlyProd: "$monthlyProd.v", name: 1 } }
])
Thanks again #Wernfried and Stackoverflow community!

Query for documents where array size inside array is greater than 1

I am trying to find if any documents present and size more than one for a list which is inside two other lists in Mongo.
this is how my collection looks like:
{
"value": {
"items": [
{
"docs": [
{
"numbers": [
1,
2
]
},
{
"numbers": [
1
]
}
]
}
]
}
}
I tried to use this query and it did not work:
db.getCollection('MyCollection').find({"value.items.docs.numbers":{ $exists: true, $gt: {$size: 1} }})
What should be the ideal query to search if more than one item present inside list of list.
You are checking condition in nested array, for that nested $elemMatch condition will help to check conditions
$size allow only number as input so $not will help in negative condition
$ne to check array size should not [] empty
db.getCollection('MyCollection').find({
"value.items": {
$elemMatch: {
docs: {
$elemMatch: {
numbers: {
$exists: true,
$ne: [],
$not: {
$size: 1
}
}
}
}
}
}
})
Playground

MongoDb count percent of document with a certain field present

I have some MongoDb document's(representing orders) and their schema looks roughly like that:
{
id: ObjectID
exchange_order_products: Array
}
The exchange_order_products array is empty if the customer didn't exchange any items he ordered, or if they did, the array will contain an Object for each item exchanged.
I want to get the percent of orders in which the customer didn't exchange anything, e.g. exchange_order_products array is empty.
So basically the formula is the following: (Number Of Orders With At Least One Exchange * 100) / Number of Orders With No Exchanges
I know that I can count the number of orders where the exchange_order_products array is empty like that:
[{$match: {
exchange_order_products: {$exists: true, $size: 0}
}}, {$count: 'count'}]
But how do I simultaneously get the number of all the documents in my collection?
You can use $group and $sum along with $cond to count empty and non-empty ones separately. Then you need $multiply and $divide to calculate the percentage:
db.collection.aggregate([
{
$group: {
_id: null,
empty: { $sum: { $cond: [ { $eq: [ { $size: "$exchange_order_products" }, 0 ] }, 1, 0 ] } },
nonEmpty: { $sum: { $cond: [ { $eq: [ { $size: "$exchange_order_products" }, 0 ] }, 0, 1 ] } },
}
},
{
$project: {
percent: {
$multiply: [
100, { $divide: [ "$nonEmpty", "$empty" ] }
]
}
}
}
])
Mongo Playground

MongoDB sorting data fails

im trying to sort around 40k objects in mongo, what i have is two collections, one of comics and other of characters, characters have a field inside with an array of comic ids where they appear. What i want is a pipeline for the aggregation framework that retrieves the comic with the strongest characters (sum of the strength of each character). I am capable of getting the list of comics with the sum of the strength of each character, however when i try to sort it, the database keeps waiting and everything ends up in a timeout. What am i doing wrong?
Characters model:
{
_id: number,
name: string,
info: {
alignment: string // can be "good" or "bad"
}
stats: {
strength: number
},
comics: [] //array of numbers referencing the id of the comic
}
Comics model:
{
_id: number,
name: string
}
And here my query:
db.comics.aggregation(
{
$lookup: {
from: 'characters',
let: {
comic_id: '$_id',
},
as: 'total_comic_str',
pipeline: [
{
$match: {
$expr: {
$and: [
{$in: ['$$comic_id', '$comics']}, // the character is from this comic
{$eq: ['$info.alignment', 'good']} // the character is a hero
]
}
}
},
{
$group: { // group by comic id and accumulate strength of each hero
_id: '$$comic_id',
str: {
$sum: '$stats.strength'
}
}
}
]
}
},
{
$unwind: {
path: '$total_comic_str',
preserveNullAndEmptyArrays: false
}
},
{
$sort: {
'total_comic_str.str': -1
}
},
{
$limit: 1
}
)
You are facing a cursor timeout.
When you have a query cursor (like what returns by find()) you can set noCursorTimeout() (which is generally not a good practice) to prevent the issue.
But when using an aggregation, the Cursor type is different so there is no noCursorTimeout.
As a solution, you can use the $out pipeline to store aggregation result into a temporary collection, then working with the generated collection as you wish.
$lookup with pipeline has shown to have performance issues for large collections
So I would suggest using just the $lookup without pipeline. This will work for your particular dataset that have relatively large characters collection and presumably smaller comics arrays
First, it's better to index what you are going to use in $lookup, so you should add an index for the field comics for this to have a meaningful improvement.
Since the characters will a subdocument array, We are going to use $reduce instead of $group to calculate total strength
Your aggregation pipeline should look like this
[
{
$lookup: {
from: "characters",
localField: "_id", // lookup with _id only we will filter out alignment later
foreignField: "comics",
as: "characters"
}
},
{
$project: {
name: true,
total_strength: {
$reduce: {
input: "$characters",
initialValue: 0,
in: {
$add: [
"$$value",
{
$cond: [
{ $eq: [ "$$this.info.alignment", "good"] }, // calculating only "good" character here
"$$this.stats.strength",
0
]
}
]
}
}
}
}
},
{
$sort: { total_strength: -1 }
},
{
$limit: 1
}
]

Resources