Remove duplicates from MongoDB 4.2 data base - database

I am trying to remove duplicates from MongoDB but all solutions find fail.
My JSON structure:
{
"_id" : ObjectId("5d94ad15667591cf569e6aa4"),
"a" : "aaa",
"b" : "bbb",
"c" : "ccc",
"d" : "ddd",
"key" : "057cea2fc37aabd4a59462d3fd28c93b"
}
Key value is md5(a+b+c+d).
I already have a database with over 1 billion records and I want to remove all the duplicates according to key and after use unique index so if the key is already in data base the record wont insert again.
I already tried
db.data.ensureIndex( { key:1 }, { unique:true, dropDups:true } )
But for what I understand dropDups were removed in MongoDB > 3.0.
I tried also several of java script codes like:
var duplicates = [];
db.data.aggregate([
{ $match: {
key: { "$ne": '' } // discard selection criteria
}},
{ $group: {
_id: { key: "$key"}, // can be grouped on multiple properties
dups: { "$addToSet": "$_id" },
count: { "$sum": 1 }
}},
{ $match: {
count: { "$gt": 1 } // Duplicates considered as count greater than one
}}
],
{allowDiskUse: true} // For faster processing if set is larger
).forEach(function(doc) {
doc.dups.shift(); // First element skipped for deleting
doc.dups.forEach( function(dupId){
duplicates.push(dupId); // Getting all duplicate ids
}
)
})
and it fails with:
QUERY [Js] uncaught exception: Error: command failed: {
“ok“: 0,
“errmsg“ : “assertion src/mongo/db/pipeline/value.cpp:1365“.
“code“ : 8,
“codeName" : “UnknownError“
} : aggregate failed
I haven't change MongoDB settings, working with the default settings.

This is my input collection dups, with some duplicate data (k with values 11 and 22):
{ "_id" : 1, "k" : 11 }
{ "_id" : 2, "k" : 22 }
{ "_id" : 3, "k" : 11 }
{ "_id" : 4, "k" : 44 }
{ "_id" : 5, "k" : 55 }
{ "_id" : 6, "k" : 66 }
{ "_id" : 7, "k" : 22 }
{ "_id" : 8, "k" : 88 }
{ "_id" : 9, "k" : 11 }
The query removes the duplicates:
db.dups.aggregate([
{ $group: {
_id: "$k",
dups: { "$addToSet": "$_id" },
count: { "$sum": 1 }
}},
{ $project: { k: "$_id", _id: { $arrayElemAt: [ "$dups", 0 ] } } }
] )
=>
{ "k" : 88, "_id" : 8 }
{ "k" : 22, "_id" : 7 }
{ "k" : 44, "_id" : 4 }
{ "k" : 55, "_id" : 5 }
{ "k" : 66, "_id" : 6 }
{ "k" : 11, "_id" : 9 }
As you see the following duplicate data is removed:
{ "_id" : 1, "k" : 11 }
{ "_id" : 2, "k" : 22 }
{ "_id" : 3, "k" : 11 }
Get the results in an array:
var arr = db.dups.aggregate([ ...] ).toArray()
The arr has the array of the documents:
[
{
"k" : 88,
"_id" : 8
},
{
"k" : 22,
"_id" : 7
},
{
"k" : 44,
"_id" : 4
},
{
"k" : 55,
"_id" : 5
},
{
"k" : 66,
"_id" : 6
},
{
"k" : 11,
"_id" : 9
}
]

Related

Update array at specific index by other filed in MongoDB

I have a collection, consist of name and data.
data is an array with 2 elements, each element is the object with code and qty.
{
"_id" : ObjectId("605c666a15d2612ed0afedd2"),
"name" : "Anna",
"data" : [
{
"code" : "a",
"qty" : 3
},
{
"code" : "b",
"qty" : 4
}
]
},
{
"_id" : ObjectId("605c666a15d2612ed0afedd3"),
"name" : "James",
"data" : [
{
"code" : "c",
"qty" : 5
},
{
"code" : "d",
"qty" : 6
}
]
}
I want to update the code of the first element to name of its document. The result I want is
{
"_id" : ObjectId("605c666a15d2612ed0afedd2"),
"name" : "Anna",
"data" : [
{
"code" : "Anna",
"qty" : 3
},
{
"code" : "b",
"qty" : 4
}
]
},
{
"_id" : ObjectId("605c666a15d2612ed0afedd3"),
"name" : "James",
"data" : [
{
"code" : "James",
"qty" : 5
},
{
"code" : "d",
"qty" : 6
}
]
}
I just google to find how to:
update array at a specific index (https://stackoverflow.com/a/34177929/11738185)
db.Collection.updateMany(
{ },
{
$set:{
'data.0.code': '$name'
}
}
)
But the code of the first element in data array is a string '$name', not a value (Anna, James)
{
"_id" : ObjectId("605c666a15d2612ed0afedd2"),
"name" : "Anna",
"data" : [
{
"code" : "$name",
"qty" : 3
},
{
"code" : "b",
"qty" : 4
}
]
},
{
"_id" : ObjectId("605c666a15d2612ed0afedd3"),
"name" : "James",
"data" : [
{
"code" : "$name",
"qty" : 5
},
{
"code" : "d",
"qty" : 6
}
]
}
update a field by the value of another field. It takes me to use pipeline updating (https://stackoverflow.com/a/37280419/11738185): the second param of updateMany is array (pipeline)
db.Collection.updateMany(
{ },
[{
$set:{
'data.0.code': '$name'
}
}]
)
and It adds field 0 to each element in data array
{
"_id" : ObjectId("605c666a15d2612ed0afedd2"),
"name" : "Anna",
"data" : [
{
"0" : {
"code" : "Anna"
},
"code" : "a",
"qty" : 3
},
{
"0" : {
"code" : "Anna"
},
"code" : "b",
"qty" : 4
}
]
},
{
"_id" : ObjectId("605c666a15d2612ed0afedd3"),
"name" : "James",
"data" : [
{
"0" : {
"code" : "James"
},
"code" : "c",
"qty" : 5
},
{
"0" : {
"code" : "James"
},
"code" : "d",
"qty" : 6
}
]
}
I can't find the solution for this case. Could anyone to help me? How can I update array at fixed index by other field. Thanks for reading!
1. update array at a specific index
You can't use internal fields as value of another fields, it will work only when you have external value to update like { $set: { "data.0.code": "Anna" } }.
2. update a field by the value of another field
Update with Aggregation pipeline can't allow to access data.0.code syntax.
You can try using $reduce in update with aggregation pipeline,
$reduce to iterate loop of data array, set empty array in initialValue of reduce, Check condition if initialValue array size is zero then replace code with name and merge with current object using $mergeObjects, else return current object,
$concatArrays to concat current object with initialValue array
db.collection.update({},
[{
$set: {
data: {
$reduce: {
input: "$data",
initialValue: [],
in: {
$concatArrays: [
"$$value",
[
{
$cond: [
{ $eq: [{ $size: "$$value" }, 0] },
{ $mergeObjects: ["$$this", { code: "$name" }] },
"$$this"
]
}
]
]
}
}
}
}
}],
{ multi: true }
)
Playground
I think easier would be another way.
Just save the model before and use it for updating after
var annaModel = nameModel.findOne({_id: "605c666a15d2612ed0afedd2" })
nameModel.findOneAndUpdate({_id: "605c666a15d2612ed0afedd2"},{$set:{'data.0.code': annaModel.name}})

Need help in querying mongodb

I have a a few documents that have the following structure. See attached image.
document structure
Each document includes an array of 'FileMeta' objects and each FileMeta object includes an array of 'StatusHistory' objects. I'm trying to get only the FileMetas that contain StatusCode equal to 4 and that the TimeStamp is greater than a certain datetime.
Tried the following query but it only returns the first FileMeta element of each document.
db.getCollection('Collection').find({'ExternalParams.RequestingApplication':'aaa.bbb'},
{ "FileMeta": { $elemMatch: { "StatusHistory":{ $elemMatch:{ "StatusCode": 4, "TimeStamp": { $gt: ISODate("2020-06-28T11:02:26.542Z")} } } } }} )
What am I doing wrong?
here is the document structure:
{
"_id" : ObjectId("5ef84e2ec08abf38b0043ab4"),
"FileMeta" : [
{
"StatusHistory" : [
{
"StatusCode" : 0,
"StatusDesc" : "New File",
"TimeStamp" : ISODate("2020-06-28T11:00:46.286Z")
},
{
"StatusCode" : 2,
"StatusDesc" : "stby",
"TimeStamp" : ISODate("2020-06-28T11:02:20.400Z")
},
{
"StatusCode" : 4,
"StatusDesc" : "success",
"TimeStamp" : ISODate("2020-06-28T11:02:26.937Z")
}
]
},
{
"StatusHistory" : [
{
"StatusCode" : 0,
"StatusDesc" : "New File",
"TimeStamp" : ISODate("2020-06-28T11:00:46.286Z")
},
{
"StatusCode" : 2,
"StatusDesc" : "stby",
"TimeStamp" : ISODate("2020-06-28T11:02:20.617Z")
},
{
"StatusCode" : 4,
"StatusDesc" : "success",
"TimeStamp" : ISODate("2020-06-28T11:02:26.542Z")
}
]
}
],
}
I want to return only the FileMeta objects that include a StatusHistory that match the following conditions: StatusCode = 4 and TimeStamp > SomeDateTime
Sorry for the delay, mate, I've been quite busy lately. Hope you already solved your problem. Anyway, I think that I found the solution.
As you can see on this link, the example shows that by default the $elemMatch operator returns the whole array in case of match on any element.
For instance, consider the following collection:
{ _id: 1, results: [ { product: "abc", score: 10 }, { product: "xyz", score: 5 } ] }
{ _id: 2, results: [ { product: "abc", score: 8 }, { product: "xyz", score: 7 } ] }
{ _id: 3, results: [ { product: "abc", score: 7 }, { product: "xyz", score: 8 } ] }
If you do the following query, for example:
db.survey.find(
{ results: { $elemMatch: { product: "xyz", score: { $gte: 8 } } } }
)
The output will be:
{ "_id" : 3, "results" : [ { "product" : "abc", "score" : 7 }, { "product" : "xyz", "score" : 8 } ] }
Not:
{ "_id" : 3, "results" : [{ "product" : "xyz", "score" : 8 }]}
That said, if you want to return only the document in the array that matches the specified query, you must use the db.collection.aggregate() function with the $unwind and $match operator.
The query below shall give you what you want.
Query:
db.collection.aggregate([
{"$unwind" : "$FileMeta"},
{"$unwind" : "$FileMeta.StatusHistory"},
{
"$match" : {
"FileMeta.StatusHistory.StatusCode" : 4,
"FileMeta.StatusHistory.TimeStamp" : {"$gte" : ISODate("2020-06-28T11:02:26.937Z")}
}
}
]).pretty()
Result:
{
"_id" : ObjectId("5ef84e2ec08abf38b0043ab4"),
"FileMeta" : {
"StatusHistory" : {
"StatusCode" : 4,
"StatusDesc" : "success",
"TimeStamp" : ISODate("2020-06-28T11:02:26.937Z")
}
}
}
One last tip. Consider changing your modeling to something that looks like the unwinded document, and remember that one document should be equivalent to one row in a normal relational database. So avoid storing information that should be on "several rows" on a single document.
Useful links:
The $elemMatch operator.
The $unwind operator.

Pushing objects on a specific multidimensional mongoDb collection

i'm fairly new to the mongoDb query language and I'm struggeling with following scenario.
We have a multidimensional dataset that is comprised of:
n users
n projects for each users
n time_entries for each project
What I am trying to achieve is: I would like to push/update a time_entry of a specific project using a collection.update.
Note each pid should be unique for a user
The collection structure I am using looks as follows:
{
"_id" : ObjectId("5d6e33987f8d7f00c063ceff"),
"date" : "2019-01-01",
"users" : [
{
"user_id" : 1,
"projects" : [
{
"pid" : 1,
"time_entries" : [
{
"duration" : 1,
"start" : "2019-08-29T09:54:56+00:00"
}
]
},
{
"pid" : 2,
"time_entries" : []
}
]
},
{
"user_id" : 2,
"projects" : [
{
"pid" : 3,
"time_entries" : []
}
]
}
]
}
I'm currently able to update all projects of a given user using:
"users.$.projects.$[].time_entries"
yet I'm not able to target a specific project, due to the fact the structure contains 2 nesting levels and using multiple $ positional operator is not yet permitted in MongoDb.
"users.$.projects.$.time_entries"
Below is my full query example:
db.times.update(
{ 'users' : { $elemMatch : { 'projects' : { $elemMatch : { 'pid' : 153446871 } } } } },
{ "$push":
{
"users.$.projects.$[].time_entries":
{
"duration" : 5,
"start" : "2019-08-29T09:54:56+00:00"
}
}
}
);
Are there other ways to achieve the same result?
Should I flatten the array so I only use 1 $ positional operator?
Are there other methods to push items on a multidimensional array?
Should this logic be handled on a code level and not a Database level?
You'll need to use the Positional Filtered Operator to achieve that:
db.times.update(
{},
{
$push: {
"users.$[].projects.$[element].time_entries":{
"duration" : 5,
"start" : "2019-08-29T09:54:56+00:00"
}
}
},
{
arrayFilters: [{"element.pid":1}],
multi: true
}
)
This query will push data to the array time_entries for every pid = 1 it finds.
This will give you the result below:
{
"_id" : ObjectId("5d6e33987f8d7f00c063ceff"),
"date" : "2019-01-01",
"users" : [
{
"user_id" : 1,
"projects" : [
{
"pid" : 1,
"time_entries" : [
{
"duration" : 1,
"start" : "2019-08-29T09:54:56+00:00"
},
{
"duration" : 5.0,
"start" : "2019-08-29T09:54:56+00:00"
}
]
},
{
"pid" : 2,
"time_entries" : []
}
]
},
{
"user_id" : 2,
"projects" : [
{
"pid" : 3,
"time_entries" : []
}
]
}
]
}

Remove mongo specific nested documents in array for each document

{
"_id" : 123,
"a" : [
{
"b" : 1,
"bb" : 2
},
{
"c" : 2,
"cc" : 3
}
],
"ab" : [
{
"d" : 4,
"dd" : 5
},
{
"e" : 5,
"ee" : 6
}
]
}
Need to remove mongo specific nested document in array for each document
Output should be like: based on inputs _id:123,ab.d=4
{
"_id" : 123,
"a" : [
{
"b" : 1,
"bb" : 2
},
{
"c" : 2,
"cc" : 3
}
],
"ab" : [
{
"e" : 5,
"ee" : 6
}
]
}
Your are looking for an update with $pull operator (https://docs.mongodb.com/manual/reference/operator/update/pull/)
In your case:
db.mycollection.update({"_id":123}, {$pull: {"ab":{"d":4}}})

Mongo push objects

I want to push an object to specify name of fields rather than array. I tried $push but I lose informations about field's name inserted in the array.
My collection is :
/* 1 */
{
"_id" : ObjectId("57614a7bd75df17df3013903"),
"O":"aa",
"D":"bb",
"month":1,
"year":2015,
"freq":5
}
/* 2 */
{
"_id" : ObjectId("57614a7bd75df17df3013904"),
"O":"aa",
"D":"bb",
"month":2,
"year":2015,
"freq":5
}
/* 3 */
{
"_id" : ObjectId("57614a7bd75df17df3013905"),
"O":"aa",
"D":"bb",
"month":1,
"year":2016,
"freq":5
}
I want to store all freq corresponding to fields : O and D.
Here is my expected output :
"_id" : ...,
"O" : "aa",
"D" : "bb",
"freq" : {
"2015" : {
"1" : 5,
"2":5
},
"2016" : {
"1" : 5
}
}
}
I tried this :
db.collection.aggregate([
{
'$group':
{
_id:{"O":"$O","D":"$D","Y":"$year"},
"freq" :{$push: "$freq"}
}
},
{
'$group':
{
_id:{"O":"$O","D":"$D"},
"freq" :{$push: "$freq"}
}
})]
but I got an array without informations of year or month.
Thank you
You have used two $group in your query
Your First group query is enough to build the data which you are expecting.
If we are executing the first query
db.stackoverflow.aggregate([
{
'$group':
{
_id:{"O":"$O","D":"$D","Y":"$year"},
"freq" :{$push: "$freq"}
}
}]);
then the result is
{ "_id" : { "O" : "aa", "D" : "bb", "Y" : 2016 }, "freq" : [ 5 ] }
{ "_id" : { "O" : "aa", "D" : "bb", "Y" : 2015 }, "freq" : [ 5, 5 ] }
Now if you execute your second $group query
db.stackoverflow.aggregate([
{
'$group':
{
_id:{"O":"$O","D":"$D"},
"freq" :{$push: "$freq"}
}
}])
then the result is
{ "_id" : { "O" : "aa", "D" : "bb" }, "freq" : [ 5, 5, 5 ] }
Reason:
The values fetched in the first $group query is not passed to the second $group query.
Solution:
Use $project available in the aggregation pipeline which passes along the documents with only the specified fields to the next stage in the aggregation pipeline. The specified fields can be existing fields from the input documents or newly computed fields.
https://docs.mongodb.com/manual/reference/operator/aggregation/project/
Here is the query to get your expected result
db.collection.aggregate([
{
'$group': {
_id: {
"o": "$o",
"d": "$d",
"year": "$year"
},
myArr: {
$push: {
year: "$year",
month: "$month",
freq: "$freq"
}
}
}
},
{
'$group': {
_id: {
"o": "$o",
"d": "$d"
},
myArr1: {
$push: {
year: "$year",
freq: "$myArr"
}
}
}
},
],
{
allowDiskUse: true
})

Resources