In mongodb, I have a collection of documents with an array of records that I want to group by similar tag preserving the natural order
{
"day": "2019-01-07",
"records": [
{
"tag": "ch",
"unixTime": ISODate("2019-01-07T09:06:56Z"),
"score": 1
},
{
"tag": "u",
"unixTime": ISODate("2019-01-07T09:07:06Z"),
"score": 0
},
{
"tag": "ou",
"unixTime": ISODate("2019-01-07T09:07:06Z"),
"score": 0
},
{
"tag": "u",
"unixTime": ISODate("2019-01-07T09:07:20Z"),
"score": 0
},
{
"tag": "u",
"unixTime": ISODate("2019-01-07T09:07:37Z"),
"score": 1
}
]
I want to group (and aggregate) the records by similar sequence of tags and NOT simply by grouping unique tags
Desired output:
{
"day": "2019-01-07",
"records": [
{
"tag": "ch",
"unixTime": [ISODate("2019-01-07T09:06:56Z")],
"score": 1
"nbRecords": 1
},
{
"tag": "u",
"unixTime": [ISODate("2019-01-07T09:07:06Z")],
"score": 0,
"nbRecords":1
},
{
"tag": "ou",
"unixTime": [ISODate("2019-01-07T09:07:06Z")],
"score": 0
},
{
"tag": "u",
"unixTime: [ISODate("2019-01-07T09:07:20Z"),ISODate("2019-01-07T09:07:37Z")]
"score": 1
"nbRecords":2
}
]
Groupby
It seems that '$groupby' aggregation operator in mongodb previously sort the array and group by the unique field
db.coll.aggregate(
[
{"$unwind":"$records"},
{"$group":
{
"_id":{
"tag":"$records.tag",
"day":"$day"
},
...
}
}
]
)
Returns
{
"day": "2019-01-07",
"records": [
{
"tag": "ch",
"unixTime": [ISODate("2019-01-07T09:06:56Z")],
"score": 1
"nbRecords": 1
},
{
"tag": "u",
"unixTime": [ISODate("2019-01-07T09:07:06Z"),ISODate("2019-01-07T09:07:20Z"),ISODate("2019-01-07T09:07:37Z")],
"score": 2,
"nbRecords":3
},
{
"tag": "ou",
"unixTime": [ISODate("2019-01-07T09:07:06Z")],
"score": 0
},
]
Map/reduce
As I'm currently using pymongo driver, I implemented the solution back in python
using itertools.groupby that as a generator performs the grouping respecting the natural order but I'm confronted to server timing out problem (cursor.NotFound Error) as an insane time processing.
Any idea of how to use directly the mapreduce function of mongo
to perform the equivalent of the itertools.groupby() in python?
Help would be very appreciated: I'm using pymongo driver 3.8 and MongoDB 4.0
Ni! Run through the array of records adding a new integer index that increments whenever the groupby target changes, then use the mongo operation on that index. .~ยด
With the recommendation of #Ale and without any tips on the way to do that in MongoDb. I switch back to a python implementation solving the cursor.NotFound problem.
I imagine that I could be done inside Mongodb but this is working out
for r in db.coll.find():
session = [
]
for tag, time_score in itertools.groupby(r["records"], key=lambda x:x["tag"]):
time_score = list(time_score)
session.append({
"tag": tag,
"start": time_score[0]["unixTime"],
"end": time_score[-1]["unixTime"],
"ca": sum([n["score"] for n in time_score]),
"nb_records": len(time_score)
})
db.col.update(
{"_id":r["_id"]},
{
"$unset": {"records": ""},
"$set":{"sessions": session}
})
Related
I have a document like this(this is the result after few pipeline stages)
[
{
"_id": ObjectId("5e9d5785e4c8343bb2b455cc"),
"name": "Jenny Adams",
"report": [
{ "category":"Beauty", "status":"submitted", "submitted_on": [{"_id": "xyz", "timestamp":"2022-02-23T06:10:05.832+00:00"}, {"_id": "abc", "timestamp":"2021-03-23T06:10:05.832+00:00"}] },
{ "category":"Kitchen", "status":"submitted", "submitted_on": [{"_id": "mnp", "timestamp":"2022-05-08T06:10:06.432+00:00"}] }
]
},
{
"_id": ObjectId("5e9d5785e4c8343bb2b455db"),
"name": "Mathew Smith",
"report": [
{ "category":"Household", "status":"submitted", "submitted_on": [{"_id": "123", "timestamp":"2022-02-23T06:10:05.832+00:00"}, {"_id": "345", "timestamp":"2021-03-23T06:10:05.832+00:00"}] },
{ "category":"Garden", "status":"submitted", "submitted_on": [{"_id": "567", "timestamp":"2022-05-08T06:10:06.432+00:00"}] },
{ "category":"BakingNeeds", "status":"submitted", "submitted_on": [{"_id": "891", "timestamp":"2022-05-08T06:10:06.432+00:00"}] }
]
}
]
I have user input for time period -
from - 2021-02-23T06:10:05.832+00:00
to - 2022-02-23T06:10:05.832+00:00
Now I wanted to filter the objects from the report which lie in a certain range of time, I want to only keep the object if the "submitted_on[-1]["timestamp"]" is in range of from and to date timestamp.
I am struggling with accessing the timestamp because of the nesting
I tried this
$project: {
"name": 1,
"report": {
"category": 1,
"status": 1,
"submitted_on": 1,
"timestamp": {
$arrayElemAt: ["$report.cataloger_submitted_on", -1]
}
}
}
But this gets the last object of the report array {"_id": "bcd", "timestamp":"2022-05-08T06:10:06.432+00:00"} for all the items inside the report. How can I do this to select the last timestamp of each obj.
You can replace your phase in the aggregation pipeline with two phases: $unwind and $addFields in order to get what I think you want:
{
$unwind: "$report"
},
{
"$addFields": {
"timestamp": {
$arrayElemAt: [
"$report.submitted_on",
-1
]
}
}
},
The $unwind phase is breaking the external array into documents since you want to perform an action on each one of them. See the playground here with your example. If you plan to continue the aggregation pipeline with more steps, you can probably skip the $addFields phase and include the condition inside your next $match phase.
How can I get the data out of this array stored in a variant column in Snowflake. I don't care if it's a new table, a view or a query. There is a second column of type varchar(256) that contains a unique ID.
If you can just help me read the "confirmed" data and the "editorIds" data I can probably take it from there. Many thanks!
Output example would be
UniqueID ConfirmationID EditorID
u3kd9 xxxx-436a-a2d7 nupd
u3kd9 xxxx-436a-a2d7 9l34c
R3nDo xxxx-436a-a3e4 5rnj
yP48a xxxx-436a-a477 jTpz8
yP48a xxxx-436a-a477 nupd
[
{
"confirmed": {
"Confirmation": "Entry ID=xxxx-436a-a2d7-3525158332f0: Confirmed order submitted.",
"ConfirmationID": "xxxx-436a-a2d7-3525158332f0",
"ConfirmedOrders": 1,
"Received": "8/29/2019 4:31:11 PM Central Time"
},
"editorIds": [
"xxsJYgWDENLoX",
"JR9bWcGwbaymm3a8v",
"JxncJrdpeFJeWsTbT"
] ,
"id": "xxxxx5AvGgeSHy8Ms6Ytyc-1",
"messages": [],
"orderJson": {
"EntryID": "xxxxx5AvGgeSHy8Ms6Ytyc-1",
"Orders": [
{
"DropShipFlag": 1,
"FromAddressValue": 1,
"OrderAttributes": [
{
"AttributeUID": 548
},
{
"AttributeUID": 553
},
{
"AttributeUID": 2418
}
],
"OrderItems": [
{
"EditorId": "aC3f5HsJYgWDENLoX",
"ItemAssets": [
{
"AssetPath": "https://xxxx573043eac521.png",
"DP2NodeID": "10000",
"ImageHash": "000000000000000FFFFFFFFFFFFFFFFF",
"ImageRotation": 0,
"OffsetX": 50,
"OffsetY": 50,
"PrintedFileName": "aC3f5HsJYgWDENLoX-10000",
"X": 50,
"Y": 52.03909266409266,
"ZoomX": 100,
"ZoomY": 93.75
}
],
"ItemAttributes": [
{
"AttributeUID": 2105
},
{
"AttributeUID": 125
}
],
"ItemBookAttribute": null,
"ProductUID": 52,
"Quantity": 1
}
],
"SendNotificationEmailToAccount": true,
"SequenceNumber": 1,
"ShipToAddress": {
"Addr1": "Addr1",
"Addr2": "0",
"City": "City",
"Country": "US",
"Name": "Name",
"State": "ST",
"Zip": "00000"
}
}
]
},
"orderNumber": null,
"status": "order_placed",
"submitted": {
"Account": "350000",
"ConfirmationID": "xxxxx-436a-a2d7-3525158332f0",
"EntryID": "xxxxx-5AvGgeSHy8Ms6Ytyc-1",
"Key": "D83590AFF0CC0000B54B",
"NumberOfOrders": 1,
"Orders": [
{
"LineItems": [],
"Note": "",
"Products": [
{
"Price": "00.30",
"ProductDescription": "xxxxxint 8x10",
"Quantity": 1
},
{
"Price": "00.40",
"ProductDescription": "xxxxxut Black 8x10",
"Quantity": 1
},
{
"Price": "00.50",
"ProductDescription": "xxxxx"
},
{
"Price": "00.50",
"ProductDescription": "xxxscount",
"Quantity": 1
}
],
"SequenceNumber": "1",
"SubTotal": "00.70",
"Tax": "1.01",
"Total": "00.71"
}
],
"Received": "8/29/2019 4:31:10 PM Central Time"
},
"tracking": null,
"updatedOn": 1.598736670503000e+12
}
]
So, this is how I'd query that exact JSON assuming the data is in column var in table x:
SELECT x.var[0]:confirmed:ConfirmationID::varchar as ConfirmationID,
f.value::varchar as EditorID
FROM x,
LATERAL FLATTEN(input => var[0]:editorIds) f
;
Since your sample output doesn't match the JSON that you provided, I will assume that this is what you need.
Also, as a note, your JSON includes outer [ ] which indicates that the entire JSON string is inside an array. This is the reason for var[0] in my query. If you have multiple records inside that array, then you should remove that. In general, you should exclude those and instead load each record into the table separately. I wasn't sure whether you could make that change, so I just wanted to make note.
I have been working on Jolt transformation from yesterday and unable to find a solution to achieve following output.
I would like to insert "year","month","day" into each map (dictionary) of "records" list.
Input
[{"root":
{"body":
{"year":2019,
"month":8,
"day":9,
"records":[
{"user":"a",
"item":"x",
"price":300,
"count":1},
{"user":"b",
"item":"y",
"price":100,
"count":3}]
}
}
}]
Desired Output
[{"user":"a",
"item":"x",
"price":300,
"count":1,
"year":2019,
"month":8,
"day":9},
{"user":"b",
"item":"y",
"price":100,
"count":3,
"year":2019,
"month":8,
"day":9}
]
I could insert value of "year", "month","day" into outside of each map as list like {"record":[[2019,8,9],{map1},{map2}]}, but this is not what I want.
I appreciate for any help or advice. And thank you in advance.
This should do what you want:
Solution (Explanation inline):
[
{
"operation": "shift",
"spec": {
"*": {
"root": {
"body": {
"records": {
//match records array
"*": {
//copy object user, item etc. to current position
"#": "[&1]",
//go up two levels and get year and place it in current array position
"#(2,year)": "[&1].year",
"#(2,month)": "[&1].month",
"#(2,day)": "[&1].day"
}
}
}
}
}
}
}
]
Output:
[
{
"user": "a",
"item": "x",
"price": 300,
"count": 1,
"year": 2019,
"month": 8,
"day": 9
},
{
"user": "b",
"item": "y",
"price": 100,
"count": 3,
"year": 2019,
"month": 8,
"day": 9
}
]
I am trying to filter nested array using ?$filter in odata filter
but it is not working properly
parent array got filtered but not child one.
My Array
{
"value": [
{
"Id": 1,
"Country": "India",
"language": [
{
"Lid": 1,
"State": "telengana",
"Statuelanguage": "Telgu",
"Place to visit": [
"p3","p4"
]
},
{
"Lid": 2,
"State": "Delhi",
"Statuelanguage": "Hindi",
"Place to visit": [
"p5","p6"
]
},
{
"Lid": 3,
"State": "UP",
"Statuelanguage": "Hindi",
"Place to visit": [
"p7","p8"
]
}
]
}
]
}
Expected Responce
{
"value": [
{
"Id": 1,
"Country": "India",
"language": [
{
"Lid": 1,
"State": "telengana",
"Statuelanguage": "Telgu",
"Place to visit": [
"p3","p4"
]
}
]
}
]
}
Filter query
?$filter=language/any(c: c/Lid eq 1)
but when i am trying to use the filter, it is filtering the parent one not the child
it returns all 3 child to me
So it works as expected :)
$filter parameter is used to filter collection that you're querying.
To filter expanded/related collection (language in your case) you have to use expand filter feature:
...$expand=language($filter=Lid eq 1)
BUT: It is only possible in OData v4.
ref for webapi
nested filter description
I've been looking at some StackOverflow cases such as this case, but I cannot find an example with a document structure close to this one.
Below is an example of one document within my collection artistTags. All documents follow the same structure.
{
"_id": ObjectId("5500aaeaa7ef65c7460fa3d9"),
"toptags": {
"tag": [
{
"count": "100",
"name": "Hip-Hop"
},
{
"count": "97",
"name": "french rap"
},
...{
"count": "0",
"name": "seen live"
}
],
"#attr": {
"artist": "113"
}
}
}
1) How can I find() this document using the "artist" value (here "113")?
2) How can I retrieve all "artist" values having a specific "name" value (say "french rap") ?
Referring to chridam answer here above:
db.collection.find({"toptags.#attr.artist": "113"})