Convert dict of array of dict to DataFrame - arrays

Here is my timeseries. Timestamp may be either the same or different. I need to convert data to single DataFrame:
{'param1': [
{'ts': 1669246574000, 'value': '6.06'},
{'ts': 1669242973000, 'value': '6.5'}
],
'param2': [
{'ts': 1669246579000, 'value': '7'},
{'ts': 1669242973000, 'value': '5'}
],
}
Update 1: format of DataFrame
ts param1 param2
1669246574000 6.06 1
1669242973000 6.5 2
1669246579000 7 3
1669242973000 5 4
Update 2:
Timestamp (ts) should be index
ts param1 param2
1669242973000 6.5 5
1669246574000 6.06 Nan
1669246579000 Nan 7
Update 3: my solution
data_frames = []
for key, values in data.items():
df = pd.DataFrame(values).set_index('ts').rename(columns={'value': key})
data_frames.append(df)
data_frame = pd.concat(data_frames, axis=1)

Try:
import pandas as pd
data = {
"param1": [
{"ts": 1669246574000, "value": "6.06"},
{"ts": 1669242973000, "value": "6.5"},
],
"param2": [
{"ts": 1669246579000, "value": "7"},
{"ts": 1669242973000, "value": "5"},
],
}
df = pd.concat([pd.DataFrame(v).assign(param=k) for k, v in data.items()])
print(df)
Prints:
ts value param
0 1669246574000 6.06 param1
1 1669242973000 6.5 param1
0 1669246579000 7 param2
1 1669242973000 5 param2
EDIT: With updated question:
import pandas as pd
from itertools import count
data = {
"param1": [
{"ts": 1669246574000, "value": "6.06"},
{"ts": 1669242973000, "value": "6.5"},
],
"param2": [
{"ts": 1669246579000, "value": "7"},
{"ts": 1669242973000, "value": "5"},
],
}
c = count(1)
df = pd.DataFrame(
[
{"ts": d["ts"], "param1": d["value"], "param2": next(c)}
for v in data.values()
for d in v
]
)
print(df)
Prints:
ts param1 param2
0 1669246574000 6.06 1
1 1669242973000 6.5 2
2 1669246579000 7 3
3 1669242973000 5 4

Related

MuleSoft: Update the value in Array and replace the content based on condition

I have below Payload.
Here if points are not divisible by 5, then round up to the nearest 5. So, if the points are 250.2, the resulting number would be 250 and that is divisible by 5
so 250 would be the value returned.
If the resulting value was 251.2, then the resulting whole number would be 251 and that is not divisible by 5 and would be rounded up to 155
{
"referenceID": "1001",
"Year": "2023",
"Type": "BK",
"contracts": [
{
"contractId": "1",
"contractType": "Booking",
"Points": "250.2",
"Reservations": {
"reservations": [
],
"reservationPoints": ""
}
},
{
"contractId": "1",
"contractType": "Booking",
"Points": "251.2",
"Reservations": {
"reservations": [
],
"reservationPoints": ""
}
}
]
}
Based on above conditions, output payload should be like below
{
"referenceID": "1001",
"Year": "2023",
"Type": "BK",
"contracts": [
{
"contractId": "1",
"contractType": "Booking",
"Points": "250",
"Reservations": {
"reservations": [
],
"reservationPoints": ""
}
},
{
"contractId": "1",
"contractType": "Booking",
"Points": "255",
"Reservations": {
"reservations": [
],
"reservationPoints": ""
}
}
]
}
With regards to this rounding up Points to nearest number divisible by 5, I am using below logic
if (((payload.contracts[0].Points as Number mod 5))<1)
(round((payload.contracts[0].Points as Number)/5)*5)
else
(ceil((payload.contracts[0].Points as Number)/5)*5)
This gets the updated value based on condition but I am not able to update the Payload.
Make use of update operator just to update Points value in the payload Object. Try like below:
%dw 2.0
output application/json
---
payload update {
case c at .contracts -> c map ($ update{
case p at .Points -> if((p mod 5) <1) round((p/5))*5
else ceil((p/5))*5
})
}
Based on the above suggestion, I updated payload like below
%dw 2.0
output application/json
---
payload.contracts map ((item,index)-> item update {
case Points at .Points -> if (((Points as Number mod 5))<1) (round((Points as Number)/5)*5)
else (ceil((Points as Number)/5)*5)
})

How to extract data from a JSON key/value pair, if the key also has the actual value

I am trying to parse/flatten JSON data using PySpark DataFrame API.
The challenge is, I need to extract one data element ('Id') from the actual Key/Attribute, and also filter only the rows having 'statC' attribute. To begin with, I am trying to explode the JSON object, but getting an error. Is there a better way to extract such data?
JSON input file:
{
"changes": {
"1": [
{
"Name": "ABC-1",
"statC": {
"newValue": 10
},
"column": {
"notDone": true,
"newStatus": "10071"
}
}
],
"2": [
{
"Name": "ABC-2",
"added": true
}
],
"3": [
{
"Name": "ABC-3",
"column": {
"notDone": true,
"newStatus": "10071"
}
}
],
"4": [
{
"Name": "ABC-4",
"statC": {
"newValue": 40
}
}
],
"5": [
{
"Name": "ABC-5",
"statC": {
"newValue": 50
},
"column": {
"notDone": false,
"done": true,
"newStatus": "13685"
}
}
],
"6": [
{
"Name": "ABC-61",
"added": true
},
{
"Name": "ABC-62",
"statC": {
"oldValue": 60
}
}
],
"7": [
{
"Name": "ABC-70",
"added": true
},
{
"Name": "ABC-71",
"statC": {
"newValue": 70
}
}
{
"Name": "ABC-72",
"statC": {
"newValue": 75
}
}
]
},
"startTime": 1666188060000,
"endTime": 1667347140000,
"activatedTime": 1666188126953,
"now": 1667294686212
}
Desired output:
Id Name statC_NewValue
1 ABC-1 10
4 ABC-4 40
5 ABC-5 50
7 ABC-71 70
7 ABC-72 75
My PySpark code:
from pyspark.sql.functions import *
rawDF = spark.read.json([f"abfss://{pADLSContainer}#{pADLSGen2}.dfs.core.windows.net/{pADLSDirectory}/InputFile.json"], multiLine = "true")
idDF = rawDF.select(explode("changes").alias("changes_json"))
Error:
AnalysisException: cannot resolve 'explode(changes)' due to data type mismatch: input to function explode should be array or map type, not struct.
There's quite much to it. First, you can create a list of columns 1, 2, 3, etc., which satisfy your conditions, select them as columns, modify arrays, unpivot and expand into columns.
from pyspark.sql import functions as F
df_raw = spark.read.json(["file.json"], multiLine =True)
df = df_raw.select('changes.*')
df0 = df.select([F.col(c)[0].alias(c) for c in df.columns])
cols = [c for c in df0.columns
if 'statC' in df0.schema[c].dataType.names and
'newValue' in df0.schema[c].dataType['statC'].dataType.names
]
df = df.select(
[F.transform(c, lambda x: F.struct(
x.Name.alias('Name'),
x.statC.newValue.alias('statC_NewValue'))).alias(c)
for c in cols]
)
to_melt = [f"'{c}', `{c}`" for c in df.columns]
df = df.selectExpr(f"stack({len(to_melt)}, {','.join(to_melt)}) (Id, val)")
df = df.selectExpr("Id", "inline(val)")
df = df.filter('statC_NewValue is not null')
df.show()
# +---+------+--------------+
# | Id| Name|statC_NewValue|
# +---+------+--------------+
# | 1| ABC-1| 10|
# | 4| ABC-4| 40|
# | 5| ABC-5| 50|
# | 7|ABC-71| 70|
# | 7|ABC-72| 75|
# +---+------+--------------+

How to use jq to produce a cartesian product of two arrays present in the input JSON

I'd like to be able to use jq to output the 'product' of 2 arrays in the input JSON... for example, given the following input JSON:
{
"quantities": [
{
"product": "A",
"quantity": 30
},
{
"product": "B",
"quantity": 10
}
],
"portions": [
{
"customer": "C1",
"percentage": .6
},
{
"customer": "C2",
"percentage": .4
}
]
}
I'd like to produce the following output (or similar...):
[
{
"customer": "C1",
"quantities": [
{
"product": "A",
"quantity": 18
},
{
"product": "B",
"quantity": 6
}
]
},
{
"customer": "C2",
"quantities": [
{
"product": "A",
"quantity": 12
},
{
"product": "B",
"quantity": 4
}
]
}
]
So in other words, for each portion, use its value of percentage, and apply it to each product quantity. Given 2 quantities and 2 portions should yield 4 results.. given 3 quantities and 2 portions should yield 6 results, etc...
I've made some attempts using foreach filters, but to no avail...
I think this will do what you want.
[
.quantities as $q
| .portions[]
| .percentage as $p
| {
customer,
quantities: [
$q[] | .quantity = .quantity * $p
]
}
]
Since you indicated you want the Cartesian product, and that you only gave the sample output as being indicative of what you're looking for, it may be worth mentioning that one can obtain the Cartesian product very simply:
.portions[] + .quantities[]
This produces objects such as:
{
"product": "B",
"quantity": 10,
"customer": "C2",
"percentage": 0.4
}
You could then use reduce or (less efficiently, group_by) to obtain the data in whatever form it is you really want.
For example, assuming .customer is always a string, we could transform
the input into the requested format as follows:
def add_by(f;g): reduce .[] as $x ({}; .[$x|f] += [$x|g]);
[.quantities[] + .portions[]]
| map( {customer, quantities: {product, quantity: (.quantity * .percentage)}} )
| add_by(.customer; .quantities)
| to_entries
| map( {customer: .key, quantities: .value })

how to return row if value find in array in postgres

{
actName: null,
applicable: {
applicable: [ 5, 4, 1 ]
},
status: 1,
id: 2
}
{
actName: null,
applicable: {
applicable: [ 3, 2 ]
},
status: 1,
id: 1
}
Is that possible to find value in array, like if i search integer value 2 in applicable array return one row with id 1.
with t(j) as (values
('{
"actName": null,
"applicable": {
"applicable": [ 5, 4, 1 ]
},
"status": 1,
"id": 2
}'::jsonb),
('{
"actName": null,
"applicable": {
"applicable": [ 3, 2 ]
},
"status": 1,
"id": 1
}')
)
select j ->> 'id' as id
from t
where exists (
select 1
from jsonb_array_elements_text(j -> 'applicable' -> 'applicable') s(i)
where i = '2'
)
;
id
----
1
With JSONB's #> you can query for any element by following the structure of your document, for example:
WITH data(d) AS (VALUES
('{
"actName": null,
"applicable": {
"applicable": [ 5, 4, 1 ]
},
"status": 1,
"id": 2
}'::JSONB),
('{
"actName": null,
"applicable": {
"applicable": [ 3, 2 ]
},
"status": 1,
"id": 1
}')
)
SELECT d ->> 'id' AS id
FROM data
WHERE d #> '{"applicable":{"applicable":[1]}}';
Assumption here is that you have jsonb with array of json docs in your database, query may be the following:
WITH test_data AS (
SELECT '[{
"actName": "null",
"applicable": {
"applicable": [5,4,1]
},
"status":1,
"id":2
},
{
"actName": "null",
"applicable": {
"applicable": [3,2]
},
"status": 1,
"id": 1
}]'::JSONB AS jsonb_value
)
SELECT
jsonb_doc->>'id' AS id,
jsonb_doc->'applicable' #>'{applicable}' AS appl_array_result
FROM
test_data td,
jsonb_array_elements(td.jsonb_value) AS jsonb_doc
WHERE (jsonb_doc->'applicable' #>'{applicable}') #> '2'::JSONB;
Output:
id | appl_array_result
----+-------------------
1 | [3, 2]
(1 row)

Database design from a json file

I have a json file like this
[
{
"topic": "Example1",
"ref": {
"1": "Example Topic",
"2": "Topic"
},
"contact": [
{
"ref": [
1
],
"corresponding": true,
"name": "XYZ"
},
{
"ref": [
1
],
"name": "ZXY"
},
{
"ref": [
1
],
"name": "ABC"
},
{
"ref": [
1,
2
],
"name":"BCA"
}
] ,
"type": "Presentation"
},
{
"topic": "Example2",
"ref": {
"1": "Example Topic",
"2": "Topic"
},
"contact": [
{
"ref": [
1
],
"corresponding": true,
"name": "XYZ"
},
{
"ref": [
1
],
"name": "ZXY"
},
{
"ref": [
1
],
"name": "ABC"
},
{
"ref": [
1,
2
],
"name":"BCA"
}
] ,
"type": "Poster"
}
]
I created 3 TablesItems,Reference,Contact one is
Items:
Item_ID
topic
type
reference:
ref_ID
content
Contact:
ref_ID
contact_ID
Item_ID
name
RelationShip :
1) Items has many references
2)Items has many Authors
3)Authors has many references
Now, my question is
1) Should I doing any wrong here?
2) is there any way to improve the my current implementation ?
3) Here I am confused about to implement the corresponding(inside the contact Array). How do I implement that in design ?
Thanks.
From your above Json., what I could infer is this normalized schema. You have 2 ref in your above Json. Could you clarify it?
Also, here a useful link for you., http://jsonviewer.stack.hu/ Switch between viewer and Text tabs.
The actual example from your scenario is.,
P- Primary Key
Ref - Reference Key
Topic:
--------------------------------------------------
Topic ID (P) | TopicName | TypeID (Ref)
----------------------------------------------------
0 Example1 0
1 Example2 1
TopicReferences :
----------------------------
TopicID (P) | RefernceID (Ref)
--------------------------------
0 0
0 1
1 0
1 1
Reference :
------------------------------------
ReferenceID (P) | ReferenceName
------------------------------------
0 Example Topic
1 Topic
Presentation Type :
--------------------------
TypeID (P) | TypeName
--------------------------
0 Presentation
1 Poster
TopicContacts:
---------------------------------
TopicID | ContactID (Ref)
---------------------------------
0 0
0 1
0 2
0 3
1 0
1 1
1 2
1 3
Contact:
-------------------------------------------------------------------
ContactID(P) | ContactName | IsCorresponding ( Boolean, nullable)
------------------------------------------------------------------
0 XYZ YES
1 ZXY NULL
2 ABC NULL
3 BCA NULL
ContactsReference2:
--------------------------------------------
ContactID | Reference2ID (Ref)
--------------------------------------------
0 0
1 0
2 0
3 0
3 1
Reference2:
--------------------------------------------
Reference2ID(P) | Reference2Value (NUM)
--------------------------------------------
0 1
1 2

Resources