I am writing a bigquery code to split a JSON data set into a more structured table.
The JSON_data_set looks something like this
Row | createdon | result
1 | 24022020 | {"searchResult": {"searchAccounts": [{"chainName": "xyxvjw", "address": {"name": "xyxvjw - ythji", "combined_city": "uptown", "combined_address": "1 downtown, uptown, 09728", "city": "uptown"}, "products": ["pin", "needle", "cloth"]}},{"chainName": "pwiewhds", "address": {"name": "pwiewhds - oujsus", "combined_city": "over the river", "combined_address": "100 under bridge, over the river, 19920", "city": "over the river"}, "products": ["tape", "stapler"]}}],"searchID": "3abci832832o0"}}
2 | 25020202 | {"searchResult": {"searchAccounts": [{"chainName": "xyxvjw2029", "address": {"name": "xyxvjw2029 - ythji", "combined_city": "uptown", "combined_address": "1 downtown, uptown, 09728", "city": "uptown"}, "products": ["pin", "needle", "cloth"]}},{"chainName": "pwiewhds8972", "address": {"name": "pwiewhds8972 - oujsus", "combined_city": "over the river", "combined_address": "100 under bridge, over the river, 19920", "city": "over the river"}, "products": ["tape", "stapler"]}}],"searchID": "3abci832832o0"}}
There are many subsequent account details in each row in the result column.
Able to unnest the data using the below code to get column data such as chain name & address. However, when I try to call the broken down field columns, it gives me the error Cannot access field _field_1 on a value with type ARRAY-STRUCT-STRING, STRING>>
How can I separate the columns created from json data into individual columns and rows without being tied to the json row column?
CREATE TEMP FUNCTION json2array(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
if (json !== null) {
return JSON.parse(json).map(x=>JSON.stringify(x));
}
""";
SELECT * EXCEPT(chains),
ARRAY(SELECT AS STRUCT JSON_EXTRACT_SCALAR(x, '$.chainName'), JSON_EXTRACT_SCALAR(x, '$.address.combined_address') FROM UNNEST(chains) x WHERE JSON_EXTRACT_SCALAR(x, '$.chainName') IS NOT NULL) chain_names
FROM (
SELECT *,
json2array(
JSON_EXTRACT(result, '$.searchResult.searchAccounts')
) chains
FROM json_data_set
)
Just needed to write the query a different way to achieve individual columns
CREATE TEMP FUNCTION json2array(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
if (json !== null) {
return JSON.parse(json).map(x=>JSON.stringify(x));
}
""";
WITH chain_name AS (
SELECT
*,
json2array(
JSON_EXTRACT(result, '$.searchResult.searchMerchants')
) chains
FROM json_data_set
)
SELECT AS STRUCT
JSON_EXTRACT_SCALAR(x, '$.chainName') chainName,
JSON_EXTRACT_SCALAR(x, '$.address.combined_address') combined_address
FROM chain_name, UNNEST(chains) x
WHERE JSON_EXTRACT_SCALAR(x, '$.chainName') IS NOT NULL
Related
I have a table 'animals' to store information
row 1 -> [{"type": "rabbit", "value": "carrot"}, {"type": "cat", "value": "milk"}]
row 2 -> [{"type": "cat", "value": "fish"}, {"type": "rabbit", "value": "leaves"}]
I need to query the value for type rabbit from all the rows.
tried to use operator #> [{"type" : "rabbit"}}
select * from data where data #> '[{"type":"rabbit"}]';
but doesn't work
I am trying to flatten a JSON file to be able to load it into PostgreSQL all in AWS Glue. I am using PySpark. Using a crawler I crawl the S3 JSON and produce a table. I then use an ETL Glue script to:
read the crawled table
use the 'Relationalize' function to flatten the file
convert the dynamic frame to a dataframe
try to 'explode' the request.data field
Script so far:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = glue_source_table, transformation_ctx = "datasource0")
df0 = Relationalize.apply(frame = datasource0, staging_path = glue_temp_storage, name = dfc_root_table_name, transformation_ctx = "dfc")
df1 = df0.select(dfc_root_table_name)
df2 = df1.toDF()
df2 = df1.select(explode(col('`request.data`')).alias("request_data"))
<then i write df1 to a PostgreSQL database which works fine>
Issues I face:
The 'Relationalize' function works well except the request.data field which becomes a bigint and therefore 'explode' doesn't work.
Explode cannot be done without using 'Relationalize' on the JSON first due to the structure of the data. Specifically the error is: "org.apache.spark.sql.AnalysisException: cannot resolve 'explode(request.data)' due to data type mismatch: input to function explode should be array or map type, not bigint"
If I try to make the dynamic frame a dataframe first then I get this issue: "py4j.protocol.Py4JJavaError: An error occurred while calling o72.jdbc.
: java.lang.IllegalArgumentException: Can't get JDBC type for struct..."
I tried to also upload a classifier so that the data would flatten in the crawl itself but AWS confirmed this wouldn't work.
The JSON format of the original file is as follows, that I an trying to normalise:
- field1
- field2
- {}
- field3
- {}
- field4
- field5
- []
- {}
- field6
- {}
- field7
- field8
- {}
- field9
- {}
- field10
# Flatten nested df
def flatten_df(nested_df):
for col in nested_df.columns:
array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
for col in array_cols:
nested_df =nested_df.withColumn(col, F.explode_outer(nested_df[col]))
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
if len(nested_cols) == 0:
return nested_df
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
flat_df = nested_df.select(flat_cols +
[F.col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols
for c in nested_df.select(nc+'.*').columns])
return flatten_df(flat_df)
df=flatten_df(df)
It will replace all dots with underscore. Note that it uses explode_outer and not explode to include Null value in case array itself is null. This function is available in spark v2.4+ only.
Also remember, exploding array will add more duplicates and overall row size will increase. Flattening struct will increase column size. In short, your original df will explode horizontally and vertically. It may slow down processing data later.
Therefore my recommendation would be to identify feature related data and store only those data in postgresql and original json files in s3.
Once you have rationalized the json column, you don't need to explode it. Relationalize transforms the nested JSON into key-value pairs at the outermost level of the JSON document. The transformed data maintains a list of the original keys from the nested JSON separated by periods.
Example :
Nested json :
{
"player": {
"username": "user1",
"characteristics": {
"race": "Human",
"class": "Warlock",
"subclass": "Dawnblade",
"power": 300,
"playercountry": "USA"
},
"arsenal": {
"kinetic": {
"name": "Sweet Business",
"type": "Auto Rifle",
"power": 300,
"element": "Kinetic"
},
"energy": {
"name": "MIDA Mini-Tool",
"type": "Submachine Gun",
"power": 300,
"element": "Solar"
},
"power": {
"name": "Play of the Game",
"type": "Grenade Launcher",
"power": 300,
"element": "Arc"
}
},
"armor": {
"head": "Eye of Another World",
"arms": "Philomath Gloves",
"chest": "Philomath Robes",
"leg": "Philomath Boots",
"classitem": "Philomath Bond"
},
"location": {
"map": "Titan",
"waypoint": "The Rig"
}
}
}
Flattened out json after rationalize :
{
"player.username": "user1",
"player.characteristics.race": "Human",
"player.characteristics.class": "Warlock",
"player.characteristics.subclass": "Dawnblade",
"player.characteristics.power": 300,
"player.characteristics.playercountry": "USA",
"player.arsenal.kinetic.name": "Sweet Business",
"player.arsenal.kinetic.type": "Auto Rifle",
"player.arsenal.kinetic.power": 300,
"player.arsenal.kinetic.element": "Kinetic",
"player.arsenal.energy.name": "MIDA Mini-Tool",
"player.arsenal.energy.type": "Submachine Gun",
"player.arsenal.energy.power": 300,
"player.arsenal.energy.element": "Solar",
"player.arsenal.power.name": "Play of the Game",
"player.arsenal.power.type": "Grenade Launcher",
"player.arsenal.power.power": 300,
"player.arsenal.power.element": "Arc",
"player.armor.head": "Eye of Another World",
"player.armor.arms": "Philomath Gloves",
"player.armor.chest": "Philomath Robes",
"player.armor.leg": "Philomath Boots",
"player.armor.classitem": "Philomath Bond",
"player.location.map": "Titan",
"player.location.waypoint": "The Rig"
}
Thus in your case, request.data is already a new column flattened out from request column and its type is interpreted as bigint by spark.
Reference : Simplify/querying nested json with the aws glue relationalize transform
I'm trying to export a dataset to a JSON file. With PROC JSON every row in my dataset is exported nicely.
What I want to do is to add an array into each exported object with data from a specific column.
My dataset has structure like this:
data test;
input id $ amount $ dimension $;
datalines;
1 x A
1 x B
1 x C
2 y A
2 y X
3 z C
3 z K
3 z X
;
run;
proc json out='/MYPATH/jsontest.json' pretty nosastags;
export test;
run;
And the exported JSON object looks, obviously, like this:
[
{
"id": "1",
"amount": "x",
"dimension": "A"
},
{
"id": "1",
"amount": "x",
"dimension": "B"
},
{
"id": "1",
"amount": "x",
"dimension": "C"
},
...]
The result I want:
For each id I would like to insert all of the data from the dimension column into an array so my output would look this this:
[
{
"id": "1",
"amount": "x",
"dimensions": [
"A",
"B",
"C"
]
},
{
"id": "2",
"amount": "y",
"dimensions": [
"A",
"X"
]
},
{
"id": "3",
"amount": "z",
"dimensions": [
"C",
"K",
"X"
]
}
]
I've not been able to find a scenario like this or some guidelines on how to solve my problem. I hope somebody can help.
/Crellee
There are other methods for json output, including
hand-coded emitter in DATA Step
JSON package in Proc DS2
Here is an example of a hand-coded emitter for your data and desired mapping.
data _null_;
file 'c:\temp\test.json';
put '[';
do group_counter = 1 by 1 while (not end_of_data);
if group_counter > 1 then put #2 ',';
put #2 '{';
do dimension_counter = 1 by 1 until (last.amount);
set test end=end_of_data;
by id amount;
if dimension_counter = 1 then do;
q1 = quote(trim(id));
q2 = quote(trim(amount));
put
#4 '"id":' q1 ","
/ #4 '"amount":' q1 ","
;
put #4 '"dimensions":' / #4 '[';
end;
else do;
put #6 ',' #;
end;
q3 = quote(trim(dimension));
put #8 q3;
end;
if dimension_counter > 1 then put #4 '}';
put #2 ']';
end;
put ']';
stop;
run;
Such an emitter can be macro-ized and generalized to handle specifications of data=, by= and arrayify=. Not a path recommended for friends.
You can try concatenating / grouping the text before calling proc json.
I don't have proc json in my SAS environment, but try this step and see it works for you:
data want;
set test (rename=(dimension=old_dimension));
Length dimension $200. ;
retain dimension ;
by id amount notsorted;
if first.amount = 1 then do; dimension=''; end;
if last.amount = 1 then do; dimension=catx(',',dimension,old_dimension); output; end;
else do; dimension=catx(',',dimension,old_dimension); end;
drop old_dimension;
run;
Output:
id=1 amount=x dimension=A,B,C
id=2 amount=y dimension=A,X
id=3 amount=z dimension=C,K,X
My json looks like this:
[
{
"blocked": 1,
"object": {
"ip": "abc",
"src_ip": "abc",
"lan_initiated": true,
"detection": "abc",
"src_port": ,
"src_mac": "abc",
"dst_mac": "abc",
"dst_ip": "abc",
"dst_port": "abc"
},
"object_type": "url",
"threat": "",
"threat_type": "abc",
"device_id": "abc",
"app_id": "abc",
"user_id": "abc",
"timestamp": 1520268249657,
"date": {
"$date": "Mon Mar 05 2018 16:44:09 GMT+0000 (UTC)"
},
"expire": {
"$date": "Fri May 04 2018 16:44:09 GMT+0000 (UTC)"
},
"_id": "abc"
}
]
I have tried:
CREATE EXTERNAL TABLE `table_name`(
reports array<struct<
user_id: string ,
device_id: string ,
app_id: string ,
blocked: string ,
object: struct<ip:string,src_ip:string,lan_initiated:string,detection:string,src_port:string,src_mac:string,dst_mac:string,dstp_ip:string,dst_port:string> ,
object_type: string ,
threat: string ,
threat_type: string ,
servertime:string,
date_t: struct<dat:string>,
expire: struct<dat:string>,
id: string >>)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'ignore.malformed.json'='false','mapping.dat'='$date', 'mapping.servertime'='timestamp','mapping.date'='date_t','mapping._id'='id')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'abc'
and after that
SELECT * FROM table_name
LATERAL VIEW outer explode(reports) exploded_table as rep;
but i get: Vertex did not succeed due to OWN_TASK_FAILURE - killed/failed due to:null.
I have read that because the JSON starts with '[' it cannot be parsed. Any ideas? The structure of the json must be changed?
I believe you have made mistake in specifying location
In the code, you have mentioned
LOCATION
'abc'
LOCATION is expected to be a folder in which your JSON file should be present. You can have any name for the JSON file.
You also need to make sure that the JSON-Serde jar is there in the classpath. If not, use below command before trying to create the table.
hive> add jar /path/to/<json-serde-jar>;
I just started to play with Angular.js and have a question about ngOptions: Is it possible to label the optgroup?
Lets assume 2 objects - cars and garages.
cars = [
{"id": 1, "name": "Diablo", "color": "red", "garageId": 1},
{"id": 2, "name": "Countach", "color": "white", "garageId": 1},
{"id": 3, "name": "Clio", "color": "silver", "garageId": 2},
...
]
garages = [
{"id": 1, "name": "Super Garage Deluxe"},
{"id": 2, "name": "Toms Eastside"},
...
]
With this code I got nearly the result I want:
ng-options = "car.id as car.name + ' (' + car.color + ')' group by car.garageId for car in cars"
Result in the select:
-----------------
1
Diablo (red)
Countach (white)
Firebird (red)
2
Clio (silver)
Golf (black)
3
Hummer (silver)
-----------------
But I want to label the optgroups like "Garage 1", "Garage 2", ... or even better display the name of the garage and not just the garageId.
The angularjs.org documentation for select says nothing about labels for the optgroup, but I would like to extend the group by part of ngOptions like group by car.garageId as 'Garage ' + car.garageId or group by car.garageId as getGarageName(car.garageId) - which sadly is not working.
My only solution so far is to add a new property "garageDisplayName" to the car objects and store there the id + garage name and use that as group by parameter. But I don't want to update all cars whenever a garage name is changed.
Is there a way to label the optgroups with ngOptions, or should I use ngRepeat in that case?
You can just call getGarageName() in the group by without using an as...
ng-options="car.id as car.name + ' (' + car.color + ')' group by getGarageName(car.garageId) for car in cars"
Instead of storing the garage id in each car, you might want to consider storing a reference to the garage object in the garages array. That way you can change the garage name and there is no need to change each car. And the group by simply becomes...
group by car.garage.name