I am trying to parse an API response which is JSON. the JSON looks like this:
{
'id': 112,
'name': 'stalin-PC',
'type': 'IP4Address',
'properties': 'address=10.0.1.110|ipLong=277893412|state=DHCP Allocated|macAddress=41-1z-y4-23-dd-98|'
}
It's length is 1200, If i convert it I should get 1200 rows. My goal is to parse this json like below:
id name type address iplong state macAddress
112 stalin-PC IP4Address 10.0.1.110 277893412 DHCP Allocated 41-1z-y4-23-dd-98
I am getting the first 3 elements but having an issue in "properties" key which value is pipe delimited. I have tried the below code:
for network in networks: # here networks = response.json()
network_id = network['id']
network_name = network['name']
network_type = network['type']
print(network_id, network_name, network_type)
It works file and gives me result :
112 stalin-PC IP4Address
But when I tried to parse the properties key with below code , its not working.
for network in networks:
network_id = network['id']
network_name = network['name']
network_type = network['type']
for line in network['properties']:
properties_value = line.split('|')
network_address = properties_value[0]
print(network_id, network_name, network_type, network_address )`
How can I parse the pipe delimited properties key? Would anyone help me please.
Thank you
Using str methods
Ex:
network = {
'id': 112,
'name': 'stalin-PC',
'type': 'IP4Address',
'properties': 'address=10.0.1.110|ipLong=277893412|state=DHCP Allocated|macAddress=41-1z-y4-23-dd-98'
}
for n in network['properties'].split("|"):
key, value = n.split("=")
print(key, "-->", value)
Output:
address --> 10.0.1.110
ipLong --> 277893412
state --> DHCP Allocated
macAddress --> 41-1z-y4-23-dd-98
Related
I need help in migrating a postgres function to snowflake function.
Currently, i have following function that takes an ip_address and returns the starting range of ip_address:
...
begin
if (p_ip_address is not null) then
return p_ip_address::inet - '0.0.0.0';
else
return null;
end if;
end
...
I know we have PARSE_IP in snowflake that give JSON file but i need just one piece of this json (ipv4_range_start)
For Example:
Input = 192.168.242.188
Output = 3232297472
If you only want peice of information, select it.
select parse_ip('127.0.0.0/24','INET') as a
,a:ipv4_range_end;
gives:
A A:IPV4_RANGE_END
{ "family": 4, "host": "127.0.0.0", "ip_fields": [ 2130706432, 0, 0, 0 ], "ip_type": "inet", "ipv4": 2130706432, "ipv4_range_end": 2130706687, "ipv4_range_start": 2130706432, "netmask_prefix_length": 24, "snowflake$type": "ip_address" } 2130706687
I am trying to flatten a JSON file to be able to load it into PostgreSQL all in AWS Glue. I am using PySpark. Using a crawler I crawl the S3 JSON and produce a table. I then use an ETL Glue script to:
read the crawled table
use the 'Relationalize' function to flatten the file
convert the dynamic frame to a dataframe
try to 'explode' the request.data field
Script so far:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = glue_source_table, transformation_ctx = "datasource0")
df0 = Relationalize.apply(frame = datasource0, staging_path = glue_temp_storage, name = dfc_root_table_name, transformation_ctx = "dfc")
df1 = df0.select(dfc_root_table_name)
df2 = df1.toDF()
df2 = df1.select(explode(col('`request.data`')).alias("request_data"))
<then i write df1 to a PostgreSQL database which works fine>
Issues I face:
The 'Relationalize' function works well except the request.data field which becomes a bigint and therefore 'explode' doesn't work.
Explode cannot be done without using 'Relationalize' on the JSON first due to the structure of the data. Specifically the error is: "org.apache.spark.sql.AnalysisException: cannot resolve 'explode(request.data)' due to data type mismatch: input to function explode should be array or map type, not bigint"
If I try to make the dynamic frame a dataframe first then I get this issue: "py4j.protocol.Py4JJavaError: An error occurred while calling o72.jdbc.
: java.lang.IllegalArgumentException: Can't get JDBC type for struct..."
I tried to also upload a classifier so that the data would flatten in the crawl itself but AWS confirmed this wouldn't work.
The JSON format of the original file is as follows, that I an trying to normalise:
- field1
- field2
- {}
- field3
- {}
- field4
- field5
- []
- {}
- field6
- {}
- field7
- field8
- {}
- field9
- {}
- field10
# Flatten nested df
def flatten_df(nested_df):
for col in nested_df.columns:
array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
for col in array_cols:
nested_df =nested_df.withColumn(col, F.explode_outer(nested_df[col]))
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
if len(nested_cols) == 0:
return nested_df
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
flat_df = nested_df.select(flat_cols +
[F.col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols
for c in nested_df.select(nc+'.*').columns])
return flatten_df(flat_df)
df=flatten_df(df)
It will replace all dots with underscore. Note that it uses explode_outer and not explode to include Null value in case array itself is null. This function is available in spark v2.4+ only.
Also remember, exploding array will add more duplicates and overall row size will increase. Flattening struct will increase column size. In short, your original df will explode horizontally and vertically. It may slow down processing data later.
Therefore my recommendation would be to identify feature related data and store only those data in postgresql and original json files in s3.
Once you have rationalized the json column, you don't need to explode it. Relationalize transforms the nested JSON into key-value pairs at the outermost level of the JSON document. The transformed data maintains a list of the original keys from the nested JSON separated by periods.
Example :
Nested json :
{
"player": {
"username": "user1",
"characteristics": {
"race": "Human",
"class": "Warlock",
"subclass": "Dawnblade",
"power": 300,
"playercountry": "USA"
},
"arsenal": {
"kinetic": {
"name": "Sweet Business",
"type": "Auto Rifle",
"power": 300,
"element": "Kinetic"
},
"energy": {
"name": "MIDA Mini-Tool",
"type": "Submachine Gun",
"power": 300,
"element": "Solar"
},
"power": {
"name": "Play of the Game",
"type": "Grenade Launcher",
"power": 300,
"element": "Arc"
}
},
"armor": {
"head": "Eye of Another World",
"arms": "Philomath Gloves",
"chest": "Philomath Robes",
"leg": "Philomath Boots",
"classitem": "Philomath Bond"
},
"location": {
"map": "Titan",
"waypoint": "The Rig"
}
}
}
Flattened out json after rationalize :
{
"player.username": "user1",
"player.characteristics.race": "Human",
"player.characteristics.class": "Warlock",
"player.characteristics.subclass": "Dawnblade",
"player.characteristics.power": 300,
"player.characteristics.playercountry": "USA",
"player.arsenal.kinetic.name": "Sweet Business",
"player.arsenal.kinetic.type": "Auto Rifle",
"player.arsenal.kinetic.power": 300,
"player.arsenal.kinetic.element": "Kinetic",
"player.arsenal.energy.name": "MIDA Mini-Tool",
"player.arsenal.energy.type": "Submachine Gun",
"player.arsenal.energy.power": 300,
"player.arsenal.energy.element": "Solar",
"player.arsenal.power.name": "Play of the Game",
"player.arsenal.power.type": "Grenade Launcher",
"player.arsenal.power.power": 300,
"player.arsenal.power.element": "Arc",
"player.armor.head": "Eye of Another World",
"player.armor.arms": "Philomath Gloves",
"player.armor.chest": "Philomath Robes",
"player.armor.leg": "Philomath Boots",
"player.armor.classitem": "Philomath Bond",
"player.location.map": "Titan",
"player.location.waypoint": "The Rig"
}
Thus in your case, request.data is already a new column flattened out from request column and its type is interpreted as bigint by spark.
Reference : Simplify/querying nested json with the aws glue relationalize transform
I have a jsonlines file that looks like this:
{"id":123,"source":"this is a text string"}
{"id":456,"source":"this is another text string"}
{"id":789,"source":"yet another string"}
When I run a BlazingText Batch Transform job on a file that just contains the source, it works. When trying to join the inputs and outputs, I get Customer Error: Unable to decode payload: Incorrect data format. (caused by AttributeError).
Any suggestions?
Code:
bt_transformer = bt_model.transformer(
instance_count = 1,
instance_type = "ml.m4.xlarge",
assemble_with = "Line",
output_path = s3_batch_out_data,
accept = "application/jsonlines"
)
bt_transformer.transform(
s3_batch_in_data,
content_type = "application/jsonlines",
split_type = "Line",
input_filter = "$.source",
join_source = "Input",
output_filter = "$['id', 'SageMakerOutput']"
)
bt_transformer.wait()
When applying "$.source" on {"id":123,"source":"this is a text string"}, the output is "this is a text string" instead of {"source":"this is a text string"} which is probably why you got a format error. I'm wondering why you need to do such filtering on a JSON input - doesn't the algorithm ignore unrecognized JSON fields automatically?
I am unable to parse a JSON array from a text file due to errors and my limited knowledge of JSON.
The file looks something like this [{"random":"fdjsf","random56":128,"name":"dsfjsd", "rid":1243,"rand":674,"name":"dsfjsd","random43":722, "rid":126},{"random":"fdfgfgjsf","random506":120,"name":"dsfjcvcsd", "rid":12403,"rando":670,"name":"dsfooojsd","random4003":720, "rid":120}] It has more than one object({}) in the entire array however I did not want to include all 600. The layout shown above is basically how all of them look.
r = s.get(getAPI, headers=header, verify=False)
f = open('text.txt', 'w+')
f.write(r.text)
f.close
output_file = open ('text.txt', 'r')
json_array = json.load(output_file)
json_list = []
for item in json_array:
name = "name"
rid = "rid"
json_items = {name:None, rid:None}
json_items = [name] = item[name]
json_items = [rid] = item[rid]
json_list.append(json_items)
print(json_list)
I would like to loop through an array and find any time it says "name":... eventually followed by "rid":... and store those in a dictionary as key value pairs.
Errors:
ValueError: too many values to unpack (expected 1)
There is a syntax error when you assign values to json_items, change it to:
json_items[name] = item[name]
json_items[rid] = item[rid]
i am trying to create a variable in a python dictionary and post this variable to the a server in a json format to a server.
the server isn't accepting a dict containing a list.
this is the code:
b9RuleData = {'hash': 'fea66a01c8091683539597d529557b8f2f21270e', 'fileState': 3, 'policyIds' : '11' }
b9RuleData2 = {'hash':'ff9586097b762b7d534cad008fc5f0382b5dcff8', 'fileState': 3, 'policyIds' : '12'}
b9RuleData3 = {'hash':'fea66a01c8091683539597d529557b8f2f21270e', 'fileState': 3, 'policyIds' : '13'}
So, what's the best way to write this ? Let's assume that,I have 100 file hashes. I can't keep adding each individual file hash..