Formatting into CSV JSON file using jq - arrays

I've some data in a file called myfile.json. I need to format using jq - in JSON it looks like this ;
{
"result": [
{
"service": "ebsvolume",
"name": "gtest",
"resourceIdentifier": "vol-999999999999",
"accountName": "g-test-acct",
"vendorAccountId": "12345678912",
"availabilityZone": "ap-southeast-2c",
"region": "ap-southeast-2",
"effectiveHourly": 998.56,
"totalSpend": 167.7,
"idle": 0,
"lastSeen": "2018-08-16T22:00:00Z",
"volumeType": "io1",
"state": "in-use",
"volumeSize": 180,
"iops": 2000,
"throughput": 500,
"lastAttachedTime": "2018-08-08T22:00:00Z",
"lastAttachedId": "i-086f957ee",
"recommendations": [
{
"action": "Rightsize",
"preferenceOrder": 2,
"risk": 0,
"savingsPct": 91,
"savings": 189.05,
"volumeType": "gp2",
"volumeSize": 120,
},
{
"action": "Rightsize",
"preferenceOrder": 4,
"risk": 0,
"savingsPct": 97,
"savings": 166.23,
"volumeType": "gp2",
"volumeSize": 167,
},
{
"action": "Rightsize",
"preferenceOrder": 6,
"risk": 0,
"savingsPct": 91,
"savings": 111.77,
"volumeType": "gp2",
"volumeSize": 169,
}
]
}
}
I have it formatted better with the following
jq '.result[] | [.service,.name,.resourceIdentifier,.accountName,.vendorAccountId,.availabilityZone,.region,.effectiveHourly,.totalSpend,.idle,.lastSeen,.volumeType,.state,.volumeSize,.iops,.throughput,.lastAttachedTime,.lastAttachedId] |#csv' ./myfile.json
This nets the following output ;
"\"ebsvolume\",\"gtest\",\"vol-999999999999\",\"g-test-acct\",\"12345678912\",\"ap-southeast-2c\",\"ap-southeast-2\",998.56,167.7,0,\"2018-08-16T22:00:00Z\",\"io1\",\"in-use\",180,2000,500,\"2018-08-08T22:00:00Z\",\"i-086f957ee\""
I figured out this but its not exactly what I am trying to achieve. I want to have each recommendation listed underneath on a seperate line, and not at the end of the same line.
jq '.result[] | [.service,.name,.resourceIdentifier,.accountName,.vendorAccountId,.availabilityZone,.region,.effectiveHourly,.totalSpend,.idle,.lastSeen,.volumeType,.state,.volumeSize,.iops,.throughput,.lastAttachedTime,.lastAttachedId,.recommendations[].action] |#csv' ./myfile.json
This nets :
"\"ebsvolume\",\"gtest\",\"vol-999999999999\",\"g-test-acct\",\"12345678912\",\"ap-southeast-2c\",\"ap-southeast-2\",998.56,167.7,0,\"2018-08-16T22:00:00Z\",\"io1\",\"in-use\",180,2000,500,\"2018-08-08T22:00:00Z\",\"i-086f957ee\",\"Rightsize\",\"Rightsize\",\"Rightsize\""
What I want is
"\"ebsvolume\",\"gtest\",\"vol-999999999999\",\"g-test-acct\",\"12345678912\",\"ap-southeast-2c\",\"ap-southeast-2\",998.56,167.7,0,\"2018-08-16T22:00:00Z\",\"io1\",\"in-use\",180,2000,500,\"2018-08-08T22:00:00Z\",\"i-086f957ee\",
\"Rightsize\",
\"Rightsize\",
\"Rightsize\""
So not entirely sure how to deal with the array inside the "recommendations" section in jq, I think it might be called unflattening?

You can try this:
jq '.result[] | [ flatten[] | try(.action) // . ] | #csv' file
"\"ebsvolume\",\"gtest\",\"vol-999999999999\",\"g-test-acct\",\"12345678912\",\"ap-southeast-2c\",\"ap-southeast-2\",998.56,167.7,0,\"2018-08-16T22:00:00Z\",\"io1\",\"in-use\",180,2000,500,\"2018-08-08T22:00:00Z\",\"i-086f957ee\",\"Rightsize\",\"Rightsize\",\"Rightsize\""
flatten does what it says.
try tests if .action is neither null nor false. If so, it emits its value, otherwise jq emits the other value (operator //).
The filtered values are put into an array in order to get them converted with the #csv operator.

That didn't overly work for me actually it omitted all the data in the previous array - but thanks!
I ended up with the following, granted it doesn't put the Rightsize details on a seperate line but it will have to do:
jq -r '.result[] | [.service,.name,.resourceIdentifier,.accountName,.vendorAccountId,.availabilityZone,.region,.effectiveHourly,.totalSpend,.idle,.lastSeen,.volumeType,.state,.volumeSize,.iops,.throughput,.lastAttachedTime,.lastAttachedId,.recommendations[][]] |#csv' ./myfile.json

Related

JQ if else then NULL Return

I'm trying to filter and output from JSON with jq.
The API will sometime return an object and sometime an array, I want to catch the result using an if statement and return empty string when the object/array is not available.
{
"result":
{
"entry": {
"id": "207579",
"title": "Realtek Bluetooth Mesh SDK on Linux\/Android Segmented Packet reference buffer overflow",
"summary": "A vulnerability, which was classified as critical, was found in Realtek Bluetooth Mesh SDK on Linux\/Android (the affected version unknown). This affects an unknown functionality of the component Segmented Packet Handler. There is no information about possible countermeasures known. It may be suggested to replace the affected object with an alternative product.",
"details": {
"affected": "A vulnerability, which was classified as critical, was found in Realtek Bluetooth Mesh SDK on Linux\/Android (the affected version unknown).",
"vulnerability": "The manipulation of the argument reference with an unknown input leads to a unknown weakness. CWE is classifying the issue as CWE-120. The program copies an input buffer to an output buffer without verifying that the size of the input buffer is less than the size of the output buffer, leading to a buffer overflow.",
"impact": "This is going to have an impact on confidentiality, integrity, and availability.",
"countermeasure": "There is no information about possible countermeasures known. It may be suggested to replace the affected object with an alternative product."
},
"timestamp": {
"create": "1661860801",
"change": "1661861110"
},
"changelog": [
"software_argument"
]
},
"software": {
"vendor": "Realtek",
"name": "Bluetooth Mesh SDK",
"platform": [
"Linux",
"Android"
],
"component": "Segmented Packet Handler",
"argument": "reference",
"cpe": [
"cpe:\/a:realtek:bluetooth_mesh_sdk"
],
"cpe23": [
"cpe:2.3:a:realtek:bluetooth_mesh_sdk:*:*:*:*:*:*:*:*"
]
}
}
}
Would also like to to use the statement globally for the whole array output so I can parse it to .csv and escape the null, since sofware name , can also contain an array or an object. Having a global if statement with simplify the syntax result and suppress the error with ?
The error i received from bash
jq -r '.result [] | [ "https://vuldb.com/?id." + .entry.id ,.software.vendor // "empty",(.software.name | if type!="array" then [.] | join (",") else . //"empty" end )?,.software.type // "empty",(.software.platform | if type!="array" then [] else . | join (",") //"empty" end )?] | #csv' > tst.csv
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 7452 0 7393 100 59 4892 39 0:00:01 0:00:01 --:--:-- 4935
jq: error (at <stdin>:182): Cannot iterate over null (null)
What I have tried is the following code which i tried to demo https://jqplay.org/ which is incorrect syntax
.result [] |( if .[] == null then // "empty" else . end
| ,software.name // "empty" ,.software.platform |if type!="array" then [.] // "empty" else . | join (",") end)
Current output
[
[
"Bluetooth Mesh SDK"
],
"Linux,Android"
]
Desired outcome
[
"Bluetooth Mesh SDK",
"empty"
]
After fixing your input JSON, I think you can get the desired output by using the following JQ filter:
if (.result | type) == "array" then . else (.result |= [.]) end \
| .result[].software | [ .name, (.platform // [ "Empty" ] | join(",")) ]
Where
if (.result | type) == "array" then . else (.result |= [.]) end
Wraps .result in an array if type isn't array
.result[].software
Loops though the software in each .result obj
[ .name, (.platform // [ "Empty" ] | join(",")) ]
Create an array with .name and .platform (which is replaced by [ "Empty" ] when it doesn't exist. Then it's join()'d to a string
Outcome:
[
"Bluetooth Mesh SDK",
"Linux,Android"
]
Online demo
Or
[
"Bluetooth Mesh SDK",
"Empty
]
Online demo

How to flatten an array in a nested json in aws glue using pyspark?

I am trying to flatten a JSON file to be able to load it into PostgreSQL all in AWS Glue. I am using PySpark. Using a crawler I crawl the S3 JSON and produce a table. I then use an ETL Glue script to:
read the crawled table
use the 'Relationalize' function to flatten the file
convert the dynamic frame to a dataframe
try to 'explode' the request.data field
Script so far:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = glue_source_table, transformation_ctx = "datasource0")
df0 = Relationalize.apply(frame = datasource0, staging_path = glue_temp_storage, name = dfc_root_table_name, transformation_ctx = "dfc")
df1 = df0.select(dfc_root_table_name)
df2 = df1.toDF()
df2 = df1.select(explode(col('`request.data`')).alias("request_data"))
<then i write df1 to a PostgreSQL database which works fine>
Issues I face:
The 'Relationalize' function works well except the request.data field which becomes a bigint and therefore 'explode' doesn't work.
Explode cannot be done without using 'Relationalize' on the JSON first due to the structure of the data. Specifically the error is: "org.apache.spark.sql.AnalysisException: cannot resolve 'explode(request.data)' due to data type mismatch: input to function explode should be array or map type, not bigint"
If I try to make the dynamic frame a dataframe first then I get this issue: "py4j.protocol.Py4JJavaError: An error occurred while calling o72.jdbc.
: java.lang.IllegalArgumentException: Can't get JDBC type for struct..."
I tried to also upload a classifier so that the data would flatten in the crawl itself but AWS confirmed this wouldn't work.
The JSON format of the original file is as follows, that I an trying to normalise:
- field1
- field2
- {}
- field3
- {}
- field4
- field5
- []
- {}
- field6
- {}
- field7
- field8
- {}
- field9
- {}
- field10
# Flatten nested df
def flatten_df(nested_df):
for col in nested_df.columns:
array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
for col in array_cols:
nested_df =nested_df.withColumn(col, F.explode_outer(nested_df[col]))
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
if len(nested_cols) == 0:
return nested_df
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
flat_df = nested_df.select(flat_cols +
[F.col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols
for c in nested_df.select(nc+'.*').columns])
return flatten_df(flat_df)
df=flatten_df(df)
It will replace all dots with underscore. Note that it uses explode_outer and not explode to include Null value in case array itself is null. This function is available in spark v2.4+ only.
Also remember, exploding array will add more duplicates and overall row size will increase. Flattening struct will increase column size. In short, your original df will explode horizontally and vertically. It may slow down processing data later.
Therefore my recommendation would be to identify feature related data and store only those data in postgresql and original json files in s3.
Once you have rationalized the json column, you don't need to explode it. Relationalize transforms the nested JSON into key-value pairs at the outermost level of the JSON document. The transformed data maintains a list of the original keys from the nested JSON separated by periods.
Example :
Nested json :
{
"player": {
"username": "user1",
"characteristics": {
"race": "Human",
"class": "Warlock",
"subclass": "Dawnblade",
"power": 300,
"playercountry": "USA"
},
"arsenal": {
"kinetic": {
"name": "Sweet Business",
"type": "Auto Rifle",
"power": 300,
"element": "Kinetic"
},
"energy": {
"name": "MIDA Mini-Tool",
"type": "Submachine Gun",
"power": 300,
"element": "Solar"
},
"power": {
"name": "Play of the Game",
"type": "Grenade Launcher",
"power": 300,
"element": "Arc"
}
},
"armor": {
"head": "Eye of Another World",
"arms": "Philomath Gloves",
"chest": "Philomath Robes",
"leg": "Philomath Boots",
"classitem": "Philomath Bond"
},
"location": {
"map": "Titan",
"waypoint": "The Rig"
}
}
}
Flattened out json after rationalize :
{
"player.username": "user1",
"player.characteristics.race": "Human",
"player.characteristics.class": "Warlock",
"player.characteristics.subclass": "Dawnblade",
"player.characteristics.power": 300,
"player.characteristics.playercountry": "USA",
"player.arsenal.kinetic.name": "Sweet Business",
"player.arsenal.kinetic.type": "Auto Rifle",
"player.arsenal.kinetic.power": 300,
"player.arsenal.kinetic.element": "Kinetic",
"player.arsenal.energy.name": "MIDA Mini-Tool",
"player.arsenal.energy.type": "Submachine Gun",
"player.arsenal.energy.power": 300,
"player.arsenal.energy.element": "Solar",
"player.arsenal.power.name": "Play of the Game",
"player.arsenal.power.type": "Grenade Launcher",
"player.arsenal.power.power": 300,
"player.arsenal.power.element": "Arc",
"player.armor.head": "Eye of Another World",
"player.armor.arms": "Philomath Gloves",
"player.armor.chest": "Philomath Robes",
"player.armor.leg": "Philomath Boots",
"player.armor.classitem": "Philomath Bond",
"player.location.map": "Titan",
"player.location.waypoint": "The Rig"
}
Thus in your case, request.data is already a new column flattened out from request column and its type is interpreted as bigint by spark.
Reference : Simplify/querying nested json with the aws glue relationalize transform

How can I create an external table in HIVE from HDFS file that contains a JSON array?

My json looks like this:
[
{
"blocked": 1,
"object": {
"ip": "abc",
"src_ip": "abc",
"lan_initiated": true,
"detection": "abc",
"src_port": ,
"src_mac": "abc",
"dst_mac": "abc",
"dst_ip": "abc",
"dst_port": "abc"
},
"object_type": "url",
"threat": "",
"threat_type": "abc",
"device_id": "abc",
"app_id": "abc",
"user_id": "abc",
"timestamp": 1520268249657,
"date": {
"$date": "Mon Mar 05 2018 16:44:09 GMT+0000 (UTC)"
},
"expire": {
"$date": "Fri May 04 2018 16:44:09 GMT+0000 (UTC)"
},
"_id": "abc"
}
]
I have tried:
CREATE EXTERNAL TABLE `table_name`(
reports array<struct<
user_id: string ,
device_id: string ,
app_id: string ,
blocked: string ,
object: struct<ip:string,src_ip:string,lan_initiated:string,detection:string,src_port:string,src_mac:string,dst_mac:string,dstp_ip:string,dst_port:string> ,
object_type: string ,
threat: string ,
threat_type: string ,
servertime:string,
date_t: struct<dat:string>,
expire: struct<dat:string>,
id: string >>)
ROW FORMAT SERDE
'org.openx.data.jsonserde.JsonSerDe'
WITH SERDEPROPERTIES (
'ignore.malformed.json'='false','mapping.dat'='$date', 'mapping.servertime'='timestamp','mapping.date'='date_t','mapping._id'='id')
STORED AS INPUTFORMAT
'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
'abc'
and after that
SELECT * FROM table_name
LATERAL VIEW outer explode(reports) exploded_table as rep;
but i get: Vertex did not succeed due to OWN_TASK_FAILURE - killed/failed due to:null.
I have read that because the JSON starts with '[' it cannot be parsed. Any ideas? The structure of the json must be changed?
I believe you have made mistake in specifying location
In the code, you have mentioned
LOCATION
'abc'
LOCATION is expected to be a folder in which your JSON file should be present. You can have any name for the JSON file.
You also need to make sure that the JSON-Serde jar is there in the classpath. If not, use below command before trying to create the table.
hive> add jar /path/to/<json-serde-jar>;

Attribute Syntax for JSON query in check_json.pl

So, I'm trying to set up check_json.pl in NagiosXI to monitor some statistics. https://github.com/c-kr/check_json
I'm using the code with the modification I submitted in pull request #32, so line numbers reflect that code.
The json query returns something like this:
[
{
"total_bytes": 123456,
"customer_name": "customer1",
"customer_id": "1",
"indices": [
{
"total_bytes": 12345,
"index": "filename1"
},
{
"total_bytes": 45678,
"index": "filename2"
},
],
"total": "765.43gb"
},
{
"total_bytes": 123456,
"customer_name": "customer2",
"customer_id": "2",
"indices": [
{
"total_bytes": 12345,
"index": "filename1"
},
{
"total_bytes": 45678,
"index": "filename2"
},
],
"total": "765.43gb"
}
]
I'm trying to monitor the sized of specific files. so a check should look something like:
/path/to/check_json.pl -u https://path/to/my/json -a "SOMETHING" -p "SOMETHING"
...where I'm trying to figure out the SOMETHINGs so that I can monitor the total_bytes of filename1 in customer2 where I know the customer_id and index but not their position in the respective arrays.
I can monitor customer1's total bytes by using the string "[0]->{'total_bytes'}" but I need to be able to specify which customer and dig deeper into file name (known) and file size (stat to monitor) AND the working query only gives me the status (OK,WARNING, or CRITICAL). Adding -p all I get are errors....
The error with -p no matter how I've been able to phrase it is always:
Not a HASH reference at ./check_json.pl line 235.
Even when I can get a valid OK from the example "[0]->{'total_bytes'}", using that in -p still gives the same error.
Links pointing to documentation on the format to use would be very helpful. Examples in the README for the script or in the -h output are failing me here. Any ideas?
I really have no idea what your question is. I'm sure I'm not alone, hence the downvotes.
Once you have the decoded json, if you have a customer_id to search for, you can do:
my ($customer_info) = grep {$_->{customer_id} eq $customer_id} #$json_response;
Regarding the error on line 235, this looks odd:
foreach my $key ($np->opts->perfvars eq '*' ? map { "{$_}"} sort keys %$json_response : split(',', $np->opts->perfvars)) {
# ....................................... ^^^^^^^^^^^^^
$perf_value = $json_response->{$key};
if perfvars eq "*", you appear to be looking for $json_reponse->{"{total}"} for example. You might want to validate the user's input:
die "no such key in json data: '$key'\n" unless exists $json_response->{$key};
This entire business of stringifying the hash ref lookups just smells bad.
A better question would look like:
I have this JSON data. How do I get the sum of total_bytes for the customer with id 1?
See https://stackoverflow.com/help/mcve

lucene solr - how to know numCount of each word in query

i have a query string with 5 words. for exmple "cat dog fish bird animals".
i need to know how many matches each word has.
at this point i create 5 queries:
/q=name:cat&rows=0&facet=true
/q=name:dog&rows=0&facet=true
/q=name:fish&rows=0&facet=true
/q=name:bird&rows=0&facet=true
/q=name:animals&rows=0&facet=true
and get matches count of each word from each query.
but this method takes too many time.
so is there a way to check get numCount of each word with one query?
any help appriciated!
In this case, functionQueries are your friends. In particular:
termfreq(field,term) returns the number of times the term appears in the field for that document. Example Syntax:
termfreq(text,'memory')
totaltermfreq(field,term) returns the number of times the term appears in the field in the entire index. ttf is an alias of
totaltermfreq. Example Syntax: ttf(text,'memory')
The following query for instance:
q=*%3A*&fl=cntOnSummary%3Atermfreq(summary%2C%27hello%27)+cntOnTitle%3Atermfreq(title%2C%27entry%27)+cntOnSource%3Atermfreq(source%2C%27activities%27)&wt=json&indent=true
returns the following results:
"docs": [
{
"id": [
"id-1"
],
"source": [
"activities",
"activities"
],
"title": "Ajones3 Activity Entry 1",
"summary": "hello hello",
"cntOnSummary": 2,
"cntOnTitle": 1,
"cntOnSource": 1,
"score": 1
},
{
"id": [
"id-2"
],
"source": [
"activities",
"activities"
],
"title": "Common activity",
"cntOnSummary": 0,
"cntOnTitle": 0,
"cntOnSource": 1,
"score": 1
}
}
]
Please notice that while it's working well on single value field, it seems that for multivalued fields, the functions consider just the first entry, for instance in the example above, termfreq(source%2C%27activities%27) returns 1 instead of 2.

Resources