Pass array into each object in JSON file. (Proc JSON SAS) - arrays

I'm trying to export a dataset to a JSON file. With PROC JSON every row in my dataset is exported nicely.
What I want to do is to add an array into each exported object with data from a specific column.
My dataset has structure like this:
data test;
input id $ amount $ dimension $;
datalines;
1 x A
1 x B
1 x C
2 y A
2 y X
3 z C
3 z K
3 z X
;
run;
proc json out='/MYPATH/jsontest.json' pretty nosastags;
export test;
run;
And the exported JSON object looks, obviously, like this:
[
{
"id": "1",
"amount": "x",
"dimension": "A"
},
{
"id": "1",
"amount": "x",
"dimension": "B"
},
{
"id": "1",
"amount": "x",
"dimension": "C"
},
...]
The result I want:
For each id I would like to insert all of the data from the dimension column into an array so my output would look this this:
[
{
"id": "1",
"amount": "x",
"dimensions": [
"A",
"B",
"C"
]
},
{
"id": "2",
"amount": "y",
"dimensions": [
"A",
"X"
]
},
{
"id": "3",
"amount": "z",
"dimensions": [
"C",
"K",
"X"
]
}
]
I've not been able to find a scenario like this or some guidelines on how to solve my problem. I hope somebody can help.
/Crellee

There are other methods for json output, including
hand-coded emitter in DATA Step
JSON package in Proc DS2
Here is an example of a hand-coded emitter for your data and desired mapping.
data _null_;
file 'c:\temp\test.json';
put '[';
do group_counter = 1 by 1 while (not end_of_data);
if group_counter > 1 then put #2 ',';
put #2 '{';
do dimension_counter = 1 by 1 until (last.amount);
set test end=end_of_data;
by id amount;
if dimension_counter = 1 then do;
q1 = quote(trim(id));
q2 = quote(trim(amount));
put
#4 '"id":' q1 ","
/ #4 '"amount":' q1 ","
;
put #4 '"dimensions":' / #4 '[';
end;
else do;
put #6 ',' #;
end;
q3 = quote(trim(dimension));
put #8 q3;
end;
if dimension_counter > 1 then put #4 '}';
put #2 ']';
end;
put ']';
stop;
run;
Such an emitter can be macro-ized and generalized to handle specifications of data=, by= and arrayify=. Not a path recommended for friends.

You can try concatenating / grouping the text before calling proc json.
I don't have proc json in my SAS environment, but try this step and see it works for you:
data want;
set test (rename=(dimension=old_dimension));
Length dimension $200. ;
retain dimension ;
by id amount notsorted;
if first.amount = 1 then do; dimension=''; end;
if last.amount = 1 then do; dimension=catx(',',dimension,old_dimension); output; end;
else do; dimension=catx(',',dimension,old_dimension); end;
drop old_dimension;
run;
Output:
id=1 amount=x dimension=A,B,C
id=2 amount=y dimension=A,X
id=3 amount=z dimension=C,K,X

Related

how to query nested jsonb array for a given key

I have a table 'animals' to store information
row 1 -> [{"type": "rabbit", "value": "carrot"}, {"type": "cat", "value": "milk"}]
row 2 -> [{"type": "cat", "value": "fish"}, {"type": "rabbit", "value": "leaves"}]
I need to query the value for type rabbit from all the rows.
tried to use operator #> [{"type" : "rabbit"}}
select * from data where data #> '[{"type":"rabbit"}]';
but doesn't work

Split json data into separate row result in Bigquery

I am writing a bigquery code to split a JSON data set into a more structured table.
The JSON_data_set looks something like this
Row | createdon | result
1 | 24022020 | {"searchResult": {"searchAccounts": [{"chainName": "xyxvjw", "address": {"name": "xyxvjw - ythji", "combined_city": "uptown", "combined_address": "1 downtown, uptown, 09728", "city": "uptown"}, "products": ["pin", "needle", "cloth"]}},{"chainName": "pwiewhds", "address": {"name": "pwiewhds - oujsus", "combined_city": "over the river", "combined_address": "100 under bridge, over the river, 19920", "city": "over the river"}, "products": ["tape", "stapler"]}}],"searchID": "3abci832832o0"}}
2 | 25020202 | {"searchResult": {"searchAccounts": [{"chainName": "xyxvjw2029", "address": {"name": "xyxvjw2029 - ythji", "combined_city": "uptown", "combined_address": "1 downtown, uptown, 09728", "city": "uptown"}, "products": ["pin", "needle", "cloth"]}},{"chainName": "pwiewhds8972", "address": {"name": "pwiewhds8972 - oujsus", "combined_city": "over the river", "combined_address": "100 under bridge, over the river, 19920", "city": "over the river"}, "products": ["tape", "stapler"]}}],"searchID": "3abci832832o0"}}
There are many subsequent account details in each row in the result column.
Able to unnest the data using the below code to get column data such as chain name & address. However, when I try to call the broken down field columns, it gives me the error Cannot access field _field_1 on a value with type ARRAY-STRUCT-STRING, STRING>>
How can I separate the columns created from json data into individual columns and rows without being tied to the json row column?
CREATE TEMP FUNCTION json2array(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
if (json !== null) {
return JSON.parse(json).map(x=>JSON.stringify(x));
}
""";
SELECT * EXCEPT(chains),
ARRAY(SELECT AS STRUCT JSON_EXTRACT_SCALAR(x, '$.chainName'), JSON_EXTRACT_SCALAR(x, '$.address.combined_address') FROM UNNEST(chains) x WHERE JSON_EXTRACT_SCALAR(x, '$.chainName') IS NOT NULL) chain_names
FROM (
SELECT *,
json2array(
JSON_EXTRACT(result, '$.searchResult.searchAccounts')
) chains
FROM json_data_set
)
Just needed to write the query a different way to achieve individual columns
CREATE TEMP FUNCTION json2array(json STRING)
RETURNS ARRAY<STRING>
LANGUAGE js AS """
if (json !== null) {
return JSON.parse(json).map(x=>JSON.stringify(x));
}
""";
WITH chain_name AS (
SELECT
*,
json2array(
JSON_EXTRACT(result, '$.searchResult.searchMerchants')
) chains
FROM json_data_set
)
SELECT AS STRUCT
JSON_EXTRACT_SCALAR(x, '$.chainName') chainName,
JSON_EXTRACT_SCALAR(x, '$.address.combined_address') combined_address
FROM chain_name, UNNEST(chains) x
WHERE JSON_EXTRACT_SCALAR(x, '$.chainName') IS NOT NULL

How to flatten an array in a nested json in aws glue using pyspark?

I am trying to flatten a JSON file to be able to load it into PostgreSQL all in AWS Glue. I am using PySpark. Using a crawler I crawl the S3 JSON and produce a table. I then use an ETL Glue script to:
read the crawled table
use the 'Relationalize' function to flatten the file
convert the dynamic frame to a dataframe
try to 'explode' the request.data field
Script so far:
datasource0 = glueContext.create_dynamic_frame.from_catalog(database = glue_source_database, table_name = glue_source_table, transformation_ctx = "datasource0")
df0 = Relationalize.apply(frame = datasource0, staging_path = glue_temp_storage, name = dfc_root_table_name, transformation_ctx = "dfc")
df1 = df0.select(dfc_root_table_name)
df2 = df1.toDF()
df2 = df1.select(explode(col('`request.data`')).alias("request_data"))
<then i write df1 to a PostgreSQL database which works fine>
Issues I face:
The 'Relationalize' function works well except the request.data field which becomes a bigint and therefore 'explode' doesn't work.
Explode cannot be done without using 'Relationalize' on the JSON first due to the structure of the data. Specifically the error is: "org.apache.spark.sql.AnalysisException: cannot resolve 'explode(request.data)' due to data type mismatch: input to function explode should be array or map type, not bigint"
If I try to make the dynamic frame a dataframe first then I get this issue: "py4j.protocol.Py4JJavaError: An error occurred while calling o72.jdbc.
: java.lang.IllegalArgumentException: Can't get JDBC type for struct..."
I tried to also upload a classifier so that the data would flatten in the crawl itself but AWS confirmed this wouldn't work.
The JSON format of the original file is as follows, that I an trying to normalise:
- field1
- field2
- {}
- field3
- {}
- field4
- field5
- []
- {}
- field6
- {}
- field7
- field8
- {}
- field9
- {}
- field10
# Flatten nested df
def flatten_df(nested_df):
for col in nested_df.columns:
array_cols = [c[0] for c in nested_df.dtypes if c[1][:5] == 'array']
for col in array_cols:
nested_df =nested_df.withColumn(col, F.explode_outer(nested_df[col]))
nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']
if len(nested_cols) == 0:
return nested_df
flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
flat_df = nested_df.select(flat_cols +
[F.col(nc+'.'+c).alias(nc+'_'+c)
for nc in nested_cols
for c in nested_df.select(nc+'.*').columns])
return flatten_df(flat_df)
df=flatten_df(df)
It will replace all dots with underscore. Note that it uses explode_outer and not explode to include Null value in case array itself is null. This function is available in spark v2.4+ only.
Also remember, exploding array will add more duplicates and overall row size will increase. Flattening struct will increase column size. In short, your original df will explode horizontally and vertically. It may slow down processing data later.
Therefore my recommendation would be to identify feature related data and store only those data in postgresql and original json files in s3.
Once you have rationalized the json column, you don't need to explode it. Relationalize transforms the nested JSON into key-value pairs at the outermost level of the JSON document. The transformed data maintains a list of the original keys from the nested JSON separated by periods.
Example :
Nested json :
{
"player": {
"username": "user1",
"characteristics": {
"race": "Human",
"class": "Warlock",
"subclass": "Dawnblade",
"power": 300,
"playercountry": "USA"
},
"arsenal": {
"kinetic": {
"name": "Sweet Business",
"type": "Auto Rifle",
"power": 300,
"element": "Kinetic"
},
"energy": {
"name": "MIDA Mini-Tool",
"type": "Submachine Gun",
"power": 300,
"element": "Solar"
},
"power": {
"name": "Play of the Game",
"type": "Grenade Launcher",
"power": 300,
"element": "Arc"
}
},
"armor": {
"head": "Eye of Another World",
"arms": "Philomath Gloves",
"chest": "Philomath Robes",
"leg": "Philomath Boots",
"classitem": "Philomath Bond"
},
"location": {
"map": "Titan",
"waypoint": "The Rig"
}
}
}
Flattened out json after rationalize :
{
"player.username": "user1",
"player.characteristics.race": "Human",
"player.characteristics.class": "Warlock",
"player.characteristics.subclass": "Dawnblade",
"player.characteristics.power": 300,
"player.characteristics.playercountry": "USA",
"player.arsenal.kinetic.name": "Sweet Business",
"player.arsenal.kinetic.type": "Auto Rifle",
"player.arsenal.kinetic.power": 300,
"player.arsenal.kinetic.element": "Kinetic",
"player.arsenal.energy.name": "MIDA Mini-Tool",
"player.arsenal.energy.type": "Submachine Gun",
"player.arsenal.energy.power": 300,
"player.arsenal.energy.element": "Solar",
"player.arsenal.power.name": "Play of the Game",
"player.arsenal.power.type": "Grenade Launcher",
"player.arsenal.power.power": 300,
"player.arsenal.power.element": "Arc",
"player.armor.head": "Eye of Another World",
"player.armor.arms": "Philomath Gloves",
"player.armor.chest": "Philomath Robes",
"player.armor.leg": "Philomath Boots",
"player.armor.classitem": "Philomath Bond",
"player.location.map": "Titan",
"player.location.waypoint": "The Rig"
}
Thus in your case, request.data is already a new column flattened out from request column and its type is interpreted as bigint by spark.
Reference : Simplify/querying nested json with the aws glue relationalize transform

Groovy, insert node after current node

I'll try my best to explain the situation.
I have the following db columns:
oid - task - start - end - realstart - realend
My requirement is to have an output like the following:
oid1 - task1 - start1 - end1
oid2 - task2 - start2 - end2
where task1 is task, task2 is task + "real", start1 is start, start2 is realstart, end1 is end, end2 is realend
BUT
the first row should always be created (those start/end fields are never empty) the second row should only be created if realstart and realend exist which may not be true.
Inputs are 6 arrays (one for each column), Outputs must be 4 arrays, something like this:
#input oid,task,start,end,realstart,realend
#output oid,task,start,end
I was thinking about using something like oid.each but I don't know how to add nodes after the current one. Order is important in the requirement.
For any explanation please ask, thanks!
After your comment and understanding that you don't want (or cannot) change the input/output data format, here's another solution that does what you've asked using classes to group the data and make it easier to manage:
import groovy.transform.Canonical
#Canonical
class Input {
String[] oids = [ 'oid1', 'oid2' ]
String[] tasks = [ 'task1', 'task2' ]
Integer[] starts = [ 10, 30 ]
Integer[] ends = [ 20, 42 ]
Integer[] realstarts = [ 12, null ]
Integer[] realends = [ 21, null ]
List<Object[]> getEntries() {
// ensure all entries have the same size
def entries = [ oids, tasks, starts, ends, realstarts, realends ]
assert entries.collect { it.size() }.unique().size() == 1,
'The input arrays do not all have the same size'
return entries
}
int getSize() {
oids.size() // any field would do, they have the same length
}
}
#Canonical
class Output {
List oids = [ ]
List tasks = [ ]
List starts = [ ]
List ends = [ ]
void add( oid, task, start, end, realstart, realend ) {
oids << oid; tasks << task; starts << start; ends << end
if ( realstart != null && realend != null ) {
oids << oid; tasks << task + 'real'; starts << realstart; ends << realend
}
}
}
def input = new Input()
def entries = input.entries
def output = new Output()
for ( int i = 0; i < input.size; i++ ) {
def entry = entries.collect { it[ i ] }
output.add( *entry )
}
println output
Responsibility of arranging the data is on the Input class, while the responsibility of knowing how to organize the output data is in the Output class.
Running this code prints:
Output([oid1, oid1, oid2], [task1, task1real, task2], [10, 12, 30], [20, 21, 42])
You can get the arrays (Lists, actually, but call toArray() if on the List to get an array) from the output object with output.oids, output.tasks, output.starts and output.ends.
The #Canonical annotation just makes the class implement equals, hashCode, toString and so on...
If you don't understand something, ask in the comments.
IF you need an "array" whose size you don't know from the start, you should use a List instead. But in Groovy, that's very easy to use.
Here's an example:
final int OID = 0
final int TASK = 1
final int START = 2
final int END = 3
final int R_START = 4
final int R_END = 5
List<Object[]> input = [
//oid, task, start, end, realstart, realend
[ 'oid1', 'task1', 10, 20, 12, 21 ],
[ 'oid2', 'task2', 30, 42, null, null ]
]
List<List> output = [ ]
input.each { row ->
output << [ row[ OID ], row[ TASK ], row[ START ], row[ END ] ]
if ( row[ R_START ] && row[ R_END ] ) {
output << [ row[ OID ], row[ TASK ] + 'real', row[ R_START ], row[ R_END ] ]
}
}
println output
Which outputs:
[[oid1, task1, 10, 20], [oid1, task1real, 12, 21], [oid2, task2, 30, 42]]

lucene solr - how to know numCount of each word in query

i have a query string with 5 words. for exmple "cat dog fish bird animals".
i need to know how many matches each word has.
at this point i create 5 queries:
/q=name:cat&rows=0&facet=true
/q=name:dog&rows=0&facet=true
/q=name:fish&rows=0&facet=true
/q=name:bird&rows=0&facet=true
/q=name:animals&rows=0&facet=true
and get matches count of each word from each query.
but this method takes too many time.
so is there a way to check get numCount of each word with one query?
any help appriciated!
In this case, functionQueries are your friends. In particular:
termfreq(field,term) returns the number of times the term appears in the field for that document. Example Syntax:
termfreq(text,'memory')
totaltermfreq(field,term) returns the number of times the term appears in the field in the entire index. ttf is an alias of
totaltermfreq. Example Syntax: ttf(text,'memory')
The following query for instance:
q=*%3A*&fl=cntOnSummary%3Atermfreq(summary%2C%27hello%27)+cntOnTitle%3Atermfreq(title%2C%27entry%27)+cntOnSource%3Atermfreq(source%2C%27activities%27)&wt=json&indent=true
returns the following results:
"docs": [
{
"id": [
"id-1"
],
"source": [
"activities",
"activities"
],
"title": "Ajones3 Activity Entry 1",
"summary": "hello hello",
"cntOnSummary": 2,
"cntOnTitle": 1,
"cntOnSource": 1,
"score": 1
},
{
"id": [
"id-2"
],
"source": [
"activities",
"activities"
],
"title": "Common activity",
"cntOnSummary": 0,
"cntOnTitle": 0,
"cntOnSource": 1,
"score": 1
}
}
]
Please notice that while it's working well on single value field, it seems that for multivalued fields, the functions consider just the first entry, for instance in the example above, termfreq(source%2C%27activities%27) returns 1 instead of 2.

Resources