Bigquery Database
I've got a webhook that's pushing to my big query table. The problem is it has lots of nested json strings which are brought in as strings. I ultimately want to make each column with these json strings into their own tables but I'm getting stuck because I can't figure out how to get them unnested and into an array.
[{"id":"63bddc8cfe21ec002d26b7f4","description":"General Admission", "src_currency":"USD","src_price":50.0,"src_fee":0.0,"src_commission":1.79,"src_discount":0.0,"applicable_pass_id":null,"seats_label":null,"seats_section_label":null,"seats_parent_type":null,"seats_parent_label":null,"seats_self_type":null,
"seats_self_label":null,"rate_type":"Rate","option_name":null,"price_level_id":null,"src_discount_price":50.0,"rate_id":"636d6d5cea8c6000222c640d","cost_item_id":"63bddc8cfe21ec002d26b7f4"}]
Here's the sample return from the original source and below is a screenshot of what I'm working with.
[Current Database
I've tried a number of things but the multiple nestings and string to array issue are really hampering everything I've tried.
I'm honestly not sure exactly what output/structure is best for this data set. I assume that each of the json returns probably just needs to be its own table and I can reference or join them based off that first "id" value in the json strings but I'm wide open to suggestions.
You can use a combination of JSON functions, and array functions to manipulate this kind of data.
JSON_EXTRACT_ARRAY can convert the JSON formatted string into an array, UNNEST then can make each entry into rows, and finally JSON_EXTRACT_SCALAR can pull out individual columns.
So here's an example of what I think you're trying to accomplish:
with sampledata as (
select """[{"id":"63bddc8cfe21ec002d26b7f4","description":"General Admission", "src_currency":"USD","src_price":50.0,"src_fee":0.0,"src_commission":1.79,"src_discount":0.0,"applicable_pass_id":null,"seats_label":null,"seats_section_label":null,"seats_parent_type":null,"seats_parent_label":null,"seats_self_type":null,"seats_self_label":null,"rate_type":"Rate","option_name":null,"price_level_id":null,"src_discount_price":50.0,"rate_id":"636d6d5cea8c6000222c640d","cost_item_id":"63bddc8cfe21ec002d26b7f4"},{"id":"63bddc8cfe21ec002d26b7f4","description":"General Admission", "src_currency":"USD","src_price":50.0,"src_fee":0.0,"src_commission":1.79,"src_discount":0.0,"applicable_pass_id":null,"seats_label":null,"seats_section_label":null,"seats_parent_type":null,"seats_parent_label":null,"seats_self_type":null,"seats_self_label":null,"rate_type":"Rate","option_name":null,"price_level_id":null,"src_discount_price":50.0,"rate_id":"636d6d5cea8c6000222c640d","cost_item_id":"63bddc8cfe21ec002d26b7f4"}]""" as my_json_string
)
select JSON_EXTRACT_SCALAR(f,'$.id') as id, JSON_EXTRACT_SCALAR(f,'$.rate_type') as rate_type, JSON_EXTRACT_SCALAR(f,'$.cost_item_id') as cost_item_id
from sampledata, UNNEST(JSON_EXTRACT_ARRAY(my_json_string)) as f
Which creates rows with specific columns from that data, like this:
I am have a large dataset to analyze which I need to look at the distinct values for multiple features (Flags).
I am attempting to run a for loop as follows:
d= {}
name_list = ["ultfi_ind", "status"] # Add names of columns here
for x in name_list:
d["{0}".format(x)] = test_df.select(x).distinct().collect() # Please change df name
dist_val = pd.DataFrame.from_dict(d)
Here I am specifying the column names in the name_list list and the then in the for loop I am finding the distinct values in each of the columns and saving the output in a dictionary.
Finally I am attempting to combine it all in a single dataframe but that isn't possible as the length of the columns isn't same.
I am aware that one to do it is via padding but I find that too complex a solution and am wondering if there's a smart way to go about this.
Note that I am running this in a spark environment as my dataset is large.
I imagine the ultimate output to be a single CSV file/ Dataframe wherein the header is the name of the column mentioned in the name_list (above) and underneath that the distinct values are listed.
d= {}
name_list = ["ultfi_ind", "status"] # Add names of columns here
for x in name_list:
d["{0}".format(x)] = test_df.select(x).distinct().collect() # Please change df name
dist_val = pd.DataFrame.from_dict(d)
What you are talking about is data profiling. There pandas.dataframe has a describe function to start your journey off.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#
If you want something a little more graphical, look at this article on towards data science.
https://towardsdatascience.com/3-tools-for-fast-data-profiling-5bd4e962e482
If you want to roll your own, you can. I am not going to code it for you since you will not learn. But enclosed is an algorithm that I would use.
1 - Get a list of columns and types in the data set.
2 - If number, we can give aggregations such as min, max, avg, etc.
3 - If non-numeric, we can group by field + count() occurrences
All this data can be outputted in as a data frame and save to your favorite format.
Last but not least, there is the spark.topandas library the replaces koalas. This allows you to convert a spark dataframe to a pandas dataframe to use some of these prebuilt functions.
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.toPandas.html
Could anyone tell me how to change the Stored Procedure in the article below to recursively expand all the attributes of a json file (multiple JSON document schemas)?
https://support.snowflake.net/s/article/Automating-Snowflake-Semi-Structured-JSON-Data-Handling-part-2
Craig Warman's stored procedure posted in that blog is a great idea. I asked him if it was okay to refactor his code, and he agreed. I've used the refactored version in the field, so I know the SP well as well as how it works.
It may be possible to modify the SP to work on your JSON. It will depend on whether or not Snowflake types the JSON in your variant column. The way you have it structured, it may not type everything. You can check by running this SQL and seeing if the result set includes all the columns you need:
set VARIANT_TABLE = 'WEATHER';
set VARIANT_COLUMN = 'V';
with MAIN_TABLE as
(
select * from identifier($VARIANT_TABLE) sample (1000 rows)
)
select distinct REGEXP_REPLACE(REGEXP_REPLACE(f.path, '\\[(.+)\\]'),'[^a-zA-Z0-9]','_') AS path_name, -- This generates paths with levels enclosed by double quotes (ex: "path"."to"."element"). It also strips any bracket-enclosed array element references (like "[0]")
typeof(f.value) AS attribute_type, -- This generates column datatypes.
path_name AS alias_name -- This generates column aliases based on the path
from
MAIN_TABLE,
LATERAL FLATTEN(identifier($VARIANT_COLUMN), RECURSIVE=>true) f
where TYPEOF(f.value) != 'OBJECT'
AND NOT contains(f.path, '[');
Be sure to replace the variables to your table and column names. If this picks up the type information for the columns in your JSON, then it's possible to modify this SP to do what you need. If it doesn't but there's a way to modify the query to get it to pick up the columns, that would work too.
If it doesn't pick up the columns, based on Craig's idea I decided to write type inference for non variant (such as strings from CSV log files without type information). Try the SQL above and see what results first.
I have an HIVE table which is AVRO format and one of the column in the table contains complex data type (nested arrays). One of the element in that nested array contains data which contains new line characters. I am using multiple lateral view explode to flatten the data; but because of the new line character in one of the columns the output is not good (means it is mapping wrong value against wrong column). I tried to use regex_replace function in my query by it is told a invalid function. I am using "Hive 1.1.0-cdh5.12.2". Can you please tell me how could I handle this new line issue within the data while querying from the avro table using multiple laterview explode (as there are nested arrays).
So I have a hive external table with schema looks like this :
{
.
.
`x` string,
`y` ARRAY<struct<age:string,cId:string,dmt:string>>,
`z` string
}
So basically I need to query a column(column "y") which is array of nested json,
I can see data of column "y" from hive, but data in that column seems invisible to presto, even though presto knows schema of this field, like this:
array(row(age varchar,cid varchar,dmt varchar))
As you can see presto already knows this field is array of row.
Notes:
1.The table is a hive external table.
2.I get schema of field "y" by using ODBC driver, but data is just all empty, however I can see something like this in hive :
[{"age":"12","cId":"bx21hdg","dmt":"120"}]
3.Presto queries hivemetastore for schema.
4.Table was stored as parquet format.
So how can I see my data in field "y" please?
Please try the below. This should work in Presto.
"If the array element is a row data type, the result is a table with one column for each row field in the element data type. The result table column data types match the corresponding array element row field data types"
select
y,age,cid,dmt
from
table
cross join UNNEST(y) AS nested_data(age,cid,dmt)
Reference: https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0055064.html