Convert PostgreSQL nested JSON to numeric array in Tableau - database

I have a PostgreSQL database containing a table test_table with individual records. First column is a simple store_id, second column meausurement is a nested json.
store_id | measurement
----------------------
0 | {...}
The format of the measurement column is as follows:
{
'file_info': 'xxxx',
'data': {
'contour_data': {
'X': [-97.0, -97.0, -97.0, -97.0, -97.0, -97.0],
'Y': [-43.0, -41.0, -39.0, -39.0, -38.0, -36.0]
}
}
}
I would like to plot Y vs. X in a scatter plot in Tableau. Therefore I connected the database successfully with the PostgreSQL connector of Tableau. From this page I learned, that I have to use Custom SQL queries to extract data from the json object, since Tableau doesn't directly support the json datatype of Postgres. I tried already the following Custom SQL Query in Tableau:
select
store_id as store_id,
measurement#>>'{data, contour_data, X}' as contour_points_x,
measurement#>>'{data, contour_data, Y}' as contour_points_y
from test_table
which successfully extracts the two arrays to two new columns contour_points_x and contour_points_y. However both new columns are in Tableau of type string, so I cannot use them as data source for a plot.
How do I have to adjust the Custom SQL query to make the data arrays plottable in a Tableau scatter plot?

Looks like you need to split the columns. Check this https://help.tableau.com/current/pro/desktop/en-us/split.htm
EDIT - the linked approach works when you can reliably assume an upper bound for the number of points in each list. One way to split arbitrarily sized lists is described here https://apogeeintegration.com/blog/apogee-busts-out-multi-value-cells-using-tableau-prep-builder

The answer is a concatenation of several functions and/or syntax operations. One has to
use the #> operator to dig in the json and return it as json type (not as text type as >># does).
use json_array_elements_text() to expand the json to a set of text.
use type cast operator :: to convert text to float
/* custom SQL Query in Tableau */
select
store_id as store_id,
json_array_elements_text(measurement#>'{data, contour_data, X}')::float as contour_points_x,
json_array_elements_text(measurement#>'{data, contour_data, Y}')::float as contour_points_y,
from test_table
Both resulting columns appear in a Tableau Sheet now as discrete measures. Changing to discrete dimensions allows to plot contour_points_y vs. contour_points_x as desired.

Related

Big query nested json strings to arrays then new tables

Bigquery Database
I've got a webhook that's pushing to my big query table. The problem is it has lots of nested json strings which are brought in as strings. I ultimately want to make each column with these json strings into their own tables but I'm getting stuck because I can't figure out how to get them unnested and into an array.
[{"id":"63bddc8cfe21ec002d26b7f4","description":"General Admission", "src_currency":"USD","src_price":50.0,"src_fee":0.0,"src_commission":1.79,"src_discount":0.0,"applicable_pass_id":null,"seats_label":null,"seats_section_label":null,"seats_parent_type":null,"seats_parent_label":null,"seats_self_type":null,
"seats_self_label":null,"rate_type":"Rate","option_name":null,"price_level_id":null,"src_discount_price":50.0,"rate_id":"636d6d5cea8c6000222c640d","cost_item_id":"63bddc8cfe21ec002d26b7f4"}]
Here's the sample return from the original source and below is a screenshot of what I'm working with.
[Current Database
I've tried a number of things but the multiple nestings and string to array issue are really hampering everything I've tried.
I'm honestly not sure exactly what output/structure is best for this data set. I assume that each of the json returns probably just needs to be its own table and I can reference or join them based off that first "id" value in the json strings but I'm wide open to suggestions.
You can use a combination of JSON functions, and array functions to manipulate this kind of data.
JSON_EXTRACT_ARRAY can convert the JSON formatted string into an array, UNNEST then can make each entry into rows, and finally JSON_EXTRACT_SCALAR can pull out individual columns.
So here's an example of what I think you're trying to accomplish:
with sampledata as (
select """[{"id":"63bddc8cfe21ec002d26b7f4","description":"General Admission", "src_currency":"USD","src_price":50.0,"src_fee":0.0,"src_commission":1.79,"src_discount":0.0,"applicable_pass_id":null,"seats_label":null,"seats_section_label":null,"seats_parent_type":null,"seats_parent_label":null,"seats_self_type":null,"seats_self_label":null,"rate_type":"Rate","option_name":null,"price_level_id":null,"src_discount_price":50.0,"rate_id":"636d6d5cea8c6000222c640d","cost_item_id":"63bddc8cfe21ec002d26b7f4"},{"id":"63bddc8cfe21ec002d26b7f4","description":"General Admission", "src_currency":"USD","src_price":50.0,"src_fee":0.0,"src_commission":1.79,"src_discount":0.0,"applicable_pass_id":null,"seats_label":null,"seats_section_label":null,"seats_parent_type":null,"seats_parent_label":null,"seats_self_type":null,"seats_self_label":null,"rate_type":"Rate","option_name":null,"price_level_id":null,"src_discount_price":50.0,"rate_id":"636d6d5cea8c6000222c640d","cost_item_id":"63bddc8cfe21ec002d26b7f4"}]""" as my_json_string
)
select JSON_EXTRACT_SCALAR(f,'$.id') as id, JSON_EXTRACT_SCALAR(f,'$.rate_type') as rate_type, JSON_EXTRACT_SCALAR(f,'$.cost_item_id') as cost_item_id
from sampledata, UNNEST(JSON_EXTRACT_ARRAY(my_json_string)) as f
Which creates rows with specific columns from that data, like this:

How to find distinct values in a list of columns and print them in a single CSV

I am have a large dataset to analyze which I need to look at the distinct values for multiple features (Flags).
I am attempting to run a for loop as follows:
d= {}
name_list = ["ultfi_ind", "status"] # Add names of columns here
for x in name_list:
d["{0}".format(x)] = test_df.select(x).distinct().collect() # Please change df name
dist_val = pd.DataFrame.from_dict(d)
Here I am specifying the column names in the name_list list and the then in the for loop I am finding the distinct values in each of the columns and saving the output in a dictionary.
Finally I am attempting to combine it all in a single dataframe but that isn't possible as the length of the columns isn't same.
I am aware that one to do it is via padding but I find that too complex a solution and am wondering if there's a smart way to go about this.
Note that I am running this in a spark environment as my dataset is large.
I imagine the ultimate output to be a single CSV file/ Dataframe wherein the header is the name of the column mentioned in the name_list (above) and underneath that the distinct values are listed.
d= {}
name_list = ["ultfi_ind", "status"] # Add names of columns here
for x in name_list:
d["{0}".format(x)] = test_df.select(x).distinct().collect() # Please change df name
dist_val = pd.DataFrame.from_dict(d)
What you are talking about is data profiling. There pandas.dataframe has a describe function to start your journey off.
https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.describe.html#
If you want something a little more graphical, look at this article on towards data science.
https://towardsdatascience.com/3-tools-for-fast-data-profiling-5bd4e962e482
If you want to roll your own, you can. I am not going to code it for you since you will not learn. But enclosed is an algorithm that I would use.
1 - Get a list of columns and types in the data set.
2 - If number, we can give aggregations such as min, max, avg, etc.
3 - If non-numeric, we can group by field + count() occurrences
All this data can be outputted in as a data frame and save to your favorite format.
Last but not least, there is the spark.topandas library the replaces koalas. This allows you to convert a spark dataframe to a pandas dataframe to use some of these prebuilt functions.
https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.toPandas.html

How to query multiple JSON document schemas in Snowflake?

Could anyone tell me how to change the Stored Procedure in the article below to recursively expand all the attributes of a json file (multiple JSON document schemas)?
https://support.snowflake.net/s/article/Automating-Snowflake-Semi-Structured-JSON-Data-Handling-part-2
Craig Warman's stored procedure posted in that blog is a great idea. I asked him if it was okay to refactor his code, and he agreed. I've used the refactored version in the field, so I know the SP well as well as how it works.
It may be possible to modify the SP to work on your JSON. It will depend on whether or not Snowflake types the JSON in your variant column. The way you have it structured, it may not type everything. You can check by running this SQL and seeing if the result set includes all the columns you need:
set VARIANT_TABLE = 'WEATHER';
set VARIANT_COLUMN = 'V';
with MAIN_TABLE as
(
select * from identifier($VARIANT_TABLE) sample (1000 rows)
)
select distinct REGEXP_REPLACE(REGEXP_REPLACE(f.path, '\\[(.+)\\]'),'[^a-zA-Z0-9]','_') AS path_name, -- This generates paths with levels enclosed by double quotes (ex: "path"."to"."element"). It also strips any bracket-enclosed array element references (like "[0]")
typeof(f.value) AS attribute_type, -- This generates column datatypes.
path_name AS alias_name -- This generates column aliases based on the path
from
MAIN_TABLE,
LATERAL FLATTEN(identifier($VARIANT_COLUMN), RECURSIVE=>true) f
where TYPEOF(f.value) != 'OBJECT'
AND NOT contains(f.path, '[');
Be sure to replace the variables to your table and column names. If this picks up the type information for the columns in your JSON, then it's possible to modify this SP to do what you need. If it doesn't but there's a way to modify the query to get it to pick up the columns, that would work too.
If it doesn't pick up the columns, based on Craig's idea I decided to write type inference for non variant (such as strings from CSV log files without type information). Try the SQL above and see what results first.

Handling new line in column value of HIVE AVRO format table - contains complex data type & nested arrays

I have an HIVE table which is AVRO format and one of the column in the table contains complex data type (nested arrays). One of the element in that nested array contains data which contains new line characters. I am using multiple lateral view explode to flatten the data; but because of the new line character in one of the columns the output is not good (means it is mapping wrong value against wrong column). I tried to use regex_replace function in my query by it is told a invalid function. I am using "Hive 1.1.0-cdh5.12.2". Can you please tell me how could I handle this new line issue within the data while querying from the avro table using multiple laterview explode (as there are nested arrays).

Presto query Array of Rows

So I have a hive external table with schema looks like this :
{
.
.
`x` string,
`y` ARRAY<struct<age:string,cId:string,dmt:string>>,
`z` string
}
So basically I need to query a column(column "y") which is array of nested json,
I can see data of column "y" from hive, but data in that column seems invisible to presto, even though presto knows schema of this field, like this:
array(row(age varchar,cid varchar,dmt varchar))
As you can see presto already knows this field is array of row.
Notes:
1.The table is a hive external table.
2.I get schema of field "y" by using ODBC driver, but data is just all empty, however I can see something like this in hive :
[{"age":"12","cId":"bx21hdg","dmt":"120"}]
3.Presto queries hivemetastore for schema.
4.Table was stored as parquet format.
So how can I see my data in field "y" please?
Please try the below. This should work in Presto.
"If the array element is a row data type, the result is a table with one column for each row field in the element data type. The result table column data types match the corresponding array element row field data types"
select
y,age,cid,dmt
from
table
cross join UNNEST(y) AS nested_data(age,cid,dmt)
Reference: https://www.ibm.com/support/knowledgecenter/en/SSEPGG_10.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0055064.html

Resources