Snowflake Parsing Unnamed Json Array in Table - arrays

I am having great difficulty in using Snowflake to parse some JSON data, I have an unnamed array in one of my tables and want to break it apart as part of a query
[{"CodeName":"443","CodeQuantity":6}]
[{"CodeName":"550","CodeQuantity":4}]
[{"CodeName":"293","CodeQuantity":1},{"CodeName":"294","CodeQuantity":3}]
My Query is this
SELECT CODES
FROM CODETABLE
I am having problems parsing the json to split the codename / codequantity into individual elements and rows.

If those are each separate records, stored as varchar, then you simply need to use the parse_json() function to make them into json before flattening and parsing:
WITH x AS (
SELECT *
FROM ( VALUES
('[{"CodeName":"443","CodeQuantity":6}]'),
('[{"CodeName":"550","CodeQuantity":4}]'),
('[{"CodeName":"293","CodeQuantity":1},{"CodeName":"294","CodeQuantity":3}]')
) x (varchar_data)
)
SELECT y.value:CodeName::number,
y.value:CodeQuantity::varchar
FROM x,
LATERAL FLATTEN (input=>parse_json(varchar_data)) y;

Related

Is there a BigQuery function to extract a nested JSON?

[
{
"SnapshotDate": 20220224,
"EquityUSD": 5530.22,
"BalanceUSD": 25506.95,
"jsonTransactions": "[{\"TransactionDate\":20220224,\"AccountTransactionID\":144155779,\"TransactionType\":\"Deposit\",\"AmountUSD\":2000},{\"TransactionDate\":20220224,\"AccountTransactionID\":144155791,\"TransactionType\":\"Deposit\",\"AmountUSD\":2000}]"
}
]
Can somenone help me to extract this json string on bigquery. I can seem to get JSON_EXTRACT to work as it does not have a root element
The double quotes in jsonTransactions are making the JSON invalid. JSON_EXTRACT_SCALAR(json_data, "$[0].jsonTransactions") returns [{ because the first pair of double quotes enclose [{. To circumvent this, I used regex to remove the double quotes of the jsonTransactions value. Now, the inner JSON string is considered an array.
After regex replacement, the outermost quotes have been removed as shown below. I replaced "[ and ]" with [ and ] respectively in the JSON string.
"jsonTransactions": [{"TransactionDate":20220224,"AccountTransactionID":144155779,"TransactionType":"Deposit","AmountUSD":2000},{"TransactionDate":20220224,"AccountTransactionID":144155791,"TransactionType":"Deposit","AmountUSD":2000}]
Consider the below query for your requirement. The JSON path for AmountUSD will be "$[0].jsonTransactions[0].AmountUSD".
WITH
sample_table AS (
SELECT
'[{"SnapshotDate": 20220224,"EquityUSD": 5530.22,"BalanceUSD": 25506.95,"jsonTransactions": "[{\"TransactionDate\":20220224,\"AccountTransactionID\":144155779,\"TransactionType\":\"Deposit\",\"AmountUSD\":2000},{\"TransactionDate\":20220224,\"AccountTransactionID\":144155791,\"TransactionType\":\"Deposit\",\"AmountUSD\":2000}]"}]'
AS json_data) as json_extracted
SELECT
JSON_EXTRACT(REGEXP_REPLACE(REGEXP_REPLACE(json_data, r'"\[', '['), r'\]"', ']'),
'$[0].jsonTransactions')
FROM
sample_table;
Output:
As you had mentioned in the comments section, it is better to store the JSON itself in a more accessible format (one valid JSON object) instead of nesting JSON strings.
You might have to build a temp table to do this.
This first create statement would take a denormalized table convert it to a table with an array of structs.
The second create statement would take that temp table and embed the array into a (array of) struct(s).
You could remove the internal struct from the first query, and array wrapper the second query to build a strict struct of arrays. But this should be flexibe enough that you can create an array of structs, a struct of arrays or any combination of the two as many times as you want up to the 15 levels deep that BigQuery allows you to max out at.
The final outcome of this could would be a table with one column (column1) of a standard datatype, as well as an array of structs called OutsideArrayOfStructs. That Struct has two columns of "standard" datatypes, as well as an array of structs called InsideArrayOfStructs.
CREATE OR REPLACE TABLE dataset.tempTable as (
select
column1,
column2,
column3,
ARRAY_AGG(
STRUCT(
ArrayObjectColumn1,
ArrayObjectColumn2,
ArrayObjectColumn3
)
) as InsideArrayOfStructs
FROM
sourceDataset.sourceTable
GROUP BY
column1,
column2,
column3 )
CREATE OR REPLACE TABLE dataset.finalTable as (
select
column1,
ARRAY_AGG(
STRUCT(
column2,
column3,
InsideArrayOfStructs
)
) as OutsideArrayOfStructs
FROM
dataset.tempTable
GROUP BY
Column1 )

Redshift query array<varchar(128)> returning encoded values

We have an external table created in Redshift like this:
CREATE EXTERNAL TABLE spectrum.my_table(
insert_id varchar(128),
attribution_ids array<varchar(100)>
PARTITIONED BY (
event_date varchar(128))
STORED AS PARQUET
LOCATION
's3://my_bucket/my_path'
We do everything perfectly, but when we query the array<varchar> field as the documentation describes:
SELECT c.insert_id, a FROM
spectrum.my_table c, c.attribution_ids a LIMIT 10
Redshift return the insert_id correctly but the array it returns encoded please see below:
"insert_id", "o"
"0baed794-df11-4032-b13c-aac5d0deced7" "0b8ad4fd9af12804ffaea83f4886672b"
The source data should be like:
"0baed794-df11-4032-b13c-aac5d0deced7", [0baed794-df11-4032-b13c-aac5d0deced7, 0baed794-df11-4032-b13c-aac5d0deced7]
When we run the same query in Athena running as a SELECT * FROM my_table it returns the array with the correct data.
What should I do here?
Redshift does not support nested data types.
Redshift spectrum has simple support for nested data types - collection types like array or map has to be unnested (exploded) before selecting.
Unnesting basically does kind of a CROSS JOIN of all collection items with the row that collection belongs to.
Notice the syntax: ... FROM TABLE a, a.collecion_column B ... - in classical queries that's the synonym for CROSS JOIN.
So what you are seeing in in "o" column is one of the items from attribution_ids array.

Most efficient way to query data nested deep in JSON arrays?

Currently I'm writing queries against a JSONB table with 8 million+ rows. How can I query from the parent and the friends objects in the most efficient manner possible?
Query (Postgres 9.6):
select distinct id, data->>'_id' jsonID, data->>'email' email, friends->>'name' friend_name, parent->>'name' parent
from temp t
CROSS JOIN jsonb_array_elements(t.data->'friends') friends
CROSS JOIN jsonb_array_elements(friends->'parent') parent
where friends ->> 'name' = 'Chan Franco'
and parent->>'name' = 'Hannah Golden'
Example DDL (with data): https://pastebin.com/byN7uyKx
Your regularly structured data would be cleaner, smaller and faster as normalized relational design.
That said, to make the setup you have much faster (if not as fast as a normalized design with matching indexes), add a GIN index on the expression data->'friends':
CREATE INDEX tbl_data_friends_gin_idx ON tbl USING gin ((data->'friends'));
Then add a matching WHERE clause to our query with the contains operator #>:
SELECT DISTINCT -- why DISTINCT ?
id, data->>'_id' AS json_id, data->>'email' AS email
, friends->>'name' AS friend_name, parent->>'name' AS parent
FROM tbl t
CROSS JOIN jsonb_array_elements(t.data->'friends') friends
CROSS JOIN jsonb_array_elements(friends->'parent') parent
WHERE t.data->'friends' #> '[{"name": "Chan Franco", "parent": [{"name": "Hannah Golden"}]}]'
AND friends->>'name' = 'Chan Franco'
AND parent ->>'name' = 'Hannah Golden';
db<>fiddle here
The huge difference: With the help of the index, Postgres can now identify matching rows before unnesting each an every nested "friends" array in the whole table. Only after having identified matching rows in the underlying table, jsonb_array_elements() is called and resulting rows with qualifying array elements are kept.
Note that the search expression has to be valid JSON, matching the structure of the JSON array data->'friends' - including the outer brackets []. But omit all key/value pairs that are not supposed to serve as filter.
Related:
Index for finding an element in a JSON array
I avoided the table name temp as this is an SQL key word, that might lead to confusing errors. Using the name tbl instead.

Postgresql jsonb set-union of lists

I am hoping it is straightforward to do the following:
Given rows containing jsonb of the form
{
'a':"hello",
'b':['jim','bob','kate']
}
I would like to be able to get all the 'b' fields from a table (as in select jsondata->'b' from mytable) and then form a list consisting of all strings which occur in at least one 'b' field. (Basically a set-union.)
How can I do this? Or am I better off using a python script to extract the 'b' entries, do the set-union there, and then store it back into the database somewhere else?
This gives you the union set of elements in list 'b' of the json.
SELECT array_agg(a order by a)
FROM (SELECT DISTINCT unnest(txt_arr) AS a FROM
(SELECT ARRAY(SELECT trim(elem::text, '"')
FROM jsonb_array_elements(jsondata->'b') elem) AS txt_arr
FROM jtest1)y)z;
Query Explanation:
Gets the list from b as jsondata->'b'
Expands a JSON array to a set of JSON values from jsonb_array_elements() function.
Trims the " part in the elements from trim() function.
Converts to an array again using array() function after trimming.
Get the distinct value by unnesting it using unnest() function.
Finally array_agg() is used to form the expected result.

Query to compare differences of two columns from two different tables

I am attempting to create a UNION ALL query, on two differently named columns in two different tables.
I would like to take the "MTRL" column in the table USER_EXCEL and compare it against the "short_material_number" column from the IM_EXCEL table. I would then like the query to only return the differences between the two columns. Both columns house material numbers but are named differently (column wise) in the tables.
What I have so far is:
(SELECT [MTRL] FROM dbo.USER_EXCEL
EXCEPT
SELECT [short_material_number] FROM dbo.IM_Excel)
UNION ALL
(SELECT [short_material_number] FROM dbo.IM_Excel
EXCEPT
SELECT [MTRL] FROM dbo.USER_EXCEL)
However, when trying to run that query I receive an error message that states:
Msg 8114, Level 16, State 5, Line 22
Error converting data type varchar to float.
You're almost certainly getting this error because one of your two columns is a FLOAT data type, while the other is VARCHAR. This article would be a good place to start reading about implicit conversions, which happen when you try to compare two columns that are of different data types.
To get this working, you need to convert the float to a varchar, like in the example below.
(
SELECT [MTRL] FROM dbo.USER_EXCEL
EXCEPT
SELECT CAST([short_material_number] AS VARCHAR(18)) FROM dbo.IM_Excel
)
UNION ALL
(
SELECT CAST([short_material_number] AS VARCHAR(18)) FROM dbo.IM_Excel
EXCEPT
SELECT [MTRL] FROM dbo.USER_EXCEL
)
From you question I understand you are trying to compare two columns but returning only one column I would recommend you to use following query to compare the differences side by side
SELECT ue.[MTRL], ie.[short_material_number]
FROM dbo.IM_Excel ie
FULL OUTER JOIN
dbo.USER_EXCEL ue
ON CAST(ie.[short_material_number] AS VARCHAR(20)) = ue.[MTRL]
WHERE ie.[short_material_number] IS NULL
OR ue.[MTRL] IS NULL

Resources