BigQuery: How to flatten repeated structured property imported from datastore - google-app-engine

dear all
I started to use BigQuery to analysis data in GAE datastore this month. Firstly, I export data via "Datastore Admin" page of GAE console to Google Cloud Storage. And then, I import the data from Google Cloud Storage to BigQuery. It works very smoothly excepted the repeated structured property. I expected the imported record should be in the format of:
parent:"James",
children: [{
name: "name1",
age: 5,
gender: "M"
}, {
name: "name2",
age: 50,
gender: "F"
}, {
name: "name3",
age: 33,
gender: "M"
},
]
I know how to flatten data in above format. But the actual data format in BigQuery seems in following format:
parent: "James",
children.name:["name1", "name2", "name3"],
children.age:[5, 50, 33],
children.gender:["M", "F", "M"],
I'm wondering if it's possible to flatten above data in BigQuery for further analysis. The ideal format of result table in my mind is:
parentName, children.name, children.age, children.gender
James, name1, 5, "M"
James, name2, 50, "F"
James, name3, 33, "M"
Cheers!

With recently introduced BigQuery Standard SQL - things are so much nicer!
Try below (make sure to uncheck Use Legacy SQL checkbox under Show Options)
WITH parents AS (
SELECT
"James" AS parentName,
STRUCT(
["name1", "name2", "name3"] AS name,
[5, 50, 33] AS age,
["M", "F", "M"] AS gender
) AS children
)
SELECT
parentName, childrenName, childrenAge, childrenGender
FROM
parents,
UNNEST(children.name) AS childrenName WITH OFFSET AS pos_name,
UNNEST(children.age) AS childrenAge WITH OFFSET AS pos_age,
UNNEST(children.gender) AS childrenGender WITH OFFSET AS pos_gender
WHERE
pos_name = pos_age AND pos_name = pos_gender
Here - original table - parents - has below data
with respective schema as
[{
"parentName": "James",
"children": {
"name": ["name1", "name2", "name3"],
"age": ["5", "50", "33" ],
"gender": ["M", "F", "M"]
}
}]
and the output is
Note: above is solely based on what I see in original question and most likely needs to be adjusted to whatever specific needs you have
Hope this helps in terms of direction to go and where to start with!
Added:
Above Query is using row based CROSS JOINS, meaning all the variations for same parent first assembled and than WHERE clause filters out "wrong" ones.
In contrast, below version, use INNER JOIN to eliminate this "side effect"
WITH parents AS (
SELECT
"James" AS parentName,
STRUCT(
["name1", "name2", "name3"] AS name,
[5, 50, 33] AS age,
["M", "F", "M"] AS gender
) AS children
)
SELECT
parentName, childrenName, childrenAge, childrenGender
FROM
parents, UNNEST(children.name) AS childrenName WITH OFFSET AS pos_name
JOIN UNNEST(children.age) AS childrenAge WITH OFFSET AS pos_age
ON pos_name = pos_age
JOIN UNNEST(children.gender) AS childrenGender WITH OFFSET AS pos_gender
ON pos_age = pos_gender
Intuitively, I would expect second version to be little more efficient for bigger table

You should be able to use the 'large query results' feature to generate a new flattened table. Unfortunately, the syntax is terrifying. The basic principle is that you want to flatten each of the fields and save off the position, then filter where the position is the same.
Try something like:
SELECT parentName, children.name, children.age, children.gender,
position(children.name) as name_pos,
position(children.age) as age_pos,
position(children.gender) as gender_pos,
FROM table
SELECT
parent,
children.name,
children.age,
children.gender,
pos
FROM (
SELECT
parent,
children.name,
children.age,
children.gender,
gender_pos,
pos
FROM (
FLATTEN((
SELECT
parent,
children.name,
children.age,
children.gender,
pos,
POSITION(children.gender) as gender_pos
FROM (
SELECT
parent,
children.name,
children.age,
children.gender,
pos,
FROM (
FLATTEN((
SELECT
parent,
children.name,
children.age,
children.gender,
pos,
POSITION(children.age) AS age_pos
FROM (
FLATTEN((
SELECT
parent,
children.name,
children.age,
children.gender,
POSITION(children.name) AS pos
FROM table
),
children.name))),
children.age))
WHERE
age_pos = pos)),
children.gender)))
WHERE
gender_pos = pos;
To allow large results, if you are using the BigQuery UI, you should click the 'advanced options' button, specify a destination table, and check the 'allow large results' flag.
Note that if your data is stored as an entity that has a nested record that looks like {name, age, gender}, we should be transforming this into a nested record in bigquery instead of parallel arrays. I'll look into why this is happening.

Related

Duplicate row detected during DML action Row Values

Each of our messages has a unique id and several attributes; the final result should combine all of these attributes into a single message. We tried using snowflake merge, but it's not working as expected. In the first run, we used row number and partition to determine unique records, and we inserted them. In the second run, we considered updating more than one record, but we received the error message "Error 3:Duplicate row detected during DML action Row Values: \n".
Session ERROR ON NONDETERMINISTIC MERGE=FALSE was tried, however the outcome might not be reliable or consistent.
We tried javascript deep merge, however the volume was really high and there were performance problems.
Sample code below:
create or replace table test1 (Job_Id VARCHAR, RECORD_CONTENT VARIANT);
create or replace table test2 like test1;
insert into test1(JOB_ID, RECORD_CONTENT) select 1,parse_json('{
"customer": "Aphrodite",
"age": 32,
"orders": {"product": "socks","quantity": 4, "price": "$6", "attribute1" : "a1"}
}');
insert into test1(JOB_ID, RECORD_CONTENT) select 1,parse_json('{
"customer": "Aphrodite",
"age": 32,
"orders": {"product": "shoe", "quantity": 2, "brand" : "Woodland","attribute2" : "a2"}
}');
insert into test1(JOB_ID, RECORD_CONTENT) select 1,parse_json('{
"customer": "Aphrodite",
"age": 32,
"orders": {"product": "shoe polish","brand" : "Helios", "attribute3" : "a3" }
}');
merge into test2 t2 using (
select * from (select
row_number() over(partition by JOB_ID order by JOB_ID desc) as rno, JOB_ID, RECORD_CONTENT
from test1) where rno>1) t1 on //1. first run - unique values inserted using "=1" -> Successfully inserted unique values
//2. second run - updating attributes using ">1" -> Duplicate row detected during DML action Row Values
t1.JOB_ID = t2.JOB_ID
WHEN MATCHED THEN
UPDATE
SET t2.JOB_ID = t1.JOB_ID,
t2.RECORD_CONTENT = t1.RECORD_CONTENT
WHEN NOT MATCHED
THEN INSERT (JOB_ID, RECORD_CONTENT) VALUES (t1.JOB_ID, t1.RECORD_CONTENT)
Expected Output:-
select * from test2;
select parse_json('{
"customer": "Aphrodite",
"age": 32,
"orders": {"product": "shoe polish","quantity": 2, "brand" : "Helios","price": "$6",
"attribute1" : "a1","attribute2" : "a2","attribute3" : "a3" }
}');

How Can I Calculate the Average of Floats in a Nested Array in a Variant Column

I have a VARIANT column that contains a JSON response from a web service. It contains a nested array with a float value that I would like to aggregate and return as an average. Here is an example SnowSQL command that I am using:
select
value:disambiguated.id,
value:mentions
from TABLE(
FLATTEN(input =>
PARSE_JSON('{ "entities": [{"count": 2,"disambiguated": {"id": 123},"label": "Coronavirus Disease 2019","mentions": [{"confidence": 0.5928,}, {"confidence": 0.5445,}],"type": "MEDICAL"}]}'):entities
)
)
Which returns:
VALUE:DISAMBIGUATED.ID VALUE:MENTIONS
123 [ { "confidence": 0.5928 }, { "confidence": 0.5445 } ]
What I would like to return is something with the two "confidence" values averaged to 0.56825. I was able to add a second FLATTEN statement which isolated the "mentions" array and allowed me to extract each "confidence" value. I can not seem to figure out how to group the records to calculate the average. Would love to use the built in AVG() function if possible. Thank you in advance for any help you can provide.
Using your example, you can use LATERAL FLATTEN to create your required flattened fields, and then aggregate as you normally would. In this example, I'm grouping on the ID that is in the data, but you could also use y.index or z.index depending on which of those you wanted to group on for your AVG().
WITH x AS (
SELECT PARSE_JSON('{ "entities": [{"count": 2,"disambiguated": {"id": 123},"label": "Coronavirus Disease 2019","mentions": [{"confidence": 0.5928,}, {"confidence": 0.5445,}],"type": "MEDICAL"}]}') as json_str
)
SELECT
y.value:disambiguated.id as id,
avg(z.value:confidence)
from x,
LATERAL FLATTEN(input => json_str:entities) y,
LATERAL FLATTEN(input => y.value:mentions) z
GROUP BY id
;

Query JSON Key:Value Pairs in AWS Athena

I have received a data set from a client that is loaded in AWS S3. The data contains unnamed JSON key:value pairs. This isn't my area of expertise, so I was looking for a little help.
The structure of JSON data that I've typically worked with in the past looks similar to this:
{ "name":"John", "age":30, "car":null }
The data that I have received from my client is formatted as such:
{
"answer_id": "cc006",
"answer": {
"101086": 1,
"101087": 2,
"101089": 2,
"101090": 7,
"101091": 5,
"101092": 3,
"101125": 2
}
}
This is survey data, where the key on the left is a numeric customer identifier, and the value on the right is their response to a survey question, i.e. customer "101125" answered the survey with a value of "2". I need to be able to query the JSON data using Athena such that my result set looks similar to:
Cross joining the unnested children against the parent node isn't an issue. What I can't figure out is how to select all of the keys from the array "answer" without specifying that actual key name. Similarly, I want to be able to select all of the values as well.
Is it possible to create a virtual table in Athena that would allow for these results, or do I need to convert the JSON to a format this looks more similar to the following:
{
"answer_id": "cc006",
"answer": [
{ "key": "101086", "value": 1 },
{ "key": "101087", "value": 2 },
{ "key": "101089", "value": 2 },
{ "key": "101090", "value": 7 },
{ "key": "101091", "value": 5 },
{ "key": "101092", "value": 3 },
{ "key": "101125", "value": 2 }
]
}
EDIT 6/4/2020
I was able to use the code that Theon provided below along with the following table structure:
CREATE EXTERNAL TABLE answer_example (
answer_id string,
answer string
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucket/'
That allowed me to use the following query to generate the results that I needed.
WITH Data AS(
SELECT
answer_id,
CAST(json_extract(answer, '$') AS MAP(VARCHAR, VARCHAR)) as answer
FROM
answer_example
)
SELECT
answer_id,
key,
element_at(answer, key) AS value
FROM
Data
CROSS JOIN UNNEST (map_keys(answer)) AS answer (key)
EDIT 6/5/2020
Taking additional advice from Theon's response below, the following DDL and Query simplify this quite a bit.
DDL:
CREATE EXTERNAL TABLE answer_example (
answer_id string,
answer map<string,string>
)
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe'
LOCATION 's3://mybucket/'
Query:
SELECT
answer_id,
key,
element_at(answer, key) AS value
FROM
answer_example
CROSS JOIN UNNEST (map_keys(answer)) AS answer (key)
Cross joining with the keys of the answer property and then picking the corresponding value. Something like this:
WITH data AS (
SELECT
'cc006' AS answer_id,
MAP(
ARRAY['101086', '101087', '101089', '101090', '101091', '101092', '101125'],
ARRAY[1, 2, 2, 7, 5, 3, 2]
) AS answers
)
SELECT
answer_id,
key,
element_at(answers, key) AS value
FROM data
CROSS JOIN UNNEST (map_keys(answers)) AS answer (key)
You could probably do something with transform_keys to create rows of the key value pairs, but the SQL above does the trick.

Postgresql: convert text representing key - value to json

I have a table with a string column containing values like this: 'ID: 1, name: john doe, occupation: salesmen'. I want to convert this into a column of JSON objects like this: {"ID" : "1", "name" : "john doe", "occupation" : "salesmen"}
For now my solution is:
WITH
lv1 as(SELECT regexp_split_to_table('ID: 1, name: john doe, occupation: salesmen', ', ') record)
, lv2 as (SELECT regexp_split_to_array(record, ': ') arr from lv1)
SELECT
json_object(
array_agg(arr[1])
, array_agg(arr[2])
)
FROM lv2
The problem is that the string actually contains nearly 100 key - value pair and the table has millions of rows, so using regex_split_to_table will make this table explode. Is there any efficient way to do this in Postgresql?
you don't necessarily need regular expressions functions here, eg:
db=# with c as (select unnest('{ID: 1, name: john doe, occupation: salesmen}'::text[]))
select string_to_array(unnest,': ') from c;
string_to_array
-----------------------
{ID,1}
{name,"john doe"}
{occupation,salesmen}
(3 rows)
Not sure what will be faster though.
Regarding built in json formatting - I think you HAVE to provide ether a row or formatted JSON - no parsers are currently vailable...

postgresql json array query

I tried to query my json array using the example here: How do I query using fields inside the new PostgreSQL JSON datatype?
They use the example:
SELECT *
FROM json_array_elements(
'[{"name": "Toby", "occupation": "Software Engineer"},
{"name": "Zaphod", "occupation": "Galactic President"} ]'
) AS elem
WHERE elem->>'name' = 'Toby';
But my Json array looks more like this (if using the example):
{
"people": [{
"name": "Toby",
"occupation": "Software Engineer"
},
{
"name": "Zaphod",
"occupation": "Galactic President"
}
]
}
But I get an error: ERROR: cannot call json_array_elements on a non-array
Is my Json "array" not really an array? I have to use this Json string because it's contained in a database, so I would have to tell them to fix it if it's not an array.
Or, is there another way to query it?
I read documentation but nothing worked, kept getting errors.
The json array has a key people so use my_json->'people' in the function:
with my_table(my_json) as (
values(
'{
"people": [
{
"name": "Toby",
"occupation": "Software Engineer"
},
{
"name": "Zaphod",
"occupation": "Galactic President"
}
]
}'::json)
)
select t.*
from my_table t
cross join json_array_elements(my_json->'people') elem
where elem->>'name' = 'Toby';
The function json_array_elements() unnests the json array and generates all its elements as rows:
select elem->>'name' as name, elem->>'occupation' as occupation
from my_table
cross join json_array_elements(my_json->'people') elem
name | occupation
--------+--------------------
Toby | Software Engineer
Zaphod | Galactic President
(2 rows)
If you are interested in Toby's occupation:
select elem->>'occupation' as occupation
from my_table
cross join json_array_elements(my_json->'people') elem
where elem->>'name' = 'Toby'
occupation
-------------------
Software Engineer
(1 row)

Resources