Postgresql: convert text representing key - value to json - arrays

I have a table with a string column containing values like this: 'ID: 1, name: john doe, occupation: salesmen'. I want to convert this into a column of JSON objects like this: {"ID" : "1", "name" : "john doe", "occupation" : "salesmen"}
For now my solution is:
WITH
lv1 as(SELECT regexp_split_to_table('ID: 1, name: john doe, occupation: salesmen', ', ') record)
, lv2 as (SELECT regexp_split_to_array(record, ': ') arr from lv1)
SELECT
json_object(
array_agg(arr[1])
, array_agg(arr[2])
)
FROM lv2
The problem is that the string actually contains nearly 100 key - value pair and the table has millions of rows, so using regex_split_to_table will make this table explode. Is there any efficient way to do this in Postgresql?

you don't necessarily need regular expressions functions here, eg:
db=# with c as (select unnest('{ID: 1, name: john doe, occupation: salesmen}'::text[]))
select string_to_array(unnest,': ') from c;
string_to_array
-----------------------
{ID,1}
{name,"john doe"}
{occupation,salesmen}
(3 rows)
Not sure what will be faster though.
Regarding built in json formatting - I think you HAVE to provide ether a row or formatted JSON - no parsers are currently vailable...

Related

Extracting data from JSON column defined as String

A table has a ports column (defined as VARCHAR) which has the following data:
[{u'position': 1, u'macAddress': u'00:8C:FA:C1:7C:88'}, {u'position':
2, u'macAddress': u'00:8C:FA:5E:98:81'}]
I want to extract the data from just the macAddress fields into separate rows. I tried to flatten the data in Snowflake but it is not working as the column is not defined as VARIANT and the the fields have a 'u' in front of them (this is my guess).
00:8C:FA:C3:7C:84
00:5C:FA:7E:98:87
Could someone please help with the requirement.
The provided JSON is not a valid JSON but it is possible to treat it as one with text operations and PARSE_JSON:
SELECT s.value:macAddress::TEXT AS macAddress
FROM t
,LATERAL FLATTEN(INPUT => PARSE_JSON(REPLACE(REPLACE(col, 'u''', ''''), '''', '"')))
AS s;
For input:
CREATE OR REPLACE TABLE t(col TEXT)
AS
SELECT $$[{u'position': 1, u'macAddress': u'00:8C:FA:C1:7C:88'}, {u'position': 2, u'macAddress': u'00:8C:FA:5E:98:81'}]$$;
Output:

Snowflake Retrieve value from Semi Structured Data

I'm trying to retrieve the health value from Snowflake semi structured data in a variant column called extra from table X.
An example of the code can be seen below:
[
{
"party":
"[{\"class\":\"Farmer\",\"gender\":\"Female\",\"ethnicity\":\"NativeAmerican\",\"health\":2},
{\"class\":\"Adventurer\",\"gender\":\"Male\",\"ethnicity\":\"White\",\"health\":3},
{\"class\":\"Farmer\",\"gender\":\"Male\",\"ethnicity\":\"White\",\"health\":0},
{\"class\":\"Banker\",\"gender\":\"Female\",\"ethnicity\":\"White\",\"health\":0}
}
]
I have tried reading the Snowflake documentation from https://community.snowflake.com/s/article/querying-semi-structured-data
I have also tried the following queries to flatten the query:
SELECT result.value:health AS PartyHealth
FROM X
WHERE value = 'Trail'
AND name = 'Completed'
AND PartyHealth > 0,
TABLE(FLATTEN(X, 'party')) result
AND
SELECT [0]['party'][0]['health'] AS Health
FROM X
WHERE value = 'Trail'
AND name = 'Completed'
AND PH > 0;
I am trying to retrieve the health value from table X from column extra which contains the the variant party, which has 4 repeating values [0-3]. Im not sure how to do this is someone able to tell me how to query semi structured data in Snowflake, considering the documentation doesn't make much sense?
First, the JSON value you posted seems wrong formatted (might be a copy paste issue).
Here's an example that works:
first your JSON formatted:
[{ "party": [ {"class":"Farmer","gender":"Female","ethnicity":"NativeAmerican","health":2}, {"class":"Adventurer","gender":"Male","ethnicity":"White","health":3}, {"class":"Farmer","gender":"Male","ethnicity":"White","health":0}, {"class":"Banker","gender":"Female","ethnicity":"White","health":0} ] }]
create a table to test:
CREATE OR REPLACE TABLE myvariant (v variant);
insert the JSON value into this table:
INSERT INTO myvariant SELECT PARSE_JSON('[{ "party": [ {"class":"Farmer","gender":"Female","ethnicity":"NativeAmerican","health":2}, {"class":"Adventurer","gender":"Male","ethnicity":"White","health":3}, {"class":"Farmer","gender":"Male","ethnicity":"White","health":0}, {"class":"Banker","gender":"Female","ethnicity":"White","health":0} ] }]');
now, to select a value you start from column name, in my case v, and as your JSON is an array inside, I specify first value [0], and from there expand, so something like this:
SELECT v[0]:party[0].health FROM myvariant;
Above gives me:
For the other rows you can simply do:
SELECT v[0]:party[1].health FROM myvariant;
SELECT v[0]:party[2].health FROM myvariant;
SELECT v[0]:party[3].health FROM myvariant;
Another option might be to make the data more like a table ... I find it easier to work with than the JSON :-)
Code at bottom - just copy/paste and it runs in Snowflake returning screenshot below.
Key Doco is Lateral Flatten
SELECT d4.path, d4.value
from
lateral flatten(input=>PARSE_JSON('[{ "party": [ {"class":"Farmer","gender":"Female","ethnicity":"NativeAmerican","health":2}, {"class":"Adventurer","gender":"Male","ethnicity":"White","health":3}, {"class":"Farmer","gender":"Male","ethnicity":"White","health":0}, {"class":"Banker","gender":"Female","ethnicity":"White","health":0} ] }]') ) as d ,
lateral flatten(input=> value) as d2 ,
lateral flatten(input=> d2.value) as d3 ,
lateral flatten(input=> d3.value) as d4

Query for condition in array of JSON objects in PostgreSQL

Lets assume we have a PostgreSQL db with a table with rows of the following kind:
id | doc
---+-----------------
1 | JSON Object
2 | JSON Object
3 | JSON Object
...
The JSON has the following structure:
{
'header' : {
'info' : 'foo'},
'data' :
[{'a' : 1, 'b' : 123},
{'a' : 2, 'b' : 234},
{'a' : 1, 'b' : 543},
...
{'a' : 1, 'b' : 123},
{'a' : 4, 'b' : 452}]
}
with arbitrary values for 'a' and 'b' in 'data' in all rows of the table.
First question: how do I query for rows in the table where the following condition holds:
There exists a dictionary in the list/array with the key 'data', where a==i and b>j.
For example for i=1 and j=400 the condition would be fulfilled for the example above and the respective column would be returned.
Second question:
In my problem I have to deal with time series data in Json. Every measurement is represented by one Json and therefore one row in the table. I want to identify measurements where certain events occurred. For the case that the above structure is unsuitable in terms of easy querying: How could such a time series look like to be more easily queryable?
Thanks a lot!
I believe a query like this should answer your first question:
select distinct id, doc
from (
select id, doc, jsonb_array_elements(doc->'data') as elem
from docs
) as docelem
where (elem->>'a')::int = 4 and (elem->>'b')::int > 400
db<>fiddle here

How to delete array element in JSONB column based on nested key value?

How can I remove an object from an array, based on the value of one of the object's keys?
The array is nested within a parent object.
Here's a sample structure:
{
"foo1": [ { "bar1": 123, "bar2": 456 }, { "bar1": 789, "bar2": 42 } ],
"foo2": [ "some other stuff" ]
}
Can I remove an array element based on the value of bar1?
I can query based on the bar1 value using: columnname #> '{ "foo1": [ { "bar1": 123 } ]}', but I've had no luck finding a way to remove { "bar1": 123, "bar2": 456 } from foo1 while keeping everything else intact.
Thanks
Running PostgreSQL 9.6
Assuming that you want to search for a specific object with an inner object of a certain value, and that this specific object can appear anywhere in the array, you need to unpack the document and each of the arrays, test the inner sub-documents for containment and delete as appropriate, then re-assemble the array and the JSON document (untested):
SELECT id, jsonb_build_object(key, jarray)
FROM (
SELECT foo.id, foo.key, jsonb_build_array(bar.value) AS jarray
FROM ( SELECT id, key, value
FROM my_table, jsonb_each(jdoc) ) foo,
jsonb_array_elements(foo.value) AS bar (value)
WHERE NOT bar.value #> '{"bar1": 123}'::jsonb
GROUP BY 1, 2 ) x
GROUP BY 1;
Now, this may seem a little dense, so picked apart you get:
SELECT id, key, value
FROM my_table, jsonb_each(jdoc)
This uses a lateral join on your table to take the JSON document jdoc and turn it into a set of rows foo(id, key, value) where the value contains the array. The id is the primary key of your table.
Then we get:
SELECT foo.id, foo.key, jsonb_build_array(bar.value) AS jarray
FROM foo, -- abbreviated from above
jsonb_array_elements(foo.value) AS bar (value)
WHERE NOT bar.value #> '{"bar1": 123}'::jsonb
GROUP BY 1, 2
This uses another lateral join to unpack the arrays into bar(value) rows. These objects can now be searched with the containment operator to remove the objects from the result set: WHERE NOT bar.value #> '{"bar1": 123}'::jsonb. In the select list the arrays are re-assembled by id and key but now without the offending sub-documents.
Finally, in the main query the JSON documents are re-assembled:
SELECT id, jsonb_build_object(key, jarray)
FROM x -- from above
GROUP BY 1;
The important thing to understand is that PostgreSQL JSON functions only operate on the level of the JSON document that you can explicitly indicate. Usually that is the top level of the document, unless you have an explicit path to some level in the document (like {foo1, 0, bar1}, but you don't have that). At that level of operation you can then unpack to do your processing such as removing objects.

BigQuery: How to flatten repeated structured property imported from datastore

dear all
I started to use BigQuery to analysis data in GAE datastore this month. Firstly, I export data via "Datastore Admin" page of GAE console to Google Cloud Storage. And then, I import the data from Google Cloud Storage to BigQuery. It works very smoothly excepted the repeated structured property. I expected the imported record should be in the format of:
parent:"James",
children: [{
name: "name1",
age: 5,
gender: "M"
}, {
name: "name2",
age: 50,
gender: "F"
}, {
name: "name3",
age: 33,
gender: "M"
},
]
I know how to flatten data in above format. But the actual data format in BigQuery seems in following format:
parent: "James",
children.name:["name1", "name2", "name3"],
children.age:[5, 50, 33],
children.gender:["M", "F", "M"],
I'm wondering if it's possible to flatten above data in BigQuery for further analysis. The ideal format of result table in my mind is:
parentName, children.name, children.age, children.gender
James, name1, 5, "M"
James, name2, 50, "F"
James, name3, 33, "M"
Cheers!
With recently introduced BigQuery Standard SQL - things are so much nicer!
Try below (make sure to uncheck Use Legacy SQL checkbox under Show Options)
WITH parents AS (
SELECT
"James" AS parentName,
STRUCT(
["name1", "name2", "name3"] AS name,
[5, 50, 33] AS age,
["M", "F", "M"] AS gender
) AS children
)
SELECT
parentName, childrenName, childrenAge, childrenGender
FROM
parents,
UNNEST(children.name) AS childrenName WITH OFFSET AS pos_name,
UNNEST(children.age) AS childrenAge WITH OFFSET AS pos_age,
UNNEST(children.gender) AS childrenGender WITH OFFSET AS pos_gender
WHERE
pos_name = pos_age AND pos_name = pos_gender
Here - original table - parents - has below data
with respective schema as
[{
"parentName": "James",
"children": {
"name": ["name1", "name2", "name3"],
"age": ["5", "50", "33" ],
"gender": ["M", "F", "M"]
}
}]
and the output is
Note: above is solely based on what I see in original question and most likely needs to be adjusted to whatever specific needs you have
Hope this helps in terms of direction to go and where to start with!
Added:
Above Query is using row based CROSS JOINS, meaning all the variations for same parent first assembled and than WHERE clause filters out "wrong" ones.
In contrast, below version, use INNER JOIN to eliminate this "side effect"
WITH parents AS (
SELECT
"James" AS parentName,
STRUCT(
["name1", "name2", "name3"] AS name,
[5, 50, 33] AS age,
["M", "F", "M"] AS gender
) AS children
)
SELECT
parentName, childrenName, childrenAge, childrenGender
FROM
parents, UNNEST(children.name) AS childrenName WITH OFFSET AS pos_name
JOIN UNNEST(children.age) AS childrenAge WITH OFFSET AS pos_age
ON pos_name = pos_age
JOIN UNNEST(children.gender) AS childrenGender WITH OFFSET AS pos_gender
ON pos_age = pos_gender
Intuitively, I would expect second version to be little more efficient for bigger table
You should be able to use the 'large query results' feature to generate a new flattened table. Unfortunately, the syntax is terrifying. The basic principle is that you want to flatten each of the fields and save off the position, then filter where the position is the same.
Try something like:
SELECT parentName, children.name, children.age, children.gender,
position(children.name) as name_pos,
position(children.age) as age_pos,
position(children.gender) as gender_pos,
FROM table
SELECT
parent,
children.name,
children.age,
children.gender,
pos
FROM (
SELECT
parent,
children.name,
children.age,
children.gender,
gender_pos,
pos
FROM (
FLATTEN((
SELECT
parent,
children.name,
children.age,
children.gender,
pos,
POSITION(children.gender) as gender_pos
FROM (
SELECT
parent,
children.name,
children.age,
children.gender,
pos,
FROM (
FLATTEN((
SELECT
parent,
children.name,
children.age,
children.gender,
pos,
POSITION(children.age) AS age_pos
FROM (
FLATTEN((
SELECT
parent,
children.name,
children.age,
children.gender,
POSITION(children.name) AS pos
FROM table
),
children.name))),
children.age))
WHERE
age_pos = pos)),
children.gender)))
WHERE
gender_pos = pos;
To allow large results, if you are using the BigQuery UI, you should click the 'advanced options' button, specify a destination table, and check the 'allow large results' flag.
Note that if your data is stored as an entity that has a nested record that looks like {name, age, gender}, we should be transforming this into a nested record in bigquery instead of parallel arrays. I'll look into why this is happening.

Resources