Snowflake aggregating table by a column - snowflake-cloud-data-platform

I have a snowflake table that looks like below:
I want to have all the rows for NEW_COL captured by Name as below.
The table needs to be sorted by Code
Keep 1st row grouped by 'Name' column , capturing all the values of NEW_COL.
EXPECTED OUTPUT
I have tried below code but not getting the combined value of NEW_COL, it just gets the {"Code" : "A", "Dept" : "Dept Store"} for row 1 whereas I should get {"Code" : "A", "Dept" : "Dept Store"}, {"Code" : "B", "Dept" : "All other supplies"}, {"Code" : "C", "Dept" : "Rest"} for the 1st row
select *, row_number() over (partition by Name order by Code)
as row_number
from TEST where row_number = 1
Here is what I am getting with my code

It basically needs aggregation of NEW_COL by name and then only select first rows of the result set.
With condition "The table needs to be sorted by Code"
The result-set/expected output shown in question is not as per 'ordered by code'.
Following approach can be used -
with data_cte (Name, Address, Code, Dept, NEW_COL) as
(
select * from values
('XYZ','324 NW','A','Dept Store','{"Code" : "A", "Dept" : "Dept Store"}'),
('XYZ','324 NW','B','All other supplies','{"Code" : "B", "Dept" : "All other supplies"}'),
('XYZ','324 NW','C','Rest','{"Code" : "C", "Dept" : "Rest"}'),
('ABC','45 N Ave','C','Rest','{"Code" : "C", "Dept" : "Rest"}'),
('ABC','45 N Ave','B','All other supplies','{"Code" : "B", "Dept" : "All other supplies"}'),
('ZXC','12 SW st','A','Dept Store','{"Code" : "A", "Dept" : "Dept Store"}')
), agg_nc_cte as
(
select Name,listagg(NEW_COL,',') nc_agg
from data_cte group by name
), other_val_cte as
(
select Name, Address, Code, Dept
from data_cte
qualify row_number() over (partition by name order by code asc)=1
)
select a.name,a.address,a.code,a.dept,b.nc_agg
from other_val_cte a, agg_nc_cte b
where a.name =b.name
order by code;
NAME
ADDRESS
CODE
DEPT
NC_AGG
XYZ
324 NW
A
Dept Store
{"Code" : "A", "Dept" : "Dept Store"},{"Code" : "B", "Dept" : "All other supplies"},{"Code" : "C", "Dept" : "Rest"}
ZXC
12 SW st
A
Dept Store
{"Code" : "A", "Dept" : "Dept Store"}
ABC
45 N Ave
B
All other supplies
{"Code" : "C", "Dept" : "Rest"},{"Code" : "B", "Dept" : "All other supplies"}

Related

Duplicate row detected during DML action Row Values

Each of our messages has a unique id and several attributes; the final result should combine all of these attributes into a single message. We tried using snowflake merge, but it's not working as expected. In the first run, we used row number and partition to determine unique records, and we inserted them. In the second run, we considered updating more than one record, but we received the error message "Error 3:Duplicate row detected during DML action Row Values: \n".
Session ERROR ON NONDETERMINISTIC MERGE=FALSE was tried, however the outcome might not be reliable or consistent.
We tried javascript deep merge, however the volume was really high and there were performance problems.
Sample code below:
create or replace table test1 (Job_Id VARCHAR, RECORD_CONTENT VARIANT);
create or replace table test2 like test1;
insert into test1(JOB_ID, RECORD_CONTENT) select 1,parse_json('{
"customer": "Aphrodite",
"age": 32,
"orders": {"product": "socks","quantity": 4, "price": "$6", "attribute1" : "a1"}
}');
insert into test1(JOB_ID, RECORD_CONTENT) select 1,parse_json('{
"customer": "Aphrodite",
"age": 32,
"orders": {"product": "shoe", "quantity": 2, "brand" : "Woodland","attribute2" : "a2"}
}');
insert into test1(JOB_ID, RECORD_CONTENT) select 1,parse_json('{
"customer": "Aphrodite",
"age": 32,
"orders": {"product": "shoe polish","brand" : "Helios", "attribute3" : "a3" }
}');
merge into test2 t2 using (
select * from (select
row_number() over(partition by JOB_ID order by JOB_ID desc) as rno, JOB_ID, RECORD_CONTENT
from test1) where rno>1) t1 on //1. first run - unique values inserted using "=1" -> Successfully inserted unique values
//2. second run - updating attributes using ">1" -> Duplicate row detected during DML action Row Values
t1.JOB_ID = t2.JOB_ID
WHEN MATCHED THEN
UPDATE
SET t2.JOB_ID = t1.JOB_ID,
t2.RECORD_CONTENT = t1.RECORD_CONTENT
WHEN NOT MATCHED
THEN INSERT (JOB_ID, RECORD_CONTENT) VALUES (t1.JOB_ID, t1.RECORD_CONTENT)
Expected Output:-
select * from test2;
select parse_json('{
"customer": "Aphrodite",
"age": 32,
"orders": {"product": "shoe polish","quantity": 2, "brand" : "Helios","price": "$6",
"attribute1" : "a1","attribute2" : "a2","attribute3" : "a3" }
}');

Not able to transform data in expected format in snowflake

I got data in rows for a column like this
[
{
"value": "A",
"path": "nth-child(1)"
},
{
"value": "K",
"path": "nth-child(2)"
},
{
"value": "C",
"path": "nth-child(3)"
}
]
Need help .....
Want to get data like this format in rows from that column
{
"A",
"K",
"C",
},
Have tried like this : but it combine all the rows of the table
SELECT LISTAGG(f.value:value::STRING, ',') AS col
FROM tablename
,LATERAL FLATTEN(input => parse_json(column_name)) f
I have used a CTE just to provide fake data for the example:
WITH data(json) as (
select parse_json(column1) from values
('[{"value":"A","path":"nth-child(1)"},{"value":"K","path":"nth-child(2)"},{"value":"C","path":"nth-child(3)"}]'),
('[{"value":"B","path":"nth-child(1)"},{"value":"L","path":"nth-child(2)"},{"value":"D","path":"nth-child(3)"}]'),
('[{"value":"C","path":"nth-child(1)"},{"value":"M","path":"nth-child(2)"},{"value":"E","path":"nth-child(3)"}]')
)
SELECT LISTAGG(f.value:value::text,',') as l1
from data as d
,table(flatten(input=>d.json)) f
group by f.seq
order by f.seq;
gives:
L1
A,K,C
B,L,D
C,M,E
Thus with some string concatenation via ||
SELECT '{' || LISTAGG('"' ||f.value:value::text|| '"' , ',') || '}' as l1
from data as d
,table(flatten(input=>d.json)) f
group by f.seq
order by f.seq;
gives:
L1
{"A","K","C"}
{"B","L","D"}
{"C","M","E"}

Get value from JSON array in postgres

I have a strange JSON array in my Postgres Database without curly brackets. It has the datatype Text so i need to cast it in a JSON i guess.
It looks like this and can change from row to row.
[
[
"-",
"name",
"Gates"
],
[
"-",
"name_1",
null
],
[
"-",
"name_2",
null
],
[
"-",
"na_cd",
null
],
[
"-",
"class_cd",
null
],
[
"-",
"reference",
"190955"
],
[
"-",
"lang_cd",
"en"
],
[
"-",
"uid_nr",
null
],
[
"-",
"id",
19000
],
[
"-",
"firstname",
"Bill"
],
[
"-",
"spare",
null
]
]
What i need is to find and print the ID if there is one. In this example 19000.
Can someone please help how can do this?
Basically, you should use jsonb_array_elements() twice, for the main array and for its filtered element (which is an array too).
select value::numeric as result
from (
select elem
from the_data
cross join jsonb_array_elements(col) as main(elem)
where elem ? 'id'
) s
cross join jsonb_array_elements(elem)
where jsonb_typeof(value) = 'number'
Try it in Db<>Fiddle.
However, if you want to get exactly the third value from the nested array, the query may be simpler (note that array elements are indexing from 0):
select (elem->2)::numeric as result
from the_data
cross join jsonb_array_elements(col) as main(elem)
where elem ? 'id'
Db<>Fiddle.
If you are using Postgres 12 or later and if the value you are after is always at the third array element, you can use a SQL/JSON path expression:
select jsonb_path_query_first(the_column, '$ ? (#[*] == "id")[2]')::int
from the_table
This assumes that the column is defined as jsonb (which it should be). If it's not, you need to cast it: the_column::jsonb

MongoDb groupby count query

I have a mongoDB document having certain columns like Id, EmployeeID, SiteID, EmployeeAddress.
A employee can be present at a site more than once.
I want to have a group by query along with count which will give result set as
EmployeeID SiteID Count EmployeeAddress
basically how many times an employee is present as a site.
I am using this query but not getting the desired data.
db.pnr_dashboard.aggregate(
[
{
"$group" : {
"_id" : { "siteId" : "$siteId" , "employeeId" : "$employeeId"} ,
"count" : { "$sum":1}},
}
]
);

BigQuery: How to flatten repeated structured property imported from datastore

dear all
I started to use BigQuery to analysis data in GAE datastore this month. Firstly, I export data via "Datastore Admin" page of GAE console to Google Cloud Storage. And then, I import the data from Google Cloud Storage to BigQuery. It works very smoothly excepted the repeated structured property. I expected the imported record should be in the format of:
parent:"James",
children: [{
name: "name1",
age: 5,
gender: "M"
}, {
name: "name2",
age: 50,
gender: "F"
}, {
name: "name3",
age: 33,
gender: "M"
},
]
I know how to flatten data in above format. But the actual data format in BigQuery seems in following format:
parent: "James",
children.name:["name1", "name2", "name3"],
children.age:[5, 50, 33],
children.gender:["M", "F", "M"],
I'm wondering if it's possible to flatten above data in BigQuery for further analysis. The ideal format of result table in my mind is:
parentName, children.name, children.age, children.gender
James, name1, 5, "M"
James, name2, 50, "F"
James, name3, 33, "M"
Cheers!
With recently introduced BigQuery Standard SQL - things are so much nicer!
Try below (make sure to uncheck Use Legacy SQL checkbox under Show Options)
WITH parents AS (
SELECT
"James" AS parentName,
STRUCT(
["name1", "name2", "name3"] AS name,
[5, 50, 33] AS age,
["M", "F", "M"] AS gender
) AS children
)
SELECT
parentName, childrenName, childrenAge, childrenGender
FROM
parents,
UNNEST(children.name) AS childrenName WITH OFFSET AS pos_name,
UNNEST(children.age) AS childrenAge WITH OFFSET AS pos_age,
UNNEST(children.gender) AS childrenGender WITH OFFSET AS pos_gender
WHERE
pos_name = pos_age AND pos_name = pos_gender
Here - original table - parents - has below data
with respective schema as
[{
"parentName": "James",
"children": {
"name": ["name1", "name2", "name3"],
"age": ["5", "50", "33" ],
"gender": ["M", "F", "M"]
}
}]
and the output is
Note: above is solely based on what I see in original question and most likely needs to be adjusted to whatever specific needs you have
Hope this helps in terms of direction to go and where to start with!
Added:
Above Query is using row based CROSS JOINS, meaning all the variations for same parent first assembled and than WHERE clause filters out "wrong" ones.
In contrast, below version, use INNER JOIN to eliminate this "side effect"
WITH parents AS (
SELECT
"James" AS parentName,
STRUCT(
["name1", "name2", "name3"] AS name,
[5, 50, 33] AS age,
["M", "F", "M"] AS gender
) AS children
)
SELECT
parentName, childrenName, childrenAge, childrenGender
FROM
parents, UNNEST(children.name) AS childrenName WITH OFFSET AS pos_name
JOIN UNNEST(children.age) AS childrenAge WITH OFFSET AS pos_age
ON pos_name = pos_age
JOIN UNNEST(children.gender) AS childrenGender WITH OFFSET AS pos_gender
ON pos_age = pos_gender
Intuitively, I would expect second version to be little more efficient for bigger table
You should be able to use the 'large query results' feature to generate a new flattened table. Unfortunately, the syntax is terrifying. The basic principle is that you want to flatten each of the fields and save off the position, then filter where the position is the same.
Try something like:
SELECT parentName, children.name, children.age, children.gender,
position(children.name) as name_pos,
position(children.age) as age_pos,
position(children.gender) as gender_pos,
FROM table
SELECT
parent,
children.name,
children.age,
children.gender,
pos
FROM (
SELECT
parent,
children.name,
children.age,
children.gender,
gender_pos,
pos
FROM (
FLATTEN((
SELECT
parent,
children.name,
children.age,
children.gender,
pos,
POSITION(children.gender) as gender_pos
FROM (
SELECT
parent,
children.name,
children.age,
children.gender,
pos,
FROM (
FLATTEN((
SELECT
parent,
children.name,
children.age,
children.gender,
pos,
POSITION(children.age) AS age_pos
FROM (
FLATTEN((
SELECT
parent,
children.name,
children.age,
children.gender,
POSITION(children.name) AS pos
FROM table
),
children.name))),
children.age))
WHERE
age_pos = pos)),
children.gender)))
WHERE
gender_pos = pos;
To allow large results, if you are using the BigQuery UI, you should click the 'advanced options' button, specify a destination table, and check the 'allow large results' flag.
Note that if your data is stored as an entity that has a nested record that looks like {name, age, gender}, we should be transforming this into a nested record in bigquery instead of parallel arrays. I'll look into why this is happening.

Resources