Create hash by ignoring null in snowflake - snowflake-cloud-data-platform

Create hash by ignoring null in snowflake - snowflake-cloud-data-platform

I want to create hash of ('a', 'b', 'c', null) by ignoring null. I used below statement to do the same but it returns null.
I want (select SHA2_HEX(a|b|c) whereas Below statement does (select SHA2_HEX(null)
(select SHA2_HEX(CONCAT_WS('|', 'a', 'b', 'c', null)))

CONCAT_WS will produce NULL as soon as one of the values is NULL. Try to add coalesce(your_column, ''), then at least the final output of CONCAT_WS is not NULL. But: Result is still not correct, because you will have |a|b|c| (note the last |).
select SHA2_HEX(CONCAT_WS('|', 'a', 'b', 'c', coalesce(null, '')))
Otherwise just do CONCAT('|', 'a', '|', 'b', '|', 'c', coalesce(null, ''))

ARRAY_CONSTRUCT_COMPACT drops NULL's and then ARRAY_TO_STRING gives the string you are looking for:
select
CONCAT_WS('|', 'a', 'b', 'c', null) as d1
,array_construct('a', 'b', 'c', null) as a1
,ARRAY_TO_STRING(a1, '|') as d2
,array_construct_compact('a', 'b', 'c', null) as a2
,ARRAY_TO_STRING(a2, '|') as d3
;
D1
A1
D2
A2
D3
null
[ "a", "b", "c", undefined ]
a|b|c|
[ "a", "b", "c" ]
a|b|c
thus
select SHA2_HEX(ARRAY_TO_STRING(array_construct_compact('a', 'b', 'c', null),'|'));
gives:
SHA2_HEX(ARRAY_TO_STRING(ARRAY_CONSTRUCT_COMPACT('A', 'B', 'C', NULL),'|'))
a52dd81bfd5e4e66d96b9f598382f6cbf8c5c3897654e6ae9055e03620fcf38e

Related

Snowflake - How to create summary table containing unique records

I'm looking for some Snowflake syntax assistance in how to generate a summary table or view from an existing table. My summary table should have 1 row per unique id from the existing table along with boolean values indicating if the various milestones (as per the summary column names) have been hit. Any help is appreciated as I am a Snowflake novice. Thanks.
Existing Table
Desired Summary Table/View

So using Himanshu's data, thank you:
WITH fake_data(id, updated, pipeline_id, stage_id) AS (
SELECT column1, to_date(column2,'mm/dd/yyyy hh:mm:ss'), column3, column4
FROM VALUES
(1111, '02/01/2022 09:01:00', 'A', '1' ),
(1111, '02/01/2022 10:01:00', 'A', '2' ),
(1111, '02/01/2022 11:01:00', 'B', '5' ),
(2222, '02/02/2022 13:01:00', 'A', '1' ),
(2222, '02/03/2022 18:01:00', 'B', '5' ),
(2222, '02/04/2022 07:01:00', 'B', '6' ),
(3333, '02/02/2022 14:01:00', 'A', '1' ),
(3333, '02/03/2022 18:01:00', 'A', '2' ),
(3333, '02/03/2022 07:01:00', 'C', '7' ),
(3333, '02/03/2022 21:01:00', 'C', '8' ),
(3333, '02/05/2022 17:01:00', 'C', '9' )
)
we are doing an aggregation across each id and we want to use COUNT_IF to see how many row meet out criteria, and if it is >0 we are happy
SELECT
id,
count_if(pipeline_id='A')>0 AS hit_stage_a,
count_if(pipeline_id='B')>0 AS hit_stage_b,
count_if(pipeline_id='C')>0 AS hit_stage_c,
count_if(stage_id='4')>0 AS hit_stage_4,
count_if(stage_id='5')>0 AS hit_stage_5,
count_if(stage_id='6')>0 AS hit_stage_6
FROM fake_data
GROUP BY 1
ORDER BY 1;
gives:
ID
HIT_STAGE_A
HIT_STAGE_B
HIT_STAGE_C
HIT_STAGE_4
HIT_STAGE_5
HIT_STAGE_6
1111
TRUE
TRUE
FALSE
FALSE
TRUE
FALSE
2222
TRUE
TRUE
FALSE
FALSE
TRUE
TRUE
3333
TRUE
FALSE
TRUE
FALSE
FALSE
FALSE

try this and see if this helps to get what you want.
SELECT ID, decode(HIT_PIPELINE_A, NULL,FALSE,TRUE) ,
decode(HIT_PIPELINE_B, NULL,FALSE,TRUE),
decode(HIT_PIPELINE_C, NULL,FALSE,TRUE),
decode(HIT_STAGE_4, NULL,FALSE,TRUE),
decode(HIT_STAGE_5, NULL,FALSE,TRUE),
decode(HIT_STAGE_6, NULL,FALSE,TRUE) FROM
(
SELECT * from tab1
PIVOT(MAx(PIPELINE_ID) FOR stage_id IN ('1','2','3','4','5','6'))
AS P(ID,DT,HIT_PIPELINE_A,HIT_PIPELINE_B,HIT_PIPELINE_C,HIT_STAGE_4,HIT_STAGE_5,HIT_STAGE_6)
) order by ID;
create or replace table Tab1 (ID varchar2(100), updated date, pipeline_id varchar2(100), stage_id varchar2(10));
insert into tab1 values(1111, to_date('02/01/2022 09:01:00','mm/dd/yyyy hh:mm:ss'), 'A', '1' );
insert into tab1 values(1111, to_date('02/01/2022 10:01:00','mm/dd/yyyy hh:mm:ss'), 'A', '2' );
insert into tab1 values(1111, to_date('02/01/2022 11:01:00','mm/dd/yyyy hh:mm:ss'), 'B', '5' );
insert into tab1 values(2222, to_date('02/02/2022 13:01:00','mm/dd/yyyy hh:mm:ss'), 'A', '1' );
insert into tab1 values(2222, to_date('02/03/2022 18:01:00','mm/dd/yyyy hh:mm:ss'), 'B', '5' );
insert into tab1 values(2222, to_date('02/04/2022 07:01:00','mm/dd/yyyy hh:mm:ss'), 'B', '6' );
insert into tab1 values(3333, to_date('02/02/2022 14:01:00','mm/dd/yyyy hh:mm:ss'), 'A', '1' );
insert into tab1 values(3333, to_date('02/03/2022 18:01:00','mm/dd/yyyy hh:mm:ss'), 'A', '2' );
insert into tab1 values(3333, to_date('02/03/2022 07:01:00','mm/dd/yyyy hh:mm:ss'), 'C', '7' );
insert into tab1 values(3333, to_date('02/03/2022 21:01:00','mm/dd/yyyy hh:mm:ss'), 'C', '8' );
insert into tab1 values(3333, to_date('02/05/2022 17:01:00','mm/dd/yyyy hh:mm:ss'), 'C', '9' );

PostgreSQL array - Remove consecutive identical values [duplicate]

This question already has answers here:
Remove consecutive duplicate rows in Postgresql
(2 answers)
Closed 1 year ago.
In a PostgreSQL text array I want to remove consecutive identical values. A distinct isn't enough because I can have duplicate values but not consecutive, and I want to keep them. The order of values have importance.
For example
SELECT ARRAY['A', 'B', 'C', 'C', 'D', 'A'];
Should return {A,B,C,D,A}
-- Edit
AS #MkSpring Mk said, this post suggest an answer. I try to adapt it:
WITH q_array AS (
SELECT ARRAY['A', 'B', 'C', 'C', 'D', 'A'] AS full_array
), q_unnest AS (
SELECT
unnest(full_array) AS unnest_array
FROM q_array
), q_id AS (
SELECT
row_number() OVER () AS id,
unnest_array
FROM q_unnest
)
SELECT
array_agg(q_id.unnest_array) AS array_logical
FROM (SELECT q_id.*, lag(q_id.unnest_array) OVER (ORDER BY q_id.id) AS unnest_array_logical FROM q_id) q_id
WHERE unnest_array_logical IS DISTINCT FROM q_id.unnest_array
I find this syntax very verbose, maybe my approach is not efficient enough.
Is this syntax ok ? Is it a best practice to create a function or can I write it directly in a query ?

To improve query performance, I made this function:
CREATE OR REPLACE FUNCTION array_remove_consecutive_duplicates(
array_vals anyarray)
RETURNS TABLE(array_logical anyarray)
LANGUAGE 'plpgsql'
COST 100
VOLATILE PARALLEL UNSAFE
ROWS 1000
AS $BODY$
BEGIN
RETURN QUERY
WITH q_array AS (
SELECT array_vals AS full_array
), q_unnest AS (
SELECT
unnest(full_array) AS unnest_array
FROM q_array
), q_id AS (
SELECT
row_number() OVER () AS id,
unnest_array
FROM q_unnest
)
SELECT
array_agg(q_id.unnest_array) AS array_logical
FROM (SELECT q_id.*, lag(q_id.unnest_array) OVER (ORDER BY q_id.id) AS unnest_array_logical FROM q_id) q_id
WHERE unnest_array_logical IS DISTINCT FROM q_id.unnest_array;
END;
$BODY$;
My query will now write like this:
SELECT array_remove_consecutive_duplicates(ARRAY['A', 'B', 'C', 'C', 'D', 'A']);
It's a bit faster and easier to use in a query than my fist approach. Maybe it's not the good way or my syntax isn't efficient. If someone have a better way to suggest, I'll take it.

Snowflake: How do I update a column with values taken at random from another table?

I've been struggling with this for a while now. Imagine I have these two tables:
CREATE TEMPORARY TABLE tmp_target AS (
SELECT * FROM VALUES
('John', 43, 'm', 17363)
, ('Mark', 21, 'm', 16354)
, ('Jean', 25, 'f', 74615)
, ('Sara', 63, 'f', 26531)
, ('Alyx', 32, 'f', 42365)
AS target (name, age, gender, zip)
);
and
CREATE TEMPORARY TABLE tmp_source AS (
SELECT * FROM VALUES
('Cory', 42, 'm', 15156)
, ('Fred', 51, 'm', 71451)
, ('Mimi', 22, 'f', 45624)
, ('Matt', 61, 'm', 12734)
, ('Olga', 19, 'f', 52462)
, ('Cleo', 29, 'f', 23352)
, ('Simm', 31, 'm', 62445)
, ('Mona', 37, 'f', 23261)
, ('Feng', 44, 'f', 64335)
, ('King', 57, 'm', 12225)
AS source (name, age, gender, zip)
);
I would like to update the tmp_target table by taking 5 rows at random from the tmp_source table for the column(s) I'm interested in. For example, maybe I want to replace all the names with 5 random names from tmp_source, or maybe I want to replace the names and the ages.
My first attempt was this:
UPDATE tmp_target t SET t.name = s.name FROM tmp_source s;
However, when I examine the target table, I notice that quite a few of the names are duplicated, usually in pairs. As well, Snowflake gives me number of rows updated: 5 as well as number of multi-joined rows updated: 5. I believe this is due to the non-deterministic nature of what's happening, possibly as noted in the Snowflake documentation on updates. Not to mention I get the nagging feeling that this is somehow horribly inefficient if the tables had many records.
Then I tried something to grab 5 random rows from the source table:
UPDATE tmp_target t SET t.name = cte.name
FROM (
WITH upd AS (SELECT name FROM tmp_source SAMPLE ROW (5 ROWS))
SELECT name FROM upd
) AS cte;
But I seem to run into the exact same issue, both when I examine the target table, and as reported by the number of multi-joined rows.
I was wondering if I can use row numbering somehow, but while I can generate row numbers in the subquery, I don't know how to do that in the SET part of the outside query.
I want to add that neither table has any identifiers or indexes that can be used, and I'm looking for a solution that wouldn't require any.
I would very much appreciate it if anyone can provide solutions or ideas that are as clean and tidy as possible, with some consideration given to efficiency (imagine a target table of 100K rows and a source table of 10M rows).
Thank you!

I like the two answers already provided, but let me give you a simple answer to solve the simple case:
UPDATE tmp_target t
SET t.name = (
select array_agg(s.name) possible_names
from tmp_source s
)[uniform(0, 9, random())]
;
The secret of this solution is building an array of possible values, and choosing one at random for each updated row.
Update: Now with a JavaScript UDF that will help us choose each name from source only once
create or replace function incremental_thing()
returns float
language javascript
as
$$
if (typeof(inc) === "undefined") inc = 0;
return inc++;
$$
;
UPDATE tmp_target t
SET t.name = (
select array_agg(s.name) within group (order by random())
from tmp_source s
)[incremental_thing()::integer]
;
Note that the JS UDF returns an incremental value each time it’s called, and that helps me choose the next value from a sorted array to use on an update.
Since the value is incremented inside the JS UDF, this will work as long as there's only one JS env involved. To for single-node processing and avoid parallelism choose an XS warehouse and test.

Two example as follows, the first uses a temporary table to house the joined data by a rownum, the second include everything in the one query, note I used UPPER and lower case strings to make sure the records were being updated the way I wanted.
CREATE OR REPLACE TEMPORARY TABLE tmp_target AS (
SELECT * FROM VALUES
('John', 43, 'm', 17363)
, ('Mark', 21, 'm', 16354)
, ('Jean', 25, 'f', 74615)
, ('Sara', 63, 'f', 26531)
, ('Alyx', 32, 'f', 42365)
AS target (name, age, gender, zip)
);
CREATE OR REPLACE TEMPORARY TABLE tmp_source AS (
SELECT * FROM VALUES
('CORY', 42, 'M', 15156)
, ('FRED', 51, 'M', 71451)
, ('MIMI', 22, 'F', 45624)
, ('MATT', 61, 'M', 12734)
, ('OLGA', 19, 'F', 52462)
, ('CLEO', 29, 'F', 23352)
, ('SIMM', 31, 'M', 62445)
, ('MONA', 37, 'F', 23261)
, ('FENG', 44, 'F', 64335)
, ('KING', 57, 'M', 12225)
AS source (name, age, gender, zip)
);
CREATE OR REPLACE TEMPORARY TABLE t1 as (
with src as (
SELECT tmp_source.*, row_number() over (order by 1) tmp_id
FROM tmp_source SAMPLE ROW (5 ROWS)),
tgt as (
SELECT tmp_target.*, row_number() over (order by 1) tmp_id
FROM tmp_target SAMPLE ROW (5 ROWS))
SELECT src.name as src_name,
src.age as src_age,
src.gender as src_gender,
src.zip as src_zip,
src.tmp_id as tmp_id,
tgt.name as tgt_name,
tgt.age as tgt_age,
tgt.gender as tgt_gender,
tgt.zip as tgt_zip
FROM src, tgt
WHERE src.tmp_id = tgt.tmp_id);
UPDATE tmp_target a
SET a.name = b.src_name,
a.gender = b.src_gender
FROM (SELECT * FROM t1) b
WHERE a.name = b.tgt_name
AND a.age = b.tgt_age
AND a.gender = b.tgt_gender
AND a.zip = b.tgt_zip;
UPDATE tmp_target a
SET a.name = b.src_name,
a.gender = b.src_gender
FROM (
with src as (
SELECT tmp_source.*, row_number() over (order by 1) tmp_id
FROM tmp_source SAMPLE ROW (5 ROWS)),
tgt as (
SELECT tmp_target.*, row_number() over (order by 1) tmp_id
FROM tmp_target SAMPLE ROW (5 ROWS))
SELECT src.name as src_name,
src.age as src_age,
src.gender as src_gender,
src.zip as src_zip,
src.tmp_id as tmp_id,
tgt.name as tgt_name,
tgt.age as tgt_age,
tgt.gender as tgt_gender,
tgt.zip as tgt_zip
FROM src, tgt
WHERE src.tmp_id = tgt.tmp_id) b
WHERE a.name = b.tgt_name
AND a.age = b.tgt_age
AND a.gender = b.tgt_gender
AND a.zip = b.tgt_zip;

At a first pass, this is all that came to mind. I'm not sure if it suits your example perfectly, since it involves reloading the table.
It should be comparably performant to any other solution that uses a generated rownum. At least to my knowledge, in Snowflake, an update is no more performant than an insert (at least in this case where you're touching every record, and every micropartition, regardless).
INSERT OVERWRITE INTO tmp_target
with target as (
select
age,
gender,
zip,
row_number() over (order by 1) rownum
from tmp_target
)
,source as (
select
name,
row_number() over (order by 1) rownum
from tmp_source
SAMPLE ROW (5 ROWS)
)
SELECT
s.name,
t.age,
t.gender,
t.zip
from target t
join source s on t.rownum = s.rownum;

Swaping values of a column in SQLite3 using UPDATE

As title says, how can I swap all the values of a column that can be 'A' or 'B', making all columns with 'A' have a 'B' and all columns with a 'B' have an A?
I'm not sure how if making it with UPDATE and SET will change all the A's into B's and then, when all the columns have a 'B', they will change into As.

You can use case like this:
update tablename
set col =
case col
when 'A' then 'B'
when 'B' then 'A'
end
where col in ('A', 'B')
See the demo.

Can't add data to datetime2 field

Im using SQL Server Express 2008 and Im trying to add data to a field in a table which has a datatype of datetime2(7).
This is what Im trying to add:
'2012-02-02 12:32:10.1234'
But I am getting the error
Msg 8152, Level 16, State 4, Line 1
String or binary data would be truncated.
The statement has been terminated.
Does this mean that it's too long to be added to the field? and should be cut down abit? If so - can you give me an example of how it should look?
Note - I've also tried it in this format:
'01/01/98 23:59:59.999'
Thanks
**EDIT
The actual statement:
INSERT INTO dbo.myTable
(
nbr,
id,
name,
dsc,
start_date,
end_date,
last_date,
condition,
condtion_dsc,
crte_dte,
someting,
activation_date,
denial_date,
another_date,
a_name,
prior_auth_start_date,
prior_auth_end_date,
history_cmnt,
cmnt,
source,
program,
[IC-code],
[IC-description],
another_start_date,
another_start_date,
ver_nbr,
created_by,
creation_date,
updated_by,
updated_date)
VALUES
(
26,
'a',
'sometinh',
'c',
01/01/98 23:59:59.999,
01/01/98 23:59:59.999,
01/01/98 23:59:59.999,
'as',
'asdf',
01/01/98 23:59:59.999,
'lkop',
01/01/98 23:59:59.999,
01/01/98 23:59:59.999,
01/01/98 23:59:59.999,
'a',
01/01/98 23:59:59.999,
01/01/98 23:59:59.999,
'b',
'c',
'd',
'b',
'c',
'd',
01/01/98 23:59:59.999,
01/01/98 23:59:59.999,
423,
'Monkeys',
01/01/98 23:59:59.999,
'Goats',
01/01/98 23:59:59.999
);

Take a close look at the table you are trying to insert into. I bet one of the values you're trying to insert into a char/varchar/nchar/nvarchar column is too long.
SELECT
name,
max_length / CASE WHEN system_type_id IN (231, 239)
THEN 2 ELSE 1 END
FROM sys.columns
WHERE [object_id] = OBJECT_ID('dbo.TargetTableName')
AND system_type_id IN (167, 175, 231, 239);
This will get you a list like:
name
-------- --------
col1 32
col5 64
col7 12
Now, compare this list to the literals you have in your VALUES clause. As I suggested in a comment, I bet one of these has more characters than the table allows.
There's a chance there are binary or varbinary columns, and the issue is there, but I strongly suspect this is a simple "string is too long" problem - and has absolutely nothing to do with your DATETIME2(7) value.