PostgreSQL array - Remove consecutive identical values [duplicate] - arrays

This question already has answers here:
Remove consecutive duplicate rows in Postgresql
(2 answers)
Closed 1 year ago.
In a PostgreSQL text array I want to remove consecutive identical values. A distinct isn't enough because I can have duplicate values but not consecutive, and I want to keep them. The order of values have importance.
For example
SELECT ARRAY['A', 'B', 'C', 'C', 'D', 'A'];
Should return {A,B,C,D,A}
-- Edit
AS #MkSpring Mk said, this post suggest an answer. I try to adapt it:
WITH q_array AS (
SELECT ARRAY['A', 'B', 'C', 'C', 'D', 'A'] AS full_array
), q_unnest AS (
SELECT
unnest(full_array) AS unnest_array
FROM q_array
), q_id AS (
SELECT
row_number() OVER () AS id,
unnest_array
FROM q_unnest
)
SELECT
array_agg(q_id.unnest_array) AS array_logical
FROM (SELECT q_id.*, lag(q_id.unnest_array) OVER (ORDER BY q_id.id) AS unnest_array_logical FROM q_id) q_id
WHERE unnest_array_logical IS DISTINCT FROM q_id.unnest_array
I find this syntax very verbose, maybe my approach is not efficient enough.
Is this syntax ok ? Is it a best practice to create a function or can I write it directly in a query ?

To improve query performance, I made this function:
CREATE OR REPLACE FUNCTION array_remove_consecutive_duplicates(
array_vals anyarray)
RETURNS TABLE(array_logical anyarray)
LANGUAGE 'plpgsql'
COST 100
VOLATILE PARALLEL UNSAFE
ROWS 1000
AS $BODY$
BEGIN
RETURN QUERY
WITH q_array AS (
SELECT array_vals AS full_array
), q_unnest AS (
SELECT
unnest(full_array) AS unnest_array
FROM q_array
), q_id AS (
SELECT
row_number() OVER () AS id,
unnest_array
FROM q_unnest
)
SELECT
array_agg(q_id.unnest_array) AS array_logical
FROM (SELECT q_id.*, lag(q_id.unnest_array) OVER (ORDER BY q_id.id) AS unnest_array_logical FROM q_id) q_id
WHERE unnest_array_logical IS DISTINCT FROM q_id.unnest_array;
END;
$BODY$;
My query will now write like this:
SELECT array_remove_consecutive_duplicates(ARRAY['A', 'B', 'C', 'C', 'D', 'A']);
It's a bit faster and easier to use in a query than my fist approach. Maybe it's not the good way or my syntax isn't efficient. If someone have a better way to suggest, I'll take it.

Related

Snowflake: How do I update a column with values taken at random from another table?

I've been struggling with this for a while now. Imagine I have these two tables:
CREATE TEMPORARY TABLE tmp_target AS (
SELECT * FROM VALUES
('John', 43, 'm', 17363)
, ('Mark', 21, 'm', 16354)
, ('Jean', 25, 'f', 74615)
, ('Sara', 63, 'f', 26531)
, ('Alyx', 32, 'f', 42365)
AS target (name, age, gender, zip)
);
and
CREATE TEMPORARY TABLE tmp_source AS (
SELECT * FROM VALUES
('Cory', 42, 'm', 15156)
, ('Fred', 51, 'm', 71451)
, ('Mimi', 22, 'f', 45624)
, ('Matt', 61, 'm', 12734)
, ('Olga', 19, 'f', 52462)
, ('Cleo', 29, 'f', 23352)
, ('Simm', 31, 'm', 62445)
, ('Mona', 37, 'f', 23261)
, ('Feng', 44, 'f', 64335)
, ('King', 57, 'm', 12225)
AS source (name, age, gender, zip)
);
I would like to update the tmp_target table by taking 5 rows at random from the tmp_source table for the column(s) I'm interested in. For example, maybe I want to replace all the names with 5 random names from tmp_source, or maybe I want to replace the names and the ages.
My first attempt was this:
UPDATE tmp_target t SET t.name = s.name FROM tmp_source s;
However, when I examine the target table, I notice that quite a few of the names are duplicated, usually in pairs. As well, Snowflake gives me number of rows updated: 5 as well as number of multi-joined rows updated: 5. I believe this is due to the non-deterministic nature of what's happening, possibly as noted in the Snowflake documentation on updates. Not to mention I get the nagging feeling that this is somehow horribly inefficient if the tables had many records.
Then I tried something to grab 5 random rows from the source table:
UPDATE tmp_target t SET t.name = cte.name
FROM (
WITH upd AS (SELECT name FROM tmp_source SAMPLE ROW (5 ROWS))
SELECT name FROM upd
) AS cte;
But I seem to run into the exact same issue, both when I examine the target table, and as reported by the number of multi-joined rows.
I was wondering if I can use row numbering somehow, but while I can generate row numbers in the subquery, I don't know how to do that in the SET part of the outside query.
I want to add that neither table has any identifiers or indexes that can be used, and I'm looking for a solution that wouldn't require any.
I would very much appreciate it if anyone can provide solutions or ideas that are as clean and tidy as possible, with some consideration given to efficiency (imagine a target table of 100K rows and a source table of 10M rows).
Thank you!
I like the two answers already provided, but let me give you a simple answer to solve the simple case:
UPDATE tmp_target t
SET t.name = (
select array_agg(s.name) possible_names
from tmp_source s
)[uniform(0, 9, random())]
;
The secret of this solution is building an array of possible values, and choosing one at random for each updated row.
Update: Now with a JavaScript UDF that will help us choose each name from source only once
create or replace function incremental_thing()
returns float
language javascript
as
$$
if (typeof(inc) === "undefined") inc = 0;
return inc++;
$$
;
UPDATE tmp_target t
SET t.name = (
select array_agg(s.name) within group (order by random())
from tmp_source s
)[incremental_thing()::integer]
;
Note that the JS UDF returns an incremental value each time it’s called, and that helps me choose the next value from a sorted array to use on an update.
Since the value is incremented inside the JS UDF, this will work as long as there's only one JS env involved. To for single-node processing and avoid parallelism choose an XS warehouse and test.
Two example as follows, the first uses a temporary table to house the joined data by a rownum, the second include everything in the one query, note I used UPPER and lower case strings to make sure the records were being updated the way I wanted.
CREATE OR REPLACE TEMPORARY TABLE tmp_target AS (
SELECT * FROM VALUES
('John', 43, 'm', 17363)
, ('Mark', 21, 'm', 16354)
, ('Jean', 25, 'f', 74615)
, ('Sara', 63, 'f', 26531)
, ('Alyx', 32, 'f', 42365)
AS target (name, age, gender, zip)
);
CREATE OR REPLACE TEMPORARY TABLE tmp_source AS (
SELECT * FROM VALUES
('CORY', 42, 'M', 15156)
, ('FRED', 51, 'M', 71451)
, ('MIMI', 22, 'F', 45624)
, ('MATT', 61, 'M', 12734)
, ('OLGA', 19, 'F', 52462)
, ('CLEO', 29, 'F', 23352)
, ('SIMM', 31, 'M', 62445)
, ('MONA', 37, 'F', 23261)
, ('FENG', 44, 'F', 64335)
, ('KING', 57, 'M', 12225)
AS source (name, age, gender, zip)
);
CREATE OR REPLACE TEMPORARY TABLE t1 as (
with src as (
SELECT tmp_source.*, row_number() over (order by 1) tmp_id
FROM tmp_source SAMPLE ROW (5 ROWS)),
tgt as (
SELECT tmp_target.*, row_number() over (order by 1) tmp_id
FROM tmp_target SAMPLE ROW (5 ROWS))
SELECT src.name as src_name,
src.age as src_age,
src.gender as src_gender,
src.zip as src_zip,
src.tmp_id as tmp_id,
tgt.name as tgt_name,
tgt.age as tgt_age,
tgt.gender as tgt_gender,
tgt.zip as tgt_zip
FROM src, tgt
WHERE src.tmp_id = tgt.tmp_id);
UPDATE tmp_target a
SET a.name = b.src_name,
a.gender = b.src_gender
FROM (SELECT * FROM t1) b
WHERE a.name = b.tgt_name
AND a.age = b.tgt_age
AND a.gender = b.tgt_gender
AND a.zip = b.tgt_zip;
UPDATE tmp_target a
SET a.name = b.src_name,
a.gender = b.src_gender
FROM (
with src as (
SELECT tmp_source.*, row_number() over (order by 1) tmp_id
FROM tmp_source SAMPLE ROW (5 ROWS)),
tgt as (
SELECT tmp_target.*, row_number() over (order by 1) tmp_id
FROM tmp_target SAMPLE ROW (5 ROWS))
SELECT src.name as src_name,
src.age as src_age,
src.gender as src_gender,
src.zip as src_zip,
src.tmp_id as tmp_id,
tgt.name as tgt_name,
tgt.age as tgt_age,
tgt.gender as tgt_gender,
tgt.zip as tgt_zip
FROM src, tgt
WHERE src.tmp_id = tgt.tmp_id) b
WHERE a.name = b.tgt_name
AND a.age = b.tgt_age
AND a.gender = b.tgt_gender
AND a.zip = b.tgt_zip;
At a first pass, this is all that came to mind. I'm not sure if it suits your example perfectly, since it involves reloading the table.
It should be comparably performant to any other solution that uses a generated rownum. At least to my knowledge, in Snowflake, an update is no more performant than an insert (at least in this case where you're touching every record, and every micropartition, regardless).
INSERT OVERWRITE INTO tmp_target
with target as (
select
age,
gender,
zip,
row_number() over (order by 1) rownum
from tmp_target
)
,source as (
select
name,
row_number() over (order by 1) rownum
from tmp_source
SAMPLE ROW (5 ROWS)
)
SELECT
s.name,
t.age,
t.gender,
t.zip
from target t
join source s on t.rownum = s.rownum;

SnowFlake function doesn't return data and errors out

I have a function which accepts 3 varchar inputs and returns 1 varchar output.
The function works quite well in SQL Server but not in Snowflake. I don't see what's wrong. Everything seems to be fine.
Note: it works fine when a string is passed but not when column is passed. I am not sure why it isn't accepting multiple rows??
SELECT Get_Status_Descriptiontest('can', '1', '2')
Below is the query:
with cte as (
select 'CAN' as a, 1 as b
union
select 'Ban' as a, 2 as b
)
SELECT Get_Status_Descriptiontest(a, '1', '2'), *
FROM cte;
Gives the error
SQL compliation error: Unsupported subquery type cannot be evaluated
This is the example function:
create or replace function Get_Status_Descriptiontest(Status_Code1 varchar, Status_Code2 varchar , Status_Code3 varchar )
returns varchar
as--to get sume of workdays in a given date range
$$
WITH cte AS (
SELECT 'can' DRKY, '1' DRSY, '2' DRRT, 'output1' DRDL01
)
SELECT TRIM(DRDL01)
FROM cte
WHERE TRIM(DRKY)= TRIM(Status_Code1) and TRIM(DRSY) = TRIM(Status_Code2) AND TRIM(DRRT) = TRIM(Status_Code3)
$$
;
with the real function looking like:
create or replace function Get_Status_Description(Status_Code1 varchar, Status_Code2 varchar , Status_Code3 varchar ,brand varchar)
returns varchar as
$$
SELECT TRIM(DRDL01)
FROM DB.Schema.F0005
WHERE TRIM(DRKY)= TRIM(Status_Code1) and TRIM(DRSY) = TRIM(Status_Code2)
AND TRIM(DRRT) = TRIM(Status_Code3) AND brand='TPH'
$$;
The error message is telling you are doing a correlated sub-query which snowflake does not support. Which SQL Server does support.
Now your question has been edited so the context is more visible I can see the problem is noted as above.
Firstly even if this worked you will get no results as your case comparing 'can' to 'Can' and in snowflake strings are compared case sensitive.
Secondly your data CTE can be rewritten as
WITH cte AS (
SELECT * FROM VALUES ('CAN', 1), ('Ban', 2) v(a, b)
)
which makes adding more rows simpler.
A general rule is not to call functions on the FILTERS, thus the TRIM's should be done in your CTEs:
WITH cte AS (
SELECT TRIM(a) AS a, b FROM VALUES ('CAN', 1), ('Ban', 2) v(a, b)
)
and
WITH cte AS (
SELECT TRIM(column1) AS drky
,TRIM(column2) AS drsy
,TRIM(column3) AS drrt
,TRIM(column4) AS drdl01
FROM VALUES ('can', '1', '2', 'output1');
)
and thus this would work if we ignore there is no matching data:
WITH cte AS (
SELECT TRIM(a) AS a, b FROM VALUES ('CAN', 1), ('Ban', 2) v(a, b)
), Status_Descriptiontest AS (
SELECT TRIM(column1) AS drky
,TRIM(column2) AS drsy
,TRIM(column3) AS drrt
,TRIM(column4) AS drdl01
FROM VALUES ('can', '1', '2', 'output1')
)
SELECT sd.drdl01 as description,
c.*
FROM cte AS c
JOIN Status_Descriptiontest AS sd
ON sd.DRKY = c.a and sd.DRSY = '1' AND sd.DRRT = '2';
so going to ILIKE to have a exact match without case sensitivity:
WITH cte AS (
SELECT TRIM(a) AS a, b FROM VALUES ('CAN', 1), ('Ban', 2) v(a, b)
), Status_Descriptiontest AS (
SELECT TRIM(column1) AS drky
,TRIM(column2) AS drsy
,TRIM(column3) AS drrt
,TRIM(column4) AS drdl01
FROM VALUES ('can', '1', '2', 'output1')
)
SELECT sd.drdl01 as description,
c.*
FROM cte AS c
JOIN Status_Descriptiontest AS sd
ON sd.DRKY ilike c.a and sd.DRSY ilike '1' AND sd.DRRT ilike '2';
gives:
DESCRIPTION A B
output1 CAN 1
In effect anything you are wanting to do in the SQL function, you can do in a CTE, so you might as well do that.

sort float numbers as a natural numbers in SQL Server

Well I had asked the same question for jquery on here, Now my question is same with SQL Server Query :) But this time this is not comma separated, this is separate row in Database like
I have separated rows having float numbers.
Name
K1.1
K1.10
K1.2
K3.1
K3.14
K3.5
and I want to sort this float numbers like,
Name
K1.1
K1.2
K1.10
K3.1
K3.5
K3.14
actually in my case, the numbers which are after decimals will consider as a natural numbers, so 1.2 will consider as '2' and 1.10 will consider as '10' thats why 1.2 will come first than 1.10.
You can remove 'K' because it is almost common and suggestion or example would be great for me, thanks.
You can use PARSENAME (which is more of a hack) or String functions like CHARINDEX , STUFF, LEFT etc to achieve this.
Input data
;WITH CTE AS
(
SELECT 'K1.1' Name
UNION ALL SELECT 'K1.10'
UNION ALL SELECT 'K1.2'
UNION ALL SELECT 'K3.1'
UNION ALL SELECT 'K3.14'
UNION ALL SELECT 'K3.5'
)
Using PARSENAME
SELECT Name,PARSENAME(REPLACE(Name,'K',''),2),PARSENAME(REPLACE(Name,'K',''),1)
FROM CTE
ORDER BY CONVERT(INT,PARSENAME(REPLACE(Name,'K',''),2)),
CONVERT(INT,PARSENAME(REPLACE(Name,'K',''),1))
Using String Functions
SELECT Name,LEFT(Name,CHARINDEX('.',Name) - 1), STUFF(Name,1,CHARINDEX('.',Name),'')
FROM CTE
ORDER BY CONVERT(INT,REPLACE((LEFT(Name,CHARINDEX('.',Name) - 1)),'K','')),
CONVERT(INT,STUFF(Name,1,CHARINDEX('.',Name),''))
Output
K1.1 K1 1
K1.2 K1 2
K1.10 K1 10
K3.1 K3 1
K3.5 K3 5
K3.14 K3 14
This works if there is always one char before the first number and the number is not higher than 9:
SELECT name
FROM YourTable
ORDER BY CAST(SUBSTRING(name,2,1) AS INT), --Get the number before dot
CAST(RIGHT(name,LEN(name)-CHARINDEX('.',name)) AS INT) --Get the number after the dot
Perhaps, more verbal, but should do the trick
declare #source as table(num varchar(12));
insert into #source(num) values('K1.1'),('K1.10'),('K1.2'),('K3.1'),('K3.14'),('K3.5');
-- create helper table
with data as
(
select num,
cast(SUBSTRING(replace(num, 'K', ''), 1, CHARINDEX('.', num) - 2) as int) as [first],
cast(SUBSTRING(replace(num, 'K', ''), CHARINDEX('.', num), LEN(num)) as int) as [second]
from #source
)
-- Select and order accordingly
select num
from data
order by [first], [second]
sqlfiddle:
http://sqlfiddle.com/#!6/a9b06/2
The shorter solution is this one :
Select Num
from yourtable
order by cast((Parsename(Num, 1) ) as Int)

Moving Average PER TICKER for each day

I am trying to calculate the, say, 3 day moving average (in reality 30 day) volume for stocks.
I'm trying to get the average of the last 3 date entries (rather than today-3 days). I've been trying to do something with rownumber in SQL server 2012 but with no success. Can anyone help out. Below is a template schema, and my rubbish attempt at SQL. I have various incarnations of the below SQL with the group by's but still not working. Many thanks!
select dt_eod, ticker, volume
from
(
select dt_eod, ticker, avg(volume)
row_number() over(partition by dt_eod order by max_close desc) rn
from mytable
) src
where rn >= 1 and rn <= 3
order by dt_eod
Sample Schema:
CREATE TABLE yourtable
([dt_date] int, [ticker] varchar(1), [volume] int);
INSERT INTO yourtable
([dt_date], [ticker], [volume])
VALUES
(20121201, 'A', 5),
(20121201, 'B', 7),
(20121201, 'C', 6),
(20121202, 'A', 10),
(20121202, 'B', 8),
(20121202, 'C', 7),
(20121203, 'A', 10),
(20121203, 'B', 87),
(20121203, 'C', 74),
(20121204, 'A', 10),
(20121204, 'B', 86),
(20121204, 'C', 67),
(20121205, 'A', 100),
(20121205, 'B', 84),
(20121205, 'C', 70),
(20121206, 'A', 258),
(20121206, 'B', 864),
(20121206, 'C', 740);
Three day average for each row:
with top3Values as
(
select t.ticker, t.dt_date, top3.volume
from yourtable t
outer apply
(
select top 3 top3.volume
from yourtable top3
where t.ticker = top3.ticker
and t.dt_date >= top3.dt_date
order by top3.dt_date desc
) top3
)
select ticker, dt_date, ThreeDayVolume = avg(volume)
from top3Values
group by ticker, dt_date
order by ticker, dt_date
SQL Fiddle demo.
Latest value:
with tickers as
(
select distinct ticker from yourtable
), top3Values as
(
select t.ticker, top3.volume
from tickers t
outer apply
(
select top 3 top3.volume
from yourtable top3
where t.ticker = top3.ticker
order by dt_date desc
) top3
)
select ticker, ThreeDayVolume = avg(volume)
from top3Values
group by ticker
order by ticker
SQL Fiddle demo.
Realistically you wouldn't need to create the tickers CTE for the second query, as you'd be basing this on a [ticker] table, and you'd probably have some sort of date parameter in the query, but hopefully this will get you on the right track.
You mentioned SQL 2012, which means that you can leverage a much simpler paradigm.
select dt_date, ticker, avg(1.0*volume) over (
partition by ticker
order by dt_date
ROWS BETWEEN 2 preceding and current row
)
from yourtable
I find this much more transparent as to what is actually trying to be accomplished.
You may want to look at yet another technique that is presented here: SQL-Server Moving Averages set-based algorithm with flexible window-periods and no self-joins.
The algorithm is quite speedy (much faster than APPLY and does not degrade in performance like APPLY does as data-points-window expands), easily adaptable to your requirement, works with pre-SQL2012, and overcomes the limitations of SQL-2012's windowing functionality that requires hard-coding of window-width in the OVER/PARTITION-BY clause.
For a stock-market type application with moving price-averages, it is a common requirement to allow a user to vary the number of data-points included in the average (from a UI selection, like allowing a user to choose 7 day, 30 day, 60 day, etc), and SQL-2012's OVER clause cannot handle this variable partition-width requirement without dynamic SQL.

select statement to return additional row

I am creating reports in SSIS using datasets and have the following SQL requirement:
The sql is returning three rows:
a
b
c
Is there anyway I can have the SQL return an additioanal row without adding data to the table?
Thanks in advance,
Bruce
select MyCol from MyTable
union all
select 'something' as MyCol
You can use a UNION ALL to include a new row.
SELECT *
FROM yourTable
UNION ALL
SELECT 'newRow'
The number of columns needs to be the same between the top query and the bottom one. So if you first query has one column, then the second would also need one column.
If you need to add multiple values there is a stranger syntax available:
declare #Footy as VarChar(16) = 'soccer'
select 'a' as Thing, 42 as Thingosity -- Your original SELECT goes here.
union all
select *
from ( values ( 'b', 2 ), ( 'c', 3 ), ( #Footy, Len( #Footy ) ) ) as Placeholder ( Thing, Thingosity )

Resources