postgres array : update multiple non contiguous array indices in one query - arrays

Consider a table temp1
create temporary table temp1 (
id integer,
days integer[]
);
insert into temp1 values (1, '{}');
And another table temp2
create temporary table temp2(
id integer
);
insert into temp2 values (2);
insert into temp2 values (5);
insert into temp2 values (6);
I want to use temp2 id values as indices of the days array of temp1. i.e. I want to update
days[index] = 99 where index is the id value from temp2. I want to accomplish this in single query or if not possible, the most optimal way.
Here is what I am trying and it updates only one index and not all. Is it possible to update multiple indices of the array ? I understand it can be done using a loop but just was hoping if more optimized solution is possible ?
update temp1
set days[temp2.id] = 99
from temp2;
select * from temp1;
id | days
----+------------
1 | [2:2]={99}
(1 row)

TL;DR: Don't use arrays for this. Really. Just because you can doesn't mean you should.
PostgreSQL's arrays are really not designed for in-place modification; they're data values, not dynamic data structures. I don't think what you're trying to do makes much sense, and suggest you re-evaluate your schema now before you dig yourself into a deeper hole.
You can't just construct a single null-padded array value from temp2 and do a slice-update because that'll overwrite values in days with nulls. There is no "update only non-null array elements" operator.
So we have to do this by decomposing the array into a set, modifying it, recomposing it into an array.
To solve that what I'm doing is:
Taking all rows from temp2 and adding the associated value, to produce (index, value) pairs
Doing a generate_series over the range from 1 to the highest index on temp2 and doing a left join on it, so there's one row for each index position
Left joining all that on the unnested original array and coalesceing away nulls
... then doing an array_agg ordered by index to reconstruct the array.
With a more realistic/useful starting array state:
create temporary table temp1 (
id integer primary key,
days integer[]
);
insert into temp1 values (1, '{42,42,42}');
Development step 1: index/value pairs
First associate values with each index:
select id, 99 from temp2;
Development step 2: add nulls for missing indexes
then join on generate_series to add entries for missing indexes:
SELECT gs.i, temp2values.newval
FROM (
SELECT id AS newvalindex, 99 as newval FROM temp2
) temp2values
RIGHT OUTER JOIN (
SELECT i FROM generate_series(1, (select max(id) from temp2)) i
) gs
ON (temp2values.newvalindex = gs.i);
Development step 3: merge the original array values in
then join that on the unnested original array. You can use UNNEST ... WITH ORDINALITY for this in PostgreSQL 9.4, but I'm guessing you're not running that yet so I'll show the old approach with row_number. Note the use of a full outer join and the change to the outer bound of the generate_series to handle the case where the original values array is longer than the highest index in the new values list:
SELECT gs.i, coalesce(temp2values.newval, originals.val) AS val
FROM (
SELECT id AS newvalindex, 99 as newval FROM temp2
) temp2values
RIGHT OUTER JOIN (
SELECT i FROM generate_series(1, (select greatest(max(temp2.id), array_length(days,1)) from temp2, temp1 group by temp1.id)) i
) gs
ON (temp2values.newvalindex = gs.i)
FULL OUTER JOIN (
SELECT row_number() OVER () AS index, val
FROM temp1, LATERAL unnest(days) val
WHERE temp1.id = 1
) originals
ON (originals.index = gs.i)
ORDER BY gs.i;
This produces something like:
regress=> \e
i | val
---+----------
1 | 42
2 | 99
3 | 42
4 |
5 | 99
6 | 99
(6 rows)
Development step 4: Produce the desired new array value
so now we just need to turn it back into an array by removing the ORDER BY clause at the end and using array_agg:
SELECT array_agg(coalesce(temp2values.newval, originals.val) ORDER BY gs.i)
FROM (
SELECT id AS newvalindex, 99 as newval FROM temp2
) temp2values
RIGHT OUTER JOIN (
SELECT i FROM generate_series(1, (select greatest(max(temp2.id), array_length(days,1)) from temp2, temp1 group by temp1.id)) i
) gs
ON (temp2values.newvalindex = gs.i)
FULL OUTER JOIN (
SELECT row_number() OVER () AS index, val
FROM temp1, LATERAL unnest(days) val
WHERE temp1.id = 1
) originals
ON (originals.index = gs.i);
with a result like:
array_agg
-----------------------
{42,99,42,NULL,99,99}
(1 row)
Final query: Use it in an UPDATE
UPDATE temp1
SET days = newdays
FROM (
SELECT array_agg(coalesce(temp2values.newval, originals.val) ORDER BY gs.i)
FROM (
SELECT id AS newvalindex, 99 as newval FROM temp2
) temp2values
RIGHT OUTER JOIN (
SELECT i FROM generate_series(1, (select greatest(max(temp2.id), array_length(days,1)) from temp2, temp1 group by temp1.id)) i
) gs
ON (temp2values.newvalindex = gs.i)
FULL OUTER JOIN (
SELECT row_number() OVER () AS index, val
FROM temp1, LATERAL unnest(days) val
WHERE temp1.id = 1
) originals
ON (originals.index = gs.i)
) calc_new_days(newdays)
WHERE temp1.id = 1;
Note, however, that **this only works for a single entry in temp1.id,and I've specified temp1.id twice in the query: once inside the query to generate the new array value, and once in the update predicate.
To avoid that, you'd need a key in temp2 that references temp1.id and you'd need to make some changes to allow the generated padding rows to have the correct id value.
I hope this convinces you that you should probably not be using arrays for what you're doing, because it's horrible.

Related

Getting non-deterministic results from WITH RECURSIVE cte

I'm trying to create a recursive CTE that traverses all the records for a given ID, and does some operations between ordered records. Let's say I have customers at a bank who get charged a uniquely identifiable fee, and a customer can pay that fee in any number of installments:
WITH recursive payments (
id
, index
, fees_paid
, fees_owed
)
AS (
SELECT id
, index
, fees_paid
, fee_charged
FROM table
WHERE index = 1
UNION ALL
SELECT t.id
, t.index
, t.fees_paid
, p.fees_owed - p.fees_paid
FROM table t
JOIN payments p
ON t.id = p.id
AND t.index = p.index + 1
)
SELECT *
FROM payments
ORDER BY 1,2;
The join logic seems sound, but when I join the output of this query to the source table, I'm getting non-deterministic and incorrect results.
This is my first foray into Snowflake's recursive CTEs. What am I missing in the intermediate result logic that is leading to the non-determinism here?
I assume this is edited code, because in the anchor of you CTE you select the fourth column fee_charged which does not exist, and then in the recursion you don't sum the fees paid and other stuff, basically you logic seems rather strange.
So creating some random data, that has two different id streams to recurse over:
create or replace table data (id number, index number, val text);
insert into data
select * from values (1,1,'a'),(2,1,'b')
,(1,2,'c'), (2,2,'d')
,(1,3,'e'), (2,3,'f')
v(id, index, val);
Now altering you CTE just a little bit to concat that strings together..
WITH RECURSIVE payments AS
(
SELECT id
, index
, val
FROM data
WHERE index = 1
UNION ALL
SELECT t.id
, t.index
, p.val || t.val as val
FROM data t
JOIN payments p
ON t.id = p.id
AND t.index = p.index + 1
)
SELECT *
FROM payments
ORDER BY 1,2;
we get:
ID INDEX VAL
1 1 a
1 2 ac
1 3 ace
2 1 b
2 2 bd
2 3 bdf
Which is exactly as I would expect. So how this relates to your "it gets strange when I join to other stuff" is ether, your output of you CTE is not how you expect it to be.. Or your join to other stuff is not working as you expect, Or there is a bug with snowflake.
Which all comes down to, if the CTE results are exactly what you expect, create a table and join that to your other table, so eliminate some form of CTE vs JOIN bug, and to debug why your join is not working.
But if your CTE output is not what you expect, then lets help debug that.

IN statement inconsistency with PRIMARY KEY

So I have a simple table called temp that can be created by:
CREATE TABLE temp (value int, id int not null primary key);
INSERT INTO temp
VALUES(0,1),
(0,2),
(0,3),
(0,4),
(1,5),
(1,6),
(1,7),
(1,8);
I have a second table temp2 that can be created by:
CREATE TABLE temp (value int, id int);
INSERT INTO temp
VALUES(0,1),
(0,2),
(0,3),
(0,4),
(1,5),
(1,6),
(1,7),
(1,8);
The only difference between temp and temp2 is that the id field is the primary key in temp, and temp2 has no primary key. I'm not sure how, but I am getting differing results with the following query:
select * from temp
where id in (
select id
from (
select id, ROW_NUMBER() over (partition by value order by value) rownum
from temp
) s1
where rownum = 1
)
This is the result for temp:
value id
----------- -----------
0 1
0 2
0 3
0 4
1 5
1 6
1 7
1 8
and this is what I get when temp is replaced by temp2 (THE CORRECT RESULT):
value id
----------- -----------
0 1
1 5
When running the inner-most query (s1), the expected results are retrieved:
id rownum
----------- --------------------
1 1
2 2
3 3
4 4
5 1
6 2
7 3
8 4
When just running the in statement query on both, I also get the expected result:
id
-----------
1
5
I cannot figure out what the reason for this could possibly be. Is this a bug?
Notes: temp2 was created with a simple select * into temp2 from temp. I am running SQL Server 2008. My apologies if this is a known glitch. It is difficult to search for this since it requires an in statement. An "equivalent" query that uses a join does produce the correct results on both tables.
Edit: dbfiddle showing the differences:
Unexpected Results
Expected Results
I can't specifically answer your question, but changing the ORDER BY fixes the problem. partition by value order by value doesn't really make sense, and it looks like the problem is "fooling" SQL Server; as you're partitioning the rows by the same value you're ordering by, every row is "row number 1" as they could all be at the start. Don't forget, a table is an unordered heap; even when it has a Primary Key (clustered or not).
If you change your ORDER BY to id instead the problem goes away.
SELECT *
FROM temp2 t2
WHERE t2.id IN (SELECT s1.id
FROM (SELECT sq.id,
ROW_NUMBER() OVER (PARTITION BY sq.value ORDER BY sq.id) AS rownum
FROM temp2 sq) s1
WHERE s1.rownum = 1);
In fact, changing the ORDER BY clause to anything else fixes the problem:
SELECT *
FROM temp2 t2
WHERE t2.id IN (SELECT s1.id
FROM (SELECT sq.id,
ROW_NUMBER() OVER (PARTITION BY sq.value ORDER BY (SELECT NULL)) AS rownum
FROM temp2 sq) s1
WHERE s1.rownum = 1);
So the problem is that your are using the same expression (column) for both your PARTITION BY and ORDER BY clause; meaning that any of those rows could be row number 1, and none of them; thus all are returned. It doesn't make sense for both to be the same, so they should be different.
Still, this problem does persist in SQL Server 2017 (and I suspect 2019) so you might want to raise a support ticket with them anyway (but as you're using 2008 don't expect it to get fixed, as your support is about to end).
As comments can be deleted without notice I wanted to add #scsimon's comment and my response:
scsimon: Interesting. Changing rownum = 2 gives expected results without changing order by. I think it's a bug.
Larnu: I agree at #scsimon. I suspect that changing the WHERE to s1.rownum = 2 effectively forces the data engine to actually determine the values of rownum, rather than assume every row is "equal"; as if that were the case none would be returned.
Even so, changing the WHERE to s1.rownum = 2 is still resigning to "return a random row", if the PARTITION BY and ORDER BY clauses are the same

postgresql: Insert two values in table b if both values are not in table a

I'm doing an assignment where I am to make an sql-database of a tournament result. Players can be added by their name, and when the database has at least two or more players who has not already been assigned to a match, two players should be matched against each other.
For instance, if the tables currently are empty I add Joe as a player. I then also add James and since the table then has two players, who also are not in the matches-table, a new row in the matches-table is created with their p_id set to left_player_P_id and right_player_P_id.
I thought it would be a good idea to create a function and a trigger so that every time a row is added to the player-table, the sql-code would run and create the row in the matches as needed. I am open to other ways of doing this.
I've tried multiple different approaches including SQL - Insert if the number of rows is greater than and Using IF ELSE statement based on Count to execute different Insert statements but I am now at a loss.
Problematic code:
This approach returns a syntax error.
IF ((select count(*) from players_not_in_any_matches) >= 2)
begin
insert into matches values (
(select p_id from players_not_in_any_matches limit 1),
(select p_id from players_not_in_any_matches limit 1 offset 1)
)
end;
Alternative approach (still problematic code):
This approach seems more promising (but less readable). However, it inserts even if there are no rows returned inside the where not exists.
insert into matches (left_player_p_id, right_player_p_id)
select
(select p_id from players_not_in_any_matches limit 1),
(select p_id from players_not_in_any_matches limit 1 offset 1)
where not exists (
select * from players_not_in_any_matches offset 2
);
Tables
CREATE TABLE players (
p_id serial PRIMARY KEY,
full_name text
);
CREATE TABLE matches(
left_player_P_id integer REFERENCES players,
right_player_P_id integer REFERENCES players,
winner integer REFERENCES players
);
Views
-- view for getting all players not currently assigned to a match
create view players_not_in_any_matches as
select * from players
where p_id not in (
select left_player_p_id from matches
) and
p_id not in (
select right_player_p_id from matches
);
Try:
insert into matches (left_player_p_id, right_player_p_id)
select p1.p_id, p2.p_id
from players p1
join players p2
on p1.p_id <> p2.p_id
and not exists(
select 1 from matches m
where p1.p_id in (m.left_player_p_id, m.right_player_p_id)
)
and not exists(
select 1 from matches m
where p2.p_id in (m.left_player_p_id, m.right_player_p_id)
)
limit 1
Anti joins (not-exists operators) in the above query could be further simplified a bit using LEFT JOINs:
insert into matches (left_player_p_id, right_player_p_id)
select p1.p_id, p2.p_id
from players p1
join players p2
left join matches m1
on p1.p_id in (m1.left_player_p_id, m1.right_player_p_id)
left join matches m2
on p2.p_id in (m2.left_player_p_id, m2.right_player_p_id)
where m1.left_player is null
and m2.left_player is null
limit 1
but in my opinion the former query is more readable, while the latter one looks tricky.

How to query number based SQL Sets with Ranges in SQL

What I'm looking for is a way in MSSQL to create a complex IN or LIKE clause that contains a SET of values, some of which will be ranges.
Sort of like this, there are some single numbers, but also some ranges of numbers.
EX: SELECT * FROM table WHERE field LIKE/IN '1-10, 13, 24, 51-60'
I need to find a way to do this WITHOUT having to specify every number in the ranges separately AND without having to say "field LIKE blah OR field BETWEEN blah AND blah OR field LIKE blah.
This is just a simple example but the real query will have many groups and large ranges in it so all the OR's will not work.
One fairly easy way to do this would be to load a temp table with your values/ranges:
CREATE TABLE #Ranges (ValA int, ValB int)
INSERT INTO #Ranges
VALUES
(1, 10)
,(13, NULL)
,(24, NULL)
,(51,60)
SELECT *
FROM Table t
JOIN #Ranges R
ON (t.Field = R.ValA AND R.ValB IS NULL)
OR (t.Field BETWEEN R.ValA and R.ValB AND R.ValB IS NOT NULL)
The BETWEEN won't scale that well, though, so you may want to consider expanding this to include all values and eliminating ranges.
You can do this with CTEs.
First, create a numbers/tally table if you don't already have one (it might be better to make it permanent instead of temporary if you are going to use it a lot):
;WITH Numbers AS
(
SELECT
1 as Value
UNION ALL
SELECT
Numbers.Value + 1
FROM
Numbers
)
SELECT TOP 1000
Value
INTO ##Numbers
FROM
Numbers
OPTION (MAXRECURSION 1000)
Then you can use a CTE to parse the comma delimited string and join the ranges with the numbers table to get the "NewValue" column which contains the whole list of numbers you are looking for:
DECLARE #TestData varchar(50) = '1-10,13,24,51-60'
;WITH CTE AS
(
SELECT
1 AS RowCounter,
1 AS StartPosition,
CHARINDEX(',',#TestData) AS EndPosition
UNION ALL
SELECT
CTE.RowCounter + 1,
EndPosition + 1,
CHARINDEX(',',#TestData, CTE.EndPosition+1)
FROM CTE
WHERE
CTE.EndPosition > 0
)
SELECT
u.Value,
u.StartValue,
u.EndValue,
n.Value as NewValue
FROM
(
SELECT
Value,
SUBSTRING(Value,1,CASE WHEN CHARINDEX('-',Value) > 0 THEN CHARINDEX('-',Value)-1 ELSE LEN(Value) END) AS StartValue,
SUBSTRING(Value,CASE WHEN CHARINDEX('-',Value) > 0 THEN CHARINDEX('-',Value)+1 ELSE 1 END,LEN(Value)- CHARINDEX('-',Value)) AS EndValue
FROM
(
SELECT
SUBSTRING(#TestData, StartPosition, CASE WHEN EndPosition > 0 THEN EndPosition-StartPosition ELSE LEN(#TestData)-StartPosition+1 END) AS Value
FROM
CTE
)t
)u INNER JOIN ##Numbers n ON n.Value BETWEEN u.StartValue AND u.EndValue
All you would need to do once you have that is query the results using an IN statement, so something like
SELECT * FROM MyTable WHERE Value IN (SELECT NewValue FROM (/*subquery from above*/)t)

Duplicated results when performing INNER JOIN

I have 2 simple tables that I would like to perform an INNER JOIN with, but the problem is that I'm getting duplicated (for the columns str1 and str2) results:
CREATE TABLE #A (Id INT, str1 nvarchar(50), str2 nvarchar(50))
insert into #A values (1, 'a', 'b')
insert into #A values (2, 'a', 'b')
CREATE TABLE #B (Id INT, str1 nvarchar(50), str2 nvarchar(50))
insert into #B values (7, 'a', 'b')
insert into #B values (8, 'a', 'b')
select * from #A a
INNER JOIN #B b ON a.str1 = b.str1 AND a.str2 = b.str2
It gave me 4 records when I really wanted 2.
What I got:
id | str1 | str2| id | str1 | str2
1 | a | b | 7 | a | b
2 | a | b | 7 | a | b
1 | a | b | 8 | a | b
2 | a | b | 8 | a | b
What I really wanted:
1 a | b | 7 | a | b
2 a | b | 8 | a | b
Can anyone help? I know this is achievable using a cursor and loop, but I'd like to avoid it and only use some type of JOIN if possible.
SELECT
a.id AS a_id, a.str1 AS a_str1, a.str2 AS a_str2,
b.id AS b_id, b.str1 AS b_str1, b.str2 AS b_str2
FROM
( SELECT *
, ROW_NUMBER() OVER (PARTITION BY str1, str2 ORDER BY id) AS rn
FROM #A
) a
INNER JOIN
( SELECT *
, ROW_NUMBER() OVER (PARTITION BY str1, str2 ORDER BY id) AS rn
FROM #B
) b
ON a.str1 = b.str1
AND a.str2 = b.str2
AND a.rn = b.rn ;
If you have more rows in one or the other tables for the same (str1, str2) combination, you can choose which ones will be returned by changing INNER join to either LEFT, RIGHT or FULL join.
You can accomplish a sort of matching with a query like the following (SQL 2005 and up):
WITH A AS (
SELECT
Seq = Row_Number() OVER (PARTITION BY Str1, Str2 ORDER BY Id),
*
FROM #A
), B AS (
SELECT
Seq = Row_Number() OVER (PARTITION BY Str1, Str2 ORDER BY Id),
*
FROM #B
)
SELECT
A.Id, A.Str1, A.Str2, B.Id, B.Str1, B.Str2
FROM
A
FULL JOIN B
ON A.Seq = B.Seq AND A.Str1 = B.Str1 AND A.Str2 = B.Str2;
This joins the items between A and B on their Id-ordered position. But take note: if you have an unequal number of items for each set of Str1 and Str2, you may get unexpected results, since NULLs will appear for #A or #B.
I'm assuming here that you want the first row of table #A's "Str1 Str2", as ordered by #A.Id (1 being first), to correlate with the first row of table #B's "Str1 Str2", as ordered by #B.Id (7 being first), and so on and so forth for each successively numbered row. Is that right?
But what will you do if the number of rows does not match, and there are, for example, 3 rows in #A that have the same values as 2 rows in #B? Or the reverse? What do you want to see?
A mere DISTINCT will not do the job because the data is not duplicated. You are getting what is in effect a partial cross-join (resulting in a partial Cartesian product). That is, your join criteria do not ensure that there is a one-to-one correspondence of #A row to #B row. When that happens, for each row in #A, you will get an output row for each matching row in B. 2 x 2 = 4, not 2.
I think it would help if you were to be a little more concrete in your example. What things are you actually querying? Surely you've simplified for us, but that has also removed all context for us to know what you're trying to accomplish in the real world. If you are trying to line up sports teams, we might give a different answer than if you are trying to line up invoice line items or tardy occurrences or who knows what!
With that data, and just that data, you can't get the result you want, unless you can provide some way for each of #A's ID values to map to each of #B's ID values.
So, if you really have just 2 records in each table, it would go something like this:
SELECT *
FROM #A a
JOIN #B b
ON a.str1 = b.str1 -- actually, if you join by IDs this isn't necessary
AND a.str2 = b.str2 -- nor is this
AND
(
( a.ID = 1 and b.ID = 7 )
OR ( a.ID = 2 and b.ID = 8 )
)
What you're getting is called a Cartesian product, where each record in #A is paired with each matching record in #B. Since there is more than one matching record in each table, you get every possible combination of matching records from A and B.
Since the only other fields you have to work with are the ID fields, you need to use those to combine exactly one A record with one B record.

Resources