Related
I have created below table with primary key in snowflake and whenever i am trying to insert data into this table, it's allow duplicate records also.
How to restrict duplicate id ?
create table tab11(id int primary key not null,grade varchar(10));
insert into tab11 values(1,'A');
insert into tab11 values(1,'B');
select * from tab11;
Output: Inserted duplicate records.
ID GRADE
1 A
1 B
Snowflake allows you to identify a column as a Primary Key but it doesn't enforce uniqueness on them. From the documentation here:
Snowflake supports defining and maintaining constraints, but does not enforce them, except for NOT NULL constraints, which are always enforced.
A Primary Key in Snowflake is purely for informative purposes. I'm not from Snowflake, but I imagine that enforcing uniqueness in Primary Keys does not really align with how Snowflake stores data behind the scenes and it probably would impact insertion speed.
You may want to look at using a merge statement to handle what happens when a row with a duplicate PK arrives:
create table tab1(id int primary key not null, grade varchar(10));
insert into tab1 values(1, 'A');
-- Try merging values 1, and 'B': Nothing will be added
merge into tab1 using
(select * from (values (1, 'B')) x(id, grade)) tab2
on tab1.id = tab2.id
when not matched then insert (id, grade)
values (tab2.id, tab2.grade);
select * from tab1;
-- Try merging values 2, and 'B': New row added
merge into tab1 using
(select * from (values (2, 'B')) x(id, grade)) tab2
on tab1.id = tab2.id
when not matched then insert (id, grade)
values (tab2.id, tab2.grade);
select * from tab1;
-- If instead of ignoring dupes, we want to update:
merge into tab1 using
(select * from (values (1, 'F'), (2, 'F')) x(id, grade)) tab2
on tab1.id = tab2.id
when matched then update set tab1.grade = tab2.grade
when not matched then insert (id, grade)
values (tab2.id, tab2.grade);
select * from tab1;
For more complex merges, you may want to investigate using Snowflake streams (change data capture tables). In addition to the documentation, I have created a SQL script walk through of how to use a stream to keep a staging and prod table in sync:
https://snowflake.pavlik.us/index.php/2020/01/12/snowflake-streams-made-simple
You could try to use SEQUENCE to fit your requirement
https://docs.snowflake.net/manuals/user-guide/querying-sequences.html#using-sequences
Snowflake does NOT enforce unique constraints, hence you can only mitigate the issue by:
using a SEQUENCE to populate the column you want to be unique;
defining the column as NOT NULL (which is effectively enforced);
using a stored procedure where you can programmatically ensure no duplicates are introduced;
using a stored procedure (which could be run by scheduled Task possibly) to de-duplicate on a regular basis;
You will have to check for duplicates yourself during the insertion (within your INSERT query).
Greg Pavlik's answer using a MERGE query is one way to do it, but you can also achieve the same result with an INSERT query (if you don't plan on updating the existing rows -- if you do, use MERGE instead)
The idea is to insert with a SELECT that checks for the existence of those keys first, along with a window function to qualify the records and remove duplicates from the insert data itself. Here's an example:
INSERT INTO tab11
SELECT *
FROM (VALUES
(1,'A'),
(1,'B')
) AS t(id, grade)
-- Make sure the table doesn't already contain the IDs we're trying to insert
WHERE id NOT IN (
SELECT id FROM tab11
)
-- Make sure the data we're inserting doesn't contain duplicate IDs
-- If it does, only the first record will be inserted (based on the ORDER BY)
-- Ideally, we would want to order by a timestamp to select the latest record
QUALIFY ROW_NUMBER() OVER (
PARTITION BY id
ORDER BY grade ASC
) = 1;
Alternatively, you can achieve the same result with a LEFT JOIN instead of a WHERE NOT IN (...) -- but it doesn't make a big difference unless your table is using a composite primary key (so that you can join on multiple keys).
INSERT INTO tab11
SELECT t.id, t.grade
FROM (VALUES
(1,'A'),
(1,'B')
) AS t(id, grade)
LEFT JOIN tab11
ON tab11.id = t.id
-- Insert only if no match is found in the join (i.e. ID doesn't exit)
WHERE tab11.id IS NULL
QUALIFY ROW_NUMBER() OVER (
PARTITION BY t.id
ORDER BY t.grade ASC
) = 1;
Side note: Snowflake is an OLAP database (as opposed to OLTP), and hence is designed for analytical queries & bulk operations (as opposed to operations on individual records). It's not a good idea to insert records one at a time in your table; instead, you should ingest data in bulk into a landing/staging table (possibly using Snowpipe), and use the data in that table to update your destination table (ideally using a table stream).
Snowflake documentation says it doesnt enforce the constraint.
https://docs.snowflake.com/en/sql-reference/constraints-overview.html
Instead of the load script process to fail, I would rather try and use merge. I have not used merge statements in snowflake yet. For other nosql databases, I have used merge statements instead of insert.
I have a two tables, a Customers and a Sales table.
I am trying to create a trigger to update the amount of sales in the customer table when the Sales table is updated.
CREATE TRIGGER salesUPDATE
ON SALES
AFTER INSERT
AS
UPDATE Customers
SET salesAmount = Sales.Amount
GO
But I get that sales does not exist. Should I be using a join?
Will this trigger update all columns or do I need to specify which column to update?
CREATE TRIGGER salesUpdate ON SALES
AFTER INSERT, UPDATE, DELETE
AS
BEGIN
;WITH cteAffectedCustomers AS (
SELECT DISTINCT CustomerId
FROM
inserted
UNION
SELECT DISTINCT CustomerId
FROM
deleted
)
, cteAggregations AS (
SELECT
ca.CustomerId
,SUM(ISNULL(s.Amount,0)) as SalesAmount
,COUNT(s.SalesId) as NumOfSales
FROM
cteAffectedCustomers ca
INNER JOIN Customers c
ON ca.CustomerId = c.CustomerId
LEFT JOIN Sales s
ON ca.CustomerId = s.CustomerId
GROUP BY
ca.CustomerId
)
UPDATE c
SET SalesAmount = ca.SalesAmount
,NumOfSales = ca.NumOfSales
FROM
Customers c
INNER JOIN cteAggregations ca
ON c.CustomerId = ca.CustomerId
END
Here is an example of the type of logic you would need to create to maintain a pre-agregated value. If you want to SUM an Amount in the Sales table you will need an AFTER INSERT, UPDATE, and DELETE. Then you would need to:
determine all of the affected customers So you don't update the entire customer table
do the aggregation
update with an inner join to the aggregated data
A note about triggers, they are a set based operation NOT a scalar. That means they fire once for x# of rows NOT x# of times for x# of rows. So you have to account for multiple records during updates and do joins just like would outside of a trigger when updating one table with another.
This has a performance impact to write operations but does expedite your reads, however if you are not in an extremely extremely high read volume operation you would do better to use a view/query and optimize your indexes. There is less likely hood of the synchronization of the aggregate data getting messed up. If you do go the trigger route I suggest you also have a SQL job set up on some reasonable increment (nightly) that checks and rectifies any inconsistencies that may occur.
Use magic table inserted
CREATE TRIGGER salesUPDATE
ON SALES
AFTER INSERT
AS
BEGIN
Declare #Amount varchar(50) = (Select top 1 Amount from inserted)
UPDATE Customers
SET salesAmount = #Amount
END
GO
Note : top 1 for insertion of multiple records
Use the inserted table name which contains the new value:
CREATE TRIGGER salesUPDATE
ON SALES
AFTER INSERT
AS
UPDATE Customers
SET salesAmount = inserted.Amount
GO
I need to retrieve all rows from a table where 2 columns combined are all different. So I want all the sales that do not have any other sales that happened on the same day for the same price. The sales that are unique based on day and price will get updated to an active status.
So I'm thinking:
UPDATE sales
SET status = 'ACTIVE'
WHERE id IN (SELECT DISTINCT (saleprice, saledate), id, count(id)
FROM sales
HAVING count = 1)
But my brain hurts going any farther than that.
SELECT DISTINCT a,b,c FROM t
is roughly equivalent to:
SELECT a,b,c FROM t GROUP BY a,b,c
It's a good idea to get used to the GROUP BY syntax, as it's more powerful.
For your query, I'd do it like this:
UPDATE sales
SET status='ACTIVE'
WHERE id IN
(
SELECT id
FROM sales S
INNER JOIN
(
SELECT saleprice, saledate
FROM sales
GROUP BY saleprice, saledate
HAVING COUNT(*) = 1
) T
ON S.saleprice=T.saleprice AND s.saledate=T.saledate
)
If you put together the answers so far, clean up and improve, you would arrive at this superior query:
UPDATE sales
SET status = 'ACTIVE'
WHERE (saleprice, saledate) IN (
SELECT saleprice, saledate
FROM sales
GROUP BY saleprice, saledate
HAVING count(*) = 1
);
Which is much faster than either of them. Nukes the performance of the currently accepted answer by factor 10 - 15 (in my tests on PostgreSQL 8.4 and 9.1).
But this is still far from optimal. Use a NOT EXISTS (anti-)semi-join for even better performance. EXISTS is standard SQL, has been around forever (at least since PostgreSQL 7.2, long before this question was asked) and fits the presented requirements perfectly:
UPDATE sales s
SET status = 'ACTIVE'
WHERE NOT EXISTS (
SELECT FROM sales s1 -- SELECT list can be empty for EXISTS
WHERE s.saleprice = s1.saleprice
AND s.saledate = s1.saledate
AND s.id <> s1.id -- except for row itself
)
AND s.status IS DISTINCT FROM 'ACTIVE'; -- avoid empty updates. see below
db<>fiddle here
Old sqlfiddle
Unique key to identify row
If you don't have a primary or unique key for the table (id in the example), you can substitute with the system column ctid for the purpose of this query (but not for some other purposes):
AND s1.ctid <> s.ctid
Every table should have a primary key. Add one if you didn't have one, yet. I suggest a serial or an IDENTITY column in Postgres 10+.
Related:
In-order sequence generation
Auto increment table column
How is this faster?
The subquery in the EXISTS anti-semi-join can stop evaluating as soon as the first dupe is found (no point in looking further). For a base table with few duplicates this is only mildly more efficient. With lots of duplicates this becomes way more efficient.
Exclude empty updates
For rows that already have status = 'ACTIVE' this update would not change anything, but still insert a new row version at full cost (minor exceptions apply). Normally, you do not want this. Add another WHERE condition like demonstrated above to avoid this and make it even faster:
If status is defined NOT NULL, you can simplify to:
AND status <> 'ACTIVE';
The data type of the column must support the <> operator. Some types like json don't. See:
How to query a json column for empty objects?
Subtle difference in NULL handling
This query (unlike the currently accepted answer by Joel) does not treat NULL values as equal. The following two rows for (saleprice, saledate) would qualify as "distinct" (though looking identical to the human eye):
(123, NULL)
(123, NULL)
Also passes in a unique index and almost anywhere else, since NULL values do not compare equal according to the SQL standard. See:
Create unique constraint with null columns
OTOH, GROUP BY, DISTINCT or DISTINCT ON () treat NULL values as equal. Use an appropriate query style depending on what you want to achieve. You can still use this faster query with IS NOT DISTINCT FROM instead of = for any or all comparisons to make NULL compare equal. More:
How to delete duplicate rows without unique identifier
If all columns being compared are defined NOT NULL, there is no room for disagreement.
The problem with your query is that when using a GROUP BY clause (which you essentially do by using distinct) you can only use columns that you group by or aggregate functions. You cannot use the column id because there are potentially different values. In your case there is always only one value because of the HAVING clause, but most RDBMS are not smart enough to recognize that.
This should work however (and doesn't need a join):
UPDATE sales
SET status='ACTIVE'
WHERE id IN (
SELECT MIN(id) FROM sales
GROUP BY saleprice, saledate
HAVING COUNT(id) = 1
)
You could also use MAX or AVG instead of MIN, it is only important to use a function that returns the value of the column if there is only one matching row.
If your DBMS doesn't support distinct with multiple columns like this:
select distinct(col1, col2) from table
Multi select in general can be executed safely as follows:
select distinct * from (select col1, col2 from table ) as x
As this can work on most of the DBMS and this is expected to be faster than group by solution as you are avoiding the grouping functionality.
I want to select the distinct values from one column 'GrondOfLucht' but they should be sorted in the order as given in the column 'sortering'. I cannot get the distinct values of just one column using
Select distinct GrondOfLucht,sortering
from CorWijzeVanAanleg
order by sortering
It will also give the column 'sortering' and because 'GrondOfLucht' AND 'sortering' is not unique, the result will be ALL rows.
use the GROUP to select the records of 'GrondOfLucht' in the order given by 'sortering
SELECT GrondOfLucht
FROM dbo.CorWijzeVanAanleg
GROUP BY GrondOfLucht, sortering
ORDER BY MIN(sortering)
I have an aplication that passes parameters to a procedure in SQL. One of the parameters is an table valued parameter containing items to include in a where clause.
Because the table valued parameter has no statistics attached to it when I join my TVP to a table that has 2 mil rows I get a very slow query.
What alternatives do I have ?
Again, the goal is to pass certain values to a procedure that will be included in a where clause:
select * from table1 where id in
(select id from #mytvp)
or
select * from table1 t1 join #mytpv
tvp on t1.id = tvp.id
although it looks like it would need to run the query once for each row in table1, EXISTS often optimizes to be more efficient than a JOIN or an IN. So, try this:
select * from table1 t where exists (select 1 from #mytvp p where t.id=p.id)
also, be sure that t.id is the same datatype as p.id and t.id has an index.
You can use a temp table with an index to boost performance....(assuming you have more than a couple of records in your #mytvp)
just before you join the table you could insert the data from the variable #mytvp to a temp table...
here's a sample code to create a temp table with index....The primary key and unique field determines which columns to index on..
CREATE TABLE #temp_employee_v3
(rowID int not null identity(1,1)
,lname varchar (30) not null
,fname varchar (30) not null
,city varchar (20) not null
,state char (2) not null
,PRIMARY KEY (lname, fname, rowID)
,UNIQUE (state, city, rowID) )
I had the same issue that table-valued parameters where very slow in my context. I came up with a solution that passed the list of values as a comma separated string to the stored procedure. the procedure then made a PATINDEX(...) > 0 comparision. This was about a factor of 1:6 faster.
As mentioned here and explained here you can have primary key and unique constraints on the table type. E.g.
CREATE TYPE IdList AS TABLE ( Id UNIQUEIDENTIFIER NOT NULL PRIMARY KEY )
However, check if it improves performance in your case as now, these indexes exist when the TVP is populated which might lead to a counter effect depending if your input is sorted and/or if you use more than one column.
In common with table variables, table-valued parameters have no statistics (see the section "restrictions"); the query optimiser works on the assumption that they contain only one row, which if your parameter contains a lot of rows is likely to result in an inappropriate query plan.
One way to improve your chances of a better plan is to add a statement level recompile; this should enable the optimiser to take the size of the TVP into account when selecting a plan.
select * from table1 t where exists (select 1 from #mytvp p where t.id=p.id) OPTION (RECOMPILE)
(incorporating KM's suggestion)
I have a stored procedure that is responsible for inserting or updating multiple records at once. I want to perform this in my stored procedure for the sake of performance.
This stored procedure takes in a comma-delimited list of permit IDs and a status. The permit IDs are stored in a variable called #PermitIDs. The status is stored in a variable called #Status. I have a user-defined function that converts this comma-delimited list of permit IDs into a Table. I need to go through each of these IDs and do either an insert or update into a table called PermitStatus.
If a record with the permit ID does not exist, I want to add a record. If it does exist, I'm want to update the record with the given #Status value. I know how to do this for a single ID, but I do not know how to do it for multiple IDs. For single IDs, I do the following:
-- Determine whether to add or edit the PermitStatus
DECLARE #count int
SET #count = (SELECT Count(ID) FROM PermitStatus WHERE [PermitID]=#PermitID)
-- If no records were found, insert the record, otherwise add
IF #count = 0
BEGIN
INSERT INTO
PermitStatus
(
[PermitID],
[UpdatedOn],
[Status]
)
VALUES
(
#PermitID,
GETUTCDATE(),
1
)
END
ELSE
UPDATE
PermitStatus
SET
[UpdatedOn]=GETUTCDATE(),
[Status]=#Status
WHERE
[PermitID]=#PermitID
How do I loop through the records in the Table returned by my user-defined function to dynamically insert or update the records as needed?
create a split function, and use it like:
SELECT
*
FROM YourTable y
INNER JOIN dbo.splitFunction(#Parameter) s ON y.ID=s.Value
I prefer the number table approach
For this method to work, you need to do this one time table setup:
SELECT TOP 10000 IDENTITY(int,1,1) AS Number
INTO Numbers
FROM sys.objects s1
CROSS JOIN sys.objects s2
ALTER TABLE Numbers ADD CONSTRAINT PK_Numbers PRIMARY KEY CLUSTERED (Number)
Once the Numbers table is set up, create this function:
CREATE FUNCTION [dbo].[FN_ListToTableAll]
(
#SplitOn char(1) --REQUIRED, the character to split the #List string on
,#List varchar(8000)--REQUIRED, the list to split apart
)
RETURNS TABLE
AS
RETURN
(
----------------
--SINGLE QUERY-- --this WILL return empty rows
----------------
SELECT
ROW_NUMBER() OVER(ORDER BY number) AS RowNumber
,LTRIM(RTRIM(SUBSTRING(ListValue, number+1, CHARINDEX(#SplitOn, ListValue, number+1)-number - 1))) AS ListValue
FROM (
SELECT #SplitOn + #List + #SplitOn AS ListValue
) AS InnerQuery
INNER JOIN Numbers n ON n.Number < LEN(InnerQuery.ListValue)
WHERE SUBSTRING(ListValue, number, 1) = #SplitOn
);
GO
You can now easily split a CSV string into a table and join on it:
select * from dbo.FN_ListToTableAll(',','1,2,3,,,4,5,6777,,,')
OUTPUT:
RowNumber ListValue
----------- ----------
1 1
2 2
3 3
4
5
6 4
7 5
8 6777
9
10
11
(11 row(s) affected)
To make what you need work, do the following:
--this would be the existing table
DECLARE #OldData table (RowID int, RowStatus char(1))
INSERT INTO #OldData VALUES (10,'z')
INSERT INTO #OldData VALUES (20,'z')
INSERT INTO #OldData VALUES (30,'z')
INSERT INTO #OldData VALUES (70,'z')
INSERT INTO #OldData VALUES (80,'z')
INSERT INTO #OldData VALUES (90,'z')
--these would be the stored procedure input parameters
DECLARE #IDList varchar(500)
,#StatusList varchar(500)
SELECT #IDList='10,20,30,40,50,60'
,#StatusList='A,B,C,D,E,F'
--stored procedure local variable
DECLARE #InputList table (RowID int, RowStatus char(1))
--convert input prameters into a table
INSERT INTO #InputList
(RowID,RowStatus)
SELECT
i.ListValue,s.ListValue
FROM dbo.FN_ListToTableAll(',',#IDList) i
INNER JOIN dbo.FN_ListToTableAll(',',#StatusList) s ON i.RowNumber=s.RowNumber
--update all old existing rows
UPDATE o
SET RowStatus=i.RowStatus
FROM #OldData o WITH (UPDLOCK, HOLDLOCK) --to avoid race condition when there is high concurrency as per #emtucifor
INNER JOIN #InputList i ON o.RowID=i.RowID
--insert only the new rows
INSERT INTO #OldData
(RowID, RowStatus)
SELECT
i.RowID, i.RowStatus
FROM #InputList i
LEFT OUTER JOIN #OldData o ON i.RowID=o.RowID
WHERE o.RowID IS NULL
--display the old table
SELECT * FROM #OldData order BY RowID
OUTPUT:
RowID RowStatus
----------- ---------
10 A
20 B
30 C
40 D
50 E
60 F
70 z
80 z
90 z
(9 row(s) affected)
EDIT thanks to #Emtucifor click here for the tip about the race condition, I have included the locking hints in my answer, to prevent race condition problems when there is high concurrency.
There are various methods to accomplish the parts you ask are asking about.
Passing Values
There are dozens of ways to do this. Here are a few ideas to get you started:
Pass in a string of identifiers and parse it into a table, then join.
SQL 2008: Join to a table-valued parameter
Expect data to exist in a predefined temp table and join to it
Use a session-keyed permanent table
Put the code in a trigger and join to the INSERTED and DELETED tables in it.
Erland Sommarskog provides a wonderful comprehensive discussion of lists in sql server. In my opinion, the table-valued parameter in SQL 2008 is the most elegant solution for this.
Upsert/Merge
Perform a separate UPDATE and INSERT (two queries, one for each set, not row-by-row).
SQL 2008: MERGE.
An Important Gotcha
However, one thing that no one else has mentioned is that almost all upsert code, including SQL 2008 MERGE, suffers from race condition problems when there is high concurrency. Unless you use HOLDLOCK and other locking hints depending on what's being done, you will eventually run into conflicts. So you either need to lock, or respond to errors appropriately (some systems with huge transactions per second have used the error-response method successfully, instead of using locks).
One thing to realize is that different combinations of lock hints implicitly change the transaction isolation level, which affects what type of locks are acquired. This changes everything: which other locks are granted (such as a simple read), the timing of when a lock is escalated to update from update intent, and so on.
I strongly encourage you to read more detail on these race condition problems. You need to get this right.
Conditional Insert/Update Race Condition
“UPSERT” Race Condition With MERGE
Example Code
CREATE PROCEDURE dbo.PermitStatusUpdate
#PermitIDs varchar(8000), -- or (max)
#Status int
AS
SET NOCOUNT, XACT_ABORT ON -- see note below
BEGIN TRAN
DECLARE #Permits TABLE (
PermitID int NOT NULL PRIMARY KEY CLUSTERED
)
INSERT #Permits
SELECT Value FROM dbo.Split(#PermitIDs) -- split function of your choice
UPDATE S
SET
UpdatedOn = GETUTCDATE(),
Status = #Status
FROM
PermitStatus S WITH (UPDLOCK, HOLDLOCK)
INNER JOIN #Permits P ON S.PermitID = P.PermitID
INSERT PermitStatus (
PermitID,
UpdatedOn,
Status
)
SELECT
P.PermitID,
GetUTCDate(),
#Status
FROM #Permits P
WHERE NOT EXISTS (
SELECT 1
FROM PermitStatus S
WHERE P.PermitID = S.PermitID
)
COMMIT TRAN
RETURN ##ERROR;
Note: XACT_ABORT helps guarantee the explicit transaction is closed following a timeout or unexpected error.
To confirm that this handles the locking problem, open several query windows and execute an identical batch like so:
WAITFOR TIME '11:00:00' -- use a time in the near future
EXEC dbo.PermitStatusUpdate #PermitIDs = '123,124,125,126', 1
All of these different sessions will execute the stored procedure in nearly the same instant. Check each session for errors. If none exist, try the same test a few times more (since it's possible to not always have the race condition occur, especially with MERGE).
The writeups at the links I gave above give even more detail than I did here, and also describe what to do for the SQL 2008 MERGE statement as well. Please read those thoroughly to truly understand the issue.
Briefly, with MERGE, no explicit transaction is needed, but you do need to use SET XACT_ABORT ON and use a locking hint:
SET NOCOUNT, XACT_ABORT ON;
MERGE dbo.Table WITH (HOLDLOCK) AS TableAlias
...
This will prevent concurrency race conditions causing errors.
I also recommend that you do error handling after each data modification statement.
If you're using SQL Server 2008, you can use table valued parameters - you pass in a table of records into a stored procedure and then you can do a MERGE.
Passing in a table valued parameter would remove the need to parse CSV strings.
Edit:
ErikE has raised the point about race conditions, please refer to his answer and linked articles.
If you have SQL Server 2008, you can use MERGE. Here's an article describing this.
You should be able to do your insert and your update as two set based queries.
The code below was based on a data load procedure that I wrote a while ago that took data from a staging table and inserted or updated it into the main table.
I've tried to make it match your example, but you may need to tweak this (and create a table valued UDF to parse your CSV into a table of ids).
-- Update where the join on permitstatus matches
Update
PermitStatus
Set
[UpdatedOn]=GETUTCDATE(),
[Status]=staging.Status
From
PermitStatus status
Join
StagingTable staging
On
staging.PermitId = status.PermitId
-- Insert the new records, based on the Where Not Exists
Insert
PermitStatus(Updatedon, Status, PermitId)
Select (GETUTCDATE(), staging.status, staging.permitId
From
StagingTable staging
Where Not Exists
(
Select 1 from PermitStatus status
Where status.PermitId = staging.PermidId
)
Essentially you have an upsert stored procedure (eg. UpsertSinglePermit)
(like the code you have given above) for dealing with one row.
So the steps I see are to create a new stored procedure (UpsertNPermits) which does
a) Parse input string into n record entries (each record contains permit id and status)
b) Foreach entry in above, invoke UpsertSinglePermit