I have a table with an "order" column which needs to maintain a contiguous range of unique numbers. What I'm trying to create is a trigger that fires after deleting rows which updates the "order" column so that the numbers remain contiguous.
I know a lot of people would argue that an "order" column only needs to be continuous, not contiguous, however there is a lot of front end JavaScript, and other SQL, for ordering/reordering these items which depends on order being contiguous. I would prefer to simply get this trigger working rather than having to rewrite that, of course I'm open to suggestions ;)
The trigger I have works fine for a single row delete, but when a multiple row delete occurs, only the first row gets deleted and the rest remain with no error thrown.
I thought the problem may have been recursion, as it updates the table it fired from, but it's only a delete trigger so I don't think that's the problem. Turning off RECURSIVE_TRIGGERS didn't fix the issue.
Here's the code:
CREATE TABLE [dbo].[Item]
(
[ItemID] INT NOT NULL IDENTITY(1, 1),
[ItemOrder] INT NOT NULL,
[ItemName] NVARCHAR (50) NOT NULL
)
GO
SET QUOTED_IDENTIFIER ON
GO
SET ANSI_NULLS ON
GO
CREATE TRIGGER [dbo].[trItem_odr] ON [dbo].[Item]
AFTER DELETE
AS
BEGIN
SET NOCOUNT ON ;
DECLARE #MinOrder INT
SELECT #MinOrder = MIN(ItemOrder)
FROM DELETED
DECLARE #UpdatedItems TABLE
(
ID INT IDENTITY(0, 1)
PRIMARY KEY,
ItemID INT
)
INSERT INTO #UpdatedItems (ItemID)
SELECT ItemID
FROM dbo.Item
WHERE ItemOrder > #MinOrder
AND ItemID NOT IN (SELECT ItemID
FROM DELETED)
ORDER BY ItemOrder
UPDATE dbo.Item
SET ItemOrder = (SELECT ID + #MinOrder
FROM #UpdatedItems
WHERE ItemID = Item.ItemID)
WHERE ItemID IN (SELECT ItemID
FROM #UpdatedItems)
END
GO
ALTER TABLE [dbo].[Item] ADD CONSTRAINT [PK_Item] PRIMARY KEY CLUSTERED ([ItemID])
GO
ALTER TABLE [dbo].[Item] ADD CONSTRAINT [IX_Item_1] UNIQUE NONCLUSTERED ([ItemName])
GO
CREATE UNIQUE NONCLUSTERED INDEX [IX_Item_2] ON [dbo].[Item] ([ItemOrder])
GO
INSERT INTO [dbo].[Item] ([ItemOrder], [ItemName])
SELECT 1, N'King Size Bed' UNION ALL
SELECT 2, N'Queen size bed' UNION ALL
SELECT 3, N'Double Bed' UNION ALL
SELECT 4, N'Single Bed' UNION ALL
SELECT 5, N'Filing Cabinet' UNION ALL
SELECT 6, N'Washing Machine' UNION ALL
SELECT 7, N'2 Seater Couch' UNION ALL
SELECT 8, N'3 Seater Couch' UNION ALL
SELECT 9, N'1 Seater Couch' UNION ALL
SELECT 10, N'Flat Screen TV' UNION ALL
SELECT 11, N'Fridge' UNION ALL
SELECT 12, N'Dishwasher' UNION ALL
SELECT 13, N'4 Seater couch' UNION ALL
SELECT 14, N'Lawn Mower' UNION ALL
SELECT 15, N'Dining table'
GO
Rewrite your front end. Prefer to trade more development time for less runtime.
Keeping the order column contiguous by updating all out-of-order rows in the table is tremendously inefficient (remove item 1 => update 100000000 items) resulting in potentially huge update operations, is going to generate tremendous contention because updates modify a lot of rows so they content with almost any read, and ultimately, incorrect under concurrency (you'll end up with gaps and overlaps anyway). Don't do it.
Related
I have some data like
Id, GroupId, Whatever
1, 1, 10
2, 1, 10
3, 1, 10
4, 2, 10
5, 2, 10
6, 3, 10
And I need to add a "group row id" column such as
Id, GroupId, Whatever, GroupRowId
1, 1, 10 1
2, 1, 10 2
3, 1, 10 3
4, 2, 10 1
5, 2, 10 2
6, 3, 10 1
Ideally it would be computed and enforced by the database. So when I do
INSERT INTO Foos (GroupId, Whatever) VALUES (1, 20)
I'd get the correct GroupRowId. Continuing the example data above, this row would then look like
Id, GroupId, Whatever, GroupRowId
7, 1, 20 4
This data is to be shared with a 3rd party and one of the requirements is for those GroupRowIds to be fixed regardless of any different ORDER BY or WHERE clauses.
I've considered a view with a row_id over/partition by but that view could still be modified in the future breaking previously shared data.
Our business rules dictate that no rows will be deleted so the GroupRowId will never need to be recomputed in this respect and there will never** be missing values.
** in the perfect world of business rules.
My thinking is that it would be preferable that this be a physical column so that it exists within the row. It can be queried and won't change based on a ORDER BY or WHERE clause.
You might try something along this:
--create a test database (will be dropped at the end! Carefull with real data!!)
USE master;
GO
CREATE DATABASE GroupingTest;
GO
USE GroupingTest;
GO
--Your table, I use an IDENTITY column for your Id column
CREATE TABLE dbo.tbl(Id INT IDENTITY,GroupId INT,Whatever INT);
GO
--Insert your test values
INSERT INTO tbl(GroupId, Whatever)
VALUES
(1,10)
,(1,10)
,(1,10)
,(2,10)
,(2,10)
,(3,10);
GO
--This is necessary to add the new column and to fill it initially
ALTER TABLE tbl ADD GroupRowId INT;
GO
WITH cte AS
(
SELECT GroupRowId
,ROW_NUMBER() OVER(PARTITION BY GroupId ORDER BY Id) AS NewValue
FROM tbl
)
UPDATE cte SET GroupRowId=NewValue;
--check the result
SELECT * FROM tbl ORDER BY GroupId,Id;
GO
--Now we create a trigger, which does exactly the same for new rows
--Very important: This must work with single inserts and with multiple inserts as well!
CREATE TRIGGER dbo.SetNextGroupRowId ON dbo.tbl
FOR INSERT
AS
BEGIN
WITH cte AS
(
SELECT GroupRowId
,ROW_NUMBER() OVER(PARTITION BY GroupId ORDER BY Id) AS NewValue
FROM tbl
)
UPDATE cte
SET GroupRowId=NewValue
WHERE GroupRowId IS NULL; --<-- this ensures to change only new rows
END
GO
--Now we can test this with a single value
INSERT INTO tbl(GroupId, Whatever)
VALUES(1,20);
SELECT * FROM tbl ORDER BY GroupId,Id;
--And we can test this with multiple inserts
INSERT INTO tbl(GroupId, Whatever)
VALUES
(1,30)
,(2,30)
,(2,30)
,(3,30)
,(4,30); --<-- the "4" is a new group
SELECT * FROM tbl ORDER BY GroupId,Id;
GO
--Cleaning
USE master;
GO
DROP DATABASE GroupingTest;
What you should keep in mind:
This might get in troubles with values inserted manually into GroupRowId or with any manipulation of this column by any other statement.
This might get in troubles with deleted rows
You can think about an approach selecting MAX(GroupRowId)+1 for the given group. This depends on your needs.
You might add an unique index on GroupId,GroupRowId. This would - at least - avoid giving the same number twice, but would lead into an error.
...but in your perfect world of business rules :-) this won't happen...
And to be honest: The whole issue has some smell...
how to delete the duplicate records from snowflake table. Thanks
ID Name
1 Apple
1 Apple
2 Apple
3 Orange
3 Orange
Result should be:
ID Name
1 Apple
2 Apple
3 Orange
Adding here a solution that doesn't recreate the table. This because recreating a table can break a lot of existing configurations and history.
Instead we are going to delete only the duplicate rows and insert a single copy of each, within a transaction:
-- find all duplicates
create or replace transient table duplicate_holder as (
select $1, $2, $3
from some_table
group by 1,2,3
having count(*)>1
);
-- time to use a transaction to insert and delete
begin transaction;
-- delete duplicates
delete from some_table a
using duplicate_holder b
where (a.$1,a.$2,a.$3)=(b.$1,b.$2,b.$3);
-- insert single copy
insert into some_table
select *
from duplicate_holder;
-- we are done
commit;
Advantages:
Doesn't recreate the table
Doesn't modify the original table
Only deletes and inserts duplicated rows (good for time travel storage costs, avoids unnecessary reclustering)
All in a transaction
If you have some primary key as such:
CREATE TABLE fruit (key number, id number, name text);
insert into fruit values (1,1, 'Apple'), (2,1,'Apple'),
(3,2, 'Apple'), (4,3, 'Orange'), (5,3, 'Orange');
as then
DELETE FROM fruit
WHERE key in (
SELECT key
FROM (
SELECT key
,ROW_NUMBER() OVER (PARTITION BY id, name ORDER BY key) AS rn
FROM fruit
)
WHERE rn > 1
);
But if you do not have a unique key then you cannot delete that way. At which point a
CREATE TABLE new_table_name AS
SELECT id, name FROM (
SELECT id
,name
,ROW_NUMBER() OVER (PARTITION BY id, name) AS rn
FROM table_name
)
WHERE rn > 1
and then swap them
ALTER TABLE table_name SWAP WITH new_table_name
Here's a very simple approach that doesn't need any temporary tables. It will work very nicely for small tables, but might not be the best approach for large tables.
insert overwrite into some_table
select distinct * from some_table
;
The OVERWRITE keyword means that the table will be truncated before the insert takes place.
Snowflake does not have effective primary keys, their use is primarily with ERD tools.
Snowflake does not have something like a ROWID either, so there is no way to identify duplicates for deletion.
It is possible to temporarily add a "is_duplicate" column, eg. numbering all the duplicates with the ROW_NUMBER() function, and then delete all records with "is_duplicate" > 1 and finally delete the utility column.
Another way is to create a duplicate table and swap, as others have suggested.
However, constraints and grants must be kept. One way to do this is:
CREATE TABLE new_table LIKE old_table COPY GRANTS;
INSERT INTO new_table SELECT DISTINCT * FROM old_table;
ALTER TABLE old_table SWAP WITH new_table;
The code above removes exact duplicates. If you want to end up with a row for each "PK" you need to include logic to select which copy you want to keep.
This illustrates the importance to add update timestamp columns in a Snowflake Data Warehouse.
this has been bothering me for some time as well. As snowflake has added support for qualify you can now create a dedupped table with a single statement without subselects:
CREATE TABLE fruit (id number, nam text);
insert into fruit values (1, 'Apple'), (1,'Apple'),
(2, 'Apple'), (3, 'Orange'), (3, 'Orange');
CREATE OR REPLACE TABLE fruit AS
SELECT * FROM
fruit
qualify row_number() OVER (PARTITION BY id, nam ORDER BY id, nam) = 1;
SELECT * FROM fruit;
Of course you are left with a new table and loose table history, primary keys, foreign keys and such.
Based on above ideas.....following query worked perfectly in my case.
CREATE OR REPLACE TABLE SCHEMA.table
AS
SELECT
DISTINCT *
FROM
SCHEMA.table
;
Your question boils down to: How can I delete one of two perfectly identical rows? . You can't. You can only do a DELETE FROM fruit where ID = 1 and Name = 'Apple';, then both rows will go away. Or you don't, and keep both.
For some databases, there are workarounds using internal rows, but there isn't any in snowflake, see https://support.snowflake.net/s/question/0D50Z00008FQyGqSAL/is-there-an-internalmetadata-unique-rowid-in-snowflake-that-i-can-reference . You cannot limit deletes, either, so your only option is to create a new table and swap.
Additional Note on Hans Henrik Eriksen's remark on the importance of update timestamps: This is a real help when the duplicates where added later. If, for example, you want to keep the newer values, you can then do this:
-- setup
create table fruit (ID Integer, Name VARCHAR(16777216), "UPDATED_AT" TIMESTAMP_NTZ);
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (2, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);
-- wait > 1 nanosecond
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);
-- delete older duplicates (DESC)
DELETE FROM fruit
WHERE (ID
, UPDATED_AT) IN (
SELECT ID
, UPDATED_AT
FROM (
SELECT ID
, UPDATED_AT
, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY UPDATED_AT DESC) AS rn
FROM fruit
)
WHERE rn > 1
);
simple UNION eliminate duplicates on use case of just all columns/no pks.
anyway problem should he solved as early on ingestion pipeline, and/or use scd etc.
Just a raw magic best way how to delete is wrong in principle, use scd with high resolution timestamp, solves any problem.
you want fix massive dups load ? then add column like batch id and remove all batch loaded records
Its like being healthy, you have 2 approaches:
eat a lot > get far > go-to a gym to burn it
eat well > have healthy life style and no need for gym.
So before discussing best gym, try change life style.
hope this helps, learn to do pressure upstream on data producers instead of living like jesus christ trying to clean up the mess of everyone.
The following solution is effective if you are looking at one or few columns as primary key references for the table.
-- Create a temp table to hold our duplicates (only second occurrence)
CREATE OR REPLACE TRANSIENT TABLE temp_table AS (
SELECT [col1], [col2], .. [coln]
FROM (
SELECT *, ROW_NUMBER () OVER(
PARTITION BY [pk]1, [pk]2, .. [pk]m
ORDER BY [pk]1, [pk]2, .. [pk]m) AS duplicate_count
FROM [schema].[table]
) WHERE duplicate_count = 2
);
-- Delete all the duplicate records from the table
DELETE FROM [schema].[table] t1
USING temp_table t2
WHERE
t1.[pk]1 = t2.[pk]1 AND
t1.[pk]2 = t2.[pk]2 AND
..
t1.[pk]n = t2.[pk]m;
-- Insert single copy using the temp_table in the original table
INSERT INTO [schema].[table]
SELECT *
FROM temp_table;
This is inspired by #Felipe Hoffa's answer:
##create table with dupes and take the max id
create or replace transient table duplicate_holder as (
select max(S.ID) ID, some_field, count(some_field) numberAssets
from some_table S
group by some_field
having count(some_field)>1
)
##join back to the original table on the field excluding the ID in the duplicate table and delete.
delete from some_table as t
USING duplicate_holder as d
WHERE t.some_field=d.some_field
and t.id <> d.id
Not sure if people are still interested in this but I've used the below query which is more elegant and seems to have worked
create or replace table {{your_table}} as
select * from {{your_table}}
qualify row_number() over (partition by {{criteria_columns}} order by 1) = 1
For example, I have 2 tables, which I need for my query, Property and Move for history of moving properties.
I must create a query which will return all properties + 1 additional boolean column, IsInService, which will have value true, in cases, when Move table has a record for property with DateTo = null and MoveTypeID = 1 ("In service").
I have created this query:
SELECT
[ID], [Name],
(SELECT COUNT(*)
FROM [Move]
WHERE PropertyID = p.ID
AND DateTo IS NULL
AND MoveTypeID = 1) AS IsInService
FROM
[Property] as p
ORDER BY
[Name] ASC
OFFSET 100500 ROWS FETCH NEXT 50 ROWS ONLY;
I'm not so strong in SQL, but as I know, subqueries are the evil :)
How to create high performance SQL query in my case, if it is expected that these tables will include millions of records?
I've updated the code based on your comment. If you need something else, please provide input and output data expected. This is about all I can do based on inference from the existing comments. Further, this isn't intended to give you an exact working solution. My intention was to give you a prototype from which you can build your solution.
That said:
The code below is the basic join that you need. However, keep in mind that indexing is probably going to play as big a part in performance as the structure of the table and the query. It doesn't matter how you query the tables if the indexes aren't there to support the queries once you reach a certain size. There are a lot of resources online for indexing but viewing querying plans should be at the top of your list.
As a note, your column [dbo].[Property] ([Name]) should probably be NVARCHAR to allow SQL to minimize data storage. Indexes on that column will then be smaller and searches/updates faster.
DECLARE #Property AS TABLE
(
[ID] INT
, [Name] NVARCHAR(100)
);
INSERT INTO #Property
([ID]
, [Name])
VALUES (1,N'A'),
(2,N'B'),
(3,N'C');
DECLARE #Move AS TABLE
(
[ID] INT
, [DateTo] DATE
, [MoveTypeID] INT
, [PropertyID] INT
);
INSERT INTO #Move
([ID]
, [DateTo]
, [MoveTypeID]
, [PropertyID])
VALUES (1,NULL,1,1),
(2,NULL,1,2),
(3,N'2017-12-07',1,2);
SELECT [Property].[ID] AS [property_id]
, [Property].[Name] AS [property_name]
, CASE
WHEN [Move].[DateTo] IS NULL
AND [Move].[MoveTypeID] = 1 THEN
N'true'
ELSE
N'false'
END AS [in_service]
FROM #Property AS [Property]
LEFT JOIN #Move AS [Move]
ON [Move].[PropertyID] = [Property].[ID]
WHERE [Move].[DateTo] IS NULL
AND [Move].[MoveTypeID] = 1;
The trigger below select ID's from one table (employeeInOut), sums int's in a column in that table matching all ID's, and is supposed to insert these in another table (monthlyHours). I can't figure out if this is a syntax problem (nothing shows up in intellisense), and all it says is trigger executed successfully - nothing is inserted.
Trigger ->
GO
CREATE TRIGGER empTotalsHoursWorked
ON employeeInOut
FOR INSERT, DELETE, UPDATE
AS
BEGIN
INSERT INTO monthlyHours(employeeID, monthlyHours)
SELECT (SELECT employeeID FROM employeeInOut),
SUM(dailyHours) AS monthlyHours
FROM employeeInOut
WHERE employeeInOut.employeeID=(SELECT employeeID FROM monthlyHours)
END
GO
I have re-worked this trigger many times and this is the one with no errors, however nothing is inserted, and results seem to be nothing. Any advice, answers please appreciated.
Going with a couple of assumptions here one being that monthlyHours table contains employeeID and monthlyhours.
With that being said I think you are going to need multiple triggers depending on the action. Below is an example based on insert into the employeeInOut table
GO
CREATE TRIGGER empTotalsHoursWorked
ON employeeInOut
AFTER INSERT
AS
BEGIN
DECLARE #employeeID INT
DECLARE #monthlyHours INT
SELECT #employeeID = INSERTED.employeeID
FROM INSERTED
SELECT #monthlyHours = SUM(dailyHours)
FROM employeeInOut
WHERE employeeInOut.employeeID = #employeeID
INSERT INTO monthlyHours(employeeID,monthlyHours)
values (#employeeID, #monthlyHours)
END
GO
This will insert a new row of course. You may want to modify this to update the row if the row already exists in the monthlyHours table for that employee.
I would really advise against a trigger for a simple running total like this, your best option would be to create a view. Something like:
CREATE VIEW dbo.MonthlyHours
AS
SELECT EmployeeID,
monthlyHours = SUM(dailyHours)
FROM dbo.employeeInOut
GROUP BY EmployeeID;
GO
Then you can access it in the same way as your table:
SELECT *
FROM dbo.MonthlyHours;
If you are particularly worried about performance, then you can always index the view:
CREATE VIEW dbo.MonthlyHours
WITH SCHEMABINDING
AS
SELECT EmployeeID,
monthlyHours = SUM(dailyHours),
RecordCount = COUNT_BIG(*)
FROM dbo.employeeInOut
GROUP BY EmployeeID;
GO
CREATE UNIQUE CLUSTERED INDEX UQ_MonthlyHours__EmployeeID ON dbo.MonthlyHours(EmployeeID);
Now whenever you add or remove records from employeeInOut SQL Server will automatically update the clustered index for the view, you just need to use the WITH (NOEXPAND) query hint to ensure that you aren't running the query behind the view:
SELECT *
FROM dbo.MonthlyHours WITH (NOEXPAND);
Finally, based on the fact the table is called monthly hours, I am guessing it should be by month, as such I assume you also have a date field in employeeInOut, in which case your view might be more like:
CREATE VIEW dbo.MonthlyHours
WITH SCHEMABINDING
AS
SELECT EmployeeID,
FirstDayOfMonth = DATEADD(MONTH, DATEDIFF(MONTH, 0, [YourDateField]), 0),
monthlyHours = SUM(dailyHours),
RecordCount = COUNT_BIG(*)
FROM dbo.employeeInOut
GROUP BY EmployeeID, DATEADD(MONTH, DATEDIFF(MONTH, 0, [YourDateField]), 0);
GO
CREATE UNIQUE CLUSTERED INDEX UQ_MonthlyHours__EmployeeID_FirstDayOfMonth
ON dbo.MonthlyHours(EmployeeID, FirstDayOfMonth);
And you can use the view in the same way described above.
ADDENDUM
For what it is worth, for your trigger to work properly you need to consider all cases:
Inserting a record where that employee already exists in MonthlyHours (Update existing).
Inserting a record where that employee does not exist in MonthlyHours (insert new).
Updating a record (update existing)
Deleting a record (update existing, or delete)
To handle all of these cases you can use MERGE:
CREATE TRIGGER empTotalsHoursWorked
ON employeeInOut
FOR INSERT, DELETE, UPDATE
AS
BEGIN
WITH ChangesToMake AS
( SELECT EmployeeID, SUM(dailyHours) AS MonthlyHours
FROM ( SELECT EmployeeID, dailyHours
FROM Inserted
UNION ALL
SELECT EmployeeID, -dailyHours
FROM deleted
) AS t
GROUP BY EmployeeID
)
MERGE INTO monthlyHours AS m
USING ChangesToMake AS c
ON c.EmployeeID = m.EmployeeID
WHEN MATCHED THEN UPDATE
SET MonthlyHours = c.MonthlyHours
WHEN NOT MATCHED BY TARGET THEN
INSERT (EmployeeID, MonthlyHours)
VALUES (c.EmployeeID, c.MonthlyHours)
WHEN NOT MATCHED BY SOURCE THEN
DELETE;
END
GO
How can I create a Primary Key in SQL Server 2005/2008 with the format:
CurrentYear + auto-increment?
Example: The current year is 2010, in a new table, the ID should start in 1, so: 20101, 20102, 20103, 20104, 20105... and so on.
The cleaner solution is to create a composite primary key consisting of e.g. Year and Counter columns.
Not sure exactly what you are trying to accomplish by doing that, but it makes a lot more sense to do this with two fields.
If the combination of the two must be the PK for some reason, just span it across both columns. However, it seems unnecessary since the identity part will be unique exclusive of the year.
This technically meets the needs of what you requested:
CREATE TABLE #test
( seeded_column INT IDENTITY(1,1) NOT NULL
, year_column INT NOT NULL DEFAULT(YEAR(GETDATE()))
, calculated_column AS CONVERT(BIGINT, CONVERT(CHAR(4), year_column, 120) + CONVERT(VARCHAR(MAX), seeded_column)) PERSISTED PRIMARY KEY
, test VARCHAR(MAX) NOT NULL);
INSERT INTO #test (test)
SELECT 'Badda'
UNION ALL
SELECT 'Cadda'
UNION ALL
SELECT 'Dadda'
UNION ALL
SELECT 'Fadda'
UNION ALL
SELECT 'Gadda'
UNION ALL
SELECT 'Hadda'
UNION ALL
SELECT 'Jadda'
UNION ALL
SELECT 'Kadda'
UNION ALL
SELECT 'Ladda'
UNION ALL
SELECT 'Madda'
UNION ALL
SELECT 'Nadda'
UNION ALL
SELECT 'Padda';
SELECT *
FROM #test;
DROP TABLE #test;
You have to write a trigger for this :)
Have a separate table for storing the last digit used (I really don't know whether there is something similar to sequences in Oracle in SQL Server).
OR
You can get the last item inserted item and extract the last number of it.
THEN
You can get the current year from SELECT DATEPART(yyyy,GetDate());
The trigger would be a ON INSERT trigger where you combine the year and the last digit and update the column