I do not want the record to the table more than 15.
Scenario:
A new record is saved. If it were a record number of 16. The first record to be deleted.
How do I remove the first record?Can it be done automatically?
if it is entity framework and you want to use a basic rule here it is
suppose your object is person and its set is called people
Before you do context.people.add(new person()) apply following logic
obtain count of people in database context.people.count()
check if this count is greater than 15 you can do this via single statment if(context.people.count()>15)
inside if you can write people firstperson = context.people.OrderBy(x=>x.ID).First() or if you have date inserted or added you can use.OrderBy(x => x.dateadded)and pick the first element. Make sure you order it in correct way usingOrderByorOrderByDescending`
place this record in a variable and call context.remove(firstperson) before you do context.add(new person())
If you are doing this in an empty table your ID's would increment but you can safely delete by ID order and pick the least one every time you delete.
WITH A AS
(
SELECT TOP 1 *
FROM MyTable
)
DELETE FROM A
The rows referenced in the TOP expression used with INSERT, UPDATE, or DELETE are not arranged in any order.
Therefore, you better use WITH decision with ORDER BY clause, which will let you specify more exactly which row you consider to be the first.
This uses a trigger and an identity column to ensure only the 15 most-recently-inserted rows are kept in the table.
CREATE TABLE MyTable
(
rowID INT IDENTITY(1,1) PRIMARY KEY
,MyColumn VARCHAR(255) NOT NULL
)
GO
CREATE TRIGGER TG_MyTable_Only15
ON MyTable
AFTER INSERT
AS
BEGIN
WITH
t1
(
rowID
)
AS
(
SELECT TOP 15
rowID
FROM MyTable
ORDER BY rowID DESC
)
DELETE FROM MyTable
WHERE rowID NOT IN (SELECT rowID FROM t1)
END
GO
Related
I am trying to create a dimension table (NewTable) from an existing Data Warehouse table (OldTable) that doesn't have a primary key.
The OldTable holds distinct values in [IdentifierCode] and other values repeat around it. I also need to invoke 3 functions to add reporting context.
I want IdentifierCode_ID to be an INT column - as the [IdentifierCode] column is VARCHAR(6).
My question is this: is using ROW_NUMBER() (as shown below) producing a suitably unique value?
My concern is that the row order on the live table could change if other rows are inserted to remediate missed codes.
Edit: OldTable has 500k rows in all and 227k when filtered with the WHERE clause
SELECT
ROW_NUMBER() OVER (ORDER BY LoadDate, StartDate, Product, IdentifierCode) AS IdentifierCode_ID,
LoadDate,
StartDate,
EndDate,
Product,
IdentifierCode,
OtherField1, OtherField2, OtherField3, OtherField4,
Function1, Function2, Function3
INTO
NewTable
FROM
OldTable
WHERE
GETDATE() BETWEEN StartDate AND EndDate
First, unless you're either loading data once and never touching it again or are truncating NewTable before each load of a new date range, your approach will not work. ROW_NUMBER will restart at 1 and violate the primary key.
Even if you ARE truncating the table or only loading once ever, there is still a better way. Designate IdentifierCode_ID as an Identity column and SQL will take care of it for you. If the type is INT and IDENTITY is set, SQL will automatically add 1 to the last value when inserting a new row, you don't even have to assign it!
CREATE TABLE dbo.NewTable(
[IdentifierCode_ID] int IDENTITY(1,1) NOT NULL,
[IdentifierCode] VARCHAR(6) NOT NULL,
...
Also, make sure you consider what you'll do if you accidentally select an overlapping date range for subsequent loads and if values in the OldTable change - for example, add a restriction to the WHERE clause to exclude existing IdentifierCode values from the insert, and add a second query to update existing IdentifierCode values that have a different LoadDate, StartDate, etc.
...
AND NOT EXISTS (SELECT * FROM NewTable as N WHERE N.IdentifierCode = OldTable.IdentifierCode)
For updating existing rows that changed, you can do an INNER JOIN to select only existing rows and a WHERE clause for only rows that changed.
UPDATE NewTable
SET LoadDate = O.LoadDate, StartDate = O.StartDate, ... --don't forget to recalculate the functions!
FROM NewTable as N INNER JOIN OldTable as O on N.IdentifierCode = O.IdentifierCode
WHERE GETDATE() between O.StartDate and O.EndDate
AND NOT (N.StartDate = O.StartDate and N.EndDate = O.EndDate ... )
how to delete the duplicate records from snowflake table. Thanks
ID Name
1 Apple
1 Apple
2 Apple
3 Orange
3 Orange
Result should be:
ID Name
1 Apple
2 Apple
3 Orange
Adding here a solution that doesn't recreate the table. This because recreating a table can break a lot of existing configurations and history.
Instead we are going to delete only the duplicate rows and insert a single copy of each, within a transaction:
-- find all duplicates
create or replace transient table duplicate_holder as (
select $1, $2, $3
from some_table
group by 1,2,3
having count(*)>1
);
-- time to use a transaction to insert and delete
begin transaction;
-- delete duplicates
delete from some_table a
using duplicate_holder b
where (a.$1,a.$2,a.$3)=(b.$1,b.$2,b.$3);
-- insert single copy
insert into some_table
select *
from duplicate_holder;
-- we are done
commit;
Advantages:
Doesn't recreate the table
Doesn't modify the original table
Only deletes and inserts duplicated rows (good for time travel storage costs, avoids unnecessary reclustering)
All in a transaction
If you have some primary key as such:
CREATE TABLE fruit (key number, id number, name text);
insert into fruit values (1,1, 'Apple'), (2,1,'Apple'),
(3,2, 'Apple'), (4,3, 'Orange'), (5,3, 'Orange');
as then
DELETE FROM fruit
WHERE key in (
SELECT key
FROM (
SELECT key
,ROW_NUMBER() OVER (PARTITION BY id, name ORDER BY key) AS rn
FROM fruit
)
WHERE rn > 1
);
But if you do not have a unique key then you cannot delete that way. At which point a
CREATE TABLE new_table_name AS
SELECT id, name FROM (
SELECT id
,name
,ROW_NUMBER() OVER (PARTITION BY id, name) AS rn
FROM table_name
)
WHERE rn > 1
and then swap them
ALTER TABLE table_name SWAP WITH new_table_name
Here's a very simple approach that doesn't need any temporary tables. It will work very nicely for small tables, but might not be the best approach for large tables.
insert overwrite into some_table
select distinct * from some_table
;
The OVERWRITE keyword means that the table will be truncated before the insert takes place.
Snowflake does not have effective primary keys, their use is primarily with ERD tools.
Snowflake does not have something like a ROWID either, so there is no way to identify duplicates for deletion.
It is possible to temporarily add a "is_duplicate" column, eg. numbering all the duplicates with the ROW_NUMBER() function, and then delete all records with "is_duplicate" > 1 and finally delete the utility column.
Another way is to create a duplicate table and swap, as others have suggested.
However, constraints and grants must be kept. One way to do this is:
CREATE TABLE new_table LIKE old_table COPY GRANTS;
INSERT INTO new_table SELECT DISTINCT * FROM old_table;
ALTER TABLE old_table SWAP WITH new_table;
The code above removes exact duplicates. If you want to end up with a row for each "PK" you need to include logic to select which copy you want to keep.
This illustrates the importance to add update timestamp columns in a Snowflake Data Warehouse.
this has been bothering me for some time as well. As snowflake has added support for qualify you can now create a dedupped table with a single statement without subselects:
CREATE TABLE fruit (id number, nam text);
insert into fruit values (1, 'Apple'), (1,'Apple'),
(2, 'Apple'), (3, 'Orange'), (3, 'Orange');
CREATE OR REPLACE TABLE fruit AS
SELECT * FROM
fruit
qualify row_number() OVER (PARTITION BY id, nam ORDER BY id, nam) = 1;
SELECT * FROM fruit;
Of course you are left with a new table and loose table history, primary keys, foreign keys and such.
Based on above ideas.....following query worked perfectly in my case.
CREATE OR REPLACE TABLE SCHEMA.table
AS
SELECT
DISTINCT *
FROM
SCHEMA.table
;
Your question boils down to: How can I delete one of two perfectly identical rows? . You can't. You can only do a DELETE FROM fruit where ID = 1 and Name = 'Apple';, then both rows will go away. Or you don't, and keep both.
For some databases, there are workarounds using internal rows, but there isn't any in snowflake, see https://support.snowflake.net/s/question/0D50Z00008FQyGqSAL/is-there-an-internalmetadata-unique-rowid-in-snowflake-that-i-can-reference . You cannot limit deletes, either, so your only option is to create a new table and swap.
Additional Note on Hans Henrik Eriksen's remark on the importance of update timestamps: This is a real help when the duplicates where added later. If, for example, you want to keep the newer values, you can then do this:
-- setup
create table fruit (ID Integer, Name VARCHAR(16777216), "UPDATED_AT" TIMESTAMP_NTZ);
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (2, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);
-- wait > 1 nanosecond
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);
-- delete older duplicates (DESC)
DELETE FROM fruit
WHERE (ID
, UPDATED_AT) IN (
SELECT ID
, UPDATED_AT
FROM (
SELECT ID
, UPDATED_AT
, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY UPDATED_AT DESC) AS rn
FROM fruit
)
WHERE rn > 1
);
simple UNION eliminate duplicates on use case of just all columns/no pks.
anyway problem should he solved as early on ingestion pipeline, and/or use scd etc.
Just a raw magic best way how to delete is wrong in principle, use scd with high resolution timestamp, solves any problem.
you want fix massive dups load ? then add column like batch id and remove all batch loaded records
Its like being healthy, you have 2 approaches:
eat a lot > get far > go-to a gym to burn it
eat well > have healthy life style and no need for gym.
So before discussing best gym, try change life style.
hope this helps, learn to do pressure upstream on data producers instead of living like jesus christ trying to clean up the mess of everyone.
The following solution is effective if you are looking at one or few columns as primary key references for the table.
-- Create a temp table to hold our duplicates (only second occurrence)
CREATE OR REPLACE TRANSIENT TABLE temp_table AS (
SELECT [col1], [col2], .. [coln]
FROM (
SELECT *, ROW_NUMBER () OVER(
PARTITION BY [pk]1, [pk]2, .. [pk]m
ORDER BY [pk]1, [pk]2, .. [pk]m) AS duplicate_count
FROM [schema].[table]
) WHERE duplicate_count = 2
);
-- Delete all the duplicate records from the table
DELETE FROM [schema].[table] t1
USING temp_table t2
WHERE
t1.[pk]1 = t2.[pk]1 AND
t1.[pk]2 = t2.[pk]2 AND
..
t1.[pk]n = t2.[pk]m;
-- Insert single copy using the temp_table in the original table
INSERT INTO [schema].[table]
SELECT *
FROM temp_table;
This is inspired by #Felipe Hoffa's answer:
##create table with dupes and take the max id
create or replace transient table duplicate_holder as (
select max(S.ID) ID, some_field, count(some_field) numberAssets
from some_table S
group by some_field
having count(some_field)>1
)
##join back to the original table on the field excluding the ID in the duplicate table and delete.
delete from some_table as t
USING duplicate_holder as d
WHERE t.some_field=d.some_field
and t.id <> d.id
Not sure if people are still interested in this but I've used the below query which is more elegant and seems to have worked
create or replace table {{your_table}} as
select * from {{your_table}}
qualify row_number() over (partition by {{criteria_columns}} order by 1) = 1
I have a UI that allows a user to select one or more fields they want to add to a table. This data also has an orderID associated with it that determines the field order.
When the user adds new fields, I need to find the last orderID this user used and increment it by 1, submitting all of the new fields.
For example, if there is a single record that already exists in the database, it would have an orderID of 1. When I choose to add three more fields, it would check to see the last orderID I used (1) and then increment it for each of the new records it adds, 1-4.
-- Get the last ID orderID for this user and increment it by 1 as our starting point
DECLARE #lastID INT = (SELECT TOP 1 orderID FROM dbo.BS_ContentRequests_Tasks_User_Fields WHERE QID = #QID ORDER BY orderID DESC)
SET #lastID = #lastID+1;
-- Create a temp table to hold our fields that we are adding
DECLARE #temp AS TABLE (fieldID int, orderID int)
-- Insert our fields and incremented numbers
INSERT INTO #temp( fieldID, orderID )
SELECT ParamValues.x1.value('selected[1]', 'int'),
#lastID++
FROM #xml.nodes('/root/data/fields/field') AS ParamValues(x1);
Obviously the #lastID++ part is where my issue is but hopefully it helps to understand what I am trying to do.
What other method could be used to handle this?
ROW_NUMBER() ought to do it.
select x.Value,
ROW_NUMBER() over (order by x.Value) + #lastID
from (
select 10 ParamValues.x1.value('selected[1]', 'int') Value
from #xml.nodes('/root/data/fields/field') AS ParamValues(x1)
) x
You could use a column with IDENTITY(1,1)
If you want OrderID to be unique across the entire table then see below:
Click here to take a look at another post that addresses this issue.
There are multiple ways to approach this issue, but in this case, the easiest, while reasonable, means may be to use an identity column. However, that is not as extensible as using a sequence. If you feel that you may need more flexibility in the future, then use a sequence.
If you want OrderID to be unique across the fields inserted in one batch then see below:
You should take a closer look at Chris Steele's answer.
I have a table, Core_Faculty with 4 fields: ID (PK - INT), InstitutionID (INT), PersonID (INT), DeprecatedDate (SMALLDATETIME)
What I'd like to do is delete all the records for institution/person combinations that have both deprecated records and non-deprecated (DeprecatedDate IS NULL) record, but keep the non-deprecated record.
If an institution/person combination has only just one record (whether deprecated or not), I'd like to keep those and leave them alone. I'm only considering records that have both DeprecatedDate IS NULL and Deprecated IS NOT NULL for each unique institution/person combination.
End goal is to be left with one record per institution/person combination whether deprecated or not, but giving priority to the record that has a NULL deprecated date. These are the good, live records. However, if we are starting with only one record and it's deprecated, go ahead and keep it.
The database currently only can potentially have one of each as institution/person/deprecateddate is a unique key on the table.
How would I go about solving this, and what methods can I use to find the appropriate records, while only considering records that have both deprecated and non-deprecated values for the combination?
DELETE f
FROM
Core_Faculty f
INNER JOIN
(
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY
f.InstitutionID,
f.PersonID
ORDER BY
CASE
WHEN f.DeprecatedDate IS NULL THEN 1
ELSE 2
END,
f.DeprecatedDate
) RowNum
FROM
Core_Faculty f
) d ON
f.ID = d.ID
WHERE
d.RowNum > 1;
In SQL Server you can use a common table expression with a ROW_NUMBER function to identify the rows you want to keep:
WITH cte AS (
SELECT [ID]
,[InstitutionID]
,[PersonID]
,[DeprecatedDate]
,ROW_NUMBER() OVER (PARTITION BY [InstitutionID], [PersonID]
ORDER BY [DeprecatedDate] DESC) as [RowNumber]
FROM [Blog].[dbo].[Core_Faculty]
)
SELECT [ID]
,[InstitutionID]
,[PersonID]
,[DeprecatedDate]
,[RowNumber]
FROM cte
--WHERE [RowNumber] = 1
The ORDER BY [DeprecatedDate] DESC part will make ensure the latest record is the 1st row in the [InstitutionID], [PersonID] grouping. If there is only one row, even if it is a null, it will be kept since it is the 1st row in the grouping.
You can then use
DELETE
FROM cte
WHERE [RowNumber] > 1
instead of the select to remove the rest of the rows. Leaving you with just one row person/institution combo.
Suppose the table with two columns:
ParentEntityId int foreign key
Number int
ParentEntityId is a foreign key to another table.
Number is a local identity, i.e. it is unique within single ParentEntityId.
Uniqueness is easily achieved via unique key over these two columns.
How to make Number be automatically incremented in the context of the ParentEntityId on insert?
Addendum 1
To clarify the problem, here is an abstract.
ParentEntity has multiple ChildEntity, and each ChiildEntity should have an unique incremental Number in the context of its ParentEntity.
Addendum 2
Treat ParentEntity as a Customer.
Treat ChildEntity as an Order.
So, orders for every customer should be numbered 1, 2, 3 and so on.
Well, there's no native support for this type of column, but you could implement it using a trigger:
CREATE TRIGGER tr_MyTable_Number
ON MyTable
INSTEAD OF INSERT
AS
SET TRANSACTION ISOLATION LEVEL SERIALIZABLE
BEGIN TRAN;
WITH MaxNumbers_CTE AS
(
SELECT ParentEntityID, MAX(Number) AS Number
FROM MyTable
WHERE ParentEntityID IN (SELECT ParentEntityID FROM inserted)
)
INSERT MyTable (ParentEntityID, Number)
SELECT
i.ParentEntityID,
ROW_NUMBER() OVER
(
PARTITION BY i.ParentEntityID
ORDER BY (SELECT 1)
) + ISNULL(m.Number, 0) AS Number
FROM inserted i
LEFT JOIN MaxNumbers_CTE m
ON m.ParentEntityID = i.ParentEntityID
COMMIT
Not tested but I'm pretty sure it'll work. If you have a primary key, you could also implement this as an AFTER trigger (I dislike using INSTEAD OF triggers, they're harder to understand when you need to modify them 6 months later).
Just to explain what's going on here:
SERIALIZABLE is the strictest isolation mode; it guarantees that only one database transaction at a time can execute these statements, which we need in order to guarantee the integrity of this "sequence." Note that this irreversibly promotes the entire transaction, so you won't want to use this inside of a long-running transaction.
The CTE picks up the highest number already used for each parent ID;
ROW_NUMBER generates a unique sequence for each parent ID (PARTITION BY) starting from the number 1; we add this to the previous maximum if there is one to get the new sequence.
I probably should also mention that if you only ever need to insert one new child entity at a time, you're better off just funneling those operations through a stored procedure instead of using a trigger - you'll definitely get better performance out of it. This is how it's currently done with hierarchyid columns in SQL '08.
Need add OUTPUT clause to trigger for Linq to SQL сompatibility.
For example:
INSERT MyTable (ParentEntityID, Number)
OUTPUT inserted.*
SELECT
i.ParentEntityID,
ROW_NUMBER() OVER
(
PARTITION BY i.ParentEntityID
ORDER BY (SELECT 1)
) + ISNULL(m.Number, 0) AS Number
FROM inserted i
LEFT JOIN MaxNumbers_CTE m
ON m.ParentEntityID = i.ParentEntityID
This solves the question as I understand it: :-)
DECLARE #foreignKey int
SET #foreignKey = 1 -- or however you get this
INSERT Tbl (ParentEntityId, Number)
VALUES (#foreignKey, ISNULL((SELECT MAX(Number) FROM Tbl WHERE ParentEntityId = #foreignKey), 0) + 1)