T-SQL, using Union to dedupe , but retain an additional column - sql-server

I am using T-SQL to return records from the database (there are multiple criteria, but the list of unique ID's must be distinct), in short, the T-SQL looks like this:
SELECT
t1.ID,
[query1mark] = 1
WHERE criteria1 = 1
UNION
SELECT
t1.ID,
[query2mark] = 1
WHERE criteria2 = 1
I would like to be able to use Union to de-dupe on the ID field (the data has to be unique on the ID field), whilst retaining the derived column "query1mark" or "query2mark" to highlight which query it additionally came from. In my real world case, there are 5 queries that need to be de-duped against each other, so I need an efficient solution is possible.
EDIT: Additionally, the results from the first query need to be prioritised over those from the second query, and the results from the second query need to be prioritised over those from the third query, as I understand, this feature is inherent when using Union, as it will only add records from below the Union statement.
Is Union the best solution for this, and if not, what can I use?
Thanks

What about this:
DECLARE #DataSource TABLE
(
[ID] INT
,[criteria] INT
);
INSERT INTO #DataSource ([ID], [criteria])
VALUES (1, 1)
,(1, 2)
,(2, 1)
,(3, 1)
,(3, 2)
,(4, 2);
WITH DataSource ([ID], [query_mark], [RowID]) AS
(
SELECT *
,ROW_NUMBER() OVER (PARTITION BY [ID] ORDER BY [criteria] ASC)
FROM #DataSource
)
SELECT [id], [query_mark]
FROM DataSource
WHERE [RowID] = 1;
The idea is to create sequence of all duplicated elements for particular group. The duplicates are order by the criteria field, but yo can change the logic if you need - for example to show the biggest criteria. The group is defined using the PARTITION BY [ID] statement, which means, order items for each [ID] group. Then, in the select, we only need to show one record per each group [RowID] = 1

you can use top 1 with ties
SELECT top 1 with ties * FROM yourtable
ORDER BY ROW_NUMBER() OVER (PARTITION BY [ID] ORDER BY [criteria])

Related

How to delete Duplicate records in snowflake database table

how to delete the duplicate records from snowflake table. Thanks
ID Name
1 Apple
1 Apple
2 Apple
3 Orange
3 Orange
Result should be:
ID Name
1 Apple
2 Apple
3 Orange
Adding here a solution that doesn't recreate the table. This because recreating a table can break a lot of existing configurations and history.
Instead we are going to delete only the duplicate rows and insert a single copy of each, within a transaction:
-- find all duplicates
create or replace transient table duplicate_holder as (
select $1, $2, $3
from some_table
group by 1,2,3
having count(*)>1
);
-- time to use a transaction to insert and delete
begin transaction;
-- delete duplicates
delete from some_table a
using duplicate_holder b
where (a.$1,a.$2,a.$3)=(b.$1,b.$2,b.$3);
-- insert single copy
insert into some_table
select *
from duplicate_holder;
-- we are done
commit;
Advantages:
Doesn't recreate the table
Doesn't modify the original table
Only deletes and inserts duplicated rows (good for time travel storage costs, avoids unnecessary reclustering)
All in a transaction
If you have some primary key as such:
CREATE TABLE fruit (key number, id number, name text);
insert into fruit values (1,1, 'Apple'), (2,1,'Apple'),
(3,2, 'Apple'), (4,3, 'Orange'), (5,3, 'Orange');
as then
DELETE FROM fruit
WHERE key in (
SELECT key
FROM (
SELECT key
,ROW_NUMBER() OVER (PARTITION BY id, name ORDER BY key) AS rn
FROM fruit
)
WHERE rn > 1
);
But if you do not have a unique key then you cannot delete that way. At which point a
CREATE TABLE new_table_name AS
SELECT id, name FROM (
SELECT id
,name
,ROW_NUMBER() OVER (PARTITION BY id, name) AS rn
FROM table_name
)
WHERE rn > 1
and then swap them
ALTER TABLE table_name SWAP WITH new_table_name
Here's a very simple approach that doesn't need any temporary tables. It will work very nicely for small tables, but might not be the best approach for large tables.
insert overwrite into some_table
select distinct * from some_table
;
The OVERWRITE keyword means that the table will be truncated before the insert takes place.
Snowflake does not have effective primary keys, their use is primarily with ERD tools.
Snowflake does not have something like a ROWID either, so there is no way to identify duplicates for deletion.
It is possible to temporarily add a "is_duplicate" column, eg. numbering all the duplicates with the ROW_NUMBER() function, and then delete all records with "is_duplicate" > 1 and finally delete the utility column.
Another way is to create a duplicate table and swap, as others have suggested.
However, constraints and grants must be kept. One way to do this is:
CREATE TABLE new_table LIKE old_table COPY GRANTS;
INSERT INTO new_table SELECT DISTINCT * FROM old_table;
ALTER TABLE old_table SWAP WITH new_table;
The code above removes exact duplicates. If you want to end up with a row for each "PK" you need to include logic to select which copy you want to keep.
This illustrates the importance to add update timestamp columns in a Snowflake Data Warehouse.
this has been bothering me for some time as well. As snowflake has added support for qualify you can now create a dedupped table with a single statement without subselects:
CREATE TABLE fruit (id number, nam text);
insert into fruit values (1, 'Apple'), (1,'Apple'),
(2, 'Apple'), (3, 'Orange'), (3, 'Orange');
CREATE OR REPLACE TABLE fruit AS
SELECT * FROM
fruit
qualify row_number() OVER (PARTITION BY id, nam ORDER BY id, nam) = 1;
SELECT * FROM fruit;
Of course you are left with a new table and loose table history, primary keys, foreign keys and such.
Based on above ideas.....following query worked perfectly in my case.
CREATE OR REPLACE TABLE SCHEMA.table
AS
SELECT
DISTINCT *
FROM
SCHEMA.table
;
Your question boils down to: How can I delete one of two perfectly identical rows? . You can't. You can only do a DELETE FROM fruit where ID = 1 and Name = 'Apple';, then both rows will go away. Or you don't, and keep both.
For some databases, there are workarounds using internal rows, but there isn't any in snowflake, see https://support.snowflake.net/s/question/0D50Z00008FQyGqSAL/is-there-an-internalmetadata-unique-rowid-in-snowflake-that-i-can-reference . You cannot limit deletes, either, so your only option is to create a new table and swap.
Additional Note on Hans Henrik Eriksen's remark on the importance of update timestamps: This is a real help when the duplicates where added later. If, for example, you want to keep the newer values, you can then do this:
-- setup
create table fruit (ID Integer, Name VARCHAR(16777216), "UPDATED_AT" TIMESTAMP_NTZ);
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (2, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);
-- wait > 1 nanosecond
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);
-- delete older duplicates (DESC)
DELETE FROM fruit
WHERE (ID
, UPDATED_AT) IN (
SELECT ID
, UPDATED_AT
FROM (
SELECT ID
, UPDATED_AT
, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY UPDATED_AT DESC) AS rn
FROM fruit
)
WHERE rn > 1
);
simple UNION eliminate duplicates on use case of just all columns/no pks.
anyway problem should he solved as early on ingestion pipeline, and/or use scd etc.
Just a raw magic best way how to delete is wrong in principle, use scd with high resolution timestamp, solves any problem.
you want fix massive dups load ? then add column like batch id and remove all batch loaded records
Its like being healthy, you have 2 approaches:
eat a lot > get far > go-to a gym to burn it
eat well > have healthy life style and no need for gym.
So before discussing best gym, try change life style.
hope this helps, learn to do pressure upstream on data producers instead of living like jesus christ trying to clean up the mess of everyone.
The following solution is effective if you are looking at one or few columns as primary key references for the table.
-- Create a temp table to hold our duplicates (only second occurrence)
CREATE OR REPLACE TRANSIENT TABLE temp_table AS (
SELECT [col1], [col2], .. [coln]
FROM (
SELECT *, ROW_NUMBER () OVER(
PARTITION BY [pk]1, [pk]2, .. [pk]m
ORDER BY [pk]1, [pk]2, .. [pk]m) AS duplicate_count
FROM [schema].[table]
) WHERE duplicate_count = 2
);
-- Delete all the duplicate records from the table
DELETE FROM [schema].[table] t1
USING temp_table t2
WHERE
t1.[pk]1 = t2.[pk]1 AND
t1.[pk]2 = t2.[pk]2 AND
..
t1.[pk]n = t2.[pk]m;
-- Insert single copy using the temp_table in the original table
INSERT INTO [schema].[table]
SELECT *
FROM temp_table;
This is inspired by #Felipe Hoffa's answer:
##create table with dupes and take the max id
create or replace transient table duplicate_holder as (
select max(S.ID) ID, some_field, count(some_field) numberAssets
from some_table S
group by some_field
having count(some_field)>1
)
##join back to the original table on the field excluding the ID in the duplicate table and delete.
delete from some_table as t
USING duplicate_holder as d
WHERE t.some_field=d.some_field
and t.id <> d.id
Not sure if people are still interested in this but I've used the below query which is more elegant and seems to have worked
create or replace table {{your_table}} as
select * from {{your_table}}
qualify row_number() over (partition by {{criteria_columns}} order by 1) = 1

How can I keep the order of column values in a union select?

I am doing a bulk insert into a table using SELECT and UNION. I need the order of the SELECT values to be unchanged when calling the INSERT, but it seems that the values are being inserted in an ascending order, rather than the order I specify.
For example, the below insert statement
declare #QuestionOptionMapping table
(
[ID] [int] IDENTITY(1,1)
, [QuestionOptionID] int
, [RateCode] varchar(50)
)
insert into #QuestionOptionMapping (
RateCode
)
select
'PD0116'
union
select
'PL0090'
union
select
'PL0091'
union
select
'DD0026'
union
select
'DD0025'
SELECT * FROM #QuestionOptionMapping
renders the data as
(5 row(s) affected)
ID QuestionOptionID RateCode
----------- ---------------- --------------------------------------------------
1 NULL DD0025
2 NULL DD0026
3 NULL PD0116
4 NULL PL0090
5 NULL PL0091
(5 row(s) affected)
How can the select of the inserted data return the same order as when it was inserted?
SQL Server stores your rows as an unordered set. The data points may or may not be contiguous, and they may or may not be in the "order" the data was specified in your insert statements.
When you query the data, the engine will retrieve the rows in the most efficient order, as determined by the optimizer. There is no guarantee that the order will be the same every time you query the data.
The only way to guarantee the order of your result set is to include an explicit ORDER BY clause with your SELECT statement.
See this answer for a much more in depth discussion as to why this the case. Default row order in SELECT query - SQL Server 2008 vs SQL 2012
By using the SELECT/UNION option for your INSERT statement, you're creating an unordered set that SQL Server ingests as a set, not as a series of inputs. Separate your inserts into discrete statements if you need them to have the IDENTITY values applied in order. Better yet, if the row numbering matters, don't leave it to chance. Explicitly number the rows on insert.
SQL tables do represent unordered sets. However, the identity column on an insert will follow the ordering of the order by.
Your data is getting out of order because of the duplicate elimination in the union. However, I would suggest writing the query to explicitly sort the data:
insert into #QuestionOptionMapping (RateCode)
select ratecode
from (values (1, 'PD0116'),
(2, 'PL0090'),
(3, 'PL0091'),
(4, 'DD0026'),
(5, 'DD0025')
) v(ord, ratecode)
order by ord;
Then be sure to use order by for the select:
select qom.*
from #QuestionOptionMapping qom
order by id;
Note that this also uses the values() table constructor, which is a very handy syntax.
If you're not selecting from tables?
Then you could insert VALUES, instead of a select with unions.
insert into #QuestionOptionMapping (RateCode) values
('PD0116')
,('PL0090')
,('PL0091')
,('DD0026')
,('DD0025')
Or in your query, change all the UNION to UNION ALL.
The difference between a UNION and a UNION ALL is that a UNION will remove duplicate rows.
While UNION ALL just stiches the resultsets from the selects together.
And for UNION to find those duplicates, internally it first has to sort them.
But a UNION ALL doesn't care about uniqueness, so it doesn't need to sort.
A 3th option would be to simply change from 1 insert statement to multiple insert statements.
One insert per value. Thus avoiding UNION completely.
But that anti-golfcoding method is also the most wordy.
Your problem is you are not putting them in in the order you think. UNION is distinct values only and it will typically sort the values to facilitate the distinct. Run the select statement alone and you will see.
If you insert using values then order is preserved:
insert into #QuestionOptionMapping (RateCode) values
('PD0116'), ('PL0090'), ('PL0091'), ('DD0026'), ('DD0025')
select * from #QuestionOptionMapping order by ID

Tree structure data query in SQL Server

I have a table Person that has 3 columns: Id, Name, ParentId where ParentId is the Id of the parent row.
Currently, to display the entire tree, it would have to loop through all child elements until there's no more child elements. It doesn't seem too efficient.
Is there a better and more efficient way to query this data?
Also, is there a better way to represent this tree like structure in a SQL Server database? An alternative design for my table/database?
I don't think there's anything wrong with the design, assuming you have a limited level of parent-child relationships. Here is a quick example of retrieving the relationship using a recursive CTE:
USE tempdb;
GO
CREATE TABLE dbo.tree
(
ID INT PRIMARY KEY,
name VARCHAR(32),
ParentID INT FOREIGN KEY REFERENCES dbo.tree(ID)
);
INSERT dbo.tree SELECT 1, 'grandpa', NULL
UNION ALL SELECT 2, 'dad', 1
UNION ALL SELECT 3, 'me', 2
UNION ALL SELECT 4, 'mom', 1
UNION ALL SELECT 5, 'grandma', NULL;
;WITH x AS
(
-- anchor:
SELECT ID, name, ParentID, [level] = 0
FROM dbo.tree WHERE ParentID IS NULL
UNION ALL
-- recursive:
SELECT t.ID, t.name, t.ParentID, [level] = x.[level] + 1
FROM x INNER JOIN dbo.tree AS t
ON t.ParentID = x.ID
)
SELECT ID, name, ParentID, [level] FROM x
ORDER BY [level]
OPTION (MAXRECURSION 32);
GO
Don't forget to clean up:
DROP TABLE dbo.tree;
This might be a useful article. An alternative is hierarchyid but I find it overly complex for most scenarios.
Aaron Bertrands answer is very good for the general case. If you only ever need to display the whole tree at once, you can just query the whole table and perform the tree-building in-memory. This is likely to be more convenient and flexible. Performance also will be slightly better (the whole table needs to be downloaded anyway and C# is faster for such calculations than SQL Server).
If you only need a part of the tree this method is not recommended because you'd be downloading more data than needed.

Generate Row Serial Numbers in SQL Query

I have a customer transaction table. I need to create a query that includes a serial number pseudo column. The serial number should be automatically reset and start over from 1 upon change in customer ID.
Now, I am familiar with the row_number() function in SQL. This doesnt exactly solve my problem because to the best of my knowledge the serial number will not be reset in case the order of the rows change.
I want to do this in a single query (SQL Server) and without having to go through any temporary table usage etc. How can this be done?
Sometime we might don't want to apply ordering on our result set to add serial number. But if we are going to use ROW_NUMBER() then we have to have a ORDER BY clause. So, for that we can simply apply a tricks to avoid any ordering on the result set.
SELECT ROW_NUMBER() OVER(ORDER BY (SELECT 1)) AS ItemNo, ItemName FROM ItemMastetr
For that we don't need to apply order by on our result set. We'll just add ItemNo on our given result set.
select
ROW_NUMBER() Over (Order by CustomerID) As [S.N.],
CustomerID ,
CustomerName,
Address,
City,
State,
ZipCode
from Customers;
I'm not certain, based on your question if you want numbered rows that will remember their numbers even if the underlying data changes (and gives a different ordering), but if you just want numbered rows - that reset on a change in customer ID, then try using the Partition by clause of row_number()
row_number() over(partition by CustomerID order by CustomerID)
Implementing Serial Numbers Without Ordering Any of the Columns
Demo SQL Script-
IF OBJECT_ID('Tempdb..#TestTable') IS NOT NULL
DROP TABLE #TestTable;
CREATE TABLE #TestTable (Names VARCHAR(75), Random_No INT);
INSERT INTO #TestTable (Names,Random_No) VALUES
('Animal', 363)
,('Bat', 847)
,('Cat', 655)
,('Duet', 356)
,('Eagle', 136)
,('Frog', 784)
,('Ginger', 690);
SELECT * FROM #TestTable;
There are ‘N’ methods for implementing Serial Numbers in SQL Server. Hereby, We have mentioned the Simple Row_Number Function to generate Serial Numbers.
ROW_NUMBER() Function is one of the Window Functions that numbers all rows sequentially (for example 1, 2, 3, …) It is a temporary value that will be calculated when the query is run. It must have an OVER Clause with ORDER BY. So, we cannot able to omit Order By Clause Simply. But we can use like below-
SQL Script
IF OBJECT_ID('Tempdb..#TestTable') IS NOT NULL
DROP TABLE #TestTable;
CREATE TABLE #TestTable (Names VARCHAR(75), Random_No INT);
INSERT INTO #TestTable (Names,Random_No) VALUES
('Animal', 363)
,('Bat', 847)
,('Cat', 655)
,('Duet', 356)
,('Eagle', 136)
,('Frog', 784)
,('Ginger', 690);
SELECT Names,Random_No,ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS SERIAL_NO FROM #TestTable;
In the Above Query, We can Also Use SELECT 1, SELECT ‘ABC’, SELECT ” Instead of SELECT NULL. The result would be Same.
SELECT ROW_NUMBER() OVER (ORDER BY ColumnName1) As SrNo, ColumnName1, ColumnName2 FROM TableName
select ROW_NUMBER() over (order by pk_field ) as srno
from TableName
Using Common Table Expression (CTE)
WITH CTE AS(
SELECT ROW_NUMBER() OVER(ORDER BY CustomerId) AS RowNumber,
Customers.*
FROM Customers
)
SELECT * FROM CTE
I found one solution for MYSQL its easy to add new column for SrNo or kind of tepropery auto increment column by following this query:
SELECT #ab:=#ab+1 as SrNo, tablename.* FROM tablename, (SELECT #ab:= 0)
AS ab
ALTER function dbo.FN_ReturnNumberRows(#Start int, #End int) returns #Numbers table (Number int) as
begin
insert into #Numbers
select n = ROW_NUMBER() OVER (ORDER BY n)+#Start-1 from (
select top (#End-#Start+1) 1 as n from information_schema.columns as A
cross join information_schema.columns as B
cross join information_schema.columns as C
cross join information_schema.columns as D
cross join information_schema.columns as E) X
return
end
GO
select * from dbo.FN_ReturnNumberRows(10,9999)

Year Based Primary Key?

How can I create a Primary Key in SQL Server 2005/2008 with the format:
CurrentYear + auto-increment?
Example: The current year is 2010, in a new table, the ID should start in 1, so: 20101, 20102, 20103, 20104, 20105... and so on.
The cleaner solution is to create a composite primary key consisting of e.g. Year and Counter columns.
Not sure exactly what you are trying to accomplish by doing that, but it makes a lot more sense to do this with two fields.
If the combination of the two must be the PK for some reason, just span it across both columns. However, it seems unnecessary since the identity part will be unique exclusive of the year.
This technically meets the needs of what you requested:
CREATE TABLE #test
( seeded_column INT IDENTITY(1,1) NOT NULL
, year_column INT NOT NULL DEFAULT(YEAR(GETDATE()))
, calculated_column AS CONVERT(BIGINT, CONVERT(CHAR(4), year_column, 120) + CONVERT(VARCHAR(MAX), seeded_column)) PERSISTED PRIMARY KEY
, test VARCHAR(MAX) NOT NULL);
INSERT INTO #test (test)
SELECT 'Badda'
UNION ALL
SELECT 'Cadda'
UNION ALL
SELECT 'Dadda'
UNION ALL
SELECT 'Fadda'
UNION ALL
SELECT 'Gadda'
UNION ALL
SELECT 'Hadda'
UNION ALL
SELECT 'Jadda'
UNION ALL
SELECT 'Kadda'
UNION ALL
SELECT 'Ladda'
UNION ALL
SELECT 'Madda'
UNION ALL
SELECT 'Nadda'
UNION ALL
SELECT 'Padda';
SELECT *
FROM #test;
DROP TABLE #test;
You have to write a trigger for this :)
Have a separate table for storing the last digit used (I really don't know whether there is something similar to sequences in Oracle in SQL Server).
OR
You can get the last item inserted item and extract the last number of it.
THEN
You can get the current year from SELECT DATEPART(yyyy,GetDate());
The trigger would be a ON INSERT trigger where you combine the year and the last digit and update the column

Resources