Select latest row on duplicate values while transfering table? - sql-server

I have a logging table that is live which saves my value to a table frequently.
My plan is to take those values and put them on a temporary table with
SELECT * INTO #temp from Block
From there I guess my block table is empty and the logger can keep on logging new values.
The next step is that I want to save them in a existing table. I wanted to use
INSERT INTO TABLENAME(COLUMN1,COLUMN2...) SELECT (COLUMN1,COLUMN2...) FROM #temp
The problem is that the #temp table has duplicates primary keys. And I only want to store the last ID.
I have tried DISTINCT but it didn't work. Could not get ROW_Count to work. Are there any ideas on how I should do it? I wish to make it with as few reads as possible.
Also, in the future I plan to send them to another database, how do I do that on SQL Server? I guess it's something like FROM Table [in databes]?
I couldn't get the blocks to copy. But here goes:
create TABLE Product_log (
Grade char(64),
block_ID char(64) PRIMARY KEY NOT NULL,
Density char(64),
BatchNumber char(64) NOT NULL,
BlockDateID Datetime
);
That is my table i want to store the data in. There I do not wish to have duplicates on the id. The problem is, while logging I get duplicates since I log on change. Lets say that the batchid is 1, if it becomes 2 while logging. I will get a blockid twice, both with batch number 1 and 2. How do I pick the latter?
Hope I explained enough for guidance. While logging they look like this:
id SiemensTiaV15_s71200_BatchTester_NewBatchIDValue_VALUE SiemensTiaV15_s71200_BatchTester_TestWriteValue_VALUE SiemensTiaV15_s71200_BatchTester_TestWriteValue_TIMESTAMP SiemensTiaV15_s71200_MainTank_Density_VALUE SiemensTiaV15_s71200_MainTank_Grade_VALUE
1 00545 S0047782 2020-06-09 11:18:44.583 0 xxxxx
2 00545 S0047783 2020-06-09 11:18:45.800 0 xxxxx

Please use below query,
select * from
(select id, SiemensTiaV15_s71200_BatchTester_NewBatchIDValue_VALUE,SiemensTiaV15_s71200_BatchTester_TestWriteValue_VALUE, SiemensTiaV15_s71200_BatchTester_TestWriteValue_TIMESTAMP, SiemensTiaV15_s71200_MainTank_Density_VALUE,SiemensTiaV15_s71200_MainTank_Grade_VALUE,
row_number() over (partition by SiemensTiaV15_s71200_BatchTester_NewBatchIDValue_VALUE order by SiemensTiaV15_s71200_BatchTester_TestWriteValue_TIMESTAMP desc) as rnk
from table_name) qry
where rnk=1;
INTO #temp FROM Block; INSERT INTO Product_log(Grade, block_ID, Density, BatchNumber, BlockDateID)
selct NewBatchIDValue_VALUE, TestWriteValue_VALUE, TestWriteValue_TIMESTAMP,
Density_VALUE, Grade_VALUE from
(select NewBatchIDValue_VALUE, TestWriteValue_VALUE,
TestWriteValue_TIMESTAMP, Density_VALUE, Grade_VALUE, row_number() over
(partition by BatchTester_NewBatchIDValue_VALUE order by
BatchTester_TestWriteValue_VALUE) as rnk from #temp) qry
where rnk = 1;

Related

Update using subquery not working in Snowflake

(Submitting on behalf of a Snowflake User)
We have wrong duplicate id loaded in the table and we need to correct it. The rules to update the id is whenever there is a time difference of more than 30 min, the id should be new/unique. I have written the query to filter that out, however update is not happening
The below query is there to find the ids to be updated. For testing I have used a particular id.
select id,
BEFORE_TIME,
TIMESTAMP,
datediff(minute,BEFORE_TIME,TIMESTAMP) time_diff,
row_number() over (PARTITION BY id ORDER BY TIMESTAMP) rowno,
concat(id,to_varchar(rowno)) newid from
(SELECT id,
TIMESTAMP,
LAG(TIMESTAMP_EST) OVER (PARTITION BY visit_id ORDER BY TIMESTAMP) as BEFORE_TIME
FROM table_name t
where id = 'XX1X2375'
order by TIMESTAMP_EST)
where BEFORE_TIME is not NULL and time_diff > 30
order by time_diff desc
;
And i could see the 12 record with same id and time difference more than 30. However when I am trying to update. the query is succesfull but nothing is getting update.
update table_name t
set t.id = c.newid
from
(select id ,
BEFORE_TIME,
TIMESTAMP,
datediff(minute,BEFORE_TIME,TIMESTAMP) time_diff,
row_number() over (PARTITION BY id ORDER BY TIMESTAMP) rowno,
concat(id,to_varchar(rowno)) newid from
(SELECT id,
TIMESTAMP,
LAG(TIMESTAMP) OVER (PARTITION BY visit_id ORDER BY TIMESTAMP) as BEFORE_TIME
FROM table_name t
where id = 'XX1X2375'
order by TIMESTAMP_EST)
where BEFORE_TIME is not NULL and time_diff > 30
order by time_diff desc) c
where t.id = c.id
and t.timestamp = c.BEFORE_TIME
;
please note:
I even created a temp table t1 from the above subquery.
And i can see the records in table t1.
when doing select with join with main table i can even see in record in main table.
But again when I am trying to update using new t1. its just showing zero record updated.
I even tried merge but same issue.
MERGE INTO snowplow_data_subset_temp t
USING t1
ON (trim(t.visit_id) = trim(t1.visit_id) and trim(t1.BEFORE_DATE) = trim(t.TIMESTAMP_EST))
WHEN MATCHED THEN UPDATE SET visit_number = newid;
Any recommendations, ideas, or work-arounds? Thanks!
This looks like they may be running into two things:
The table that you created t1, was it a transient or cloned table? Check out the
Get_DDL('t1', 'schemaname');
to check if there are any constraints on the temp table in the session you work on this next. Or you can query the table constraints view
"Alternatively, retrieve a list of all table constraints by schema (or across all schemas in a database) by querying the TABLE_CONSTRAINTS View view in the Information Schema." from: https://docs.snowflake.net/manuals/user-guide/table-considerations.html#referential-integrity-constraints
Since the sub query is working just fine - the merge and update statements are clues for what to look for, this is what I found in the documentation for more general info:
*Limitations on Sub queries:
https://docs.snowflake.net/manuals/user-guide/querying-subqueries.html#limitations
You can also check to see if there are any errors for the update query by altering the session: https://docs.snowflake.net/manuals/sql-reference/sql/update.html#usage-notes
ALTER SESSION SET ERROR_ON_NONDETERMINISTIC_UPDATE=TRUE;
Here is an example of how to use an update with a Temp table:
https://snowflakecommunity.force.com/s/question/0D50Z00008P7BznSAF/can-you-use-a-cte-or-temp-table-with-an-update-statement-to-update-a-table
I am looking forward to seeing how they ended up solving the issue.

How to delete Duplicate records in snowflake database table

how to delete the duplicate records from snowflake table. Thanks
ID Name
1 Apple
1 Apple
2 Apple
3 Orange
3 Orange
Result should be:
ID Name
1 Apple
2 Apple
3 Orange
Adding here a solution that doesn't recreate the table. This because recreating a table can break a lot of existing configurations and history.
Instead we are going to delete only the duplicate rows and insert a single copy of each, within a transaction:
-- find all duplicates
create or replace transient table duplicate_holder as (
select $1, $2, $3
from some_table
group by 1,2,3
having count(*)>1
);
-- time to use a transaction to insert and delete
begin transaction;
-- delete duplicates
delete from some_table a
using duplicate_holder b
where (a.$1,a.$2,a.$3)=(b.$1,b.$2,b.$3);
-- insert single copy
insert into some_table
select *
from duplicate_holder;
-- we are done
commit;
Advantages:
Doesn't recreate the table
Doesn't modify the original table
Only deletes and inserts duplicated rows (good for time travel storage costs, avoids unnecessary reclustering)
All in a transaction
If you have some primary key as such:
CREATE TABLE fruit (key number, id number, name text);
insert into fruit values (1,1, 'Apple'), (2,1,'Apple'),
(3,2, 'Apple'), (4,3, 'Orange'), (5,3, 'Orange');
as then
DELETE FROM fruit
WHERE key in (
SELECT key
FROM (
SELECT key
,ROW_NUMBER() OVER (PARTITION BY id, name ORDER BY key) AS rn
FROM fruit
)
WHERE rn > 1
);
But if you do not have a unique key then you cannot delete that way. At which point a
CREATE TABLE new_table_name AS
SELECT id, name FROM (
SELECT id
,name
,ROW_NUMBER() OVER (PARTITION BY id, name) AS rn
FROM table_name
)
WHERE rn > 1
and then swap them
ALTER TABLE table_name SWAP WITH new_table_name
Here's a very simple approach that doesn't need any temporary tables. It will work very nicely for small tables, but might not be the best approach for large tables.
insert overwrite into some_table
select distinct * from some_table
;
The OVERWRITE keyword means that the table will be truncated before the insert takes place.
Snowflake does not have effective primary keys, their use is primarily with ERD tools.
Snowflake does not have something like a ROWID either, so there is no way to identify duplicates for deletion.
It is possible to temporarily add a "is_duplicate" column, eg. numbering all the duplicates with the ROW_NUMBER() function, and then delete all records with "is_duplicate" > 1 and finally delete the utility column.
Another way is to create a duplicate table and swap, as others have suggested.
However, constraints and grants must be kept. One way to do this is:
CREATE TABLE new_table LIKE old_table COPY GRANTS;
INSERT INTO new_table SELECT DISTINCT * FROM old_table;
ALTER TABLE old_table SWAP WITH new_table;
The code above removes exact duplicates. If you want to end up with a row for each "PK" you need to include logic to select which copy you want to keep.
This illustrates the importance to add update timestamp columns in a Snowflake Data Warehouse.
this has been bothering me for some time as well. As snowflake has added support for qualify you can now create a dedupped table with a single statement without subselects:
CREATE TABLE fruit (id number, nam text);
insert into fruit values (1, 'Apple'), (1,'Apple'),
(2, 'Apple'), (3, 'Orange'), (3, 'Orange');
CREATE OR REPLACE TABLE fruit AS
SELECT * FROM
fruit
qualify row_number() OVER (PARTITION BY id, nam ORDER BY id, nam) = 1;
SELECT * FROM fruit;
Of course you are left with a new table and loose table history, primary keys, foreign keys and such.
Based on above ideas.....following query worked perfectly in my case.
CREATE OR REPLACE TABLE SCHEMA.table
AS
SELECT
DISTINCT *
FROM
SCHEMA.table
;
Your question boils down to: How can I delete one of two perfectly identical rows? . You can't. You can only do a DELETE FROM fruit where ID = 1 and Name = 'Apple';, then both rows will go away. Or you don't, and keep both.
For some databases, there are workarounds using internal rows, but there isn't any in snowflake, see https://support.snowflake.net/s/question/0D50Z00008FQyGqSAL/is-there-an-internalmetadata-unique-rowid-in-snowflake-that-i-can-reference . You cannot limit deletes, either, so your only option is to create a new table and swap.
Additional Note on Hans Henrik Eriksen's remark on the importance of update timestamps: This is a real help when the duplicates where added later. If, for example, you want to keep the newer values, you can then do this:
-- setup
create table fruit (ID Integer, Name VARCHAR(16777216), "UPDATED_AT" TIMESTAMP_NTZ);
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (2, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);
-- wait > 1 nanosecond
insert into fruit values (1, 'Apple', CURRENT_TIMESTAMP::timestamp_ntz)
, (3, 'Orange', CURRENT_TIMESTAMP::timestamp_ntz);
-- delete older duplicates (DESC)
DELETE FROM fruit
WHERE (ID
, UPDATED_AT) IN (
SELECT ID
, UPDATED_AT
FROM (
SELECT ID
, UPDATED_AT
, ROW_NUMBER() OVER (PARTITION BY ID ORDER BY UPDATED_AT DESC) AS rn
FROM fruit
)
WHERE rn > 1
);
simple UNION eliminate duplicates on use case of just all columns/no pks.
anyway problem should he solved as early on ingestion pipeline, and/or use scd etc.
Just a raw magic best way how to delete is wrong in principle, use scd with high resolution timestamp, solves any problem.
you want fix massive dups load ? then add column like batch id and remove all batch loaded records
Its like being healthy, you have 2 approaches:
eat a lot > get far > go-to a gym to burn it
eat well > have healthy life style and no need for gym.
So before discussing best gym, try change life style.
hope this helps, learn to do pressure upstream on data producers instead of living like jesus christ trying to clean up the mess of everyone.
The following solution is effective if you are looking at one or few columns as primary key references for the table.
-- Create a temp table to hold our duplicates (only second occurrence)
CREATE OR REPLACE TRANSIENT TABLE temp_table AS (
SELECT [col1], [col2], .. [coln]
FROM (
SELECT *, ROW_NUMBER () OVER(
PARTITION BY [pk]1, [pk]2, .. [pk]m
ORDER BY [pk]1, [pk]2, .. [pk]m) AS duplicate_count
FROM [schema].[table]
) WHERE duplicate_count = 2
);
-- Delete all the duplicate records from the table
DELETE FROM [schema].[table] t1
USING temp_table t2
WHERE
t1.[pk]1 = t2.[pk]1 AND
t1.[pk]2 = t2.[pk]2 AND
..
t1.[pk]n = t2.[pk]m;
-- Insert single copy using the temp_table in the original table
INSERT INTO [schema].[table]
SELECT *
FROM temp_table;
This is inspired by #Felipe Hoffa's answer:
##create table with dupes and take the max id
create or replace transient table duplicate_holder as (
select max(S.ID) ID, some_field, count(some_field) numberAssets
from some_table S
group by some_field
having count(some_field)>1
)
##join back to the original table on the field excluding the ID in the duplicate table and delete.
delete from some_table as t
USING duplicate_holder as d
WHERE t.some_field=d.some_field
and t.id <> d.id
Not sure if people are still interested in this but I've used the below query which is more elegant and seems to have worked
create or replace table {{your_table}} as
select * from {{your_table}}
qualify row_number() over (partition by {{criteria_columns}} order by 1) = 1

Reverse of each value in a column

Suppose I have a table with even number of rows. For eg- a table Employee with two columns Name and EmpCode. The table looks like
Name EmpCode
Ajay 7
Vikash 5
Shalu 4
Hari 8
Anu 1
Puja 9
Now, I want my output in reverse of EmpCode like:
Name EmpCode
Ajay 9
Vikash 1
Shalu 8
Hari 4
Anu 5
Puja 7
I need to run this query in SQL Server.
As the OP hasn't replied, I'll post a little explanation for them instead. As everyone has eluded to, tables in SQL Server have no built in ordering. Your data is stored in what is known as a HEAP. This means, when you run a query without an ORDER BY your data can return in any order that the Server feels like. With small datasets this might be in the order you inserted it in, but that's just it (it might).
When you get to larger datasets, and when you have multiple cores running on the operation, then the order of a SELECT * FROM [Table]; is more likely to not be the order in insertion, and is more likely to be random which each instance of running the query. I have several tables where a SELECT TOP 1 *... will return a different row every time I run the query; even with the CLUSTERED INDEX.
The only, yes only, way to guarantee the order is by using ORDER BY. Now, you might have another column which you haven't shared that you can order by, but if not, perhaps this (very) simple example will at least assist you, if nothing else:
CREATE TABLE #Employee ([Name] varchar(10), EmpCode tinyint);
INSERT INTO #Employee
VALUES ('Ajay',7),
('Vikash',5),
('Shalu',4),
('Hari',8),
('Anu',1),
('Puja',9);
GO
--Just SELECT *. ORDER is NOT guaranteed, but, due to the low volume of data, will probably be in the order by insertion
SELECT *
FROM #Employee;
--But, we want to reverse the order, so, let's add an ORDER BY
SELECT *
FROM #Employee
ORDER BY [Name];
--Oh! That didn't work (duh). Let's try again
SELECT *
FROM #Employee
ORDER BY Empcode;
--Nope, this isn't working. That's because your data has nothing related to it's insertion order. So, let's give it one:
GO
DROP TABLE #Employee;
CREATE TABLE #Employee (ID int IDENTITY(1,1), --Oooo, what is this?
[Name] varchar(10),
EmpCode tinyint);
INSERT INTO #Employee
VALUES ('Ajay',7),
('Vikash',5),
('Shalu',4),
('Hari',8),
('Anu',1),
('Puja',9);
GO
--Now look
SELECT *
FROM #Employee;
--So, we can use an ORDER BY, and get the correct order too
SELECT [Name],
Empcode
FROM #Employee
ORDER BY ID;
--So, we got the right ORDER using an ORDER BY. Now we can do something about the ordering:
--We'll need a CTE for this:
WITH RNs AS(
SELECT *,
ROW_NUMBER() OVER (ORDER BY ID ASC) AS RN1,
ROW_NUMBER() OVER (ORDER BY ID DESC) AS RN2
FROM #Employee)
SELECT R1.[Name],
R2.EmpCode
FROM RNs R1
JOIN RNs R2 ON R1.RN1 = R2.RN2;
GO
DROP TABLE #Employee;

Using a running total calculated column in SQL Server table variable

I have inherited a stored procedure that utilizes a table variable to store data, then updates each row with a running total calculation. The order of the records in the table variable is very important, as we want the volume to be ordered highest to lowest (i.e. the running total will get increasingly larger as you go down the table).
My problem is, during the step where the table variable is updated, the running total seems to be calculating , but not in a way that the data in the table variable was previously sorted by (descending by highest volume)
DECLARE #TableVariable TABLE ([ID], [Volume], [SortValue], [RunningTotal])
--Populate table variable and order by the sort value...
INSERT INTO #TableVariable (ID, Volume, SortValue)
SELECT
[ID], [Volume], ABS([Volume]) as SortValue
FROM
dbo.VolumeTable
ORDER BY
SortValue DESC
--Set TotalVolume variable...
SELECT#TotalVolume = ABS(sum([Volume]))
FROM #TableVariable
--Calculate running total, update rows in table variable...I believe this is where problem occurs?
SET #RunningTotal = 0
UPDATE #TableVariable
SET #RunningTotal = RunningTotal = #RunningTotal + [Volume]
FROM #TableVariable
--Output...
SELECT
ID, Volume, SortValue, RunningTotal
FROM
#TableVariable
ORDER BY
SortValue DESC
The result is, the record that had the highest volume, that I would have expected the running total to calculate on first (thus running total = [volume]), somehow ends up much further down in the list. The running total seems to calculate randomly
Here is what I would expect to get:
But here is what the code actually generates:
Not sure if there is a way to get the UPDATE statement to be enacted on the table variable in such a way that it is ordered by volume desc? From what Ive read so far, it could be an issue with the sorting behavior of a table variable but not sure how to correct? Can anyone help?
GarethD provided the definitive link to the multiple ways of calculating running totals and their performance. The correct one is both the simplest and fastest, 300 times faster that then quirky update. That's because it can take advantage of any indexes that cover the sort column, and because it's a lot simpler.
I repeat it here to make clear how much simpler this is when the database provided the appropriate windowing functions
SELECT
[Date],
TicketCount,
SUM(TicketCount) OVER (ORDER BY [Date] RANGE UNBOUNDED PRECEDING)
FROM dbo.SpeedingTickets
ORDER BY [Date];
The SUM line means: Sum all ticket counts over all (UNBOUNDED) the rows that came before (PRECEDING) the current one if they were ordered by date
That ends up being 300 times faster than the quirky update.
The equivalent query for VolumeTable would be:
SELECT
ID,
Volume,
ABS(Volume) as SortValue,
SUM(Volume) OVER (ORDER BY ABS(Volume) DESC RANGE UNBOUNDED PRECEDING)
FROM
VolumeTable
ORDER BY ABS(Volume) DESC
Note that this will be a lot faster if there is an index on the sort column (Volume), and ABS isn't used. Applying any function on a column means that the optimizer can't use any indexes that cover it, because the actual sort value is different than the one stored in the index.
If the table is very large and performance suffers, you could create a computed column and create an index on it
Take a peek at the Window functions offered in SQL
For example
Declare #YourTable table (ID int,Volume int)
Insert Into #YourTable values
(100,1306489),
(125,898426),
(150,907404)
Select ID
,Volume
,RunningTotal = sum(Volume) over (Order by Volume Desc)
From #YourTable
Order By Volume Desc
Returns
ID Volume RunningTotal
100 1306489 1306489
150 907404 2213893
125 898426 3112319
To be clear, The #YourTable is for demonstrative purposes only. There should be no need to INSERT your actual data into a table variable.
EDIT to Support 2008 (Good news is Row_Number() is supported in 2008)
Select ID
,Volume
,RowNr=Row_Number() over (Order by Volume Desc)
Into #Temp
From #YourTable
Select A.ID
,A.Volume
,RunningTotal = sum(B.Volume)
From #Temp A
Join #Temp B on (B.RowNr<=A.RowNr)
Group By A.ID,A.Volume
Order By A.Volume Desc

TSQL Incrementing Count of Variable

I have a UI that allows a user to select one or more fields they want to add to a table. This data also has an orderID associated with it that determines the field order.
When the user adds new fields, I need to find the last orderID this user used and increment it by 1, submitting all of the new fields.
For example, if there is a single record that already exists in the database, it would have an orderID of 1. When I choose to add three more fields, it would check to see the last orderID I used (1) and then increment it for each of the new records it adds, 1-4.
-- Get the last ID orderID for this user and increment it by 1 as our starting point
DECLARE #lastID INT = (SELECT TOP 1 orderID FROM dbo.BS_ContentRequests_Tasks_User_Fields WHERE QID = #QID ORDER BY orderID DESC)
SET #lastID = #lastID+1;
-- Create a temp table to hold our fields that we are adding
DECLARE #temp AS TABLE (fieldID int, orderID int)
-- Insert our fields and incremented numbers
INSERT INTO #temp( fieldID, orderID )
SELECT ParamValues.x1.value('selected[1]', 'int'),
#lastID++
FROM #xml.nodes('/root/data/fields/field') AS ParamValues(x1);
Obviously the #lastID++ part is where my issue is but hopefully it helps to understand what I am trying to do.
What other method could be used to handle this?
ROW_NUMBER() ought to do it.
select x.Value,
ROW_NUMBER() over (order by x.Value) + #lastID
from (
select 10 ParamValues.x1.value('selected[1]', 'int') Value
from #xml.nodes('/root/data/fields/field') AS ParamValues(x1)
) x
You could use a column with IDENTITY(1,1)
If you want OrderID to be unique across the entire table then see below:
Click here to take a look at another post that addresses this issue.
There are multiple ways to approach this issue, but in this case, the easiest, while reasonable, means may be to use an identity column. However, that is not as extensible as using a sequence. If you feel that you may need more flexibility in the future, then use a sequence.
If you want OrderID to be unique across the fields inserted in one batch then see below:
You should take a closer look at Chris Steele's answer.

Resources