Update using subquery not working in Snowflake - snowflake-cloud-data-platform

(Submitting on behalf of a Snowflake User)
We have wrong duplicate id loaded in the table and we need to correct it. The rules to update the id is whenever there is a time difference of more than 30 min, the id should be new/unique. I have written the query to filter that out, however update is not happening
The below query is there to find the ids to be updated. For testing I have used a particular id.
select id,
BEFORE_TIME,
TIMESTAMP,
datediff(minute,BEFORE_TIME,TIMESTAMP) time_diff,
row_number() over (PARTITION BY id ORDER BY TIMESTAMP) rowno,
concat(id,to_varchar(rowno)) newid from
(SELECT id,
TIMESTAMP,
LAG(TIMESTAMP_EST) OVER (PARTITION BY visit_id ORDER BY TIMESTAMP) as BEFORE_TIME
FROM table_name t
where id = 'XX1X2375'
order by TIMESTAMP_EST)
where BEFORE_TIME is not NULL and time_diff > 30
order by time_diff desc
;
And i could see the 12 record with same id and time difference more than 30. However when I am trying to update. the query is succesfull but nothing is getting update.
update table_name t
set t.id = c.newid
from
(select id ,
BEFORE_TIME,
TIMESTAMP,
datediff(minute,BEFORE_TIME,TIMESTAMP) time_diff,
row_number() over (PARTITION BY id ORDER BY TIMESTAMP) rowno,
concat(id,to_varchar(rowno)) newid from
(SELECT id,
TIMESTAMP,
LAG(TIMESTAMP) OVER (PARTITION BY visit_id ORDER BY TIMESTAMP) as BEFORE_TIME
FROM table_name t
where id = 'XX1X2375'
order by TIMESTAMP_EST)
where BEFORE_TIME is not NULL and time_diff > 30
order by time_diff desc) c
where t.id = c.id
and t.timestamp = c.BEFORE_TIME
;
please note:
I even created a temp table t1 from the above subquery.
And i can see the records in table t1.
when doing select with join with main table i can even see in record in main table.
But again when I am trying to update using new t1. its just showing zero record updated.
I even tried merge but same issue.
MERGE INTO snowplow_data_subset_temp t
USING t1
ON (trim(t.visit_id) = trim(t1.visit_id) and trim(t1.BEFORE_DATE) = trim(t.TIMESTAMP_EST))
WHEN MATCHED THEN UPDATE SET visit_number = newid;
Any recommendations, ideas, or work-arounds? Thanks!

This looks like they may be running into two things:
The table that you created t1, was it a transient or cloned table? Check out the
Get_DDL('t1', 'schemaname');
to check if there are any constraints on the temp table in the session you work on this next. Or you can query the table constraints view
"Alternatively, retrieve a list of all table constraints by schema (or across all schemas in a database) by querying the TABLE_CONSTRAINTS View view in the Information Schema." from: https://docs.snowflake.net/manuals/user-guide/table-considerations.html#referential-integrity-constraints
Since the sub query is working just fine - the merge and update statements are clues for what to look for, this is what I found in the documentation for more general info:
*Limitations on Sub queries:
https://docs.snowflake.net/manuals/user-guide/querying-subqueries.html#limitations
You can also check to see if there are any errors for the update query by altering the session: https://docs.snowflake.net/manuals/sql-reference/sql/update.html#usage-notes
ALTER SESSION SET ERROR_ON_NONDETERMINISTIC_UPDATE=TRUE;
Here is an example of how to use an update with a Temp table:
https://snowflakecommunity.force.com/s/question/0D50Z00008P7BznSAF/can-you-use-a-cte-or-temp-table-with-an-update-statement-to-update-a-table
I am looking forward to seeing how they ended up solving the issue.

Related

Update Query with Order by while preventing two users updating the same row

I have a SQL Server table with an expirydate column, I want to update rows on this table with the nearest expirydate, running two queries (select then update) won't work because two users may update the same row at the same time, so it has be one query.
The following query:
Update Top(5) table1
Set col1 = 1
Output deleted.* Into table2
This query runs fine but it doesn't sort by expirydate
This query:
WITH q AS
(
SELECT TOP 5 *
FROM table1
ORDER BY expirydate
)
UPDATE table1
SET col1 = 1
OUTPUT deleted.* INTO table2
WHERE table1.id IN (SELECT id FROM q)
It works but again I run the risk of two users updating the same row at the same time
What options do I have to make this work?
Thanks for the help
In these types of scenarios if you want a more optimistic concurrency approach, you need to include either an Order By AND / OR a Where clause to filter out the rows.
In application design it is common to use SELECT TOP (#count) FROM... style queries to fill the interface, however to execute DELETE or UPDATE statements you would use the primary key to specifically identify the rows to modify.
As long as you are not executing delete, then you could use a timestamp or other date based descriminator column to ensure that your updates only affect the rows that haven't been changed since the last select.
So you could query the current time as part of the select query:
SELECT TOP 5 *, SYSDATETIMEOFFSET() as [Selected]
FROM table1
ORDER BY expirydate
or query for the timestamp first, and add a created column to the table to track new records so you do not include them in deletes, either way you need to ensure that the query to select the rows will always return the same records, even if I run it tomorrow, which means you will need to ensure that no one can modify the expirydate column, if that could be modified, then you can't use it as your primary sort or filter key.
DECLARE #selectedTimestamp DateTimeOffset = (SELECT SYSDatetimeoffset())
SELECT TOP 5 *, SYSDATETIMEOFFSET() as [Selected]
FROM table1
WHERE CREATED < #selectedTimestamp
ORDER BY expirydate
Then in your update, make sure you only update the rows if they have not changed since the time that we selected them, this will either require you to have setup a standard audit trigger on the table to keep created and modified columns up to date, or for you to manage it manually in your update statement:
WITH q AS
(
SELECT TOP 5 *
FROM table1
WHERE CREATED < #selectedTimestamp
ORDER BY expirydate
)
UPDATE table1
SET col1 = 1, MODIFIED = SYSDatetimeoffset()
OUTPUT deleted.* INTO table2
WHERE table1.id IN (SELECT id FROM q)
AND MODIFIED < #selectedTimestamp
In this way we are effectively ignoring our change if another user has already updated records that were in the same or similar initial selection range.
Ultimately you could combine my initial advice to UPDATE based on the primary key AND the modified dates if you are genuinely concerned about the rows being updated twice.
If you need a more pessimistic approach, you could lock the rows with a specific user based flag so that other users cannot even select those rows, but that requires a much more detailed explanation.

Select latest row on duplicate values while transfering table?

I have a logging table that is live which saves my value to a table frequently.
My plan is to take those values and put them on a temporary table with
SELECT * INTO #temp from Block
From there I guess my block table is empty and the logger can keep on logging new values.
The next step is that I want to save them in a existing table. I wanted to use
INSERT INTO TABLENAME(COLUMN1,COLUMN2...) SELECT (COLUMN1,COLUMN2...) FROM #temp
The problem is that the #temp table has duplicates primary keys. And I only want to store the last ID.
I have tried DISTINCT but it didn't work. Could not get ROW_Count to work. Are there any ideas on how I should do it? I wish to make it with as few reads as possible.
Also, in the future I plan to send them to another database, how do I do that on SQL Server? I guess it's something like FROM Table [in databes]?
I couldn't get the blocks to copy. But here goes:
create TABLE Product_log (
Grade char(64),
block_ID char(64) PRIMARY KEY NOT NULL,
Density char(64),
BatchNumber char(64) NOT NULL,
BlockDateID Datetime
);
That is my table i want to store the data in. There I do not wish to have duplicates on the id. The problem is, while logging I get duplicates since I log on change. Lets say that the batchid is 1, if it becomes 2 while logging. I will get a blockid twice, both with batch number 1 and 2. How do I pick the latter?
Hope I explained enough for guidance. While logging they look like this:
id SiemensTiaV15_s71200_BatchTester_NewBatchIDValue_VALUE SiemensTiaV15_s71200_BatchTester_TestWriteValue_VALUE SiemensTiaV15_s71200_BatchTester_TestWriteValue_TIMESTAMP SiemensTiaV15_s71200_MainTank_Density_VALUE SiemensTiaV15_s71200_MainTank_Grade_VALUE
1 00545 S0047782 2020-06-09 11:18:44.583 0 xxxxx
2 00545 S0047783 2020-06-09 11:18:45.800 0 xxxxx
Please use below query,
select * from
(select id, SiemensTiaV15_s71200_BatchTester_NewBatchIDValue_VALUE,SiemensTiaV15_s71200_BatchTester_TestWriteValue_VALUE, SiemensTiaV15_s71200_BatchTester_TestWriteValue_TIMESTAMP, SiemensTiaV15_s71200_MainTank_Density_VALUE,SiemensTiaV15_s71200_MainTank_Grade_VALUE,
row_number() over (partition by SiemensTiaV15_s71200_BatchTester_NewBatchIDValue_VALUE order by SiemensTiaV15_s71200_BatchTester_TestWriteValue_TIMESTAMP desc) as rnk
from table_name) qry
where rnk=1;
INTO #temp FROM Block; INSERT INTO Product_log(Grade, block_ID, Density, BatchNumber, BlockDateID)
selct NewBatchIDValue_VALUE, TestWriteValue_VALUE, TestWriteValue_TIMESTAMP,
Density_VALUE, Grade_VALUE from
(select NewBatchIDValue_VALUE, TestWriteValue_VALUE,
TestWriteValue_TIMESTAMP, Density_VALUE, Grade_VALUE, row_number() over
(partition by BatchTester_NewBatchIDValue_VALUE order by
BatchTester_TestWriteValue_VALUE) as rnk from #temp) qry
where rnk = 1;

T-SQL: GROUP BY, but while keeping a non-grouped column (or re-joining it)?

I'm on SQL Server 2008, and having trouble querying an audit table the way I want to.
The table shows every time a new ID comes in, as well as every time an IDs Type changes
Record # ID Type Date
1 ae08k M 2017-01-02:12:03
2 liei0 A 2017-01-02:12:04
3 ae08k C 2017-01-02:13:05
4 we808 A 2017-01-03:20:05
I'd kinda like to produce a snapshot of the status for each ID, at a certain date. My thought was something like this:
SELECT
ID
,max(date) AS Max
FROM
Table
WHERE
Date < 'whatever-my-cutoff-date-is-here'
GROUP BY
ID
But that loses the Type column. If I add in the type column to my GROUP BY, then I'd get get duplicate rows per ID naturally, for all the types it had before the date.
So I was thinking of running a second version of the table (via a common table expression), and left joining that in to get the Type.
On my query above, all I have to join to are the ID & Date. Somehow if the dates are too close together, I end up with duplicate results (like say above, ae08k would show up once for each Type). That or I'm just super confused.
Basically all I ever do in SQL are left joins, group bys, and common table expressions (to then left join). What am I missing that I'd need in this situation...?
Use row_number()
select *
from ( select *
, row_number() over (partition by id order by date desc) as rn
from table
WHERE Date < 'whatever-my-cutoff-date-is-here'
) tt
where tt.rn = 1
I'd kinda like know how many IDs are of each type, at a certain date.
Well, for that you use COUNT and GROUP BY on Type:
SELECT Type, COUNT(ID)
FROM Table
WHERE Date < 'whatever-your-cutoff-date-is-here'
GROUP BY Type
Basing on your comment under Zohar Peled answer you probably looking for something like this:
; with cte as (select distinct ID from Table where Date < '$param')
select [data].*, [data2].[count]
from cte
cross apply
( select top 1 *
from Table
where Table.ID = cte.ID
and Table.Date < '$param'
order by Table.Date desc
) as [data]
cross apply
( select count(1) as [count]
from Table
where Table.ID = cte.ID
and Table.Date < '$param'
) as [data2]

How to remove duplicate rows in SQL Server?

Environment:
OS: Windows Server 2012 DataCenter
DBMS: SQL Server 2012
Hardware (VPS): Xeon E5530 4 cores + 4GB RAM
Question:
I have a large table with 140 million rows. Some rows are supposed to be duplicate so I want to remove such rows. For example:
id name value timestamp
---------------------------------------
001 dummy1 10 2015-7-27 10:00:00
002 dummy1 10 2015-7-27 10:00:00 <-- duplicate
003 dummy1 20 2015-7-27 10:00:00
The second row is deemed duplicate because it has identical name, value and timestamp regardless of different id with the first row.
Note: the first two rows are duplicate NOT because of all identical columns, but due to self-defined rules.
I tried to remove such duplication by using window function:
select
id, name, value, timestamp
from
(select
id, name, value, timestamp,
DATEDIFF(SECOND, lag(timestamp, 1) over (partition by name order by timestamp),
timestamp) [TimeDiff]
from table) tab
But after an hour of execution, the lock is used up and error was raised:
Msg 1204, Level 19, State 4, Line 2
The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.
How could I remove such duplicate rows in an efficient way?
What about using a cte? Something like this.
with DeDupe as
(
select id
, [name]
, [value]
, [timestamp]
, ROW_NUMBER() over (partition by [name], [value], [timestamp] order by id) as RowNum
from SomeTable
)
Delete DeDupe
where RowNum > 1;
If only thing is selection of non-duplicate rows from table, consider using this script
SELECT MIN(id), name, value, timestamp FROM table GROUP BY name, value, timestamp
If you need to delete duplicate rows:
DELETE FROM table WHERE id NOT IN ( SELECT MIN(id) FROM table GROUP BY name, value, timestamp)
or
DELETE t FROM table t INNER JOIN
table t2 ON
t.name=t2.name AND
t.value=t2.value AND
t.timestamp=t2.timestamp AND
t2.id<t.id
Try something like this - determine the lowest ID for each set of values, then delete rows that have an ID other than the lowest one.
Select Name, Value, TimeStamp, min(ID) as LowestID
into #temp1
From MyTable
group by Name, Value, TimeStamp
Delete MyTable
from MyTable a
inner join #temp1 b
on a.Name = b.Name
and a.Value = b.Value
and a.Timestamp = b.timestamp
and a.ID <> b.LowestID

Using NEWID() with CTE to produce random subset of rows produces odd results

I'm writing some SQL in a stored procedure to reduce a dataset to a limited random number of rows that I want to report on.
The report starts with a Group of Users and a filter is applied to specify the total number of random rows required (#SampleLimit).
To achieve the desired result, I start by creating a CTE (temp table) with:
The top(#SampleLimit) applied
group by UserId (as the UserID appears multiple times)
order by NEWID() to put the results in a random order
SQL:
; with cte_temp as
(select top(#SampleLimit) UserId from QueryResults
where (GroupId = #GroupId)
group by UserId order by NEWID())
Once I have this result set, I then delete any results where the UserId is NOT IN the CTE created in the previous step.
delete QueryResults
where (GroupId = #GroupId) and (UserId not in(select UserId from cte_temp))
The issue that I'm having is that from time to time, I get more results than specified in the #SampleLimit and other times it works exactly as expected.
I've tried breaking up the SQL and executing it outside the application and I cannot reproduce the issue.
Is there anything fundamentally wrong with what I am doing that could explain why I occasionally get more results that I request?
For completeness - my re-factored solution based on below answer:
select top(#SampleLimit) UserId into #T1
from QueryResults
where (GroupId = #GroupId)
group by UserId
order by NEWID()
delete QueryResults
where (GroupId = #GroupId) and (UserId not in(select UserId from #T1))
It is undeterministic how many times the SELECT statement involving NEWID() will be executed.
If you get a nested loops anti semi join between QueryResults and cte_temp and there is no spool in the plan it will likely be re-evaluated as many times as there are rows in QueryResults this means that for each outer row the set that is being compared against with NOT IN may be entirely different.
Instead of using a CTE you can materialize the results into a temporary table to avoid this.
INSERT INTO #T
SELECT TOP(#SampleLimit) UserId
FROM QueryResults
WHERE ( GroupId = #GroupId )
GROUP BY UserId
ORDER BY NEWID()
Then reference that in the DELETE

Resources