I have two processes that work with data in the same table.
One process inserts daily, one by one (pure ADO.NET), about 20000 records in the target table.
The second process calls ( periodically, every 15 minutes ) a stored procedure that
Detects the duplicates in those 20000 records by looking at all the records 7 days back and marks them as such.
Marks all records that are not duplicates with a 'ToBeCopied' flag.
Select a number of columns from the records marked as 'ToBeCopied' and returns the set.
Sometimes these two processes overlap ( due to delays in data processing ) and I am suspecting that if the first process inserts new records when second process is somewhere between 1 and 2 then records will be marked 'ToBeCopied' without having gone through the duplicate sifting.
This means that now the store procedure is returning some duplicates.
This is my theory but In practice I have not been able to replicate it...
I am using LINQ to SQL to insert duplicates (40-50 or so a second) and while this is running I am manually calling the stored procedure and store its results.
It appears that when the stored procedure is running the inserting pauses ... such that at the end no duplicates have made it to the final result set.
I am wondering if LINQ to SQL or SQL Server has a default mechanism that prevents concurrency and is pausing the inserting while the selecting or updating takes place.
What do you think?
EDIT 1:
The 'duplicates' are not identical rows. They are 'equivalent' given the business/logical entities these records represent. Each row has a unique primary key.
P.S. Selecting the result set takes place with NOLOCK. Trying to reproduce on SQL Server 2008. Problem is alleged to occur on SQL Server 2005.
What do I think?
Why do you have duplicates in the database? Data purity begins in the client at the app drawing board, which should have a data model that simply does not allow for duplicates.
Why do you have duplicates in the database? Check constraints should prevent this from happening if the client app misbehaves
If you have duplicates, the reader must be prepared to handle them.
You cannot detect duplicates in two stages (look then mark), it has to be one single atomic mark. In fact, you cannot do almost anything in a database in two stages 'look and mark'. All 'look for record then mark the records found' processes fail under concurrency.
NOLOCK will give you inconsistent reads. Records will be missing or read twice. Use SNAPSHOT isolation.
Linq-To-SQL has no pixie dust to replace bad design.
Update
Consider this for instance:
A staging table with a structure like:
CREATE TABLE T1 (
id INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
date DATETIME NOT NULL DEFAULT GETDATE(),
data1 INT NULL,
data2 INT NULL,
data3 INT NULL);
Process A is doing inserts at leisure into this table. It doe snot do any validation, it just dumps raw records in:
INSERT INTO T1 (data1, data2, data3) VALUES (1,2,3);
INSERT INTO T1 (data1, data2, data3) VALUES (2,1,4);
INSERT INTO T1 (data1, data2, data3) VALUES (2,2,3);
...
INSERT INTO T1 (data1, data2, data3) VALUES (1,2,3);
INSERT INTO T1 (data1, data2, data3) VALUES (2,2,3);
...
INSERT INTO T1 (data1, data2, data3) VALUES (2,1,4);
...
Process B is tasked with extracting this staging table and moving cleaned up data into a table T2. It has to remove duplicates that, by business rules, mean records with same values in data1, data2 and data3. Within a set of duplicates, only the first record by date should be kept:
set transaction isolation snapshot;
declare #maxid int;
begin transaction
-- Snap the current max (ID)
--
select #maxid = MAX(id) from T1;
-- Extract the cleaned rows into T2 using ROW_NUMBER() to
-- filter out duplicates
--
with cte as (
SELECT date, data1, data2, datta3,
ROW_NUMBER() OVER
(PARTITION BY data1, data2, data3 ORDER BY date) as rn
FROM T1
WHERE id <= #maxid)
MERGE INTO T2
USING (
SELECT date, data1, data2, data3
FROM cte
WHERE rn = 1
) s ON s.data1 = T2.data1
AND s.data2 = T2.data2
AND s.data3 = T2.data3
WHEN NOT MATCHED BY TARGET
THEN INSERT (date, data1, data2, data3)
VALUES (s.date, s.data1, s.data2, s.data3);
-- Delete the processed row up to #maxid
--
DELETE FROM T1
WHERE id <= #maxid;
COMMIT;
Assuming Process A only inserts, this procedure would safely process the staging table and extract the cleaned duplicates. Of course, this is just a skeleton, a true ETL process would have error handling via BEGIN TRY/BEGIN CATCH and transaction log size control via batching.
When are you calling submit on your data context? I believe that this happens within a transaction.
As for your problem, what you are saying sounds plausible - would it maybe make more sense to do you load into a staging table (if it's slow) and then do a
SELECT * FROM StagingTable INTO ProductionTable
once your load is complete?
Related
I have a somewhat of what feels like a complex data problem that I am trying to solve. I am more a developer than a SQL expert but our DBA whom wrote the script a few months back, recently moved on and I have been tasked with resolving this problem in the short term.
I am having some real issues with four tables of which two are receiving real-time bulk inserts, being read by users and having a daily job copying then deleting the copied over records into historic tables. The operations run 24/7 and there is no downtime for the data insertions or archiving.
A script that originally did this process started to fail. It was changing the Serializable isolation from Snapshot in ReadCommitted mode, but did a full table scan so it blocked all insert operations for 1-2 hours, which was not acceptable.
Question(s)
The pain point for the archiving is that we have to wrap up the insert and deletes together in one transaction. Is there a better way to do the below? Is it better to not have foreign keys and use a trigger or constraint? The below scripts either lock the table too much or run for too long (1-4hours).
Scenario
Four SQL tables
Order (Memory Optimized table)
OrderDetail (Memory Optimized table)
OrderHistory (File/normal table – ColumnStore index)
OrderDetailHistory (File/normal table– ColumnStore index)
Order and OrderDetail tables receive on average a bulk insert between 50-700 inserts per second is performed.
Updates are only performed during the batch insert operation.
During the day the Order table can be between 3-6 million records.
The OrderDetail table can be up to 2-3 times as large as the Order table.
OrderHistory and OrderDetailHistory can have 1-7 days data, so varies between 10-50 million records at any one point.
OrderDetail has a FK reference to Order on the Id column.
At one point of time during the day, the data is copied and inserted for each table into their respective 'history' tables, which are non-memory optimized tables.
Attempt 1
The script with the issue does this:
BEGIN TRAN
INSERT INTO OrderHistory
SELECT * FROM Order o
WHERE o.CreatedAt <= 1 day
INSERT INTO OrderDetailHistory
SELECT * FROM OrderDetail od
WHERE od.CreatedAt <= 1 day
DELETE FROM Order
WHERE CreatedAt <= 1 day
DELETE FROM OrderDetail
WHERE CreatedAt <= 1 day
COMMIT
The database is run at Snapshot Isolation level with Read Committed.
When run, we were originally getting Serializable errors, and after some reviewing realised that the delete operation would go down to a serializable isolation level to lock the nonclustered index on the CreatedAt column whilst it performed deletes because it is doing a range scan on the index, whilst still inside the transaction that we used for selecting data too.
Attempt 2
So next I modified the script, by creating two memory optimized user defined tables to select data into first outside of the transaction into these udf tables. Then in a separate transaction, insert into our history table, and then finally delete in a separate transaction. The idea is that if the inserts are a success but the delete fails then the next time it runs it will not try to re-insert data twice. The downside to this is that there will be duplication in the history data until it runs again, and it ran for 2 hours before our scheduling tool timeout, so this doesn't seem ideal.
INSERT INTO #OrderLoadingUdfTable
SELECT * FROM Order o
WHERE o.CreatedAt <= 1 day
INSERT INTO #OrderDetailLoadingUdfTable
SELECT * FROM OrderDetail od
WHERE od.CreatedAt <= 1 day
BEGIN TRAN
INSERT INTO OrderHistory
SELECT * FROM #OrderLoadingUdfTable
WHERE CreatedAt <= 1 day
AND Id NOT IN (SELECT Id FROM OrderHistory)
INSERT INTO OrderDetailHistory
SELECT * FROM #OrderDetailLoadingUdfTable
WHERE CreatedAt <= 1 day
AND Id NOT IN (SELECT Id FROM OrderDetailHistory)
COMMIT
BEGIN TRAN
DELETE FROM Order
WHERE CreatedAt <= 1 day
DELETE FROM OrderDetail
WHERE CreatedAt <= 1 day
COMMIT
I'm trying to archive many records in batches rather than in one shot.
Will TSQL Join the two tables, TeamRoster and #teamIdsToDelete for every loop in the batch? My concern is that if my temporary table is huge and I don't remove records from the temporary table as I go, the JOIN might be unnecessarily expensive. On the other hand, how expensive is it to delete from the temporary table as I go? Is it made up for by the (?real/hypothetical?) smaller joins I'll have to do in each batch?
(Can provide more details/thoughts but will do so if helpful.)
DECLARE #teamIdsToDelete Table
(
RosterID int PRIMARY KEY
)
--collect the list of active teamIds. we will rely on the modified date to age them out.
INSERT INTO #teamIdsToDelete
SELECT DISTINCT tr.RosterID FROM
rosterload.TeamRoster tr WITH (NOLOCK)
WHERE tr.IsArchive=0 and tr.Loaded=1
--ageout out remaining rosters. (no cap - proved we can update more than 50k by modifying test case:
WHILE (1 = 1)
BEGIN
BEGIN TRANSACTION
UPDATE TOP (1000) r
SET [Status] = 'Delete', IsArchive = 1, ModifiedDate = GETDATE(), ModifiedBy = 'abc'
FROM rosterload.TeamRoster r with(rowlock)
JOIN #teamIdsToDelete ttd ON ttd.rosterID = r.RosterID
WHERE r.[Status] != 'Delete' AND r.IsArchive != 1 AND r.ModifiedBy != 'abc' -- predicate for filtering;
IF ##ROWCOUNT = 0 -- terminating condition;
BEGIN
COMMIT TRANSACTION
BREAK
END
COMMIT TRANSACTION
END
As I understand the goal of this query is to archive huge number of rows w/o blocking other queries at the same time. The temp table helps you to narrow down the subset of records to delete. Since it has one column which is clustered primary key, the join to another PK will be blazingly fast. You will spend more efforts on calculating and deleting updated records from the temp table.
Also, there is no reason to use transaction and do batches. You could just do one big update instead. The result is the same - table will be locked after first 5k row locks are acquired (~after first five batches updated) until the COMMIT statement. With rowlock hint does not prevent lock escalation. On the other hand, running w/o transaction would give other queries opportunity to continue after each 1000-row batch. If you need to make sure that all records are archived in one go - add some retry logic to your query or your application code for such errors like deadlocks or process interruption. And do you really need NOLOCK hint?
We are using the technique outlined here to generate random record IDs without collisions. In short, we create a randomly-ordered table of every possible ID, and mark each record as 'Taken' as it is used.
I use the following Stored Procedure to obtain an ID:
ALTER PROCEDURE spc_GetId #retVal BIGINT OUTPUT
AS
DECLARE #curUpdate TABLE (Id BIGINT);
SET NOCOUNT ON;
UPDATE IdMasterList SET Taken=1
OUTPUT DELETED.Id INTO #curUpdate
WHERE ID=(SELECT TOP 1 ID FROM IdMasterList WITH (INDEX(IX_Taken)) WHERE Taken IS NULL ORDER BY SeqNo);
SELECT TOP 1 #retVal=Id FROM #curUpdate;
RETURN;
The retrieval of the ID must be an atomic operation, as simultaneous inserts are possible.
For large inserts (10+ million), the process is quite slow, as I must pass through the table to be inserted via a cursor.
The IdMasterList has a schema:
SeqNo (BIGINT, NOT NULL) (PK) -- sequence of ordered numbers
Id (BIGINT) -- sequence of random numbers
Taken (BIT, NULL) -- 1 if taken, NULL if not
The IX_Taken index is:
CREATE NONCLUSTERED INDEX (IX_Taken) ON IdMasterList (Taken ASC)
I generally populate a table with Ids in this manner:
DECLARE #recNo BIGINT;
DECLARE #newId BIGINT;
DECLARE newAdds CURSOR FOR SELECT recNo FROM Adds
OPEN newAdds;
FETCH NEXT FROM newAdds INTO #recNo;
WHILE ##FETCH_STATUS=0 BEGIN
EXEC spc_GetId #newId OUTPUT;
UPDATE Adds SET id=#newId WHERE recNo=#recNo;
FETCH NEXT FROM newAdds INTO #id;
END;
CLOSE newAdds;
DEALLOCATE newAdds;
Questions:
Is there any way I can improve the SP to extract Ids faster?
Would a conditional index improve peformance (I've yet to test, as
IdMasterList is very big)?
Is there a better way to populate a table with these Ids?
As with most things in SQL Server, if you are using cursors, you are doing it wrong.
Since you are using SQL Server 2012, you can use a SEQUENCE to keep track of what random value you already used and effectively replace the Taken column.
CREATE SEQUENCE SeqNoSequence
AS bigint
START WITH 1 -- Start with the first SeqNo that is not taken yet
CACHE 1000; -- Increase the cache size if you regularly need large blocks
Usage:
CREATE TABLE #tmp
(
recNo bigint,
SeqNo bigint
)
INSERT INTO #tmp (recNo, SeqNo)
SELECT recNo,
NEXT VALUE FOR SeqNoSequence
FROM Adds
UPDATE Adds
SET id = m.id
FROM Adds a
INNER JOIN #tmp tmp ON a.recNo = tmp.recNo
INNER JOIN IdMasterList m ON tmp.SeqNo = m.SeqNo
SEQUENCE is atomic. Subsequent calls to NEXT VALUE FOR SeqNoSequence are guaranteed to return unique values, even for parallel processes. Note that there can be gaps in SeqNo, but it's a very small trade off for the huge speed increase.
Put a PK inden of BigInt on each table
insert into user (name)
values ().....
update user set = user.ID = id.ID
from id
left join usr
on usr.PK = id.PK
where user.ID = null;
one
insert into user (name) value ("justsaynotocursor");
set #PK = select select SCOPE_IDENTITY();
update user set ID = (select ID from id where PK = #PK);
Few ideas that came to my mind:
Try if removing the top, inner select etc. helps to improve the performance of the ID fetching (look at statistics io & query plan):
UPDATE top(1) IdMasterList
SET #retVal = Id, Taken=1
WHERE Taken IS NULL
Change the index to be a filtered index, since I assume you don't need to fetch numbers that are taken. If I remember correctly, you can't do this for NULL values, so you would need to change the Taken to be 0/1.
What actually is your problem? Fetching single IDs or 10+ million IDs? Is the problem CPU / I/O etc. caused by the cursor & ID fetching logic, or are the parallel processes being blocked by other processes?
Use sequence object to get the SeqNo. and then fetch the Id from idMasterList using the value returned by it. This could work if you don't have gaps in IdMasterList sequences.
Using READPAST hint could help in blocking, for CPU / I/O issues, you should try to optimize the SQL.
If the cause is purely the table being a hotspot, and no other easy solutions seem to help, split it into several tables and use some kind of simple logic (even ##spid, rand() or something similar) to decide from which table the ID should be fetched. You would need more checking if all tables have free numbers, but it shouldn't be that bad.
Create different procedures (or even tables) to handle fetching of single ID, hundreds of IDs and millions of IDs.
I am currently performing analysis on a client's MSSQL Server. I've already fixed many issues (unnecessary indexes, index fragmentation, NEWID() being used for identities all over the shop etc), but I've come across a specific situation that I haven't seen before.
Process 1 imports data into a staging table, then Process 2 copies the data from the staging table using an INSERT INTO. The first process is very quick (it uses BULK INSERT), but the second takes around 30 mins to execute. The "problem" SQL in Process 2 is as follows:
INSERT INTO ProductionTable(field1,field2)
SELECT field1, field2
FROM SourceHeapTable (nolock)
The above INSERT statement inserts hundreds of thousands of records into ProductionTable, each row allocating a UNIQUEIDENTIFIER, and inserting into about 5 different indexes. I appreciate this process is going to take a long time, so my issue is this: while this import is taking place, a 3rd process is responsible for performing constant lookups on ProductionTable - in addition to inserting an additional record into the table as such:
INSERT INTO ProductionTable(fields...)
VALUES(values...)
SELECT *
FROM ProductionTable (nolock)
WHERE ID = #Id
For the 30 or so minutes that the INSERT...SELECT above is taking place, the INSERT INTO times-out.
My immediate thought is that SQL server is locking the entire table during the INSERT...SELECT. I did quite a lot of profiling on the server during my analysis, and there are definitely locks being allocated for the duration of the INSERT...SELECT, though I fail remember what type they were.
Having never needed to insert records into a table from two sources at the same time - at least during an ETL process - I'm not sure how to approach this. I've been looking up INSERT table hints, but most are being made obsolete in future versions.
It looks to me like a CURSOR is the only way to go here?
You could consider BULK INSERT for Process-2 to get the data into the ProductionTable.
Another option would be to batch Process-2 into small batches of around 1000 records and use a Table Valued Parameter to do the INSERT. See: http://msdn.microsoft.com/en-us/library/bb510489.aspx#BulkInsert
It seems like table lock.
Try portion insert in ETL process. Something like
while 1=1
begin
INSERT INTO ProductionTable(field1,field2)
SELECT top (1000) field1, field2
FROM SourceHeapTable sht (nolock)
where not exists (select 1 from ProductionTable pt where pt.id = sht.id)
-- optional
--waitfor delay '00:00:01.0'
if ##rowcount = 0
break;
end
Say I have a merge Statement that looks like this:
merge TableA as target
using (select Id, Description, UnitCost
from TableB)
as source (Id, Description, UnitCost)
on (target.Id = source.Id)
when MATCHED then
update set Id = source.Id,
Description = source.Description,
UnitCost = Source.UnitCost
when NOT MATCHED then
insert (Id, Description, UnitCost)
values (source.Id, source.Description, source.UnitCost);
When I run this it tells me how many rows were affected. If I run it and I know that the source and the destination are exactly the same I still get a message telling me that x number of rows were affected. In my case it is about 200 rows. Is SQL Server re-writing the same data to disk?
200 rows is nothing and can easily be rewritten with out impacting SQL Server's performance. But if I have a merge statement with a 500,000+ rows and lots of indexes, then re-updating all the data in the table is going to get expensive.
Do I need to be checking that the data has changed first (at least in the cases were performance could be an issue)?
If so, how do I do that in a merge statement (maybe using my example above)?
merge TableA as target
using (select Id, Description, UnitCost
from TableB)
as source (Id, Description, UnitCost)
on (target.Id = source.Id)
when MATCHED AND (ID <> source.ID OR Description <> source.Description OR UnitCost <> Source.UnitCost) then
update set Id = source.Id,
Description = source.Description,
UnitCost = Source.UnitCost
when NOT MATCHED then
insert (Id, Description, UnitCost)
values (source.Id, source.Description, source.UnitCost);
You can add a conditional search clause to the Matched statement, this basically checks to make sure that something actually has changed. Not sure if this will be necessarily faster but it won't update rows that don't need to be updated.
If you need more information check the docs MERGE (T-SQL)
SQL Server, and any buffer-pool write ahead log based engine for the matters, will not do data IO for updates/deletes/inserts. It has always been like this since the ARIES paper was published, and almost all modern relational databases trace their ancestry to System-R and ARIES.
When a row is updated (and that includes the inserting and deletion of the row) a log record is appended into a log buffer in memory describing the change, then the page containing the row in memory is updated. Nothing is written to disk. Execution continues. When the transaction commits a new log record is generated and the commit cannot proceed until all the log in memory, up to and including the log commit record is flushed to disk. This is the only mandatory IO that is required for the update to be allowed to proceed. If you update 500k rows, in one statement then the system will only have to wait for the flush of the log after all 500k rows were updated.
The data in memory is periodically written to disk during checkpoints.