I have data stored in ms sql database. I want delete all records older than some date.
For this, a service is used that sends a request once a day, like:
delete from [log].[HttpRequestLogEntries] where DateTimeUtc < dateadd(day, -3, getutcdate())
and it work fine, but very slowly. In my table can be over 10kk rows and deleting may take hours for the work.
How to solve this problem in the best way?
If there is not an existing index with a first column of [DateTimeUtc], you might try adding one. Indexing the column in the search criteria has improved mass delete performance on some of our databases. The trade-off is that inserts and updates may take additional time to maintain index entries.
Consider deleting fewer rows at a time. If you delete more than 5,000 rows at once, the delete query may attempt to escalate to a table lock. If there is a lot of concurrent activity, the attempt to acquire a table lock may block while other requests complete.
For example, this loop deletes 4,000 rows maximum at a time:
declare #RowCount int = 1
while #RowCount > 0
begin
delete top (4000)
from [log].[HttpRequestLogEntries]
where DateTimeUtc < dateadd(day, -3, getutcdate())
select #RowCount = ##rowcount
end
Also, check for database triggers. If a trigger is firing when rows are deleted, it is possible code in the trigger is causing a long delay.
Related
I have a somewhat of what feels like a complex data problem that I am trying to solve. I am more a developer than a SQL expert but our DBA whom wrote the script a few months back, recently moved on and I have been tasked with resolving this problem in the short term.
I am having some real issues with four tables of which two are receiving real-time bulk inserts, being read by users and having a daily job copying then deleting the copied over records into historic tables. The operations run 24/7 and there is no downtime for the data insertions or archiving.
A script that originally did this process started to fail. It was changing the Serializable isolation from Snapshot in ReadCommitted mode, but did a full table scan so it blocked all insert operations for 1-2 hours, which was not acceptable.
Question(s)
The pain point for the archiving is that we have to wrap up the insert and deletes together in one transaction. Is there a better way to do the below? Is it better to not have foreign keys and use a trigger or constraint? The below scripts either lock the table too much or run for too long (1-4hours).
Scenario
Four SQL tables
Order (Memory Optimized table)
OrderDetail (Memory Optimized table)
OrderHistory (File/normal table – ColumnStore index)
OrderDetailHistory (File/normal table– ColumnStore index)
Order and OrderDetail tables receive on average a bulk insert between 50-700 inserts per second is performed.
Updates are only performed during the batch insert operation.
During the day the Order table can be between 3-6 million records.
The OrderDetail table can be up to 2-3 times as large as the Order table.
OrderHistory and OrderDetailHistory can have 1-7 days data, so varies between 10-50 million records at any one point.
OrderDetail has a FK reference to Order on the Id column.
At one point of time during the day, the data is copied and inserted for each table into their respective 'history' tables, which are non-memory optimized tables.
Attempt 1
The script with the issue does this:
BEGIN TRAN
INSERT INTO OrderHistory
SELECT * FROM Order o
WHERE o.CreatedAt <= 1 day
INSERT INTO OrderDetailHistory
SELECT * FROM OrderDetail od
WHERE od.CreatedAt <= 1 day
DELETE FROM Order
WHERE CreatedAt <= 1 day
DELETE FROM OrderDetail
WHERE CreatedAt <= 1 day
COMMIT
The database is run at Snapshot Isolation level with Read Committed.
When run, we were originally getting Serializable errors, and after some reviewing realised that the delete operation would go down to a serializable isolation level to lock the nonclustered index on the CreatedAt column whilst it performed deletes because it is doing a range scan on the index, whilst still inside the transaction that we used for selecting data too.
Attempt 2
So next I modified the script, by creating two memory optimized user defined tables to select data into first outside of the transaction into these udf tables. Then in a separate transaction, insert into our history table, and then finally delete in a separate transaction. The idea is that if the inserts are a success but the delete fails then the next time it runs it will not try to re-insert data twice. The downside to this is that there will be duplication in the history data until it runs again, and it ran for 2 hours before our scheduling tool timeout, so this doesn't seem ideal.
INSERT INTO #OrderLoadingUdfTable
SELECT * FROM Order o
WHERE o.CreatedAt <= 1 day
INSERT INTO #OrderDetailLoadingUdfTable
SELECT * FROM OrderDetail od
WHERE od.CreatedAt <= 1 day
BEGIN TRAN
INSERT INTO OrderHistory
SELECT * FROM #OrderLoadingUdfTable
WHERE CreatedAt <= 1 day
AND Id NOT IN (SELECT Id FROM OrderHistory)
INSERT INTO OrderDetailHistory
SELECT * FROM #OrderDetailLoadingUdfTable
WHERE CreatedAt <= 1 day
AND Id NOT IN (SELECT Id FROM OrderDetailHistory)
COMMIT
BEGIN TRAN
DELETE FROM Order
WHERE CreatedAt <= 1 day
DELETE FROM OrderDetail
WHERE CreatedAt <= 1 day
COMMIT
I'm trying to archive many records in batches rather than in one shot.
Will TSQL Join the two tables, TeamRoster and #teamIdsToDelete for every loop in the batch? My concern is that if my temporary table is huge and I don't remove records from the temporary table as I go, the JOIN might be unnecessarily expensive. On the other hand, how expensive is it to delete from the temporary table as I go? Is it made up for by the (?real/hypothetical?) smaller joins I'll have to do in each batch?
(Can provide more details/thoughts but will do so if helpful.)
DECLARE #teamIdsToDelete Table
(
RosterID int PRIMARY KEY
)
--collect the list of active teamIds. we will rely on the modified date to age them out.
INSERT INTO #teamIdsToDelete
SELECT DISTINCT tr.RosterID FROM
rosterload.TeamRoster tr WITH (NOLOCK)
WHERE tr.IsArchive=0 and tr.Loaded=1
--ageout out remaining rosters. (no cap - proved we can update more than 50k by modifying test case:
WHILE (1 = 1)
BEGIN
BEGIN TRANSACTION
UPDATE TOP (1000) r
SET [Status] = 'Delete', IsArchive = 1, ModifiedDate = GETDATE(), ModifiedBy = 'abc'
FROM rosterload.TeamRoster r with(rowlock)
JOIN #teamIdsToDelete ttd ON ttd.rosterID = r.RosterID
WHERE r.[Status] != 'Delete' AND r.IsArchive != 1 AND r.ModifiedBy != 'abc' -- predicate for filtering;
IF ##ROWCOUNT = 0 -- terminating condition;
BEGIN
COMMIT TRANSACTION
BREAK
END
COMMIT TRANSACTION
END
As I understand the goal of this query is to archive huge number of rows w/o blocking other queries at the same time. The temp table helps you to narrow down the subset of records to delete. Since it has one column which is clustered primary key, the join to another PK will be blazingly fast. You will spend more efforts on calculating and deleting updated records from the temp table.
Also, there is no reason to use transaction and do batches. You could just do one big update instead. The result is the same - table will be locked after first 5k row locks are acquired (~after first five batches updated) until the COMMIT statement. With rowlock hint does not prevent lock escalation. On the other hand, running w/o transaction would give other queries opportunity to continue after each 1000-row batch. If you need to make sure that all records are archived in one go - add some retry logic to your query or your application code for such errors like deadlocks or process interruption. And do you really need NOLOCK hint?
I created a Trigger to allow updates on Customer table only between 8am and 5 pm. But now I want to make sure that there is no update/insert within 5 secs of the previous update/insert. Can I accommodate this condition in the same Trigger or do I need to create another Trigger. Anyways, can you guide me how to go about it. I am working on sql server.
`Create trigger tr_delete_Cust
On dbo.Customer FOR Delete
AS
IF convert(datetime2, 'hh24') < '8' OR
Convert(datetime2, 'hh24') > '17'
BEGIN
RAISERROR('Data cannot be modified at this time!',1600,16,1)
END`
That doesn't seem like valid trigger for SQL Server, you're probably looking something like this:
IF (datepart(hour, getdate()) < 8 OR
datepart(hour, getdate()) > 17) begin ...
Your trigger seems to be just for deletes, so you'll need either this to handle also updates and inserts or create a new triggers for those.
To check how much time has passed, I'd say the best option is to have a column in the table for update / insert time, and check max value from it. If there's a lot of rows in the table you'll might want to index that field too.
Need to keep running count of Rows in very large database. Row Count is needed enough times in my program that running Count(*) is too slow, so I will just keep running count to get around this in SQLITE.
CREATE TRIGGER RowCountUpdate AFTER INSERT ON LastSample
BEGIN
UPDATE BufferControl SET NumberOfSamples = NumberOfSamples +
(SELECT Count(*) FROM Inserted);
END;
So from here I want to take the current number of rows (NumberOfSamples) and increment it with how many rows were affected by the insert (do same with DELETE and decrementing). In the C API of Sqlite, this is done with Sqlite3_Changes(). However, I cannot use that function here in this script. I looked around and saw that some were using the SELECT Count(*) FROM Inserted, but I don't think Sqlite supports that.
Is there any statement that Sqlite recognizes that holds the amount of rows that were affected by the INSERT and DELETE queries?
SQLite has the changes() SQL function, but like the sqlite3_changes() API function, it reports the number of rows of the last completed statement.
During trigger execution, the triggering statement is not yet completed.
Just use a FOR EACH ROW trigger, and add 1 for each row.
I am updating 2 columns in a table that contains millions (85 million) of rows. Now to update these I am using a update command like,
UPDATE Table1
SET Table1.column1 = Table2.column1 ,
Table1.column2 = Table2.column2
FROM
Tables and with a Join-conditions;
Now my problem is, it is taking 23 hours for that. Even after using the batch size there is not much change in the time taken.
But I need to update it in less than 5 hours. Is that possible. What approach should I take to achieve it ?
SQL Update statements have to keep all the rows in the log file so it can roll-back on failure. As explained by this guy, the best way to handle millions of rows is to forget about atomicity and batch your updates into 50,000 rows (or whatever):
--Declare variable for row count
Declare #rc int
Set #rc=50000
While #rc=50000
Begin
Begin Transaction
--Use Top (50000) to limit number of updates
--performed in each batch to 50K rows.
--Use tablockx and holdlock to obtain and hold
--an immediate exclusive table lock. This unusually
--speeds the update because only one lock is needed.
Update Top (50000) MyTable With (tablockx, holdlock)
Set UpdFlag = 0
From MyTable mt
Join ControlTable ct
On mt.KeyCol=ct.PK
--Add criteria to avoid updating rows that
--were updated in previous pass
Where m.UpdFlag <> 0
--Get number of rows updated
--Process will continue until less than 50000
Select #rc=##rowcount
--Commit the transaction
Commit
End
This still has some problems in that you need to know which rows you've already handled, perhaps someone smarter than this guy (and me!) can figure something nicer with more MSSQL magic; but this should be a start.
I have used SSIS for doing this task.
First I have taken the source table in which I have to update the 2-columns. Then I have taken Look-Up task in which I have to mapped source columns to the destination table columns from which I have to get the data to update source table columns. Finally added OLEDB destination from where I'll fill the table basing on the joining conditions from the look-up.
This process was really fast than executing an update script.