Preferred way to approach expired records delition in SQLite - database

There are 3 tables in the database -- Students, Courses, Professors.
Each student has activation_deadline column, which is set to NULL upon activation. I need some mechanism that will periodically delete students who's activation_deadline is overdue (and prevent already 'expired' students from being activated).
At the moment I do it via three separate triggers (as there are no database or server level triggers in SQLite) for Students table.
One for UPDATE:
CREATE TRIGGER Remove_Unactivated_Students_Update
BEFORE UPDATE
ON Students
FOR EACH ROW
BEGIN
DELETE FROM Students
WHERE (activation_deadline IS NOT NULL) AND (activation_deadline <= strftime('%s', 'now'));
END;
One for INSERT:
CREATE TRIGGER Remove_Unactivated_Students_Insert
BEFORE INSERT
ON Students
FOR EACH ROW
BEGIN
DELETE FROM Students
WHERE (activation_deadline IS NOT NULL) AND (activation_deadline <= strftime('%s', 'now'));
END;
And one for DELETE:
CREATE TRIGGER Remove_Unactivated_Students_Delete
AFTER DELETE
ON Students
FOR EACH ROW
BEGIN
DELETE FROM Students
WHERE (activation_deadline IS NOT NULL) AND (activation_deadline <= strftime('%s', 'now'));
END;
Another approach would be to add some code to the backend, which will check and remove expired records before any other query to database is executed (though, it will increase amount of database calls, and this is not good).
Which (keeping 'expired' records removing logic in database trigger or backend) approach in this circumstances is preferred and why? What are the pitfalls and advantages of each?

SQLite is a serverless DBMS and you can't define/schedule tasks or jobs.
Your requirement should be taken care of at the application level, where you can define a daily or weekly job to delete expired students.
This involves the execution of only 1 very simple and fast DELETE statement, once per day/week:
DELETE FROM Students WHERE activation_deadline <= strftime('%s', 'now');
Note that the condition activation_deadline IS NOT NULL is covered by activation_deadline <= strftime('%s', 'now'), so it is not needed.
Any solution involving multiple triggers is out of the question, because it would add unnecessary overhead to any simple INSERT/DELETE/UPDATE operation on the table.

Related

Long query running with Memory optimized tables with daily archiving with real time data

I have a somewhat of what feels like a complex data problem that I am trying to solve. I am more a developer than a SQL expert but our DBA whom wrote the script a few months back, recently moved on and I have been tasked with resolving this problem in the short term.
I am having some real issues with four tables of which two are receiving real-time bulk inserts, being read by users and having a daily job copying then deleting the copied over records into historic tables. The operations run 24/7 and there is no downtime for the data insertions or archiving.
A script that originally did this process started to fail. It was changing the Serializable isolation from Snapshot in ReadCommitted mode, but did a full table scan so it blocked all insert operations for 1-2 hours, which was not acceptable.
Question(s)
The pain point for the archiving is that we have to wrap up the insert and deletes together in one transaction. Is there a better way to do the below? Is it better to not have foreign keys and use a trigger or constraint? The below scripts either lock the table too much or run for too long (1-4hours).
Scenario
Four SQL tables
Order (Memory Optimized table)
OrderDetail (Memory Optimized table)
OrderHistory (File/normal table – ColumnStore index)
OrderDetailHistory (File/normal table– ColumnStore index)
Order and OrderDetail tables receive on average a bulk insert between 50-700 inserts per second is performed.
Updates are only performed during the batch insert operation.
During the day the Order table can be between 3-6 million records.
The OrderDetail table can be up to 2-3 times as large as the Order table.
OrderHistory and OrderDetailHistory can have 1-7 days data, so varies between 10-50 million records at any one point.
OrderDetail has a FK reference to Order on the Id column.
At one point of time during the day, the data is copied and inserted for each table into their respective 'history' tables, which are non-memory optimized tables.
Attempt 1
The script with the issue does this:
BEGIN TRAN
INSERT INTO OrderHistory
SELECT * FROM Order o
WHERE o.CreatedAt <= 1 day
INSERT INTO OrderDetailHistory
SELECT * FROM OrderDetail od
WHERE od.CreatedAt <= 1 day
DELETE FROM Order
WHERE CreatedAt <= 1 day
DELETE FROM OrderDetail
WHERE CreatedAt <= 1 day
COMMIT
The database is run at Snapshot Isolation level with Read Committed.
When run, we were originally getting Serializable errors, and after some reviewing realised that the delete operation would go down to a serializable isolation level to lock the nonclustered index on the CreatedAt column whilst it performed deletes because it is doing a range scan on the index, whilst still inside the transaction that we used for selecting data too.
Attempt 2
So next I modified the script, by creating two memory optimized user defined tables to select data into first outside of the transaction into these udf tables. Then in a separate transaction, insert into our history table, and then finally delete in a separate transaction. The idea is that if the inserts are a success but the delete fails then the next time it runs it will not try to re-insert data twice. The downside to this is that there will be duplication in the history data until it runs again, and it ran for 2 hours before our scheduling tool timeout, so this doesn't seem ideal.
INSERT INTO #OrderLoadingUdfTable
SELECT * FROM Order o
WHERE o.CreatedAt <= 1 day
INSERT INTO #OrderDetailLoadingUdfTable
SELECT * FROM OrderDetail od
WHERE od.CreatedAt <= 1 day
BEGIN TRAN
INSERT INTO OrderHistory
SELECT * FROM #OrderLoadingUdfTable
WHERE CreatedAt <= 1 day
AND Id NOT IN (SELECT Id FROM OrderHistory)
INSERT INTO OrderDetailHistory
SELECT * FROM #OrderDetailLoadingUdfTable
WHERE CreatedAt <= 1 day
AND Id NOT IN (SELECT Id FROM OrderDetailHistory)
COMMIT
BEGIN TRAN
DELETE FROM Order
WHERE CreatedAt <= 1 day
DELETE FROM OrderDetail
WHERE CreatedAt <= 1 day
COMMIT

TSQL Large Update in Batches - Is Join Costing Me More Because it is Performed Each Time in a Loop

I'm trying to archive many records in batches rather than in one shot.
Will TSQL Join the two tables, TeamRoster and #teamIdsToDelete for every loop in the batch? My concern is that if my temporary table is huge and I don't remove records from the temporary table as I go, the JOIN might be unnecessarily expensive. On the other hand, how expensive is it to delete from the temporary table as I go? Is it made up for by the (?real/hypothetical?) smaller joins I'll have to do in each batch?
(Can provide more details/thoughts but will do so if helpful.)
DECLARE #teamIdsToDelete Table
(
RosterID int PRIMARY KEY
)
--collect the list of active teamIds. we will rely on the modified date to age them out.
INSERT INTO #teamIdsToDelete
SELECT DISTINCT tr.RosterID FROM
rosterload.TeamRoster tr WITH (NOLOCK)
WHERE tr.IsArchive=0 and tr.Loaded=1
--ageout out remaining rosters. (no cap - proved we can update more than 50k by modifying test case:
WHILE (1 = 1)
BEGIN
BEGIN TRANSACTION
UPDATE TOP (1000) r
SET [Status] = 'Delete', IsArchive = 1, ModifiedDate = GETDATE(), ModifiedBy = 'abc'
FROM rosterload.TeamRoster r with(rowlock)
JOIN #teamIdsToDelete ttd ON ttd.rosterID = r.RosterID
WHERE r.[Status] != 'Delete' AND r.IsArchive != 1 AND r.ModifiedBy != 'abc' -- predicate for filtering;
IF ##ROWCOUNT = 0 -- terminating condition;
BEGIN
COMMIT TRANSACTION
BREAK
END
COMMIT TRANSACTION
END
As I understand the goal of this query is to archive huge number of rows w/o blocking other queries at the same time. The temp table helps you to narrow down the subset of records to delete. Since it has one column which is clustered primary key, the join to another PK will be blazingly fast. You will spend more efforts on calculating and deleting updated records from the temp table.
Also, there is no reason to use transaction and do batches. You could just do one big update instead. The result is the same - table will be locked after first 5k row locks are acquired (~after first five batches updated) until the COMMIT statement. With rowlock hint does not prevent lock escalation. On the other hand, running w/o transaction would give other queries opportunity to continue after each 1000-row batch. If you need to make sure that all records are archived in one go - add some retry logic to your query or your application code for such errors like deadlocks or process interruption. And do you really need NOLOCK hint?

Why is stored procedure unexpectedly getting an optimization timeout on simple DELETE statement?

There's a multi-step procedure in the Data Warehouse that generates a temp table with a list of Jobs that will be processed for each batch. Usually this is about 5,000 jobs. By the end of financial aggregation we may be looking at about 500,000 records processed. I've noticed that a very small part of it is giving me an Early Timeout on Optimization for just this part of the stored procedure:
DELETE jfs
FROM DataWarehouse.dbo.JobFinancialSummary jfs -- Financials table (> 3,000,000 records with indices)
INNER JOIN #JobList jl ON jfs.JobID = jl.JobID -- List of Jobs being processed (avg. of 5,000 records)
INNER JOIN FiscalPeriod fp ON fp.ID = jfs.FiscalPeriodID -- Month Reference Table (about 1,000 records)
WHERE fp.[Status] IN (1,2) -- Last 2 months
The most confusing thing is that this is a relatively simple part of the stored procedure and all of the JOINs are on indices. My only question is how it gets timed out when the optimizer evaluates this. My understanding is that the optimizer gives each statement its own "Budget" but perhaps I'm missing something. Why the timeout here?
There are many reasons for delete to be very slow:
Reasons:
Table might contain CDC, CT enabled
Table might contain triggers which work after operation of delete
Table might contain FK references, constraints
Table might contain indexed views associated with that table
Above all we need to check how transactional this table will be etc.,
There are many ways to do delete, Obviously batch delete is faster. First and famous way is to mark these records with soft delete attribute and delete offline during nightly time.
If it is going to be offline delete, to make that delete faster,
then capture clustered index keys for the deleting table
Disable indexes, FK's, Constraints as you will be taking care functionally on all these
Disable CT, CDC etc on that table if these are not required
Create a script for indexed views and then drop the indexed views associated with this table
Then delete via batches, you can by setting ##rowCount or top batchsize We can delete any number of records faster this way.
DELETE TOP 50000 -- based on scenario
FROM table1
in loop
Call explicit 'CheckPoint' to make sure records are cleared from transaction log. Also make sure your 'Recovery Model' is 'Simple' not the 'Full'
If it is going to be online delete still during daytime mark that as soft delete and in the nightly job run the deletes in very smaller batches
It may likely be faster to isolate the rows/entries you want to delete first rather than joining whilst deleting.
Something like this assuming id is your primary key/identity:
IF OBJECT_ID('tempdb..#tmp') IS NOT NULL DROP TABLE #tmp
SELECT jfs.ID
INTO #tmp
FROM DataWarehouse.dbo.JobFinancialSummary jfs -- Financials table (> 3,000,000 records with indices)
INNER JOIN FiscalPeriod fp ON fp.ID = jfs.FiscalPeriodID -- Month Reference Table (about 1,000 records)
WHERE fp.[Status] IN (1,2)
/* EXISTS IS FASTER THAN A JOIN, AVOIDS FANNING */
AND EXISTS (SELECT 1 FROM #JobList jl where jfs.JobID = jl.JobID) -- List of Jobs being processed (avg. of 5,000 records)
Then issue a delete such as:
DELETE TOP(1000) jfs
FROM DataWarehouse.dbo.JobFinancialSummary jfs
WHERE EXISTS (SELECT 1 FROM #tmp t WHERE jfs.ID=t.ID)
From there, depending on how many rows you are deleting, you may wish to delete in a batch overnight -- anything over 5000 rows will escalate to table locks and is a prime candidate for batch deletion.
I wrote a fairly popular answer on how to accomplish large batch deletes here:
Deleting 1 millions rows in SQL Server

Sqlite Trigger Number of Rows Affected

Need to keep running count of Rows in very large database. Row Count is needed enough times in my program that running Count(*) is too slow, so I will just keep running count to get around this in SQLITE.
CREATE TRIGGER RowCountUpdate AFTER INSERT ON LastSample
BEGIN
UPDATE BufferControl SET NumberOfSamples = NumberOfSamples +
(SELECT Count(*) FROM Inserted);
END;
So from here I want to take the current number of rows (NumberOfSamples) and increment it with how many rows were affected by the insert (do same with DELETE and decrementing). In the C API of Sqlite, this is done with Sqlite3_Changes(). However, I cannot use that function here in this script. I looked around and saw that some were using the SELECT Count(*) FROM Inserted, but I don't think Sqlite supports that.
Is there any statement that Sqlite recognizes that holds the amount of rows that were affected by the INSERT and DELETE queries?
SQLite has the changes() SQL function, but like the sqlite3_changes() API function, it reports the number of rows of the last completed statement.
During trigger execution, the triggering statement is not yet completed.
Just use a FOR EACH ROW trigger, and add 1 for each row.

Validating UPDATE and INSERT statements against an entire table

I'm looking for the best way to go about adding a constraint to a table that is effectively a unique index on the relationship between the record and the rest of the records in that table.
Imagine the following table describing the patrols of various guards (from the previous watchman scenario)
PK PatrolID Integer
FK GuardID Integer
Starts DateTime
Ends DateTime
We start with a constraint specifying that the start and end times must be logical:
Ends >= Starts
However I want to add another logical constraint: A specific guard (GuardID) cannot be in two places at the same time, meaning that for any record the period specified by Start/Ends should not overlap with the period defined for any other patrol by the same guard.
I can think of two ways of trying to approach this:
Create an INSTEAD OF INSERT trigger. This trigger would then use cursors to go through the INSERTED table, checking each record. If any record conflicted with an existing record, an error would be raised. The two problems I have with this approach are: I dislike using cursors in a modern version of SQL Server, and I'm not sure how to go about implimenting the same logic for UPDATEs. There may also be the complexity of records within INSERTED conflicting with each other.
The second, seemingly better, approach would be to create a CONSTRAINT that calls a user defined function, passing the PatrolID, GuardID, Starts and Ends. The function would then do a WHERE EXISTS query checking for any records that overlap the GuardID/Starts/Ends parameters that are not the original PatrolID record. However I'm not sure of what potential side effects this approach might have.
Is the second approach better? Does anyone see any pitfalls, such as when inserting/updating multiple rows at once (here I'm concerned because rows within that group could conflict, meaning the order they are "inserted" makes a difference). Is there a better way of doing this (such as some fancy INDEX trick?)
Use an after trigger to check that the overlap constraint has not been violated:
create trigger Patrol_NoOverlap_AIU on Patrol for insert, update as
begin
if exists (select *
from inserted i
inner join Patrol p
on i.GuardId = p.GuardId
and i.PatrolId <> p.PatrolId
where (i.Starts between p.starts and p.Ends)
or (i.Ends between p.Starts and p.Ends))
rollback transaction
end
NOTE: Rolling back a transaction within a trigger will terminate the batch. Unlike a normal contraint violation, you will not be able to catch the error.
You may want a different where clause depending on how you define the time range and overlap. For instance if you want to be able to say Guard #1 is at X from 6:00 to 7:00 then Y 7:00 to 8:00 the above would not allow. You would want instead:
create trigger Patrol_NoOverlap_AIU on Patrol for insert, update as
begin
if exists (select *
from inserted i
inner join Patrol p
on i.GuardId = p.GuardId
and i.PatrolId <> p.PatrolId
where (p.Starts <= i.Starts and i.Starts < p.Ends)
or (p.Starts <= i.Ends and i.Ends < p.Ends))
rollback transaction
end
Where Starts is the time the guarding starts and Ends is the infinitesimal moment after guarding ends.
The simplest way would be to use a stored procedure for the inserts. The stored procedure can do the insert in a single statement:
insert into YourTable
(GuardID, Starts, Ends)
select #GuardID, #Starts, #Ends
where not exists (
select *
from YourTable
where GuardID = #GuardID
and Starts <= #Ends
and Ends >= #Start
)
if ##rowcount <> 1
return -1 -- Failure
In my experience triggers and constraints with UDF's tend to become very complex. They have side effects that can require a lot of debugging to figure out.
Stored procedures just work, and they have the added advantage that you can deny INSERT permissions to clients, giving you fine-grained control over what enters your database.
CREATE TRIGGER [dbo].[emaill] ON [dbo].[email]
FOR INSERT
AS
BEGIN
declare #email CHAR(50);
SELECT #email=i.email from inserted i;
IF #email NOT LIKE '%_#%_.__%'
BEGIN
print 'Triggered Fired';
Print 'Invalid Emaill....';
ROLLBACK TRANSACTION
END
END
Can be done with constraints too:
http://www2.sqlblog.com/blogs/alexander_kuznetsov/archive/2009/03/08/storing-intervals-of-time-with-no-overlaps.aspx

Resources