Avoiding stale data when two cron jobs access the same DB

Avoiding stale data when two cron jobs access the same DB - database

I have two tables(let's say Table A and B) to store the status of two independent activity respectively. I have two cron jobs(Lets call it A and B again) which run 15 mins apart, every 30 mins. They both have the same logic, Updating the data in the table, the only difference is the table they work on.
Let me explain Cron A:
Cron A picks up users with "pending" status in Table A, performs some logic to filter out the now "active" users out of these. It then queries data from table B for these active users only. It then begins to update the status of users in Table A from pending to active, checks if this user is active in table B as well(from the data queried in table B) and stores such users in a Set(active in A and B) for further processing.
This DB update to table A happens for one record at a time and not Bulk update hence it consumes a lot of time.
Note: the users present in table A are also present in Table B.
Similar logic applies to cron B as well where it considers the other table.
You can see the problem. If there are too many records to update and, for example, say cron A is still running and cron B starts it's processing, Cron B will query users from cron A who are still being updated by cron A, i.e, cron B will end up having stale data.
I have one approach to solve this but wanted to know better or more practised solutions to such issues. One thing is, we currently don't have time or resource to completely optimize the legacy code in these cron jobs to improve its processing time.
I was thinking of having a redis key which will be set to True when one of the cron jobs begins to update its respective table's data. While the redis key is set, if it's time for the other cron job, it will first check the value of this redis key and if it is True it should stop processing for that schedule.
The same redis key logic goes for the other cron job as well.
There is not much impact in terms of TAT since the next cron will run in 30 mins. Moreover, I can setup an Elastalert saying the cron job had to be terminated and anyone can monitor and trigger the job manually once the other job is completed.
Wanted to check if this is a viable approach

Related

how are concurrent queries handled in snowflake?

For example, if I have a task that's inserting rows into a table while another task is truncating the same table, what happens?
I'm asking because I have a task that runs every minute which inserts rows into a table and then a lambda that reads and truncates the same table that runs every minute. I know snow tasks and event bridge don't run at every minute on the dot so I haven't really run into this issue yet but I'm thinking it'll happen eventually.
How does snowflake handle this?

It is the same concept in other SQL engines, that lock on resources will be placed.
In the Snowflake world, INSERT will have PARTITION level locking, because most of the INSERT statements write only new partitions.
Please see the below doc:
https://docs.snowflake.com/en/sql-reference/transactions.html#resource-locking
If the INSERT query is submitted before the TRUNCATE, then the TRUNCATE will have to wait until the INSERT query finishes. They can't be operated at the same time on the same resource.
See the screenshot below, the first query was the INSERT, which was HOLDING the PARTITION level lock, while the second query was the TRUNCATE, which was in the WAITING state:

The table will be locked by the first transaction that runs and subsequent transactions will be queued until the preceding transaction(s) complete.
BTW (and this may be the point of your question) having two processes like this operate independently doesn’t seem like a good design - as the lambda process seems to be logically dependent on the task.

SpringBatch application periodically pulling data from DB

I am working on a spring batch service that pulls data from a db on a schedule. (e.g. every day at 12pm)
I am using JdbcPagingItemReader to read the data and a scheduler (#Scheduled provided by spring batch) to launch the job. The problem that I have now is: every time the job runs, it will just pull all the data from the beginning and not from the "last read" row.
The data from the db is changing everyday(deleting old ones and adding new ones) and all I have is a timestamp column to track them.
Is there a way to "remember" the last row read from the last execution of the job and read data only later than that row?

Since you need to pull data on a daily basis, and your records have a timestamp, then you can design your job instances to be based on a given date (ie using the date as an identifying job parameter). With this approach, you do not need to "remember" the last processed record. All you need to do is process records for a given date by using the correct SQL query. For example:
Job instance ID
Date
Job parameter
SQL
1
2021-03-22
date=2021-03-22
Select c1, c2 from table where date = 2021-03-22
2
2021-03-23
date=2021-03-23
Select c1, c2 from table where date = 2021-03-23
...
...
...
...
With that in place, you can use any cursor-based or paging-based reader to process records of a given date. If a job instance fails, you can restart it without a risk to interfere with other job instances. The restart could be done even several days after the failure since the job instance will always process the same data set. Moreover, in case of failure and job restart, Spring Batch will reprocess records from the last check point in the previous (failed) run.

Just want to post an update to this question.
So in the end I created two more steps to achieve what I wanted to do initially.
Since I don't have the privilege to modify the table where I read the data from, I couldn't use the "process indicator pattern" which involves having a column to mark if a record is processed or not. I created another table to store the last-read record's timestamp, and use it to update the sql query.
step 0: a tasklet that reads the bookmark from a table, pass it in the job context
step 1: a chunk step, get the bookmark from the context, use jdbcPagingItemReader to read the data
step 2: a tasklet to update the bookmark
But doing this I have to be very cautious with the bookmark table. If I lose that I lose everything

Keep a look-out for passing dates, then take action "real-time"

Let's say I have a table with a date column; can I attach some sort of "watcher" that can take action if the date gets smaller than getdate()? Note that the date is larger than getdate() at the time of insertion.
Are there any tools that I might be unaware of in SQL Server 2008/2012?
Or would the best option be to poll the data from another application?
Edit: Note that there is no insertion/update taking place.

You could set up a SQL Job which runs periodically and executes a stored procedure which can then handle the logic around past dates.
https://msdn.microsoft.com/en-gb/library/ms187910.aspx
For example a SQL Job could be set up to run once daily to find out user's birthdays and send out an automated email.
In your case a job could be set up every minute (if required) which detects past dates and does something with those records. I would suggest adding some kind of flag to each record so that it isn't actioned the next time the job runs.
Alternatively if you have a lot of servers and databases, you could centralise your job scheduling using a third-party tool such as ActiveBatch.

Change Data Capture (CDC) cleanup job only removes a few records at a time

I'm a beginner with SQL Server. For a project I need CDC to be turned on. I copy the cdc data to another (archive) database and after that the CDC tables can be cleaned immediately. So the retention time doesn't need to be high, I just put it on 1 minute and when the cleanup job runs (after the retention time is already fulfilled) it appears that it only deleted a few records (the oldest ones). Why didn't it delete everything? Sometimes it doesn't delete anything at all. After running the job a few times, the other records get deleted. I find this strange because the retention time has long passed.
I set the retention time at 1 minute (I actually wanted 0 but it was not possible) and didn't change the threshold (= 5000). I disabled the schedule since I want the cleanup job to run immediately after the CDC records are copied to my archive database and not particularly on a certain time.
My logic for this idea was that for example there will be updates in the afternoon. The task to copy CDC records to archive database should run at 2:00 AM, after this task the cleanup job gets called. So because of the minimum retention time, all the CDC records should be removed by the cleanup job. The retention time has passed after all?
I just tried to see what happened when I set up a schedule again in the job, like how CDC is meant to be used in general. After the time has passed I checked the CDC table and turns out it also only deletes the oldest record. So what am I doing wrong?
I made a workaround where I made a new job with the task to delete all records in the CDC tables (and disabled the entire default CDC cleanup job). This works better as it removes everything but it's bothering me because I want to work with the original cleanup job and I think it should be able to work in the way that I want it to.
Thanks,
Kim

Rather than worrying about what's in the table, I'd use the helper functions that are created for each capture instance. Specifically, cdc.fn_cdc_get_all_changes_ and cdc.fn_cdc_get_net_changes_. A typical workflow that I've used wuth these goes something below (do this for all of the capture instances). First, you'll need a table to keep processing status. I use something like:
create table dbo.ProcessingStatus (
CaptureInstance sysname,
LSN numeric(25,0),
IsProcessed bit
)
create unique index [UQ_ProcessingStatus]
on dbo.ProcessingStatus (CaptureInstance)
where IsProcessed = 0
Get the current max log sequence number (LSN) using fn_cdc_get_max_lsn.
Get the last processed LSN and increment it using fn_cdc_increment_lsn. If you don't have one (i.e. this is the first time you've processed), use fn_cdc_get_min_lsn for this instance and use that (but don't increment it!). Record whatever LSN you're using in the table with, set IsProcessed = 0.
Select from whichever of the cdc.fn_cdc_get… functions makes sense for your scenario and process the results however you're going to process them.
Update IsProcessed = 1 for this run.
As for monitoring your original issue, just make sure that the data in the capture table is generally within the retention period. That is, if you set it to 2 days, I wouldn't even think about it being a problem until it got to be over 4 days (assuming that your call to the cleanup job is scheduled at something like every hour). And when you process with the above scheme, you don't need to worry about "too much" data being there; you're always processing a specific interval rather than "everything".

What factors that degrade the performance of a SQL Server 2000 Job?

We are currently running a SQL Job that archives data daily at every 10PM. However, the end users complains that from 10PM to 12, the page shows a time out error.
Here's the pseudocode of the job
while #jobArchive = 1 and #countProcecessedItem < #maxItem
exec ArchiveItems #countProcecessedItem out
if error occured
set #jobArchive = 0
delay '00:10'
The ArchiveItems stored procedure grabs the top 100 item that was created 30 days ago, process and archive them in another database and delete the item in the original table, including other tables that are related with it. finally sets the #countProcecessedItem with the number of item processed. The ArchiveItems also creates and deletes temporary tables it used to hold some records.
Note: if the information I've provide is incomplete, reply and I'll gladly add more information if possible.

Only thing not clear is it the ArchiveItems also delete or not data from database. Deleting rows in SQL Server is a very expensive operation that causes a lot of Locking condition on the database, with possibility to have table and database locks and this typically causes timeout.
If you're deleting data what you can do is:
Set a "logical" deletion flag on the relevant row and consider it in the query you do to read data
Perform deletes in batches. I've found that (in my application) deleting about 250 rows in each transaction gives the faster operation, taking a lot less time than issuing 250 delete command in a separate way
Hope this helps, but archiving and deleting data from SQL Server is a very tough job.

While the ArchiveItems process is deleting the 100 records, it is locking the table. Make sure you have indexes in place to make the delete run quickly; run a Profiler session during that timeframe and see how long it takes. You may need to add an index on the date field if it is doing a Table Scan or Index Scan to find the records.
On the end user's side, you may be able to add a READUNCOMMITTED or NOLOCK hint on the queries; this allows the query to run while the deletes are taking place, but with the possibility of returning records that are about to be deleted.
Also consider a different timeframe for the job; find the time that has the least user activity, or only do the archiving once a month during a maintenance window.

As another poster mentioned, slow DELETEs are often caused by not having a suitable index, or a suitable index needs rebuilding.
During DELETEs it is not uncommon for locks to be escalated ROW -> PAGE -> TABLE. You reduce locking by
Adding a ROWLOCK hint (but be aware
it will likely consume more memory)
Randomising the Rows that are
deleted (makes lock escalation less
likely)
Easiest: Adding a short WAITFOR in
ArchiveItems
WHILE someCondition
BEGIN
DELETE some rows
-- Give other processes a chance...
WAITFOR DELAY '000:00:00.250'
END
I wouldn't use the NOLOCK hint if the deletes are happening during periods with other activity taking place, and you want to maintain integrity of your data.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight