We use a table for a mail queue. When new mail needs to be sent, it is inserted into this table. There is a field in the table called status with an index on it.
A script runs every 10 seconds and checks whether there is new mail with status=0, sends this mail and then updates the status to 1 (the actual mail content is saved as a nvarchar(max) column).
My question: is there any benefit to immediately "cleaning" the table, meaning once an email is sent, copy the record into a different "sent" table and delete it from the mail queue table? Right now we are performing this cleaning process only once a month, removing about 500,000 emails each month.
You must consider performance of a few operations here:
selects of e-mails to be sent
inserts of new e-mails
updates of sent e-mails
deletes of sent e-mails
To address the deletion issue right away: if you have no issues with disk space, the fastest way to delete the data is to truncate the whole table, e.g. once a month.
Insert performance should not be affected by choosing different approaches - inserts' time depends more on the table's availability (speaking of locks) and physical structure, than on the size of the table. So when it comes to locking, it would be better not to clean the rows right away (looking just at the inserts right now).
Updates of the 'sent' flag is the most tricky part here. Size of the table plays a big role, because you first need to select the rows with new e-mails and then update the row once you actually send it. I assume you use a cursor to send the e-mails one by one in a loop, so all you need is just one select statement to find IDs of all the items to be sent. That should not be resource consuming if you have a non-clustered index on the flag column. You must remember to maintain the index though, but that can be taken care of outside the business hours. Once you sent an e-mail, in order to update the flag, you can access the row using it's ID, so if you have a clustered index on the ID field (which I hope is the case), then it is a fast operation.
All in all, I would say that you would find no benefit in cleaning the table right away, because you lock it for the sake of doing it, you fragment the indexes with deletes and selects and updates would not benefit that much from that approach.
if 10 second is justified time to run your script and volume of mail is lot then what your doing is correct.Becasue sending mail is very important than cleaning table.
cleaning process require Select -Insert,select -delete
Since you are saying 5,00,000 record per month.That measn you send this much mail per month.
so per 10 second mail=
select 500000/30 ,select 16666/(24*60*60.0)*10=2 mail approax.
So you send 2 mail per 10 second.
I think you can do all operation at one go.Like writing trigger to move data to archive when update Status=1.
no need of second scheduler.
Related
I am working on a spring batch service that pulls data from a db on a schedule. (e.g. every day at 12pm)
I am using JdbcPagingItemReader to read the data and a scheduler (#Scheduled provided by spring batch) to launch the job. The problem that I have now is: every time the job runs, it will just pull all the data from the beginning and not from the "last read" row.
The data from the db is changing everyday(deleting old ones and adding new ones) and all I have is a timestamp column to track them.
Is there a way to "remember" the last row read from the last execution of the job and read data only later than that row?
Since you need to pull data on a daily basis, and your records have a timestamp, then you can design your job instances to be based on a given date (ie using the date as an identifying job parameter). With this approach, you do not need to "remember" the last processed record. All you need to do is process records for a given date by using the correct SQL query. For example:
Job instance ID
Date
Job parameter
SQL
1
2021-03-22
date=2021-03-22
Select c1, c2 from table where date = 2021-03-22
2
2021-03-23
date=2021-03-23
Select c1, c2 from table where date = 2021-03-23
...
...
...
...
With that in place, you can use any cursor-based or paging-based reader to process records of a given date. If a job instance fails, you can restart it without a risk to interfere with other job instances. The restart could be done even several days after the failure since the job instance will always process the same data set. Moreover, in case of failure and job restart, Spring Batch will reprocess records from the last check point in the previous (failed) run.
Just want to post an update to this question.
So in the end I created two more steps to achieve what I wanted to do initially.
Since I don't have the privilege to modify the table where I read the data from, I couldn't use the "process indicator pattern" which involves having a column to mark if a record is processed or not. I created another table to store the last-read record's timestamp, and use it to update the sql query.
step 0: a tasklet that reads the bookmark from a table, pass it in the job context
step 1: a chunk step, get the bookmark from the context, use jdbcPagingItemReader to read the data
step 2: a tasklet to update the bookmark
But doing this I have to be very cautious with the bookmark table. If I lose that I lose everything
I have a database table which have more than 1 million records uniquely identified by a GUID column. I want to find out which of these record or rows was selected or retrieved in the last 5 years. The select query can happen from multiple places. Sometimes the row will be returned as a single row. Sometimes it will be part of a set of rows. there is select query that does the fetching from a jdbc connection from a java code. Also a SQL procedure also fetches data from the table.
My intention is to clean up a database table.I want to delete all rows which was never used( retrieved via select query) in last 5 years.
Does oracle DB have any inbuild meta data which can give me this information.
My alternative solution was to add a column LAST_ACCESSED and update this column whenever I select a row from this table. But this operation is a costly operation for me based on time taken for the whole process. Atleast 1000 - 10000 records will be selected from the table for a single operation. Is there any efficient way to do this rather than updating table after reading it. Mine is a multi threaded application. so update such large data set may result in deadlocks or large waiting period for the next read query.
Any elegant solution to this problem?
Oracle Database 12c introduced a new feature called Automatic Data Optimization that brings you Heat Maps to track table access (modifications as well as read operations). Careful, the feature is currently to be licensed under the Advanced Compression Option or In-Memory Option.
Heat Maps track whenever a database block has been modified or whenever a segment, i.e. a table or table partition, has been accessed. It does not track select operations per individual row, neither per individual block level because the overhead would be too heavy (data is generally often and concurrently read, having to keep a counter for each row would quickly become a very costly operation). However, if you have you data partitioned by date, e.g. create a new partition for every day, you can over time easily determine which days are still read and which ones can be archived or purged. Also Partitioning is an option that needs to be licensed.
Once you have reached that conclusion you can then either use In-Database Archiving to mark rows as archived or just go ahead and purge the rows. If you happen to have the data partitioned you can do easy DROP PARTITION operations to purge one or many partitions rather than having to do conventional DELETE statements.
I couldn't use any inbuild solutions. i tried below solutions
1)DB audit feature for select statements.
2)adding a trigger to update a date column whenever a select query is executed on the table.
Both were discarded. Audit uses up a lot of space and have performance hit. Similary trigger also had performance hit.
Finally i resolved the issue by maintaining a separate table were entries older than 5 years that are still used or selected in a query are inserted. While deleting I cross check this table and avoid deleting entries present in this table.
I am running SQL Server 2012 and this one query is killing my database performance.
My text message provider does not support scheduled text messages so I have a text message engine that picks up messages from the database and sends them at the scheduled time. I put this query together that gets the messages from the database and also changes their status so that they do not get picked up again.
The query works fine, it is just causing wait times on the CPU especially since it runs every other second. I installed a database performance software and it said this query accounts for 92% of instance execution time. The software also said that every single execution is doing 347,267 Logical Reads.
Any ideas on how to make this perform better?
Should I maybe select into a temporary table and update those results before returning them?
Here is the current query:
UPDATE TOP (30) dbo.Outgoing
SET Status = 2
OUTPUT INSERTED.OutgoingID, INSERTED.[Message], n.PhoneNumber, c.OptInStatus
FROM dbo.Outgoing o
JOIN Numbers n on n.NumberID = o.NumberID
LEFT JOIN Contacts c on c.ContactID = o.ContactID
WHERE Scheduled <= GETUTCDATE() AND SmsId IS NULL AND Status = 1
Here is the execution plan
There are three tables involved in this query: Outgoing, Numbers, & Contacts
Outgoing is the main table that this query deals with. There are only two indexes right now, a clustered primary key index on OutgoingID [PK, bigint, not null] and a non-clustered, non-unique index on SmsId [varchar(255), null] which is an identifier sent back from our text message provider once the messages are successfully received in their system. The Status column is just an integer column that relates to a few different statuses (Scheduled, Queued, Sent, Failed, Etc)
Numbers is just a simple table where we store unique cell phone numbers, some different formats of that number, and some basic information identifying the customer such as First name, carrier, etc. It just has a clustered primary key index on NumberID [bigint]. The PhoneNumber column is just a varchar(15).
The Contacts table just connects the individual person (phone number) to one of our merchants and keeps up with the number's opt in status, and other information related to the customer/merchant relationship. The only columns related to this query are OptInStatus [bit, not null] and ContactID [PK, bigint, not null]
--UPDATE--
Added a non-clustered index on the the Outgoing table with columns (Scheduled, SmsId, Status) and that seems to have brought down the execution time from 2+ second to milliseconds. I will check in with my performance monitoring software tomorrow to see how it has improved. Thank you everyone for the help so far!
As several commenters have already pointed out you need a new index on the dbo.Outgoing table. The server is struggling with finding the rows to update/output. This is most probably where the problem is:
WHERE Scheduled <= GETUTCDATE() AND SmsId IS NULL AND Status = 1
To improve performance you should create an index on dbo.Outgoing where you include these columns. This will make is easier for Sql Server to find the correct rows. It will on the other hand create some more work for the actual update though since there will be a new index that needs attention when updating.
While you're working on this, it would likely be a good idea to shorten the SmsId column unless you actually need it to be 255 chars long. Preferably before you create the index.
As an alternate solution you might think about having separate tables for the messages that are outgoing and those that are outgone. Then you can:
insert all records from Outgoing to Outgone
delete all records from Outgoing, with output clause like you are currently doing.
Make sure though that the insert and the delete operations are done in one transaction or you will soon have weird inconsistencies in the database.
it is just causing wait times on the CPU especially since it runs every other second.
Get rid of the TOP 30 and run it much less often than once every other second... maybe every two or three minutes.
You can enable max degree of parallelism of your sql server for faster processing
We want to know what rows in a certain table is used frequently, and which are never used. We could add an extra column for this, but then we'd get an UPDATE for every SELECT, which sounds expensive? (The table contains 80k+ rows, some of which are used very often.)
Is there a better and perhaps faster way to do this? We're using some old version of Microsoft's SQL Server.
This kind of logging/tracking is the classical application server's task. If you want to realize your own architecture (there tracking architecture) do it on your own layer.
And in any case you will need application server there. You are not going to update tracking field it in the same transaction with select, isn't it? what about rollbacks? so you have some manager who first run select than write track information. And what is the point to save tracking information together with entity info sending it back to DB? Save it into application server file.
You could either update the column in the table as you suggested, but if it was me I'd log the event to another table, i.e. id of the record, datetime, userid (maybe ip address etc, browser version etc), just about anything else I could capture and that was even possibly relevant. (For example, 6 months from now your manager decides not only does s/he want to know which records were used the most, s/he wants to know which users are using the most records, or what time of day that usage pattern is etc).
This type of information can be useful for things you've never even thought of down the road, and if it starts to grow large you can always roll-up and prune the table to a smaller one if performance becomes an issue. When possible, I log everything I can. You may never use some of this information, but you'll never wish you didn't have it available down the road and will be impossible to re-create historically.
In terms of making sure the application doesn't slow down, you may want to 'select' the data from within a stored procedure, that also issues the logging command, so that the client is not doing two roundtrips (one for the select, one for the update/insert).
Alternatively, if this is a web application, you could use an async ajax call to issue the logging action which wouldn't slow down the users experience at all.
Adding new column to track SELECT is not a practice, because it may affect database performance, and the database performance is one of major critical issue as per Database Server Administration.
So here you can use one very good feature of database called Auditing, this is very easy and put less stress on Database.
Find more info: Here or From Here
Or Search for Database Auditing For Select Statement
Use another table as a key/value pair with two columns(e.g. id_selected, times) for storing the ids of the records you select in your standard table, and increment the times value by 1 every time the records are selected.
To do this you'd have to do a mass insert/update of the selected ids from your select query in the counting table. E.g. as a quick example:
SELECT id, stuff1, stuff2 FROM myTable WHERE stuff1='somevalue';
INSERT INTO countTable(id_selected, times)
SELECT id, 1 FROM myTable mt WHERE mt.stuff1='somevalue' # or just build a list of ids as values from your last result
ON DUPLICATE KEY
UPDATE times=times+1
The ON DUPLICATE KEY is right from the top of my head in MySQL. For conditionally inserting or updating in MSSQL you would need to use MERGE instead
We are currently running a SQL Job that archives data daily at every 10PM. However, the end users complains that from 10PM to 12, the page shows a time out error.
Here's the pseudocode of the job
while #jobArchive = 1 and #countProcecessedItem < #maxItem
exec ArchiveItems #countProcecessedItem out
if error occured
set #jobArchive = 0
delay '00:10'
The ArchiveItems stored procedure grabs the top 100 item that was created 30 days ago, process and archive them in another database and delete the item in the original table, including other tables that are related with it. finally sets the #countProcecessedItem with the number of item processed. The ArchiveItems also creates and deletes temporary tables it used to hold some records.
Note: if the information I've provide is incomplete, reply and I'll gladly add more information if possible.
Only thing not clear is it the ArchiveItems also delete or not data from database. Deleting rows in SQL Server is a very expensive operation that causes a lot of Locking condition on the database, with possibility to have table and database locks and this typically causes timeout.
If you're deleting data what you can do is:
Set a "logical" deletion flag on the relevant row and consider it in the query you do to read data
Perform deletes in batches. I've found that (in my application) deleting about 250 rows in each transaction gives the faster operation, taking a lot less time than issuing 250 delete command in a separate way
Hope this helps, but archiving and deleting data from SQL Server is a very tough job.
While the ArchiveItems process is deleting the 100 records, it is locking the table. Make sure you have indexes in place to make the delete run quickly; run a Profiler session during that timeframe and see how long it takes. You may need to add an index on the date field if it is doing a Table Scan or Index Scan to find the records.
On the end user's side, you may be able to add a READUNCOMMITTED or NOLOCK hint on the queries; this allows the query to run while the deletes are taking place, but with the possibility of returning records that are about to be deleted.
Also consider a different timeframe for the job; find the time that has the least user activity, or only do the archiving once a month during a maintenance window.
As another poster mentioned, slow DELETEs are often caused by not having a suitable index, or a suitable index needs rebuilding.
During DELETEs it is not uncommon for locks to be escalated ROW -> PAGE -> TABLE. You reduce locking by
Adding a ROWLOCK hint (but be aware
it will likely consume more memory)
Randomising the Rows that are
deleted (makes lock escalation less
likely)
Easiest: Adding a short WAITFOR in
ArchiveItems
WHILE someCondition
BEGIN
DELETE some rows
-- Give other processes a chance...
WAITFOR DELAY '000:00:00.250'
END
I wouldn't use the NOLOCK hint if the deletes are happening during periods with other activity taking place, and you want to maintain integrity of your data.