I need to move the data that is a month old from a logging table to a logging-archive table, and remove data older than a year from the later.
There are lots of data (600k insert in 2 months).
I was considering to simply call (batch) a stored proc every day/week.
I first thought about doing 2 stored proc :
Deleting from the archives what is older than 365 days
Moving the data from logging to archive, what is older than 30 days (I suppose there's a way to do that with 1 sql query)
Removing from logging what is older than 30 days.
However, this solution seems quite ineficient and will probably lock the DB for a few minutes, which I do not want.
So, do I have any alternative and what are they?
None of this should lock the tables that you actually use. You are writing only to the logging table currently, and only to new records.
You are selecting from the logging table only OLD records, and writing to a table that you don't write to except for the archive process.
The steps you are taking sound fine. I would go one step further, and instead of deleting based on date, just do an INNER JOIN to your archive table on your id field - then you only delete the specific records you have archived.
As a side note, 600k records is not very big at all. We have production DBs with tables over 2billion rows, and I know some other folks here have dbs with millions of inserts a minute into transactional tables.
Edit:
I forgot to include originally, another benefit of your planned method is that each step is isolated. If you need to stop for any reason, none of your steps is destructive or depends on the next step executing immediately. You could potentially archive a lot of records, then run the deletes the next day or overnight without creating any issues.
What if you archived to a secondary database.
I.e.:
Primary database has the logging table.
Secondary database has the archive table.
That way, if you're worried about locking your archive table so you can do a batch on it, it won't take your primary database down.
But in any case, i'm not sure you have to worry about locking -- I guess just depends on how you implement.
Related
My current project for a client requires me to work with Oracle databases (11g). Most of my previous database experience is with MSSQL Server, Access, and MySQL. I've recently run into an issue that seems incredibly strange to me and I was hoping someone could provide some clarity.
I was looking to do a statement like the following:
update MYTABLE set COLUMN_A = COLUMN_B;
MYTABLE has about 13 million rows.
The source column is indexed (COLUMN_B), but the destination column is not (COLUMN_A)
The primary key field is a GUID.
This seems to run for 4 hours but never seems to complete.
I spoke with a former developer that was more familiar with Oracle than I, and they told me you would normally create a procedure that breaks this down into chunks of data to be commited (roughly 1000 records or so). This procedure would iterate over the 13 million records and commit 1000 records, then commit the next 1000...normally breaking the data up based on the primary key.
This sounds somewhat silly to me coming from my experience with other database systems. I'm not joining another table, or linking to another database. I'm simply copying data from one column to another. I don't consider 13 million records to be large considering there are systems out there in the orders of billions of records. I can't imagine it takes a computer hours and hours (only to fail) at copying a simple column of data in a table that as a whole takes up less than 1 GB of storage.
In experimenting with alternative ways of accomplishing what I want, I tried the following:
create table MYTABLE_2 as (SELECT COLUMN_B, COLUMN_B as COLUMN_A from MYTABLE);
This took less than 2 minutes to accomplish the exact same end result (minus dropping the first table and renaming the new table).
Why does the UPDATE run for 4 hours and fail (which simply copies one column into another column), but the create table which copies the entire table takes less than 2 minutes?
And are there any best practices or common approaches used to do this sort of change? Thanks for your help!
It does seem strange to me. However, this comes to mind:
When you are updating the table, transaction logs must be created in case a rollback is needed. Creating a table, that isn't necessary.
I have database with one of the tables that got over populated ( 465025579 records ), what is the best way to delete the records and keep only 3 months of the records, without the device to hang?
Delete them in batches based on date earliest first. Sure it'll take some time, but it's safer (as you are defining which to delete) and not so resource intensive. It also means you can shrink the database in batches too, instead of one big hit (which is quite resource intensive).
Yeah, it might fragment the database a little, but until you've got the actual data down to a manageable level, there isn't that much you can do.
To be fair, 200G of data isn't that much on a decent machine these days.
All this said, I'm presuming you want the database to remain 'online'
If you don't need the database to be available whilst you're doing this, the easiest thing to do is usually to select the rows that you want to keep into a different table, run a TRUNCATE on this table, and then copy the saved rows back in.
From TRUNCATE:
TRUNCATE TABLE is similar to the DELETE statement with no WHERE clause; however, TRUNCATE TABLE is faster and uses fewer system and transaction log resources.
I have reports that perform some time consuming data calculations for each user in my database, and the result is 10 to 20 calculated new records for each user. To improve report responsiveness, a nightly job was created to run the calculations and dump the results to a snapshot table in the database. It only runs for active users.
So with 50k users, 30k of which are active, the job "updates" 300k to 600k records in the large snapshot table. The method it currently uses is it deletes all previous records for a given user, then inserts the new set. There is no PK on the table, only a business key is used to group the sets of data.
So my question is, when removing and adding up to 600k records every night, are there techniques to optimize the table to handle this? For instance, since the data can be recreated on demand, is there a way to disable logging for the table as these changes are made?
UPDATE:
One issue is I cannot do this in batch because the way the script works, it's examining one user at a time, so it looks at a user, deletes the previous 10-20 records, and inserts a new set of 10-20 records. It does this over and over. I am worried that the transaction log will run out of space or other performance issues could occur. I would like to configure the table to now worry about data preservation or other items that could slow it down. I cannot drop the indexes and all that because people are accessing the table concurrently to it being updated.
It's also worth noting that indexing could potentially speed up this bulk update rather than slow it down, because UPDATE and DELETE statements still need to be able to locate the affected rows in the first place, and without appropriate indexes it will resort to table scans.
I would, at the very least, consider a non-clustered index on the column(s) that identify the user, and (assuming you are using 2008) consider the MERGE statement, which can definitely avoid the shortcomings of the mass DELETE/INSERT method currently employed.
According to The Data Loading Performance Guide (MSDN), MERGE is minimally logged for inserts with the use of a trace flag.
I won't say too much more until I know which version of SQL Server you are using.
This is called Bulk Insert, you have to drop all indexes in destination table and send insert commands in large packs (hundreds of insert statements) separated by ;
Another way is to use BULK INSERT statement http://msdn.microsoft.com/en-us/library/ms188365.aspx
but it involves dumping data to file.
See also: Bulk Insert Sql Server millions of record
It really depends upon many things
speed of your machine
size of the records being processed
network speed
etc.
Generally it is quicker to add records to a "heap" or an un-indexed table. So dropping all of your indexes and re-creating them after the load may improve your performance.
Partitioning the table may see performance benefits if you partition by active and inactive users (although the data set may be a little small for this)
Ensure you test how long each tweak adds or reduces your load and work from there.
I'm writing an application which must log information pretty frequently, say, twice in a second. I wish to save the information to an sqlite database, however I don't mind to commit changes to the disk once every ten minutes.
Executing my queries when using a file-database takes to long, and makes the computer lag.
An optional solution is to use an in-memory database (it will fit, no worries), and synchronize it to the disk from time to time,
Is it possible? Is there a better way to achieve that (can you tell sqlite to commit to disk only after X queries?).
Can I solve this with Qt's SQL wrapper?
Let's assume you have an on-disk database called 'disk_logs' with a table called 'events'. You could attach an in-memory database to your existing database:
ATTACH DATABASE ':memory:' AS mem_logs;
Create a table in that database (which would be entirely in-memory) to receive the incoming log events:
CREATE TABLE mem_logs.events(a, b, c);
Then transfer the data from the in-memory table to the on-disk table during application downtime:
INSERT INTO disk_logs.events SELECT * FROM mem_logs.events;
And then delete the contents of the existing in-memory table. Repeat.
This is pretty complicated though... If your records span multiple tables and are linked together with foreign keys, it might be a pain to keep these in sync as you copy from an in-memory tables to on-disk tables.
Before attempting something (uncomfortably over-engineered) like this, I'd also suggest trying to make SQLite go as fast as possible. SQLite should be able to easily handly > 50K record inserts per second. A few log entries twice a second should not cause significant slowdown.
If you're executing each insert within it's own transaction - that could be a significant contributor to the slow-downs you're seeing. Perhaps you could:
Count the number of records inserted so far
Begin a transaction
Insert your record
Increment count
Commit/end transaction when N records have been inserted
Repeat
The downside is that if the system crashes during that period you risk loosing the un-committed records (but if you were willing to use an in-memory database, than it sounds like you're OK with that risk).
A brief search of the SQLite documentation turned up nothing useful (it wasn't likely and I didn't expect it).
Why not use a background thread that wakes up every 10 minutes, copies all of the log rows from the in-memory database to the external database (and deletes them from the in-memory database). When your program is ready to end, wake up the background thread one last time to save the last logs, then close all of the connections.
We are currently running a SQL Job that archives data daily at every 10PM. However, the end users complains that from 10PM to 12, the page shows a time out error.
Here's the pseudocode of the job
while #jobArchive = 1 and #countProcecessedItem < #maxItem
exec ArchiveItems #countProcecessedItem out
if error occured
set #jobArchive = 0
delay '00:10'
The ArchiveItems stored procedure grabs the top 100 item that was created 30 days ago, process and archive them in another database and delete the item in the original table, including other tables that are related with it. finally sets the #countProcecessedItem with the number of item processed. The ArchiveItems also creates and deletes temporary tables it used to hold some records.
Note: if the information I've provide is incomplete, reply and I'll gladly add more information if possible.
Only thing not clear is it the ArchiveItems also delete or not data from database. Deleting rows in SQL Server is a very expensive operation that causes a lot of Locking condition on the database, with possibility to have table and database locks and this typically causes timeout.
If you're deleting data what you can do is:
Set a "logical" deletion flag on the relevant row and consider it in the query you do to read data
Perform deletes in batches. I've found that (in my application) deleting about 250 rows in each transaction gives the faster operation, taking a lot less time than issuing 250 delete command in a separate way
Hope this helps, but archiving and deleting data from SQL Server is a very tough job.
While the ArchiveItems process is deleting the 100 records, it is locking the table. Make sure you have indexes in place to make the delete run quickly; run a Profiler session during that timeframe and see how long it takes. You may need to add an index on the date field if it is doing a Table Scan or Index Scan to find the records.
On the end user's side, you may be able to add a READUNCOMMITTED or NOLOCK hint on the queries; this allows the query to run while the deletes are taking place, but with the possibility of returning records that are about to be deleted.
Also consider a different timeframe for the job; find the time that has the least user activity, or only do the archiving once a month during a maintenance window.
As another poster mentioned, slow DELETEs are often caused by not having a suitable index, or a suitable index needs rebuilding.
During DELETEs it is not uncommon for locks to be escalated ROW -> PAGE -> TABLE. You reduce locking by
Adding a ROWLOCK hint (but be aware
it will likely consume more memory)
Randomising the Rows that are
deleted (makes lock escalation less
likely)
Easiest: Adding a short WAITFOR in
ArchiveItems
WHILE someCondition
BEGIN
DELETE some rows
-- Give other processes a chance...
WAITFOR DELAY '000:00:00.250'
END
I wouldn't use the NOLOCK hint if the deletes are happening during periods with other activity taking place, and you want to maintain integrity of your data.