I have Firebird table with 60 milions rows and i need delete ca. half of table.
Table rows has gps position of cars, timestamp of record and other data. Table has primary key IdVehicle+TimeStamp and one foreign key (into Vehicle table). There is no other key or index or trigger. One vehicle has 100 000 - 500 000 records.
I need delete older data, eg. from all vehicles delete data older than 1 March 2015. I tried different ways and actually use my fastest comes with 'execute block' (use primary key). First I read for one vehicle records older then 1.3.2015. Then I am going through the individual records and prepare sql execute a block and then perform it into firebird for every 50 entries.
EXECUTE BLOCK AS BEGIN
DELETE FROM RIDE_POS WHERE IdVehicle = 1547 and date = '4.5.2015 8:56:47'
DELETE FROM RIDE_POS WHERE IdVehicle = 1547 and date = '4.5.2015 8:56:59'
DELETE FROM RIDE_POS WHERE IdVehicle = 1547 and date = '4.5.2015 8:57:17'
...... a total of 50 line
END
Thus delete 1 million lines per 800 seconds (about 1 record for 1 ms).
Is there another quicker way to delete records?
Additionally, this way I can delete only a few million lines, and then I have to restart firebird, otherwise starts to slow down and jam (on the test server there is no other database / application). From early records cleared quickly and gradually takes longer and longer time.
For orientation, how quickly you erasing records routinely in large tables (not completely erase the table, but only a part of the record).
If you want to delete all records older than given date, no matter the vehicle, then there is no point including the Idvehicle in the query, just the date is enough. Ie following should do, just straight query, no need for execute block either:
DELETE FROM RIDE_POS WHERE date < '2015-03-01'
If you have to delete many thousands (or millions) records do not do it in one single transaction. You better do it in several steps - delete for example 1000 records and commit, then delete other 1000 and commit - it should be faster than delete one million of records in one transaction. 1000 is not a rule, it depends on your particular situation (how large are your records, how many linked data they have via foreign keys with "on delete cascade"). Also check whether you have "on delete" triggers and maybe it is possible to temporary deactivate them.
Maybe a combined approach would help.
Add (temporarily) index on date:
CREATE INDEX IDX_RIDE_POS_date_ASC ON RIDE_POS (date)
Write an execute block:
EXECUTE BLOCK
AS
DECLARE VARIABLE V_ID_VEHICLE INTEGER;
BEGIN
FOR SELECT
DISTINCT ID_VEHICLE
FROM
RIDE_POS
INTO
:V_ID_VEHICLE
DO BEGIN
DELETE FROM RIDE_POS WHERE IdVehicle = :V_ID_VEHICLE AND date < '1.3.2015'
END
END
Drop index if you don't want to have it anymore.
DROP INDEX IDX_RIDE_POS_date_ASC'
I think that even taking into account a time that is needed for creating index, you would still save some time on deleting records.
Finally, I found where the problem was. The main problem was that I am using the classic Winforms application (or IBExpert) and that causing jams and slowing query. I used to execute block and erases the data portions, which has solved the problem of jams, but it was slow.
The solution was to create a simple console application and run query from it. I left primary key and erases through it (no adding or deleting indexes) and the speed of deleting the records was some 65 per milisecond (1 million rows per 16 second).
When I tried to delete primary and add index on datetime column, than erasing speed up just little about 5-10%.
Related
As part of a uni project, I am using MariaDB to cleanse some large CSV's with an algorithm and am using MariaDB 10.5.9 due to the size.
The data is 5 columns with date, time, PlaceID, ID(not unique and repeated), Location
It is a large dataset with approx 50+ million records per day, in total over 1 week 386 million records.
I started to run the algorithm over each day individually and this worked well, the whole process taking between 11 and 15 minutes.
When trying to run over the 7 days combined I have some significant impact on performance.
Most elements work, but I have 1 query which compares values in ID with a list of known good id's and deletes any not in the known good.
DELETE quick FROM merged WHERE ID NOT IN (SELECT ID FROM knownID) ;
On a daily table, this query takes around 2 minutes (comparing 50 million against 125 million known good, both tables have indexes to speed up the process on the ID columns on each table.
Table size for merged data is 24.5GB and for known good is 4.7GB
When running across the whole week, I expected around 7 times as long (plus a bit) the query took just under 2 hours? How can I increase this performance? I am loading both tables into a Memory table when performing the work and then copying back to a disc-based table once complete to try and speed up the process, server has 256GB RAM so plenty of room on there. Are there any other settings I can change/tweak?
my.ini is below:
innodb_buffer_pool_size=18G
max_heap_table_size=192G
key_buffer_size=18G
tmp_memory_table_size=64G
Many thanks
innodb_buffer_pool_size=18G -- too low; raise to 200G
max_heap_table_size=192G -- dangerously high; set to 2G
key_buffer_size=18G -- used only by MyISQM; set to 50M
tmp_memory_table_size=64G -- dangerously high; set to 2G
How many rows will be deleted by this?
DELETE quick FROM merged
WHERE ID NOT IN (SELECT ID FROM knownID) ;
Change to the "multi-table" syntax for DELETE and use LEFT JOIN ... IS NULL
If you are deleting more than, say, a thousand rows, do it in chunks. See http://mysql.rjweb.org/doc.php/deletebig
As discussed in that link, it may be faster to build a new table with just the rows you want to keep.
DELETE must keep the old rows until the end of the statement; then (in the background) do the actual delete. This is a lot of overhead.
For further discussion, please provide SHOW CREATE TABLE for both tables.
We have process in our project where records in a table with specific flag is deleted and remaining record's flag is updated.
Table have approx 45 million records and half the records are with flag='C' and remaining half with flag='P'.
Process run once in a day to delete all the records with flag 'P' and then update all the remaining ones with flag 'C'
Below are the two statements that is run through SSIS package.
DELETE FROM dbo.RTL_Valuation WITH (TABLOCK)
WHERE Valuation_Age_Flag = 'P';
UPDATE dbo.RTL_Valuation WITH (TABLOCK)
SET Valuation_Age_Flag = 'P'
WHERE Valuation_Age_Flag = 'C';
Currently process takes 60 minutes to complete. Is there any way process time could be improved ?
Thanks
You need to do 10000 rows at a time. You are creating one enormous transaction that takes up a lot of room in the transaction log (so it can be rolled back).
set nocount on
DELETE top (10000) FROM dbo.RTL_Valuation WHERE valuation_Age_Flag = 'P';
while ##rowcount()>0
begin
DELETE top (10000) FROM dbo.RTL_Valuation WHERE valuation_Age_Flag = 'P';
end
You can try 1,000, 5,000 or some other number to determine which is the best 'magic' number to quickly delete rows from a large table on your install of SQL Server. But it will be a lot faster that doing a big delete. The same logic applies to the update.
Ok. I assume, that when you perform your delete and update statements it results into two scans of the entire table (one to identify the rows to delete and one to identify the rows to update) and then you have to perform fully logged delete and update operations over it.
There is nice trick for situations like this if your database is in the simple recovery model. However, whether this is suitable for you depends on other circumstances (e.g. how many indexes you table has, whether there are some references, data types ...) that I am not able to asses from your description. It requires more coding but it usually results into much better performance. You would have to test whether it works better for you than your original approach.
Anyway, the trick works like this:
Instead of delete and update operations just select the rows you want to keep (including the changes of the flag) using "SELECT INTO" construct into new table. This results in the minimally logged insert operation and single table scan. You can use also the "INSERT INTO SELECT" construct but there you must fulfill some additional conditions to get the minimally logged insert.
Once data is in the new table, you have to build all required indexes on it.
Once all indexes are build, you just truncate the original table and using the SWITCH command you simply switch the data back to the original table and drop the "new table". It works also on the standard edition of the SQL Server.
I have a database table which have more than 1 million records uniquely identified by a GUID column. I want to find out which of these record or rows was selected or retrieved in the last 5 years. The select query can happen from multiple places. Sometimes the row will be returned as a single row. Sometimes it will be part of a set of rows. there is select query that does the fetching from a jdbc connection from a java code. Also a SQL procedure also fetches data from the table.
My intention is to clean up a database table.I want to delete all rows which was never used( retrieved via select query) in last 5 years.
Does oracle DB have any inbuild meta data which can give me this information.
My alternative solution was to add a column LAST_ACCESSED and update this column whenever I select a row from this table. But this operation is a costly operation for me based on time taken for the whole process. Atleast 1000 - 10000 records will be selected from the table for a single operation. Is there any efficient way to do this rather than updating table after reading it. Mine is a multi threaded application. so update such large data set may result in deadlocks or large waiting period for the next read query.
Any elegant solution to this problem?
Oracle Database 12c introduced a new feature called Automatic Data Optimization that brings you Heat Maps to track table access (modifications as well as read operations). Careful, the feature is currently to be licensed under the Advanced Compression Option or In-Memory Option.
Heat Maps track whenever a database block has been modified or whenever a segment, i.e. a table or table partition, has been accessed. It does not track select operations per individual row, neither per individual block level because the overhead would be too heavy (data is generally often and concurrently read, having to keep a counter for each row would quickly become a very costly operation). However, if you have you data partitioned by date, e.g. create a new partition for every day, you can over time easily determine which days are still read and which ones can be archived or purged. Also Partitioning is an option that needs to be licensed.
Once you have reached that conclusion you can then either use In-Database Archiving to mark rows as archived or just go ahead and purge the rows. If you happen to have the data partitioned you can do easy DROP PARTITION operations to purge one or many partitions rather than having to do conventional DELETE statements.
I couldn't use any inbuild solutions. i tried below solutions
1)DB audit feature for select statements.
2)adding a trigger to update a date column whenever a select query is executed on the table.
Both were discarded. Audit uses up a lot of space and have performance hit. Similary trigger also had performance hit.
Finally i resolved the issue by maintaining a separate table were entries older than 5 years that are still used or selected in a query are inserted. While deleting I cross check this table and avoid deleting entries present in this table.
Is it more efficient and ultimately FASTER to delete rows from a DB in blocks of 1000 or 10000? I am having to remove approx 3 million rows from many tables. I first did the deletes in blocks of 100K rows but the performance wasn't looking good. I changed to 10000 and seem to be removing faster. Wondering if even smaller like 1K per DELETE statement is even better.
Thoughts?
I am deleting like this:
DELETE TOP(10000)
FROM TABLE
WHERE Date < '1/1/2012'
Yes, it is. It all depends on your server though. I mean, last time I did that i was using this approeach to delete things in 64 million increments (on a table that had at that point around 14 billion rows, 80% Of which got ultimately deleted). I got a delete through every 10 seconds or so.
It really depends on your hardware. Going moreg granular is more work but it means less waiting for tx logs for other things operating on the table. You have to try out and find where you are comfortable - there is no ultimate answer because it is totally dependend on usage of the table and hardware.
We used Table Partitioning to remove 5 million rows in less than a sec but this was from just one table. It took some work up-front but ultimately was the best way. This may not be the best way for you.
From our document about partitioning:
Let’s say you want to add 5 million rows to a table but don’t want to lock the table up while you do it. I ran into a case in an ordering system where I couldn’t insert the rows without stopping the system from taking orders. BAD! Partitioning is one way of doing it if you are adding rows that don’t overlap current data.
WHAT TO WATCH OUT FOR:
Data CANNOT overlap current data. You have to partition the data on a value. The new data cannot be intertwined within the currently partitioned data. If removing data, you have to remove an entire partition or partitions. You will not have a WHERE clause.
If you are doing this on a production database and want to limit the locking on the table, create your indexes with “ONLINE = ON”.
OVERVIEW OF STEPS:
FOR ADDING RECORDS
Partition the table you want to add records to (leave a blank partition for the new data). Do not forget to partition all of your indexes.
Create new table with the exact same structure (keys, data types, etc.).
Add a constraint to the new table to limit that data so that it would fit into the blank partition in the old table.
Insert new rows into new table.
Add indexes to match old table.
Swap the new table with the blank partition of the old table.
Un-partition the old table if you wish.
FOR DELETING RECORDS
Partition the table into sets so that the data you want to delete is all on partitions by itself (this could be many different partitions).
Create a new table with the same partitions.
Swap the partitions with the data you want to delete to the new table.
Un-partition the old table if you wish.
Yes, no, it depends on the usage of table due to locking. I would try to delete the records in a slower pace. So the opposite of the op's question.
set rowcount 10000
while ##rowcount > 0
begin
waitfor delay '0:0:1'
delete
from table
where date < convert(datetime, '20120101', 112)
end
set rowcount 0
My current project for a client requires me to work with Oracle databases (11g). Most of my previous database experience is with MSSQL Server, Access, and MySQL. I've recently run into an issue that seems incredibly strange to me and I was hoping someone could provide some clarity.
I was looking to do a statement like the following:
update MYTABLE set COLUMN_A = COLUMN_B;
MYTABLE has about 13 million rows.
The source column is indexed (COLUMN_B), but the destination column is not (COLUMN_A)
The primary key field is a GUID.
This seems to run for 4 hours but never seems to complete.
I spoke with a former developer that was more familiar with Oracle than I, and they told me you would normally create a procedure that breaks this down into chunks of data to be commited (roughly 1000 records or so). This procedure would iterate over the 13 million records and commit 1000 records, then commit the next 1000...normally breaking the data up based on the primary key.
This sounds somewhat silly to me coming from my experience with other database systems. I'm not joining another table, or linking to another database. I'm simply copying data from one column to another. I don't consider 13 million records to be large considering there are systems out there in the orders of billions of records. I can't imagine it takes a computer hours and hours (only to fail) at copying a simple column of data in a table that as a whole takes up less than 1 GB of storage.
In experimenting with alternative ways of accomplishing what I want, I tried the following:
create table MYTABLE_2 as (SELECT COLUMN_B, COLUMN_B as COLUMN_A from MYTABLE);
This took less than 2 minutes to accomplish the exact same end result (minus dropping the first table and renaming the new table).
Why does the UPDATE run for 4 hours and fail (which simply copies one column into another column), but the create table which copies the entire table takes less than 2 minutes?
And are there any best practices or common approaches used to do this sort of change? Thanks for your help!
It does seem strange to me. However, this comes to mind:
When you are updating the table, transaction logs must be created in case a rollback is needed. Creating a table, that isn't necessary.