MariaDB Compare 2 tables and delete where not in 1st (Large Dataset)

MariaDB Compare 2 tables and delete where not in 1st (Large Dataset) - database

As part of a uni project, I am using MariaDB to cleanse some large CSV's with an algorithm and am using MariaDB 10.5.9 due to the size.
The data is 5 columns with date, time, PlaceID, ID(not unique and repeated), Location
It is a large dataset with approx 50+ million records per day, in total over 1 week 386 million records.
I started to run the algorithm over each day individually and this worked well, the whole process taking between 11 and 15 minutes.
When trying to run over the 7 days combined I have some significant impact on performance.
Most elements work, but I have 1 query which compares values in ID with a list of known good id's and deletes any not in the known good.
DELETE quick FROM merged WHERE ID NOT IN (SELECT ID FROM knownID) ;
On a daily table, this query takes around 2 minutes (comparing 50 million against 125 million known good, both tables have indexes to speed up the process on the ID columns on each table.
Table size for merged data is 24.5GB and for known good is 4.7GB
When running across the whole week, I expected around 7 times as long (plus a bit) the query took just under 2 hours? How can I increase this performance? I am loading both tables into a Memory table when performing the work and then copying back to a disc-based table once complete to try and speed up the process, server has 256GB RAM so plenty of room on there. Are there any other settings I can change/tweak?
my.ini is below:
innodb_buffer_pool_size=18G
max_heap_table_size=192G
key_buffer_size=18G
tmp_memory_table_size=64G
Many thanks

innodb_buffer_pool_size=18G -- too low; raise to 200G
max_heap_table_size=192G -- dangerously high; set to 2G
key_buffer_size=18G -- used only by MyISQM; set to 50M
tmp_memory_table_size=64G -- dangerously high; set to 2G
How many rows will be deleted by this?
DELETE quick FROM merged
WHERE ID NOT IN (SELECT ID FROM knownID) ;
Change to the "multi-table" syntax for DELETE and use LEFT JOIN ... IS NULL
If you are deleting more than, say, a thousand rows, do it in chunks. See http://mysql.rjweb.org/doc.php/deletebig
As discussed in that link, it may be faster to build a new table with just the rows you want to keep.
DELETE must keep the old rows until the end of the statement; then (in the background) do the actual delete. This is a lot of overhead.
For further discussion, please provide SHOW CREATE TABLE for both tables.

Related

Postgres check if any new rows were inserted

I have numerous quite large tables (300-400 tables, ~30 million rows each). Everyday (once a day) I have to check if any new rows were inserted into any of these tables. Possible number of rows inserted may vary from 0 to 30 million rows. Rows are not going to be deleted.
At the moment, I check if any new rows were inserted using approximate count. And then compare it with previous (yesterday) result.
SELECT reltuples FROM pg_class WHERE oid='tablename'::regclass;
The main thing I doubt: how soon reltuples will be updated if, for example, 3000 rows will be inserted (or 5 rows inserted)? And is approximate count a good solution for that case?
My config parameters are:
autovacuum_analyze_threshold: 50
autovacuum_analyze_scale_factor: 0.1

reltuples will be updated whenever VACUUM (ir autovacuum) runs, so this number normally has an error margin of 20%.
You'll get a better estimate for the number of rows in the table from the table statistics view:
SELECT n_live_tup
FROM pg_stat_user_tables
WHERE schemaname = 'myschema' AND relname = 'mytable';
This number is updated by the statistics collector, so it is not guaranteed to be 100% accurate (there is a UDP socket involved), and it may take a little while for the effects of a data modification to be visible there.
Still it is often a more accurate estimate than reltuples.

H2 database performance strangeness --- or how to efficiently `count(*)`

The setup could not be simpler:
H2 version 1.3.176
One table, 10 columns of which two are a bit lengthy with 300 and 3500 characters a typical value length
Simple query: select count(*) from requestrepository where request_type = 'ADD'
Index is on the queried column.
Queried column is just varchar(20) (i.e. not one of the longer ones)
Queried column contains just two different values, with one appearing 200k times and the other appearing 12 million times.
DB runs off an SSD, current server hardware, current Java 8 (varied a bit but no change in result)
What I do: (0) run analyze, (1) delete one row by a key field, (2) insert one row for the key just deleted, (3) run the query cited above, count to 10 and repeat.
What I see: The query cited above takes between 3 and 5 seconds each time and explain analyze says:
SELECT
COUNT(*)
FROM PUBLIC.REQUESTREPOSITORY
/* PUBLIC.IX_REQUESTS: REQUEST_TYPE = 'ADD' */
/* scanCount: 12098748 */
WHERE REQUEST_TYPE = 'ADD'
/*
REQUESTREPOSITORY.IX_REQUESTS read: 126700
*/
I tried the same DB on different machines, hardware/linux/ssd, VM/Windows/netapp, but the tendency is always the same: the count(*) takes too(?) long.
And this is what I am not sure about. Is it to be expected that this takes long? I would have thought that at least for the second round, caches are filled and this should be much faster, but the explain analyze always lists 126700 reads.
Any hints about H2 parameters or settings how this may be improve are appreciated.
EDIT (not sure if this should rather go as an answer)
Meanwhile we tried a wide range of things, including mvstore, 1.4.x, parallel threads, computers with different disks, Linux, Windows. The situation is always the same. Take over 10 or 12 million rows, a varchar column with three status values, something like PROCESSING, ADD, DELETE, an index on the column and one status grossly overrepresented: Then something like count(*) where colname='ADD' takes between 1 and many seconds after each update of the table.
To prevent this from creating a problem, we finally fixed our own code, which did three count(*), one for each status, instead of one with a group by and was run every 5 seconds instead of just on demand. Certainly not the greatest design we had.
The only excuse I have is that it I am still surprised that a count(*) takes that long in such a setup. My hunch is that the count must be computed on the index by really counting after an update, whereas I expected that the count can be just read off the data structure somewhere. (No critique, I for myself would certainly not be able to implement a DB.)

Not sure about H2, but have you tried COUNT(request_type) instead of COUNT(*)?
SQL standard's COUNT(*) tends to take long time to compute, as it requires a full table scan to filter out rows that consist of NULL values only.
Using COUNT() on a single indexed column can speed things up. This way no table row need to be read, as the index is sufficient to decide whether the column's value is NULL.

How to optimize Firebird bulk delete with execute block

I have Firebird table with 60 milions rows and i need delete ca. half of table.
Table rows has gps position of cars, timestamp of record and other data. Table has primary key IdVehicle+TimeStamp and one foreign key (into Vehicle table). There is no other key or index or trigger. One vehicle has 100 000 - 500 000 records.
I need delete older data, eg. from all vehicles delete data older than 1 March 2015. I tried different ways and actually use my fastest comes with 'execute block' (use primary key). First I read for one vehicle records older then 1.3.2015. Then I am going through the individual records and prepare sql execute a block and then perform it into firebird for every 50 entries.
EXECUTE BLOCK AS BEGIN
DELETE FROM RIDE_POS WHERE IdVehicle = 1547 and date = '4.5.2015 8:56:47'
DELETE FROM RIDE_POS WHERE IdVehicle = 1547 and date = '4.5.2015 8:56:59'
DELETE FROM RIDE_POS WHERE IdVehicle = 1547 and date = '4.5.2015 8:57:17'
...... a total of 50 line
END
Thus delete 1 million lines per 800 seconds (about 1 record for 1 ms).
Is there another quicker way to delete records?
Additionally, this way I can delete only a few million lines, and then I have to restart firebird, otherwise starts to slow down and jam (on the test server there is no other database / application). From early records cleared quickly and gradually takes longer and longer time.
For orientation, how quickly you erasing records routinely in large tables (not completely erase the table, but only a part of the record).

If you want to delete all records older than given date, no matter the vehicle, then there is no point including the Idvehicle in the query, just the date is enough. Ie following should do, just straight query, no need for execute block either:
DELETE FROM RIDE_POS WHERE date < '2015-03-01'

If you have to delete many thousands (or millions) records do not do it in one single transaction. You better do it in several steps - delete for example 1000 records and commit, then delete other 1000 and commit - it should be faster than delete one million of records in one transaction. 1000 is not a rule, it depends on your particular situation (how large are your records, how many linked data they have via foreign keys with "on delete cascade"). Also check whether you have "on delete" triggers and maybe it is possible to temporary deactivate them.

Maybe a combined approach would help.
Add (temporarily) index on date:
CREATE INDEX IDX_RIDE_POS_date_ASC ON RIDE_POS (date)
Write an execute block:
EXECUTE BLOCK
AS
DECLARE VARIABLE V_ID_VEHICLE INTEGER;
BEGIN
FOR SELECT
DISTINCT ID_VEHICLE
FROM
RIDE_POS
INTO
:V_ID_VEHICLE
DO BEGIN
DELETE FROM RIDE_POS WHERE IdVehicle = :V_ID_VEHICLE AND date < '1.3.2015'
END
END
Drop index if you don't want to have it anymore.
DROP INDEX IDX_RIDE_POS_date_ASC'
I think that even taking into account a time that is needed for creating index, you would still save some time on deleting records.

Finally, I found where the problem was. The main problem was that I am using the classic Winforms application (or IBExpert) and that causing jams and slowing query. I used to execute block and erases the data portions, which has solved the problem of jams, but it was slow.
The solution was to create a simple console application and run query from it. I left primary key and erases through it (no adding or deleting indexes) and the speed of deleting the records was some 65 per milisecond (1 million rows per 16 second).
When I tried to delete primary and add index on datetime column, than erasing speed up just little about 5-10%.

Efficient DELETE TOP?

Is it more efficient and ultimately FASTER to delete rows from a DB in blocks of 1000 or 10000? I am having to remove approx 3 million rows from many tables. I first did the deletes in blocks of 100K rows but the performance wasn't looking good. I changed to 10000 and seem to be removing faster. Wondering if even smaller like 1K per DELETE statement is even better.
Thoughts?
I am deleting like this:
DELETE TOP(10000)
FROM TABLE
WHERE Date < '1/1/2012'

Yes, it is. It all depends on your server though. I mean, last time I did that i was using this approeach to delete things in 64 million increments (on a table that had at that point around 14 billion rows, 80% Of which got ultimately deleted). I got a delete through every 10 seconds or so.
It really depends on your hardware. Going moreg granular is more work but it means less waiting for tx logs for other things operating on the table. You have to try out and find where you are comfortable - there is no ultimate answer because it is totally dependend on usage of the table and hardware.

We used Table Partitioning to remove 5 million rows in less than a sec but this was from just one table. It took some work up-front but ultimately was the best way. This may not be the best way for you.
From our document about partitioning:
Let’s say you want to add 5 million rows to a table but don’t want to lock the table up while you do it. I ran into a case in an ordering system where I couldn’t insert the rows without stopping the system from taking orders. BAD! Partitioning is one way of doing it if you are adding rows that don’t overlap current data.
WHAT TO WATCH OUT FOR:
Data CANNOT overlap current data. You have to partition the data on a value. The new data cannot be intertwined within the currently partitioned data. If removing data, you have to remove an entire partition or partitions. You will not have a WHERE clause.
If you are doing this on a production database and want to limit the locking on the table, create your indexes with “ONLINE = ON”.
OVERVIEW OF STEPS:
FOR ADDING RECORDS
Partition the table you want to add records to (leave a blank partition for the new data). Do not forget to partition all of your indexes.
Create new table with the exact same structure (keys, data types, etc.).
Add a constraint to the new table to limit that data so that it would fit into the blank partition in the old table.
Insert new rows into new table.
Add indexes to match old table.
Swap the new table with the blank partition of the old table.
Un-partition the old table if you wish.
FOR DELETING RECORDS
Partition the table into sets so that the data you want to delete is all on partitions by itself (this could be many different partitions).
Create a new table with the same partitions.
Swap the partitions with the data you want to delete to the new table.
Un-partition the old table if you wish.

Yes, no, it depends on the usage of table due to locking. I would try to delete the records in a slower pace. So the opposite of the op's question.
set rowcount 10000
while ##rowcount > 0
begin
waitfor delay '0:0:1'
delete
from table
where date < convert(datetime, '20120101', 112)
end
set rowcount 0

Copy data from one column to another in oracle table

My current project for a client requires me to work with Oracle databases (11g). Most of my previous database experience is with MSSQL Server, Access, and MySQL. I've recently run into an issue that seems incredibly strange to me and I was hoping someone could provide some clarity.
I was looking to do a statement like the following:
update MYTABLE set COLUMN_A = COLUMN_B;
MYTABLE has about 13 million rows.
The source column is indexed (COLUMN_B), but the destination column is not (COLUMN_A)
The primary key field is a GUID.
This seems to run for 4 hours but never seems to complete.
I spoke with a former developer that was more familiar with Oracle than I, and they told me you would normally create a procedure that breaks this down into chunks of data to be commited (roughly 1000 records or so). This procedure would iterate over the 13 million records and commit 1000 records, then commit the next 1000...normally breaking the data up based on the primary key.
This sounds somewhat silly to me coming from my experience with other database systems. I'm not joining another table, or linking to another database. I'm simply copying data from one column to another. I don't consider 13 million records to be large considering there are systems out there in the orders of billions of records. I can't imagine it takes a computer hours and hours (only to fail) at copying a simple column of data in a table that as a whole takes up less than 1 GB of storage.
In experimenting with alternative ways of accomplishing what I want, I tried the following:
create table MYTABLE_2 as (SELECT COLUMN_B, COLUMN_B as COLUMN_A from MYTABLE);
This took less than 2 minutes to accomplish the exact same end result (minus dropping the first table and renaming the new table).
Why does the UPDATE run for 4 hours and fail (which simply copies one column into another column), but the create table which copies the entire table takes less than 2 minutes?
And are there any best practices or common approaches used to do this sort of change? Thanks for your help!

It does seem strange to me. However, this comes to mind:
When you are updating the table, transaction logs must be created in case a rollback is needed. Creating a table, that isn't necessary.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight