There is a base with 2 namespaces (main and archive) located on the same disk space.
Task: transfer records up to a certain time of creation (the date is present in one of the tables) in the archive space in an identical table. (Oracle 12c) (hundreds of billions of rows in each table)
Because this will work on the production, the option of transferring only the necessary records to a new table is not suitable, because the data will be updated in the process. (will be executed SELECT/INSERT/UPDATE from same table of main namespace)
Currently the best option which I found:
CREATE TABLE MAIN_NAME_SPACE.TEMP AS SELECT ID FROM MAIN_NAME_SPACE.MAIN_TABLE
create a temp table with id for some period (~ 5 days) (as a separate procedure);
PROCEDURE ARCH AS
table variable;
BEGIN
SELECT ID INTO table FROM TEMP;
FORALL i IN table
DELETE ARCH_NAME_SPACE.TABLE_1 T WHERE T.ID = i;
FORALL i IN table
INSERT INTO ARCH_NAME_SPACE.TABLE_1 VALUES(SELECT * FROM MAIN_NAME_SPACE.TABLE_1 T WHERE T.ID = i);
...
FORALL i IN table
DELETE ARCH_NAME_SPACE.TABLE_N T WHERE T.ID = i;
FORALL i IN table
INSERT INTO ARCH_NAME_SPACE.TABLE_N VALUES(SELECT * FROM MAIN_NAME_SPACE.TABLE_N T WHERE T.ID = i);
END ARCH;
we read the data in a table variable and for each table in the archive through FORALL we delete the records (if they exist there) and write in a new one;
PROCEDURE DELETE AS
table variable;
BEGIN
SELECT ID INTO table FROM TEMP;
FORALL i IN table
DELETE MAIN_NAME_SPACE.TABLE_1 T WHERE T.ID = i;
...
FORALL i IN table
DELETE MAIN_NAME_SPACE.TABLE_N T WHERE T.ID = i;
END ARCH;
Using another procedure, we delete data from the main namespace by analogy (leaving this piece within the same procedure, the time increases to 2.5 hours for unknown reasons).
But the speed leaves much to be desired (10 million records are transferred to the archive for 43 minutes, removal from the original namespace is 1h 5min).
Is there any other way to speed up this pleasure? (earlier, before updating to 12c, all this worked through the cursor veryyyyyyy slowly and veryyyyyyy rarely started).
P.S.: tables are not partitioned.
Also maybe somebody can say why with INVISIBLE indexes DELETE/INSERT operations work faster? I'am not understand how it works.
Thank u advance.
Related
I have a somewhat of what feels like a complex data problem that I am trying to solve. I am more a developer than a SQL expert but our DBA whom wrote the script a few months back, recently moved on and I have been tasked with resolving this problem in the short term.
I am having some real issues with four tables of which two are receiving real-time bulk inserts, being read by users and having a daily job copying then deleting the copied over records into historic tables. The operations run 24/7 and there is no downtime for the data insertions or archiving.
A script that originally did this process started to fail. It was changing the Serializable isolation from Snapshot in ReadCommitted mode, but did a full table scan so it blocked all insert operations for 1-2 hours, which was not acceptable.
Question(s)
The pain point for the archiving is that we have to wrap up the insert and deletes together in one transaction. Is there a better way to do the below? Is it better to not have foreign keys and use a trigger or constraint? The below scripts either lock the table too much or run for too long (1-4hours).
Scenario
Four SQL tables
Order (Memory Optimized table)
OrderDetail (Memory Optimized table)
OrderHistory (File/normal table – ColumnStore index)
OrderDetailHistory (File/normal table– ColumnStore index)
Order and OrderDetail tables receive on average a bulk insert between 50-700 inserts per second is performed.
Updates are only performed during the batch insert operation.
During the day the Order table can be between 3-6 million records.
The OrderDetail table can be up to 2-3 times as large as the Order table.
OrderHistory and OrderDetailHistory can have 1-7 days data, so varies between 10-50 million records at any one point.
OrderDetail has a FK reference to Order on the Id column.
At one point of time during the day, the data is copied and inserted for each table into their respective 'history' tables, which are non-memory optimized tables.
Attempt 1
The script with the issue does this:
BEGIN TRAN
INSERT INTO OrderHistory
SELECT * FROM Order o
WHERE o.CreatedAt <= 1 day
INSERT INTO OrderDetailHistory
SELECT * FROM OrderDetail od
WHERE od.CreatedAt <= 1 day
DELETE FROM Order
WHERE CreatedAt <= 1 day
DELETE FROM OrderDetail
WHERE CreatedAt <= 1 day
COMMIT
The database is run at Snapshot Isolation level with Read Committed.
When run, we were originally getting Serializable errors, and after some reviewing realised that the delete operation would go down to a serializable isolation level to lock the nonclustered index on the CreatedAt column whilst it performed deletes because it is doing a range scan on the index, whilst still inside the transaction that we used for selecting data too.
Attempt 2
So next I modified the script, by creating two memory optimized user defined tables to select data into first outside of the transaction into these udf tables. Then in a separate transaction, insert into our history table, and then finally delete in a separate transaction. The idea is that if the inserts are a success but the delete fails then the next time it runs it will not try to re-insert data twice. The downside to this is that there will be duplication in the history data until it runs again, and it ran for 2 hours before our scheduling tool timeout, so this doesn't seem ideal.
INSERT INTO #OrderLoadingUdfTable
SELECT * FROM Order o
WHERE o.CreatedAt <= 1 day
INSERT INTO #OrderDetailLoadingUdfTable
SELECT * FROM OrderDetail od
WHERE od.CreatedAt <= 1 day
BEGIN TRAN
INSERT INTO OrderHistory
SELECT * FROM #OrderLoadingUdfTable
WHERE CreatedAt <= 1 day
AND Id NOT IN (SELECT Id FROM OrderHistory)
INSERT INTO OrderDetailHistory
SELECT * FROM #OrderDetailLoadingUdfTable
WHERE CreatedAt <= 1 day
AND Id NOT IN (SELECT Id FROM OrderDetailHistory)
COMMIT
BEGIN TRAN
DELETE FROM Order
WHERE CreatedAt <= 1 day
DELETE FROM OrderDetail
WHERE CreatedAt <= 1 day
COMMIT
I have a table (Table A) that contains 300 million records, I want to do a data retention activity on basis of some criteria. So I want to delete about 200M records of the table.
Concerning the performance, I planned to create a new table (Table-B) with the oldest 10M records from Table-A. Then I can select records from Table-B which matches the criteria and will delete it in Table A.
Extracting 10M records from Table-A and loading into Table-B using SQL Loader takes ~5 hours.
I already created indexes and I use parallel 32 wherever applicable.
What I wanted to know is,
Is there any better way to extract from Table-A and to load it in Table-B.
Is there any better approach other than creating a temp table(Table-B).
DBMS: Oracle 10g, PL/SQL and Shell.
Thanks.
If you want to delete 70% of the records of your table, the best way is to create a new table that contains the remaining 30% of the rows, drop the old table and rename the new table to the name of the old table. One possibility to create the new table is a create-table-as-select statement (CTAS), but there are also possibilities that make the impact on the running system much smaller, e.g. one can use materialized views to select the remaining data and convert the materialized vie to a table. The details of the approach depend on the requirements.
This reading and writing is much more efficient then deleting the rows of the old table.
If you delete the rows of the old table it is probably necessary to reorganize the old table which will also end up in writing these remaining 30% of data.
Partitioning the table by your criteria may be an option.
Consider a case with the criteria is the month. All January data falls into the Jan partition. All February data falls into the Feb partition...
Then when it comes time to drop all the old January data, you just drop the partition.
Using rowid best to use but inline cursor can help u more
Insert into table a values ( select * from table B where = criteria) then truncate table A
Is there any better way to extract from Table-A and to load it in? You can use parallel CTAS - create table-b as select from table-a. You can use compression and parallel query in one step.
Table-B. Is there any better approach other than creating a temp
table(Table-B)? Better approach would be partitioning of table a
Probably better approach would be partitioning of Table A but if not you can try fast and simple:
declare
i pls_integer :=0 ;
begin
for r in
( -- select what you want to move to second table
SELECT
rowid as rid,
col1,
col2,
col3
FROM
table_a t
WHERE
t.col < SYSDATE - 30 --- or other criteria
)
loop
insert /*+ append */ into table_b values (r.col1, r.col2, r.col3 ); -- insert it to second table
delete from table_a where rowid = r.rid; -- and delete it
if i < 500 -- check your best commit interval
then
i:=i+1;
else
commit;
i:=0;
end if;
end loop;
commit;
end;
In above example you will move your records in small 500 rows transactions. You can optimize it using collection and bulk insert but i wanted to keep simple code.
I was missing one index on a column that i was using in a search criteria.
Apart from this there was some indexes missing on referenced tables too.
Apart from this #miracle173 answer is also good but we are having some foreign key too that might create problem if we had used that approach.
+1 to #miracle173
I have two tables with same column structure in the same database: TableA and TableB.
TableA doesn't have any indexes, but TableB has a non-clustered unique index.
TableA has 290 Million rows of data that needs to be copied to TableB.
As they both have same structure, I've tried
INSERT INTO TableB
SELECT *
FROM TableA;
It was executing for hours and produced a huge log file that filled the disk. As a result the disk ran out of space and the query was killed.
I can shrink the log file. How can I copy these many rows of data to another table efficiently?
First of all, disable the index on TableB before inserting the rows. You can do it using T-SQL:
ALTER INDEX IX_Index_Name ON dbo.TableB DISABLE;
Make sure to disable all the constraints (foreign keys, check constraints, unique indexes) on your destination table.
Re-enable (and rebuild) them after the load is complete.
Now, there's a couple of approaches to solve the problem:
You have to be OK with a slight chance of data loss: use the INSERT INTO ... SELECT ... FROM ... syntax you have but switch your database to Bulk-logged recovery mode first (read before switching). Won't help if you're already in Bulk-logged or Simple.
With exporting the data first: you can use the BCP utility to export/import the data. It supports loading data in batches. Read more about using the BCP utility here.
Fancy, with exporting the data first: With SQL 2012+ you can try exporting the data into binary file (using the BCP utility) and load it by using the BULK INSERT statement, setting ROWS_PER_BATCH option.
Old-school "I don't give a damn" method: to prevent the log from filling up you will need to perform the
inserts in batches of rows, not everything at once. If your database
is running in Full recovery mode you will need to keep log backups
running, maybe even trying to increase the frequency of the job.
To batch-load your rows you will need a WHILE (don't use them in
day-to-day stuff, just for batch loads), something like the
following will work if you have an identifier in the dbo.TableA
table:
DECLARE #RowsToLoad BIGINT;
DECLARE #RowsPerBatch INT = 5000;
DECLARE #LeftBoundary BIGINT = 0;
DECLARE #RightBoundary BIGINT = #RowsPerBatch;
SELECT #RowsToLoad = MAX(IdentifierColumn) dbo.FROM TableA
WHILE #LeftBoundary < #RowsToLoad
BEGIN
INSERT INTO TableB (Column1, Column2)
SELECT
tA.Column1,
tB.Column2
FROM
dbo.TableA as tA
WHERE
tA.IdentifierColumn > #LeftBoundary
AND tA.IdentifierColumn <= #RightBoundary
SET #LeftBoundary = #LeftBoundary + #RowsPerBatch;
SET #RightBoundary = #RightBoundary + #RowsPerBatch;
END
For this to work effectively you really want to consider creating an
index on dbo.TableA (IdentifierColumn) just for the time you're
running the load.
I'm trying to archive many records in batches rather than in one shot.
Will TSQL Join the two tables, TeamRoster and #teamIdsToDelete for every loop in the batch? My concern is that if my temporary table is huge and I don't remove records from the temporary table as I go, the JOIN might be unnecessarily expensive. On the other hand, how expensive is it to delete from the temporary table as I go? Is it made up for by the (?real/hypothetical?) smaller joins I'll have to do in each batch?
(Can provide more details/thoughts but will do so if helpful.)
DECLARE #teamIdsToDelete Table
(
RosterID int PRIMARY KEY
)
--collect the list of active teamIds. we will rely on the modified date to age them out.
INSERT INTO #teamIdsToDelete
SELECT DISTINCT tr.RosterID FROM
rosterload.TeamRoster tr WITH (NOLOCK)
WHERE tr.IsArchive=0 and tr.Loaded=1
--ageout out remaining rosters. (no cap - proved we can update more than 50k by modifying test case:
WHILE (1 = 1)
BEGIN
BEGIN TRANSACTION
UPDATE TOP (1000) r
SET [Status] = 'Delete', IsArchive = 1, ModifiedDate = GETDATE(), ModifiedBy = 'abc'
FROM rosterload.TeamRoster r with(rowlock)
JOIN #teamIdsToDelete ttd ON ttd.rosterID = r.RosterID
WHERE r.[Status] != 'Delete' AND r.IsArchive != 1 AND r.ModifiedBy != 'abc' -- predicate for filtering;
IF ##ROWCOUNT = 0 -- terminating condition;
BEGIN
COMMIT TRANSACTION
BREAK
END
COMMIT TRANSACTION
END
As I understand the goal of this query is to archive huge number of rows w/o blocking other queries at the same time. The temp table helps you to narrow down the subset of records to delete. Since it has one column which is clustered primary key, the join to another PK will be blazingly fast. You will spend more efforts on calculating and deleting updated records from the temp table.
Also, there is no reason to use transaction and do batches. You could just do one big update instead. The result is the same - table will be locked after first 5k row locks are acquired (~after first five batches updated) until the COMMIT statement. With rowlock hint does not prevent lock escalation. On the other hand, running w/o transaction would give other queries opportunity to continue after each 1000-row batch. If you need to make sure that all records are archived in one go - add some retry logic to your query or your application code for such errors like deadlocks or process interruption. And do you really need NOLOCK hint?
There's a multi-step procedure in the Data Warehouse that generates a temp table with a list of Jobs that will be processed for each batch. Usually this is about 5,000 jobs. By the end of financial aggregation we may be looking at about 500,000 records processed. I've noticed that a very small part of it is giving me an Early Timeout on Optimization for just this part of the stored procedure:
DELETE jfs
FROM DataWarehouse.dbo.JobFinancialSummary jfs -- Financials table (> 3,000,000 records with indices)
INNER JOIN #JobList jl ON jfs.JobID = jl.JobID -- List of Jobs being processed (avg. of 5,000 records)
INNER JOIN FiscalPeriod fp ON fp.ID = jfs.FiscalPeriodID -- Month Reference Table (about 1,000 records)
WHERE fp.[Status] IN (1,2) -- Last 2 months
The most confusing thing is that this is a relatively simple part of the stored procedure and all of the JOINs are on indices. My only question is how it gets timed out when the optimizer evaluates this. My understanding is that the optimizer gives each statement its own "Budget" but perhaps I'm missing something. Why the timeout here?
There are many reasons for delete to be very slow:
Reasons:
Table might contain CDC, CT enabled
Table might contain triggers which work after operation of delete
Table might contain FK references, constraints
Table might contain indexed views associated with that table
Above all we need to check how transactional this table will be etc.,
There are many ways to do delete, Obviously batch delete is faster. First and famous way is to mark these records with soft delete attribute and delete offline during nightly time.
If it is going to be offline delete, to make that delete faster,
then capture clustered index keys for the deleting table
Disable indexes, FK's, Constraints as you will be taking care functionally on all these
Disable CT, CDC etc on that table if these are not required
Create a script for indexed views and then drop the indexed views associated with this table
Then delete via batches, you can by setting ##rowCount or top batchsize We can delete any number of records faster this way.
DELETE TOP 50000 -- based on scenario
FROM table1
in loop
Call explicit 'CheckPoint' to make sure records are cleared from transaction log. Also make sure your 'Recovery Model' is 'Simple' not the 'Full'
If it is going to be online delete still during daytime mark that as soft delete and in the nightly job run the deletes in very smaller batches
It may likely be faster to isolate the rows/entries you want to delete first rather than joining whilst deleting.
Something like this assuming id is your primary key/identity:
IF OBJECT_ID('tempdb..#tmp') IS NOT NULL DROP TABLE #tmp
SELECT jfs.ID
INTO #tmp
FROM DataWarehouse.dbo.JobFinancialSummary jfs -- Financials table (> 3,000,000 records with indices)
INNER JOIN FiscalPeriod fp ON fp.ID = jfs.FiscalPeriodID -- Month Reference Table (about 1,000 records)
WHERE fp.[Status] IN (1,2)
/* EXISTS IS FASTER THAN A JOIN, AVOIDS FANNING */
AND EXISTS (SELECT 1 FROM #JobList jl where jfs.JobID = jl.JobID) -- List of Jobs being processed (avg. of 5,000 records)
Then issue a delete such as:
DELETE TOP(1000) jfs
FROM DataWarehouse.dbo.JobFinancialSummary jfs
WHERE EXISTS (SELECT 1 FROM #tmp t WHERE jfs.ID=t.ID)
From there, depending on how many rows you are deleting, you may wish to delete in a batch overnight -- anything over 5000 rows will escalate to table locks and is a prime candidate for batch deletion.
I wrote a fairly popular answer on how to accomplish large batch deletes here:
Deleting 1 millions rows in SQL Server