Environment: Oracle database 19C
The table in question has a few number data type columns and one column of CLOB data type. The table is properly indexed and there is a nightly gather stats job as well.
Below are the operations on the table-
A PL/SQL batch procedure inserts 4 to 5 million of records from a flat file presented as an external table
After the insert operation, another batch process reads the rows and updates some of the columns
A daily purge process deletes rows that are no longer needed
My question is - should gather stats be triggered immediately after the insert and/or delete operations on the table ?
Per this Oracle doc Online Statistics Gathering for Bulk Loads, bulk loads only gather online statistics automatically when the object is empty. My process will not benefit from it as the table is not empty when I load data.
But online statistics gathering works for insert into select operations on empty segments using direct path. So next I am going to try append hint. Any thoughts... ?
Before Oracle 12c, it was best practise to gather statistics immediately after a bulk load. However, according to Oracle's SQL Tuning Guide, many applications failed to do so, therefore they automated this for certain operations.
I would recommend to have a look at the dictionary views DBA_TAB_STATISTICS, DBA_IND_STATISTICS and DBA_TAB_MODIFICATIONS and see how your table behaves:
CREATE TABLE t AS SELECT * FROM all_objects;
CREATE INDEX i ON t(object_name);
SELECT table_name, num_rows, stale_stats
FROM DBA_TAB_STATISTICS WHERE table_name='T'
UNION ALL
SELECT index_name, num_rows, stale_stats
FROM DBA_IND_STATISTICS WHERE table_name='T';
TABLE_NAME NUM_ROWS STALE_STATS
T 67135 NO
I 67135 NO
If you insert data, the statistics are marked as stale:
INSERT INTO t SELECT * FROM all_objects;
TABLE_NAME NUM_ROWS STALE_STATS
T 67138 YES
I 67138 YES
SELECT inserts, updates, deletes
FROM DBA_TAB_MODIFICATIONS
WHERE table_name='T';
INSERTS UPDATES DELETES
67140 0 0
Likewise for updates and delete:
UPDATE t SET object_id = - object_id WHERE object_type='TABLE';
4,449 rows updated.
DELETE FROM t WHERE object_type = 'SYNONYM';
23,120 rows deleted.
INSERTS UPDATES DELETES
67140 4449 23120
When you gather statistics, stale_stats becomes 'NO' again, and `DBA_TAB_MODIFICATIONS* goes back to zero (or an empty row)
EXEC DBMS_STATS.GATHER_TABLE_STATS(NULL, 'T');
TABLE_NAME NUM_ROWS STALE_STATS
T 111158 YES
I 111158 YES
Please note, that `INSERT /*+ APPEND */ gathers only statistics if the table (or partition) is empty. The restriction is documented here.
So, I would recommend in your code, after the inserts, updates and deletes are done, to check if the table(s) appear in USER_TAB_MODIFICATIONS. If the statistics are stale, I'd gather statistics.
I would also look into partitioning. Check if you can insert, update and gather stats in a fresh new partition, which would be a bit faster. And check if you can purge your data by dropping a whole partition, which would be a lot faster.
Related
I have two tables with same column structure in the same database: TableA and TableB.
TableA doesn't have any indexes, but TableB has a non-clustered unique index.
TableA has 290 Million rows of data that needs to be copied to TableB.
As they both have same structure, I've tried
INSERT INTO TableB
SELECT *
FROM TableA;
It was executing for hours and produced a huge log file that filled the disk. As a result the disk ran out of space and the query was killed.
I can shrink the log file. How can I copy these many rows of data to another table efficiently?
First of all, disable the index on TableB before inserting the rows. You can do it using T-SQL:
ALTER INDEX IX_Index_Name ON dbo.TableB DISABLE;
Make sure to disable all the constraints (foreign keys, check constraints, unique indexes) on your destination table.
Re-enable (and rebuild) them after the load is complete.
Now, there's a couple of approaches to solve the problem:
You have to be OK with a slight chance of data loss: use the INSERT INTO ... SELECT ... FROM ... syntax you have but switch your database to Bulk-logged recovery mode first (read before switching). Won't help if you're already in Bulk-logged or Simple.
With exporting the data first: you can use the BCP utility to export/import the data. It supports loading data in batches. Read more about using the BCP utility here.
Fancy, with exporting the data first: With SQL 2012+ you can try exporting the data into binary file (using the BCP utility) and load it by using the BULK INSERT statement, setting ROWS_PER_BATCH option.
Old-school "I don't give a damn" method: to prevent the log from filling up you will need to perform the
inserts in batches of rows, not everything at once. If your database
is running in Full recovery mode you will need to keep log backups
running, maybe even trying to increase the frequency of the job.
To batch-load your rows you will need a WHILE (don't use them in
day-to-day stuff, just for batch loads), something like the
following will work if you have an identifier in the dbo.TableA
table:
DECLARE #RowsToLoad BIGINT;
DECLARE #RowsPerBatch INT = 5000;
DECLARE #LeftBoundary BIGINT = 0;
DECLARE #RightBoundary BIGINT = #RowsPerBatch;
SELECT #RowsToLoad = MAX(IdentifierColumn) dbo.FROM TableA
WHILE #LeftBoundary < #RowsToLoad
BEGIN
INSERT INTO TableB (Column1, Column2)
SELECT
tA.Column1,
tB.Column2
FROM
dbo.TableA as tA
WHERE
tA.IdentifierColumn > #LeftBoundary
AND tA.IdentifierColumn <= #RightBoundary
SET #LeftBoundary = #LeftBoundary + #RowsPerBatch;
SET #RightBoundary = #RightBoundary + #RowsPerBatch;
END
For this to work effectively you really want to consider creating an
index on dbo.TableA (IdentifierColumn) just for the time you're
running the load.
There's a multi-step procedure in the Data Warehouse that generates a temp table with a list of Jobs that will be processed for each batch. Usually this is about 5,000 jobs. By the end of financial aggregation we may be looking at about 500,000 records processed. I've noticed that a very small part of it is giving me an Early Timeout on Optimization for just this part of the stored procedure:
DELETE jfs
FROM DataWarehouse.dbo.JobFinancialSummary jfs -- Financials table (> 3,000,000 records with indices)
INNER JOIN #JobList jl ON jfs.JobID = jl.JobID -- List of Jobs being processed (avg. of 5,000 records)
INNER JOIN FiscalPeriod fp ON fp.ID = jfs.FiscalPeriodID -- Month Reference Table (about 1,000 records)
WHERE fp.[Status] IN (1,2) -- Last 2 months
The most confusing thing is that this is a relatively simple part of the stored procedure and all of the JOINs are on indices. My only question is how it gets timed out when the optimizer evaluates this. My understanding is that the optimizer gives each statement its own "Budget" but perhaps I'm missing something. Why the timeout here?
There are many reasons for delete to be very slow:
Reasons:
Table might contain CDC, CT enabled
Table might contain triggers which work after operation of delete
Table might contain FK references, constraints
Table might contain indexed views associated with that table
Above all we need to check how transactional this table will be etc.,
There are many ways to do delete, Obviously batch delete is faster. First and famous way is to mark these records with soft delete attribute and delete offline during nightly time.
If it is going to be offline delete, to make that delete faster,
then capture clustered index keys for the deleting table
Disable indexes, FK's, Constraints as you will be taking care functionally on all these
Disable CT, CDC etc on that table if these are not required
Create a script for indexed views and then drop the indexed views associated with this table
Then delete via batches, you can by setting ##rowCount or top batchsize We can delete any number of records faster this way.
DELETE TOP 50000 -- based on scenario
FROM table1
in loop
Call explicit 'CheckPoint' to make sure records are cleared from transaction log. Also make sure your 'Recovery Model' is 'Simple' not the 'Full'
If it is going to be online delete still during daytime mark that as soft delete and in the nightly job run the deletes in very smaller batches
It may likely be faster to isolate the rows/entries you want to delete first rather than joining whilst deleting.
Something like this assuming id is your primary key/identity:
IF OBJECT_ID('tempdb..#tmp') IS NOT NULL DROP TABLE #tmp
SELECT jfs.ID
INTO #tmp
FROM DataWarehouse.dbo.JobFinancialSummary jfs -- Financials table (> 3,000,000 records with indices)
INNER JOIN FiscalPeriod fp ON fp.ID = jfs.FiscalPeriodID -- Month Reference Table (about 1,000 records)
WHERE fp.[Status] IN (1,2)
/* EXISTS IS FASTER THAN A JOIN, AVOIDS FANNING */
AND EXISTS (SELECT 1 FROM #JobList jl where jfs.JobID = jl.JobID) -- List of Jobs being processed (avg. of 5,000 records)
Then issue a delete such as:
DELETE TOP(1000) jfs
FROM DataWarehouse.dbo.JobFinancialSummary jfs
WHERE EXISTS (SELECT 1 FROM #tmp t WHERE jfs.ID=t.ID)
From there, depending on how many rows you are deleting, you may wish to delete in a batch overnight -- anything over 5000 rows will escalate to table locks and is a prime candidate for batch deletion.
I wrote a fairly popular answer on how to accomplish large batch deletes here:
Deleting 1 millions rows in SQL Server
I am currently performing analysis on a client's MSSQL Server. I've already fixed many issues (unnecessary indexes, index fragmentation, NEWID() being used for identities all over the shop etc), but I've come across a specific situation that I haven't seen before.
Process 1 imports data into a staging table, then Process 2 copies the data from the staging table using an INSERT INTO. The first process is very quick (it uses BULK INSERT), but the second takes around 30 mins to execute. The "problem" SQL in Process 2 is as follows:
INSERT INTO ProductionTable(field1,field2)
SELECT field1, field2
FROM SourceHeapTable (nolock)
The above INSERT statement inserts hundreds of thousands of records into ProductionTable, each row allocating a UNIQUEIDENTIFIER, and inserting into about 5 different indexes. I appreciate this process is going to take a long time, so my issue is this: while this import is taking place, a 3rd process is responsible for performing constant lookups on ProductionTable - in addition to inserting an additional record into the table as such:
INSERT INTO ProductionTable(fields...)
VALUES(values...)
SELECT *
FROM ProductionTable (nolock)
WHERE ID = #Id
For the 30 or so minutes that the INSERT...SELECT above is taking place, the INSERT INTO times-out.
My immediate thought is that SQL server is locking the entire table during the INSERT...SELECT. I did quite a lot of profiling on the server during my analysis, and there are definitely locks being allocated for the duration of the INSERT...SELECT, though I fail remember what type they were.
Having never needed to insert records into a table from two sources at the same time - at least during an ETL process - I'm not sure how to approach this. I've been looking up INSERT table hints, but most are being made obsolete in future versions.
It looks to me like a CURSOR is the only way to go here?
You could consider BULK INSERT for Process-2 to get the data into the ProductionTable.
Another option would be to batch Process-2 into small batches of around 1000 records and use a Table Valued Parameter to do the INSERT. See: http://msdn.microsoft.com/en-us/library/bb510489.aspx#BulkInsert
It seems like table lock.
Try portion insert in ETL process. Something like
while 1=1
begin
INSERT INTO ProductionTable(field1,field2)
SELECT top (1000) field1, field2
FROM SourceHeapTable sht (nolock)
where not exists (select 1 from ProductionTable pt where pt.id = sht.id)
-- optional
--waitfor delay '00:00:01.0'
if ##rowcount = 0
break;
end
I am updating 2 columns in a table that contains millions (85 million) of rows. Now to update these I am using a update command like,
UPDATE Table1
SET Table1.column1 = Table2.column1 ,
Table1.column2 = Table2.column2
FROM
Tables and with a Join-conditions;
Now my problem is, it is taking 23 hours for that. Even after using the batch size there is not much change in the time taken.
But I need to update it in less than 5 hours. Is that possible. What approach should I take to achieve it ?
SQL Update statements have to keep all the rows in the log file so it can roll-back on failure. As explained by this guy, the best way to handle millions of rows is to forget about atomicity and batch your updates into 50,000 rows (or whatever):
--Declare variable for row count
Declare #rc int
Set #rc=50000
While #rc=50000
Begin
Begin Transaction
--Use Top (50000) to limit number of updates
--performed in each batch to 50K rows.
--Use tablockx and holdlock to obtain and hold
--an immediate exclusive table lock. This unusually
--speeds the update because only one lock is needed.
Update Top (50000) MyTable With (tablockx, holdlock)
Set UpdFlag = 0
From MyTable mt
Join ControlTable ct
On mt.KeyCol=ct.PK
--Add criteria to avoid updating rows that
--were updated in previous pass
Where m.UpdFlag <> 0
--Get number of rows updated
--Process will continue until less than 50000
Select #rc=##rowcount
--Commit the transaction
Commit
End
This still has some problems in that you need to know which rows you've already handled, perhaps someone smarter than this guy (and me!) can figure something nicer with more MSSQL magic; but this should be a start.
I have used SSIS for doing this task.
First I have taken the source table in which I have to update the 2-columns. Then I have taken Look-Up task in which I have to mapped source columns to the destination table columns from which I have to get the data to update source table columns. Finally added OLEDB destination from where I'll fill the table basing on the joining conditions from the look-up.
This process was really fast than executing an update script.
Suppose that I have a table with 10000000 record. What is difference between this two solution?
delete data like :
DELETE FROM MyTable
delete all of data with a application row by row :
DELETE FROM MyTable WHERE ID = #SelectedID
Is the first solution has best performance?
what is the impact on log and performance?
If you need to restrict to what rows you need to delete and not do a complete delete, or you can't use TRUNCATE TABLE (e.g. the table is referenced by a FK constraint, or included in an indexed view), then you can do the delete in chunks:
DECLARE #RowsDeleted INTEGER
SET #RowsDeleted = 1
WHILE (#RowsDeleted > 0)
BEGIN
-- delete 10,000 rows a time
DELETE TOP (10000) FROM MyTable [WHERE .....] -- WHERE is optional
SET #RowsDeleted = ##ROWCOUNT
END
Generally, TRUNCATE is the best way and I'd use that if possible. But it cannot be used in all scenarios. Also, note that TRUNCATE will reset the IDENTITY value for the table if there is one.
If you are using SQL 2000 or earlier, the TOP condition is not available, so you can use SET ROWCOUNT instead.
DECLARE #RowsDeleted INTEGER
SET #RowsDeleted = 1
SET ROWCOUNT 10000 -- delete 10,000 rows a time
WHILE (#RowsDeleted > 0)
BEGIN
DELETE FROM MyTable [WHERE .....] -- WHERE is optional
SET #RowsDeleted = ##ROWCOUNT
END
If you have that many records in your table and you want to delete them all, you should consider truncate <table> instead of delete from <table>. It will be much faster, but be aware that it cannot activate a trigger.
See for more details (this case sql server 2000):
http://msdn.microsoft.com/en-us/library/aa260621%28SQL.80%29.aspx
Deleting the table within the application row by row will end up in long long time, as your dbms can not optimize anything, as it doesn't know in advance, that you are going to delete everything.
The first has clearly better performance.
When you specify DELETE [MyTable] it will simply erase everything without doing checks for ID. The second one will waste time and disk operation to locate a respective record each time before deleting it.
It also gets worse because every time a record disappears from the middle of the table, the engine may want to condense data on disk, thus wasting time and work again.
Maybe a better idea would be to delete data based on clustered index columns in descending order. Then the table will basically be truncated from the end at every delete operation.
Option 1 will create a very large transaction and have a big impact on the log / performance, as well as escalating locks so that the table will be unavailable.
Option 2 will be slower, although it will generate less impact on the log (assuming bulk / full mode)
If you want to get rid of all the data, Truncate Table MyTable would be faster than both, although it has no facility to filter rows, it does a meta data change at the back and basically drops the IAM on the floor for the table in question.
The best performance for clearing a table would bring TRUNCATE TABLE MyTable. See http://msdn.microsoft.com/en-us/library/ms177570.aspx for more verbose explaination
Found this post on Microsoft TechNet.
Basically, it recommends:
By using SELECT INTO, copy the data that you want to KEEP to an intermediate table;
Truncate the source table;
Copy back with INSERT INTO from intermediate table, the data to the source table;
..
BEGIN TRANSACTION
SELECT *
INTO dbo.bigtable_intermediate
FROM dbo.bigtable
WHERE Id % 2 = 0;
TRUNCATE TABLE dbo.bigtable;
SET IDENTITY_INSERT dbo.bigTable ON;
INSERT INTO dbo.bigtable WITH (TABLOCK) (Id, c1, c2, c3)
SELECT Id, c1, c2, c3 FROM dbo.bigtable_intermediate ORDER BY Id;
SET IDENTITY_INSERT dbo.bigtable OFF;
ROLLBACK TRANSACTION
The first will delete all the data from the table and will have better performance that your second who will delete only data from a specific key.
Now if you have to delete all the data from the table and you don't rely on using rollback think of the use a truncate table