ORACLE table performance basics - sql-server

Complete newbie to Oracle DBA-ing, and yet trying to migrate a SQL Server DB (2008R2) to Oracle (11g - total DB size only ~20Gb)...
I'm having a major problem with my largest single table (~30 million rows). Rough structure of the table is:
CREATE TABLE TableW (
WID NUMBER(10,0) NOT NULL,
PID NUMBER(10,0) NOT NULL,
CID NUMBER(10,0) NOT NULL
ColUnInteresting1 NUMBER(3,0) NOT NULL,
ColUnInteresting2 NUMBER(3,0) NOT NULL,
ColUnInteresting3 FLOAT NOT NULL,
ColUnInteresting4 FLOAT NOT NULL,
ColUnInteresting5 VARCHAR2(1024 CHAR),
ColUnInteresting6 NUMBER(3,0) NOT NULL,
ColUnInteresting7 NUMBER(5,0) NOT NULL,
CreatedDate DATE NOT NULL,
ModifiedDate DATE NOT NULL,
CreatedByUser VARCHAR2(20 CHAR),
ModifiedByUser VARCHAR2(20 CHAR)
);
ALTER TABLE TableW ADD CONSTRAINT WPrimaryKey PRIMARY KEY (WID)
ENABLE;
CREATE INDEX WClusterIndex ON TableW (PID);
CREATE INDEX WCIDIndex ON TableW (CID);
ALTER TABLE TableW ADD CONSTRAINT FKTableC FOREIGN KEY (CID)
REFERENCES TableC (CID) ON DELETE CASCADE
ENABLE;
ALTER TABLE TableW ADD CONSTRAINT FKTableP FOREIGN KEY (PID)
REFERENCES TableP (PID) ON DELETE CASCADE
ENABLE;
Running through some basics test, it seems a simple 'DELETE FROM TableW WHERE PID=13455' is taking a huge amount of time (~880s) to execute what should be a quick delete (~350 rows). [query run via SQL Developer].
Generally, the performance of this table is noticeably worse than its SQL equivalent. There are no issues under SQL Server, and the structure of this table and the surrounding ones look sensible for Oracle by comparison to SQL.
My problem is that I cannot find a useful set of diagnostics to start looking for where the problem lies. Any queries / links greatly appreciated.
[The above is a request for help based on the assumption it should not take anything like 10 minutes to delete 350 rows from a table with 30 million records, when it takes SQL Server <1s to do the same for an equivalent DB structure]
EDIT:
The migration is being performed thus:
1 In SQL developer:
- Create Oracle User, tablespace, grants etc AS Sys
- Create the tables, sequences, triggers etc AS New User
2 Via some Java:
- Check SQL-Oracle structure consistency
- Disable all foreign keys
- Move data (Truncate destination table, Select From Old, Insert Into New)
- Adjust sequences to correct starting value
- Enable foreign keys

If you ask us how to improve the performance, then there are several ways to improve it:
Parallel DML
Partitioning.
Parallel DML consumes all the resource you have to perform the operation. Oracle runs several threads to complete the operation. Other sessions has to wait for the end of the operation, because system resources are busy.
Partitioning let you exclude old sections right away. For example, your table stores the data from 2000 to 2014. Most likely you don't need old records, so you can split your table for several partitions and exclude the oldest one.

Check the wait events for your session that's doing the DELETE. That will tell you what your main bottleneck is.
And echoing Marco's comment above - Make sure your table stats are up to date - that will help the optimizer build a good plan to run those queries for you.

To update all (and in case any else finds this):
The correct question to find a solution was: what tables do you have referencing this one?
The problem was another table (let's call it TableV) using WID as a foreign key, but the WID column in TableV was not indexed. This means for every record delete in TableW, the whole of TableV had to be searched for associated records to be deleted. As TableV is >3 million rows, deleting the small set of 350 rows in TableV meant the Oracle server trying to read a total of >1 billion rows.
A single index added to WID in TableV, and the delete statement now takes <1s.
Thanks to all for the comments - a lot of Oracle inner working learnt!

Related

Is it safe to add IDENTITY PK Column to existing SQL SERVER table?

After rebuilding all of the tables in one of my SQL SERVER databases, into a new database, I failed to set the 'ID' column to IDENTITY and PRIMARY KEY for many of the tables. Most of them have data.
I discovered this T-SQL, and have successfully implemented it for a couple of the tables already. The new/replaced ID column contains the same values from the previous column (simply because they were from an auto-incremented column in the table I imported from), and my existing stored procedures all still work.
Alter Table ExistingTable
Add NewID Int Identity(1, 1)
Go
Alter Table ExistingTable Drop Column ID
Go
Exec sp_rename 'ExistingTable.NewID', 'ID', 'Column'
--Then open the table in Design View, and set the new/replaced column as the PRIMARY KEY
--I understand that I could set the PK when I create the new IDENTITY column
The new/replaced ID column is now the last column in the table, and so far, I haven't ran into issues with the ASP.Net/C# data access objects that call the stored procedures.
As mentioned, each of these tables had no PRIMARY KEY (nor FOREIGN KEY) set. With that in mind, are there any additional steps I should take to ensure the integrity of the database?
I ran across this SO post, which suggests that I should run the 'ALTER TABLE REBUILD' statement, but since there was no PK already set, do I really need to do this?
Ultimately, I just want to be sure I'm not creating issues that won't appear until later in the game, and be sure the methods I'm implementing are sound, logical, and ensure data integrity.
I suppose it might be a better option to DROP/RECREATE the table with the proper PK/IDENTITY column, and I could write some T-SQL to dump the existing data into a TEMP table, then drop/recreate, and re-populate the new table with data from the TEMP table. I specifically avoided this option as it seems much more aggressive, and I don't fully understand what it means for the Stored Procedures/Functions, etc., that depend on these tables.
Here is an example of one of the tables I've performed this on. You can see the NewID values are identical to the original ID.enter image description here
Give this a go; it's rummaged up from a script we used a few years ago in a similar situation, can't remember what version of SQLS it was used against.. If it works out for your scenario you can adapt it to your tables..
SELECT MAX(Id)+1 FROM causeCodes -- run and use value below
CREATE TABLE [dbo].[CauseCodesW]( [ID] [int] NOT NULL IDENTITY(put_maxplusone_here,1), [Code] [varchar](50) NOT NULL, [Description] [varchar](500) NULL, [IsActive] [bit] NOT NULL )
ALTER TABLE CauseCodes SWITCH TO CauseCodesW;
DROP TABLE CauseCodes;
EXEC sp_rename 'CauseCodesW','CauseCodes';
ALTER TABLE CauseCodes ADD CONSTRAINT PK_CauseCodes_Id PRIMARY KEY CLUSTERED (Id);
SELECT * FROM CauseCodes;
You can now find any tables that have FKs to this table and recreate those relationships..

Batching / Splitting a PostgreSQL database

I am working on a project which processes data in batches and fills up a PostgreSQL (9.6, but I could upgrade) database. The way it currently works is that the process happens in separate steps and each step adds data to a table that it owns (rarely two processes write in the same table, if they do, they write in different column).
The way the data happens to be, the data tends to become more and more fine-grained with each step. As a simplified example I have one table defining the data sources. There are very few (in the tens/ low hundreds), but each of these data sources generate batches of data samples (batches and samples are separate tables, to store metadata). Each batch typically generates about 50k samples. Each of these data points then gets processed step-by-step and each data sample generates more data-points in the next table.
This worked fine, until we got to a 1.5mil rows in the sample table (which is not a lot of data from our point of view). Now filtering for a batch starts becoming slow (about 10ms for each sample we retrieve). And it starts becoming a major bottleneck, because the execution time to get the data for a batch take 5-10mins (fetching is ms).
We have b-tree indices on all foreign keys that are involved for these queries.
Since our computations target the batches, I do normally not need to query across batches during the computation (this is when the query time hurts a lot at the moment). However for data-analysis reasons ad-hoc queries across batches need to remain possible.
So a very simple solution would be to generate an individual database for each batch, and somehow query across these databases when I need to. If I had only one batch in each database, obviously the filtering for a single batch would be instant and my problem would be solved (for now). However, then I would end up with thousands of databases and the data-analysis would be painful.
Within PostgreSQL, is there a way of pretending that I have separate databases for some queries? Ideally I would like to do that for each batch when I "register" a new batch.
Outside of the world of PostgreSQL, is there another database I should try for my usecase?
Edit: DDL / Schema
In our current implementation, sample_representation is the table that all processing results depend on. A batch is truly defined by a tuple of (batch.id, representation.id). The query I tried and described above as slow is (10ms for each sample, adding up to around 5 min for 50k samples)
SELECT sample_representation.id, sample.sample_pos
FROM sample_representation
JOIN sample ON sample.id = sample_representation.id_sample
WHERE sample_representation.id_representation = 'representation-uuid' AND sample.id_batch = 'batch-uuid'
We have currently somewhere around 1.5 ssamples, 2 representations, 460 batches (of which 49 have been processed, the others do not have samples associated to it), which means each batch has 30k samples in average. Some have around 50k.
The schema is below. There is some meta-data associated with all tables, but I am not querying for it in this case. The actual sample-data are stored separately on disk and not in the database, in case that makes a difference.
create table batch
(
id uuid default uuid_generate_v1mc() not null
constraint batch_pk
primary key,
path text not null
constraint unique_batch_path
unique,
id_data_source uuid
)
;
create table sample
(
id uuid default uuid_generate_v1mc() not null
constraint sample_pk
primary key,
sample_pos integer,
id_batch uuid
constraint batch_fk
references batch
on update cascade on delete set null
)
;
create index sample_sample_pos_index
on sample (sample_pos)
;
create index sample_id_batch_sample_pos_index
on sample (id_batch, sample_pos)
;
create table representation
(
id uuid default uuid_generate_v1mc() not null
constraint representation_pk
primary key,
id_data_source uuid
)
;
create table data_source
(
id uuid default uuid_generate_v1mc() not null
constraint data_source_pk
primary key
)
;
alter table batch
add constraint data_source_fk
foreign key (id_data_source) references data_source
on update cascade on delete set null
;
alter table representation
add constraint data_source_fk
foreign key (id_data_source) references data_source
on update cascade on delete set null
;
create table sample_representation
(
id uuid default uuid_generate_v1mc() not null
constraint sample_representation_pk
primary key,
id_sample uuid
constraint sample_fk
references sample
on update cascade on delete set null,
id_representation uuid
constraint representation_fk
references representation
on update cascade on delete set null
)
;
create unique index sample_representation_id_sample_id_representation_uindex
on sample_representation (id_sample, id_representation)
;
create index sample_representation_id_sample_index
on sample_representation (id_sample)
;
create index sample_representation_id_representation_index
on sample_representation (id_representation)
;
After fiddling around, I found a solution. But I am still not sure why the original query really takes that much time:
SELECT sample_representation.id, sample.sample_pos
FROM sample_representation
JOIN sample ON sample.id = sample_representation.id_sample
WHERE sample_representation.id_representation = 'representation-uuid' AND sample.id_batch = 'batch-uuid'
Everything is indexed, but the tables are relatively big with 1.5 million rows in sample_representation and in sample. I guess what happens is that first the tables get joined and then filtered with WHERE. But even if creating a large view as a result of the join, it should not take that long?!
In any case, I tried to use a CTE instead of joining two "massive" tables. The idea was to filter early and then join afterwards:
WITH sel_samplerepresentation AS (
SELECT *
FROM sample_representation
WHERE id_representation='1437a5da-e4b1-11e7-a254-7fff1955d16a'
), sel_samples AS (
SELECT *
FROM sample
WHERE id_video='75c04b9c-e4b9-11e7-a93f-132baa27ac91'
)
SELECT sel_samples.sample_pos, sel_samplerepresentation.id
FROM sel_samplerepresentation
JOIN sel_samples ON sel_samples.id = sel_samplerepresentation.id_representation
This query also takes forever. Here the reason is clear. sel_samples and sel_samplerepresentation have 50k records each. The join happens on a non-indexed column of the CTEs.
Since there are no indices for CTEs, I reformulated them as materialized views for which I can add indices:
CREATE MATERIALIZED VIEW sel_samplerepresentation AS (
SELECT *
FROM sample_representation
WHERE id_representation='1437a5da-e4b1-11e7-a254-7fff1955d16a'
);
CREATE MATERIALIZED VIEW sel_samples AS (
SELECT *
FROM sample
WHERE id_video = '75c04b9c-e4b9-11e7-a93f-132baa27ac91'
);
CREATE INDEX sel_samplerepresentation_sample_id_index ON sel_samplerepresentation (id_sample);
CREATE INDEX sel_samples_id_index ON sel_samples (id);
SELECT sel_samples.sample_pos, sel_samplerepresentation.id
FROM sel_samplerepresentation
JOIN sel_samples ON sel_samples.id = sel_samplerepresentation.id_sample;
DROP MATERIALIZED VIEW sel_samplerepresentation;
DROP MATERIALIZED VIEW sel_samples;
This is more of a hack than a solution, but executing these queries takes 1s! (down from 8min)

Is there a way to update primary key Identity specification Increment 1 without dropping Foreign Keys?

I am trying to change a primary key Id to identity to increment 1 on each entry. But the column has been referenced already by other tables. Is there any way to set primary key to auto increment without dropping the foreign keys from other tables?
If the table isn't that large generate script to create an identical table but change the schema it created to:
CREATE TABLE MYTABLE_NEW (
PK INT PRIMARY KEY IDENTITY(1,1),
COL1 TYPEx,
COL2 TYPEx,
COLn
...)
Set your database to single-user mode or make sure no one is in the
database or tables you're changing or change the table you need to
change to READ/ONLY.
Import your data into MYTABLE_NEW from MYTABLE using set IDENTITY_INSERT on
Script your foreign key constraints and save them--in case you need
to back out of your change later and/or re-implement them.
Drop all the constraints from MYTABLE
Rename MYTABLE to MYTABLE_SAV
Rename MYTABLE_NEW to MYTABLE
Run constraint scripts to re-implement constraints on MYTABLE
p.s.
you did ask if there was a way to not drop the foreign key constraints. Here's something to try on your test system. on Step 4 run
ALTER TABLE MYTABLE NOCHECK CONSTRAINT ALL
and on Step 7 ALTER TABLE MYTABLE CHECK CONSTRAINT ALL. I've not tried this myself -- interesting to see if this would actually work on renamed tables.
You can script all this ahead of time on a test SQL Server or even a copy of the database staged on a production server--to make implementation day a no-brainer and gauge your SLAs for any change control procedures for your company.
You can do a similar methodology by deleting the primary key and re-adding it back, but you'll need to have the same data inserted in the new column before you delete the old column. So you'll be deleting and inserting schema and inserting primary key data with this approach. I like to avoid touching a production table if at all possible and having MYTABLE_SAV around in case "anything" unexpected occurs is a comfort to me personally--as I can tell management "the production data was not touched". But some tables are simply too large for this approach to be worthwhile and, also, tastes and methodologies differ largely from DBA to DBA.

Converting int primary key to bigint in Sql Server

We have a production table with 770 million rows and change. We want(/need?) to change the Primary ID column from int to bigint to allow for future growth (and to avoid the sudden stop when the 32bit integer space is exhausted)
Experiments in DEV have shown that this is not as simple as altering the column as we would need to drop the index and then re-create it. So far in DEV (which is a bit humbler than PROD) the dropping of the index has not finished after 1 and a half hours. This table is hit 24/7 and having it offline for such a long time is not an option.
Has anyone else had to deal with a similar situation? How did you get it done?
Are there alternatives?
Edit: Additional Info:
The Primary key is clustered.
You could attempt a staged approach.
Create a new bigint column
Create an insert trigger to keep new entries in sync with the 2 columns
Execute an update to populate all the empty values in the bigint column with the converted value
Change the primary index on the table from your old id column to the new one
Point any FK's and queries to use the new column
Change the new column to become your identity column and remove the insert trigger from #2
Delete the old ID column
You should end up spreading the pain out over these 7 steps instead of hitting it all at once.
Create a parallel table with the longer data type for new rows and UNION the results?
What I had to do was copy the data into a new table with the desired structure (primary/clustered key only, non-clustered/FK once complete). If you don't have the room, you could bcp out the data and back in. You may need an application outage to make this happen.
What doesn't work: alter table Orderhistory alter column ID bigint because of the primary key. Don't drop the key and alter column as you will just fill your log file and take much longer than copy/bcp.
Never use the SSMS tools designer to change a column property, it copies table into temp table then does a rename once done. Lookup the alter table alter column syntax and use it and possibly defrag once complete if you modified a column wider that sits in middle of table.

Creating a Primary Key on a temp table - When?

I have a stored procedure that is working with a large amount of data. I have that data being inserted in to a temp table. The overall flow of events is something like
CREATE #TempTable (
Col1 NUMERIC(18,0) NOT NULL, --This will not be an identity column.
,Col2 INT NOT NULL,
,Col3 BIGINT,
,Col4 VARCHAR(25) NOT NULL,
--Etc...
--
--Create primary key here?
)
INSERT INTO #TempTable
SELECT ...
FROM MyTable
WHERE ...
INSERT INTO #TempTable
SELECT ...
FROM MyTable2
WHERE ...
--
-- ...or create primary key here?
My question is when is the best time to create a primary key on my #TempTable table? I theorized that I should create the primary key constraint/index after I insert all the data because the index needs to be reorganized as the primary key info is being created. But I realized that my underlining assumption might be wrong...
In case it is relevant, the data types I used are real. In the #TempTable table, Col1 and Col4 will be making up my primary key.
Update: In my case, I'm duplicating the primary key of the source tables. I know that the fields that will make up my primary key will always be unique. I have no concern about a failed alter table if I add the primary key at the end.
Though, this aside, my question still stands as which is faster assuming both would succeed?
This depends a lot.
If you make the primary key index clustered after the load, the entire table will be re-written as the clustered index isn't really an index, it is the logical order of the data. Your execution plan on the inserts is going to depend on the indexes in place when the plan is determined, and if the clustered index is in place, it will sort prior to the insert. You will typically see this in the execution plan.
If you make the primary key a simple constraint, it will be a regular (non-clustered) index and the table will simply be populated in whatever order the optimizer determines and the index updated.
I think the overall quickest performance (of this process to load temp table) is usually to write the data as a heap and then apply the (non-clustered) index.
However, as others have noted, the creation of the index could fail. Also, the temp table does not exist in isolation. Presumably there is a best index for reading the data from it for the next step. This index will need to either be in place or created. This is where you have to make a tradeoff of speed here for reliability (apply the PK and any other constraints first) and speed later (have at least the clustered index in place if you are going to have one).
If the recovery model of your database is set to simple or bulk-logged, SELECT ... INTO ... UNION ALL may be the fastest solution. SELECT .. INTO is a bulk operation and bulk operations are minimally logged.
eg:
-- first, create the table
SELECT ...
INTO #TempTable
FROM MyTable
WHERE ...
UNION ALL
SELECT ...
FROM MyTable2
WHERE ...
-- now, add a non-clustered primary key:
-- this will *not* recreate the table in the background
-- it will only create a separate index
-- the table will remain stored as a heap
ALTER TABLE #TempTable ADD PRIMARY KEY NONCLUSTERED (NonNullableKeyField)
-- alternatively:
-- this *will* recreate the table in the background
-- and reorder the rows according to the primary key
-- CLUSTERED key word is optional, primary keys are clustered by default
ALTER TABLE #TempTable ADD PRIMARY KEY CLUSTERED (NonNullableKeyField)
Otherwise, Cade Roux had good advice re: before or after.
You may as well create the primary key before the inserts - if the primary key is on an identity column then the inserts will be done sequentially anyway and there will be no difference.
Even more important than performance considerations, if you are not ABSOLUTELY, 100% sure that you will have unique values being inserted into the table, create the primary key first. Otherwise the primary key will fail to be created.
This prevents you from inserting duplicate/bad data.
If you add the primary key when creating the table, the first insert will be free (no checks required.) The second insert just has to see if it's different from the first. The third insert has to check two rows, and so on. The checks will be index lookups, because there's a unique constraint in place.
If you add the primary key after all the inserts, every row has to be matched against every other row. So my guess is that adding a primary key early on is cheaper.
But maybe Sql Server has a really smart way of checking uniqueness. So if you want to be sure, measure it!
I was wondering if I could improve a very very "expensive" stored procedure entailing a bunch of checks at each insert across tables and came across this answer. In the Sproc, several temp tables are opened and reference each other. I added the Primary Key to the CREATE TABLE statement (even though my selects use WHERE NOT EXISTS statements to insert data and ensure uniqueness) and my execution time was cut down SEVERELY. I highly recommend using the primary keys. Always at least try it out even when you think you don't need it.
I don't think it makes any significant difference in your case:
either you pay the penalty a little bit at a time, with each single insert
or you'll pay a larger penalty after all the inserts are done, but only once
When you create it up front before the inserts start, you could potentially catch PK violations as the data is being inserted, if the PK value isn't system-created.
But other than that - no big difference, really.
Marc
I wasn't planning to answer this, since I'm not 100% confident on my knowledge of this. But since it doesn't look like you are getting much response ...
My understanding is a PK is a unique index and when you insert each record, your index is updated and optimized. So ... if you add the data first, then create the index, the index is only optimized once.
So, if you are confident your data is clean (without duplicate PK data) then I'd say insert, then add the PK.
But if your data may have duplicate PK data, I'd say create the PK first, so it will bomb out ASAP.
When you add PK on table creation - the insert check is O(Tn) (where Tn is "n-th triangular number", which is 1 + 2 + 3 ... + n) because when you insert x-th row, it's checked against previously inserted "x - 1" rows
When you add PK after inserting all the values - the checker is O(n^2) because when you insert x-th row, it's checked against all n existing rows.
First one is obviously faster since O(Tn) is less than O(n^2)
P.S. Example: if you insert 5 rows it is 1 + 2 + 3 + 4 + 5 = 15 operations vs 5^2 = 25 operations

Resources