Batching / Splitting a PostgreSQL database

Batching / Splitting a PostgreSQL database - database

I am working on a project which processes data in batches and fills up a PostgreSQL (9.6, but I could upgrade) database. The way it currently works is that the process happens in separate steps and each step adds data to a table that it owns (rarely two processes write in the same table, if they do, they write in different column).
The way the data happens to be, the data tends to become more and more fine-grained with each step. As a simplified example I have one table defining the data sources. There are very few (in the tens/ low hundreds), but each of these data sources generate batches of data samples (batches and samples are separate tables, to store metadata). Each batch typically generates about 50k samples. Each of these data points then gets processed step-by-step and each data sample generates more data-points in the next table.
This worked fine, until we got to a 1.5mil rows in the sample table (which is not a lot of data from our point of view). Now filtering for a batch starts becoming slow (about 10ms for each sample we retrieve). And it starts becoming a major bottleneck, because the execution time to get the data for a batch take 5-10mins (fetching is ms).
We have b-tree indices on all foreign keys that are involved for these queries.
Since our computations target the batches, I do normally not need to query across batches during the computation (this is when the query time hurts a lot at the moment). However for data-analysis reasons ad-hoc queries across batches need to remain possible.
So a very simple solution would be to generate an individual database for each batch, and somehow query across these databases when I need to. If I had only one batch in each database, obviously the filtering for a single batch would be instant and my problem would be solved (for now). However, then I would end up with thousands of databases and the data-analysis would be painful.
Within PostgreSQL, is there a way of pretending that I have separate databases for some queries? Ideally I would like to do that for each batch when I "register" a new batch.
Outside of the world of PostgreSQL, is there another database I should try for my usecase?
Edit: DDL / Schema
In our current implementation, sample_representation is the table that all processing results depend on. A batch is truly defined by a tuple of (batch.id, representation.id). The query I tried and described above as slow is (10ms for each sample, adding up to around 5 min for 50k samples)
SELECT sample_representation.id, sample.sample_pos
FROM sample_representation
JOIN sample ON sample.id = sample_representation.id_sample
WHERE sample_representation.id_representation = 'representation-uuid' AND sample.id_batch = 'batch-uuid'
We have currently somewhere around 1.5 ssamples, 2 representations, 460 batches (of which 49 have been processed, the others do not have samples associated to it), which means each batch has 30k samples in average. Some have around 50k.
The schema is below. There is some meta-data associated with all tables, but I am not querying for it in this case. The actual sample-data are stored separately on disk and not in the database, in case that makes a difference.
create table batch
(
id uuid default uuid_generate_v1mc() not null
constraint batch_pk
primary key,
path text not null
constraint unique_batch_path
unique,
id_data_source uuid
)
;
create table sample
(
id uuid default uuid_generate_v1mc() not null
constraint sample_pk
primary key,
sample_pos integer,
id_batch uuid
constraint batch_fk
references batch
on update cascade on delete set null
)
;
create index sample_sample_pos_index
on sample (sample_pos)
;
create index sample_id_batch_sample_pos_index
on sample (id_batch, sample_pos)
;
create table representation
(
id uuid default uuid_generate_v1mc() not null
constraint representation_pk
primary key,
id_data_source uuid
)
;
create table data_source
(
id uuid default uuid_generate_v1mc() not null
constraint data_source_pk
primary key
)
;
alter table batch
add constraint data_source_fk
foreign key (id_data_source) references data_source
on update cascade on delete set null
;
alter table representation
add constraint data_source_fk
foreign key (id_data_source) references data_source
on update cascade on delete set null
;
create table sample_representation
(
id uuid default uuid_generate_v1mc() not null
constraint sample_representation_pk
primary key,
id_sample uuid
constraint sample_fk
references sample
on update cascade on delete set null,
id_representation uuid
constraint representation_fk
references representation
on update cascade on delete set null
)
;
create unique index sample_representation_id_sample_id_representation_uindex
on sample_representation (id_sample, id_representation)
;
create index sample_representation_id_sample_index
on sample_representation (id_sample)
;
create index sample_representation_id_representation_index
on sample_representation (id_representation)
;

After fiddling around, I found a solution. But I am still not sure why the original query really takes that much time:
SELECT sample_representation.id, sample.sample_pos
FROM sample_representation
JOIN sample ON sample.id = sample_representation.id_sample
WHERE sample_representation.id_representation = 'representation-uuid' AND sample.id_batch = 'batch-uuid'
Everything is indexed, but the tables are relatively big with 1.5 million rows in sample_representation and in sample. I guess what happens is that first the tables get joined and then filtered with WHERE. But even if creating a large view as a result of the join, it should not take that long?!
In any case, I tried to use a CTE instead of joining two "massive" tables. The idea was to filter early and then join afterwards:
WITH sel_samplerepresentation AS (
SELECT *
FROM sample_representation
WHERE id_representation='1437a5da-e4b1-11e7-a254-7fff1955d16a'
), sel_samples AS (
SELECT *
FROM sample
WHERE id_video='75c04b9c-e4b9-11e7-a93f-132baa27ac91'
)
SELECT sel_samples.sample_pos, sel_samplerepresentation.id
FROM sel_samplerepresentation
JOIN sel_samples ON sel_samples.id = sel_samplerepresentation.id_representation
This query also takes forever. Here the reason is clear. sel_samples and sel_samplerepresentation have 50k records each. The join happens on a non-indexed column of the CTEs.
Since there are no indices for CTEs, I reformulated them as materialized views for which I can add indices:
CREATE MATERIALIZED VIEW sel_samplerepresentation AS (
SELECT *
FROM sample_representation
WHERE id_representation='1437a5da-e4b1-11e7-a254-7fff1955d16a'
);
CREATE MATERIALIZED VIEW sel_samples AS (
SELECT *
FROM sample
WHERE id_video = '75c04b9c-e4b9-11e7-a93f-132baa27ac91'
);
CREATE INDEX sel_samplerepresentation_sample_id_index ON sel_samplerepresentation (id_sample);
CREATE INDEX sel_samples_id_index ON sel_samples (id);
SELECT sel_samples.sample_pos, sel_samplerepresentation.id
FROM sel_samplerepresentation
JOIN sel_samples ON sel_samples.id = sel_samplerepresentation.id_sample;
DROP MATERIALIZED VIEW sel_samplerepresentation;
DROP MATERIALIZED VIEW sel_samples;
This is more of a hack than a solution, but executing these queries takes 1s! (down from 8min)

Related

How to find unused rows in a dimension table

I have a dimension table in my database that has grown too large. With that I mean that is has too many records - over a million - because it grew at the same pace as the linked facts. This is mostly due to a bad design, and I'm trying to clean it up.
One of the things I try to do is to remove dimension records which are no longer used. The fact tables are regularly maintained and old snapshots are removed. Because the dimensions were not maintained like that, there are many rows in the table whose primary key value no longer appears in any of the linked fact tables anymore.
All the fact tables have foreign key constraints.
Is there a way to locate table rows whose primary key value no longer appears in any of the tables which are linked with a foreign key constraint?
I tried writing a script to track this. Basically this:
select key from dimension
where not exists (select 1 from fact1 where fk = pk)
and not exists (select 1 from fact2 where fk = pk)
and not exists (select 1 from fact3 where fk = pk)
But with a lot of linked tables this query dies after some time - at least, my management studio crashed. So I'm not sure if there are any other options.

we had to do something similar to this at one of my clients. The query, like yours with "not exists.... and not exists.... and not exists...." was taking ~22 hours to run before we change our strategy to handle this in ~20 minutes.
As Nsousa suggest, you have to split the query so SQL Server doesn't have to handle all data in one shot, having to unnecessarily use tempdb and all other things.
First, create new table with all keys in it. The reason to create this table is to not have to read the full table scan for every query, having more keys on a 8k page and to deal with a smaller and smaller set of keys after each delete.
create table DimensionkeysToDelete (Dimkey char(32) primary key nonclustered);
insert into DimensionkeysToDelete
select key from dimension order by key;
Then, instead of deleting unused key, delete the keys that exists in facts table, beginning with the fact table that has the least numbers of rows.
Make sure facts table have proper indexing for performance.
delete from DimensionkeysToDelete
from DimensionkeysToDelete d
inner join fact1 on f.fk = d.Dimkey;
delete from DimensionkeysToDelete
from DimensionkeysToDelete d
inner join fact2 on f.fk = d.Dimkey;
delete from DimensionkeysToDelete
from DimensionkeysToDelete d
inner join fact3 on f.fk = d.Dimkey;
Once all facts tables done, only unused keys remains in DimensionkeysToDelete. To answers your question, just perform a select on this table to get all unused key for that particular dimension, or join it with the dimension to get data.
But, from what I understand of your needs for cleaning up you warehouse, use this table to delete from the orignal dimension table. At this step, you might also want take some action for auditing purposes (ie: insert in an audit table 'Key ' + key + ' deleted on + convert(datetime, getdate(),121) + ' by script X'.... )
I think this can be optimize, take a look at the execution plan, but my client was happy with it so we didn't have to put much effort in it.

You may want to split that into different queries. Check unused rows in fact1, then on fact2, etc, individually. Then intersect all those results to get to the rows that are unused in all fact tables.
I would also suggest a left outer join instead of nested queries, counting rows in the fact table for each pk, and filter out from the resultset those that have a non zero count.
Your query will struggle as it’ll scan every fact table at the same time.

How to avoid table scan on SQL Server in this situation

There are two tables, Costs and Logs. The data in Costs table can be in the millions of rows, and in Logs table it can be billions of rows.
I need to update the CostBy column in Costs table in a service task in production environment within 100 records each run.
CREATE TABLE Cost
(
C_PK uniqueidentifier primary key not null,
C_CostBy varchar(3) not null
)
CREATE TABLE Logs
(
L_PK uniqueidentifier primary key not null,
L_ParentTable varchar(255) not null, -- Table Cost and other table's name
L_ParentID uniqueidentifier not null, -- Cost's pk and other table's pk
L_Event varchar(3) not null, -- Part are 'ADD' and other event types
L_User varchar(3) not null
)
CREATE NONCLUSTERED INDEX [L_ParentID]
ON [dbo].[Costs] ([L_ParentID] ASC)
Here is the original update statement:
UPDATE TOP(100) Costs
SET CostBy = ISNULL(L_User, '~UK')
FROM Costs
LEFT JOIN Logs ON L_ParentID = C_PK AND L_Event = 'ADD'
WHERE CostBy = ''
However, the statement introducing a massive performance issue, high cost of table scan in Costs table.
My question is how to avoid the table scan in Costs table or how to optimize the update statement?
Thanks in advance.

You may want to try the following.
First, create an index on Logs, including all the relevant columns:
CREATE INDEX ix ON Logs
(
L_Parent_ID -- join condition, variable
)
INCLUDE
(
L_User -- no filter condition, but you use it your update
)
WHERE
(
L_Event = 'ADD' -- join condition, constant
)
If this is a unique index, ie. only a single row will ever exist with ADD event for a given parent ID, make sure to make this a unique index as it can dramatically improve performance.
Second, and this is a hit and miss situation, you may try with an index on Costs (CostBy) because you're only looking for empty CostBy values to update. This index will need to be updated upon your query because it's updating it, so it may slow down your query instead of speeding it up. It depends on a number of factors.
If you have an enterprise license, try both with with WITH (DATA_COMPRESSION = PAGE), it can significantly improve IO time at the expense of CPU. It depends which is your bottleneck.
Additionally, depending on the nature of your data, updating statistics may improve your queries. If there is a disproportionate number of rows with CostBy = '' to other values in there, you may benefit from full statistics on that field. Consider NORECOMPUTE if you only need them for this specific query, this one time.
CREATE STATISTICS st_Costs_CostBy
ON Costs (CostBy)
WITH FULLSCAN, NORECOMPUTE;

SQL Server execution plan is suggesting to create an index containing all the columns in the table

I've got a key table with 2 columns: Key, Id.
In a stored procedure I've written, my code joins the Employee table to the Key column, then selects the Id - something like this:
SELECT
E.EmployeeName, K.Id
FROM
Employee E
JOIN
KeyTable K ON E.Key = K.Key
The execution plan is suggesting to create the following index:
[schema].[Employee] ([Key]) INCLUDE ([Id])
My question is why? If all the information is in the table to begin with why create an index and duplicate that information?

Just because all of the information is "in the table", that doesn't mean that searching the entire table is going to be the most efficient way of obtaining the results for this query.
Here, the server is saying that, if it had a way to quickly locate rows in this table, given a Key value, that the query should be able to be processed more quickly (not that it's 100% reliable in its suggestions, so you should test before implementing).
This can be true if the table is a heap (no clustered index) or for a clustered table where the clustering key(s) don't match the desired access order for the query.
Also, if you think about it - every (non-clustered) index duplicates information. It's just that usually its a subset of the information rather than the whole set.

ORACLE table performance basics

Complete newbie to Oracle DBA-ing, and yet trying to migrate a SQL Server DB (2008R2) to Oracle (11g - total DB size only ~20Gb)...
I'm having a major problem with my largest single table (~30 million rows). Rough structure of the table is:
CREATE TABLE TableW (
WID NUMBER(10,0) NOT NULL,
PID NUMBER(10,0) NOT NULL,
CID NUMBER(10,0) NOT NULL
ColUnInteresting1 NUMBER(3,0) NOT NULL,
ColUnInteresting2 NUMBER(3,0) NOT NULL,
ColUnInteresting3 FLOAT NOT NULL,
ColUnInteresting4 FLOAT NOT NULL,
ColUnInteresting5 VARCHAR2(1024 CHAR),
ColUnInteresting6 NUMBER(3,0) NOT NULL,
ColUnInteresting7 NUMBER(5,0) NOT NULL,
CreatedDate DATE NOT NULL,
ModifiedDate DATE NOT NULL,
CreatedByUser VARCHAR2(20 CHAR),
ModifiedByUser VARCHAR2(20 CHAR)
);
ALTER TABLE TableW ADD CONSTRAINT WPrimaryKey PRIMARY KEY (WID)
ENABLE;
CREATE INDEX WClusterIndex ON TableW (PID);
CREATE INDEX WCIDIndex ON TableW (CID);
ALTER TABLE TableW ADD CONSTRAINT FKTableC FOREIGN KEY (CID)
REFERENCES TableC (CID) ON DELETE CASCADE
ENABLE;
ALTER TABLE TableW ADD CONSTRAINT FKTableP FOREIGN KEY (PID)
REFERENCES TableP (PID) ON DELETE CASCADE
ENABLE;
Running through some basics test, it seems a simple 'DELETE FROM TableW WHERE PID=13455' is taking a huge amount of time (~880s) to execute what should be a quick delete (~350 rows). [query run via SQL Developer].
Generally, the performance of this table is noticeably worse than its SQL equivalent. There are no issues under SQL Server, and the structure of this table and the surrounding ones look sensible for Oracle by comparison to SQL.
My problem is that I cannot find a useful set of diagnostics to start looking for where the problem lies. Any queries / links greatly appreciated.
[The above is a request for help based on the assumption it should not take anything like 10 minutes to delete 350 rows from a table with 30 million records, when it takes SQL Server <1s to do the same for an equivalent DB structure]
EDIT:
The migration is being performed thus:
1 In SQL developer:
- Create Oracle User, tablespace, grants etc AS Sys
- Create the tables, sequences, triggers etc AS New User
2 Via some Java:
- Check SQL-Oracle structure consistency
- Disable all foreign keys
- Move data (Truncate destination table, Select From Old, Insert Into New)
- Adjust sequences to correct starting value
- Enable foreign keys

If you ask us how to improve the performance, then there are several ways to improve it:
Parallel DML
Partitioning.
Parallel DML consumes all the resource you have to perform the operation. Oracle runs several threads to complete the operation. Other sessions has to wait for the end of the operation, because system resources are busy.
Partitioning let you exclude old sections right away. For example, your table stores the data from 2000 to 2014. Most likely you don't need old records, so you can split your table for several partitions and exclude the oldest one.

Check the wait events for your session that's doing the DELETE. That will tell you what your main bottleneck is.
And echoing Marco's comment above - Make sure your table stats are up to date - that will help the optimizer build a good plan to run those queries for you.

To update all (and in case any else finds this):
The correct question to find a solution was: what tables do you have referencing this one?
The problem was another table (let's call it TableV) using WID as a foreign key, but the WID column in TableV was not indexed. This means for every record delete in TableW, the whole of TableV had to be searched for associated records to be deleted. As TableV is >3 million rows, deleting the small set of 350 rows in TableV meant the Oracle server trying to read a total of >1 billion rows.
A single index added to WID in TableV, and the delete statement now takes <1s.
Thanks to all for the comments - a lot of Oracle inner working learnt!

Creating a Primary Key on a temp table - When?

I have a stored procedure that is working with a large amount of data. I have that data being inserted in to a temp table. The overall flow of events is something like
CREATE #TempTable (
Col1 NUMERIC(18,0) NOT NULL, --This will not be an identity column.
,Col2 INT NOT NULL,
,Col3 BIGINT,
,Col4 VARCHAR(25) NOT NULL,
--Etc...
--
--Create primary key here?
)
INSERT INTO #TempTable
SELECT ...
FROM MyTable
WHERE ...
INSERT INTO #TempTable
SELECT ...
FROM MyTable2
WHERE ...
--
-- ...or create primary key here?
My question is when is the best time to create a primary key on my #TempTable table? I theorized that I should create the primary key constraint/index after I insert all the data because the index needs to be reorganized as the primary key info is being created. But I realized that my underlining assumption might be wrong...
In case it is relevant, the data types I used are real. In the #TempTable table, Col1 and Col4 will be making up my primary key.
Update: In my case, I'm duplicating the primary key of the source tables. I know that the fields that will make up my primary key will always be unique. I have no concern about a failed alter table if I add the primary key at the end.
Though, this aside, my question still stands as which is faster assuming both would succeed?

This depends a lot.
If you make the primary key index clustered after the load, the entire table will be re-written as the clustered index isn't really an index, it is the logical order of the data. Your execution plan on the inserts is going to depend on the indexes in place when the plan is determined, and if the clustered index is in place, it will sort prior to the insert. You will typically see this in the execution plan.
If you make the primary key a simple constraint, it will be a regular (non-clustered) index and the table will simply be populated in whatever order the optimizer determines and the index updated.
I think the overall quickest performance (of this process to load temp table) is usually to write the data as a heap and then apply the (non-clustered) index.
However, as others have noted, the creation of the index could fail. Also, the temp table does not exist in isolation. Presumably there is a best index for reading the data from it for the next step. This index will need to either be in place or created. This is where you have to make a tradeoff of speed here for reliability (apply the PK and any other constraints first) and speed later (have at least the clustered index in place if you are going to have one).

If the recovery model of your database is set to simple or bulk-logged, SELECT ... INTO ... UNION ALL may be the fastest solution. SELECT .. INTO is a bulk operation and bulk operations are minimally logged.
eg:
-- first, create the table
SELECT ...
INTO #TempTable
FROM MyTable
WHERE ...
UNION ALL
SELECT ...
FROM MyTable2
WHERE ...
-- now, add a non-clustered primary key:
-- this will *not* recreate the table in the background
-- it will only create a separate index
-- the table will remain stored as a heap
ALTER TABLE #TempTable ADD PRIMARY KEY NONCLUSTERED (NonNullableKeyField)
-- alternatively:
-- this *will* recreate the table in the background
-- and reorder the rows according to the primary key
-- CLUSTERED key word is optional, primary keys are clustered by default
ALTER TABLE #TempTable ADD PRIMARY KEY CLUSTERED (NonNullableKeyField)
Otherwise, Cade Roux had good advice re: before or after.

You may as well create the primary key before the inserts - if the primary key is on an identity column then the inserts will be done sequentially anyway and there will be no difference.

Even more important than performance considerations, if you are not ABSOLUTELY, 100% sure that you will have unique values being inserted into the table, create the primary key first. Otherwise the primary key will fail to be created.
This prevents you from inserting duplicate/bad data.

If you add the primary key when creating the table, the first insert will be free (no checks required.) The second insert just has to see if it's different from the first. The third insert has to check two rows, and so on. The checks will be index lookups, because there's a unique constraint in place.
If you add the primary key after all the inserts, every row has to be matched against every other row. So my guess is that adding a primary key early on is cheaper.
But maybe Sql Server has a really smart way of checking uniqueness. So if you want to be sure, measure it!

I was wondering if I could improve a very very "expensive" stored procedure entailing a bunch of checks at each insert across tables and came across this answer. In the Sproc, several temp tables are opened and reference each other. I added the Primary Key to the CREATE TABLE statement (even though my selects use WHERE NOT EXISTS statements to insert data and ensure uniqueness) and my execution time was cut down SEVERELY. I highly recommend using the primary keys. Always at least try it out even when you think you don't need it.

I don't think it makes any significant difference in your case:
either you pay the penalty a little bit at a time, with each single insert
or you'll pay a larger penalty after all the inserts are done, but only once
When you create it up front before the inserts start, you could potentially catch PK violations as the data is being inserted, if the PK value isn't system-created.
But other than that - no big difference, really.
Marc

I wasn't planning to answer this, since I'm not 100% confident on my knowledge of this. But since it doesn't look like you are getting much response ...
My understanding is a PK is a unique index and when you insert each record, your index is updated and optimized. So ... if you add the data first, then create the index, the index is only optimized once.
So, if you are confident your data is clean (without duplicate PK data) then I'd say insert, then add the PK.
But if your data may have duplicate PK data, I'd say create the PK first, so it will bomb out ASAP.

When you add PK on table creation - the insert check is O(Tn) (where Tn is "n-th triangular number", which is 1 + 2 + 3 ... + n) because when you insert x-th row, it's checked against previously inserted "x - 1" rows
When you add PK after inserting all the values - the checker is O(n^2) because when you insert x-th row, it's checked against all n existing rows.
First one is obviously faster since O(Tn) is less than O(n^2)
P.S. Example: if you insert 5 rows it is 1 + 2 + 3 + 4 + 5 = 15 operations vs 5^2 = 25 operations

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight