I have a dimension table in my database that has grown too large. With that I mean that is has too many records - over a million - because it grew at the same pace as the linked facts. This is mostly due to a bad design, and I'm trying to clean it up.
One of the things I try to do is to remove dimension records which are no longer used. The fact tables are regularly maintained and old snapshots are removed. Because the dimensions were not maintained like that, there are many rows in the table whose primary key value no longer appears in any of the linked fact tables anymore.
All the fact tables have foreign key constraints.
Is there a way to locate table rows whose primary key value no longer appears in any of the tables which are linked with a foreign key constraint?
I tried writing a script to track this. Basically this:
select key from dimension
where not exists (select 1 from fact1 where fk = pk)
and not exists (select 1 from fact2 where fk = pk)
and not exists (select 1 from fact3 where fk = pk)
But with a lot of linked tables this query dies after some time - at least, my management studio crashed. So I'm not sure if there are any other options.
we had to do something similar to this at one of my clients. The query, like yours with "not exists.... and not exists.... and not exists...." was taking ~22 hours to run before we change our strategy to handle this in ~20 minutes.
As Nsousa suggest, you have to split the query so SQL Server doesn't have to handle all data in one shot, having to unnecessarily use tempdb and all other things.
First, create new table with all keys in it. The reason to create this table is to not have to read the full table scan for every query, having more keys on a 8k page and to deal with a smaller and smaller set of keys after each delete.
create table DimensionkeysToDelete (Dimkey char(32) primary key nonclustered);
insert into DimensionkeysToDelete
select key from dimension order by key;
Then, instead of deleting unused key, delete the keys that exists in facts table, beginning with the fact table that has the least numbers of rows.
Make sure facts table have proper indexing for performance.
delete from DimensionkeysToDelete
from DimensionkeysToDelete d
inner join fact1 on f.fk = d.Dimkey;
delete from DimensionkeysToDelete
from DimensionkeysToDelete d
inner join fact2 on f.fk = d.Dimkey;
delete from DimensionkeysToDelete
from DimensionkeysToDelete d
inner join fact3 on f.fk = d.Dimkey;
Once all facts tables done, only unused keys remains in DimensionkeysToDelete. To answers your question, just perform a select on this table to get all unused key for that particular dimension, or join it with the dimension to get data.
But, from what I understand of your needs for cleaning up you warehouse, use this table to delete from the orignal dimension table. At this step, you might also want take some action for auditing purposes (ie: insert in an audit table 'Key ' + key + ' deleted on + convert(datetime, getdate(),121) + ' by script X'.... )
I think this can be optimize, take a look at the execution plan, but my client was happy with it so we didn't have to put much effort in it.
You may want to split that into different queries. Check unused rows in fact1, then on fact2, etc, individually. Then intersect all those results to get to the rows that are unused in all fact tables.
I would also suggest a left outer join instead of nested queries, counting rows in the fact table for each pk, and filter out from the resultset those that have a non zero count.
Your query will struggle as it’ll scan every fact table at the same time.
Related
Just playing around in SQL Server to get better with query writing. I'm using the Northwind sample database from Microsoft.
I want to delete 'Robert King', EmployeeID = 7.
So normally I would do:
DELETE FROM Employees
WHERE EmployeeID = 7
but it's linked to another table and throws
The DELETE statement conflicted with the REFERENCE constraint "FK_Orders_Employees". The conflict occurred in database "Northwind", table "dbo.Orders", column 'EmployeeID'
So I have to delete the rows from the Orders table first, but I also get an error because the order ID are linked to yet another table [Order Details].
How can I delete everything at once?
I have a query what shows me everything for EmployeeID = 7, but how can I delete it in one go?
Query to show all data for EmployeeID = 7:
SELECT
Employees.EmployeeID,
Orders.OrderID,
Employees.FirstName,
Employees.LastName
FROM
Employees
INNER JOIN
Orders on Employees.EmployeeID = Orders.EmployeeID
INNER JOIN
[Order Details] on orders.OrderID = [Order Details].orderID
WHERE
Employees.EmployeeID = 7
can you change the design of database?
if you have access to change, The best way is to set "cascade" type for delete operation for employee table.
Don't do Physical Deletes of important data on a Source Of Truth RDBMS
If this an OLTP system, then what you are suggesting, i.e. deleting OrderId rows linked to an employee, looks dangerous as it could break the data integrity of your system.
An OrderDetails row is likely also foreign keyed to a parent Orders table. Deleting an OrderDetails row will likely corrupt your Order processing data (since the Order table Totals will no longer match the cumulative line item rows).
By deleting what appears to be important transactional data, you may be destroying important business records, which could have dire consequences for both yourself and your company.
If the employee has left service of the company, physical deletion of data is NOT the answer. Instead, you should reconsider the table design, possibly by using a soft delete pattern on the Employee (and potentially associated data, but not likely important transactional data like Orders fulfilled by an employee). This way data integrity and audit trail will be preserved.
For important business data like Orders, if the order itself was placed in error, a compensating mechanism or status indication on the order (e.g. Cancelled status) should be used in preferenced to physical data deletion.
Cascading Deletes on non-critical data
In general, if DELETING data in cascading fashion is a deliberate, designed-for use case for the tables, the original design could include ON DELETE CASCADE definitions on the applicable foreign keys. To repeat the concerns others have mentioned, this decision should be taken at design time of the tables, not arbitrarily taken once the database is in Production.
If the CASCADE DELETE triggers are not defined, and your team is in agreement that a cascading delete is warranted, then an alternative is to run a script or better, create a stored procedure) which simulates the cascading delete. This can be somewhat tedious, but providing all dependency tables with foreign keys ultimately dependent on your Employee row (#EmployeeId), the script is of the form (and note that you should define a transaction boundary around the deletions to ensure an all-or-nothing outcome):
BEGIN TRAN
-- Delete all Nth level nested dependencies via foreign keys
DELETE FROM [TableNth-Dependency]
WHERE ForeignKeyNId IN
(
SELECT PrimaryKey
FROM [TableNth-1 Dependency]
WHERE ForeignKeyN-1 IN
(
SELECT PrimaryKey
FROM [TableNth-2 Dependency]
WHERE ForeignKeyN-3 IN
(
... Innermost query is the first level foreign key
WHERE
ForeignKey = #PrimaryKey;
)
)
);
-- Repeat the delete for all intermediate levels. Each level becomes one level simpler
-- Finally delete the root level object by it's primary key
DELETE FROM dbo.SomeUnimportantTable
WHERE PrimaryKey = #PrimaryKey;
COMMIT TRAN
I am working on a project which processes data in batches and fills up a PostgreSQL (9.6, but I could upgrade) database. The way it currently works is that the process happens in separate steps and each step adds data to a table that it owns (rarely two processes write in the same table, if they do, they write in different column).
The way the data happens to be, the data tends to become more and more fine-grained with each step. As a simplified example I have one table defining the data sources. There are very few (in the tens/ low hundreds), but each of these data sources generate batches of data samples (batches and samples are separate tables, to store metadata). Each batch typically generates about 50k samples. Each of these data points then gets processed step-by-step and each data sample generates more data-points in the next table.
This worked fine, until we got to a 1.5mil rows in the sample table (which is not a lot of data from our point of view). Now filtering for a batch starts becoming slow (about 10ms for each sample we retrieve). And it starts becoming a major bottleneck, because the execution time to get the data for a batch take 5-10mins (fetching is ms).
We have b-tree indices on all foreign keys that are involved for these queries.
Since our computations target the batches, I do normally not need to query across batches during the computation (this is when the query time hurts a lot at the moment). However for data-analysis reasons ad-hoc queries across batches need to remain possible.
So a very simple solution would be to generate an individual database for each batch, and somehow query across these databases when I need to. If I had only one batch in each database, obviously the filtering for a single batch would be instant and my problem would be solved (for now). However, then I would end up with thousands of databases and the data-analysis would be painful.
Within PostgreSQL, is there a way of pretending that I have separate databases for some queries? Ideally I would like to do that for each batch when I "register" a new batch.
Outside of the world of PostgreSQL, is there another database I should try for my usecase?
Edit: DDL / Schema
In our current implementation, sample_representation is the table that all processing results depend on. A batch is truly defined by a tuple of (batch.id, representation.id). The query I tried and described above as slow is (10ms for each sample, adding up to around 5 min for 50k samples)
SELECT sample_representation.id, sample.sample_pos
FROM sample_representation
JOIN sample ON sample.id = sample_representation.id_sample
WHERE sample_representation.id_representation = 'representation-uuid' AND sample.id_batch = 'batch-uuid'
We have currently somewhere around 1.5 ssamples, 2 representations, 460 batches (of which 49 have been processed, the others do not have samples associated to it), which means each batch has 30k samples in average. Some have around 50k.
The schema is below. There is some meta-data associated with all tables, but I am not querying for it in this case. The actual sample-data are stored separately on disk and not in the database, in case that makes a difference.
create table batch
(
id uuid default uuid_generate_v1mc() not null
constraint batch_pk
primary key,
path text not null
constraint unique_batch_path
unique,
id_data_source uuid
)
;
create table sample
(
id uuid default uuid_generate_v1mc() not null
constraint sample_pk
primary key,
sample_pos integer,
id_batch uuid
constraint batch_fk
references batch
on update cascade on delete set null
)
;
create index sample_sample_pos_index
on sample (sample_pos)
;
create index sample_id_batch_sample_pos_index
on sample (id_batch, sample_pos)
;
create table representation
(
id uuid default uuid_generate_v1mc() not null
constraint representation_pk
primary key,
id_data_source uuid
)
;
create table data_source
(
id uuid default uuid_generate_v1mc() not null
constraint data_source_pk
primary key
)
;
alter table batch
add constraint data_source_fk
foreign key (id_data_source) references data_source
on update cascade on delete set null
;
alter table representation
add constraint data_source_fk
foreign key (id_data_source) references data_source
on update cascade on delete set null
;
create table sample_representation
(
id uuid default uuid_generate_v1mc() not null
constraint sample_representation_pk
primary key,
id_sample uuid
constraint sample_fk
references sample
on update cascade on delete set null,
id_representation uuid
constraint representation_fk
references representation
on update cascade on delete set null
)
;
create unique index sample_representation_id_sample_id_representation_uindex
on sample_representation (id_sample, id_representation)
;
create index sample_representation_id_sample_index
on sample_representation (id_sample)
;
create index sample_representation_id_representation_index
on sample_representation (id_representation)
;
After fiddling around, I found a solution. But I am still not sure why the original query really takes that much time:
SELECT sample_representation.id, sample.sample_pos
FROM sample_representation
JOIN sample ON sample.id = sample_representation.id_sample
WHERE sample_representation.id_representation = 'representation-uuid' AND sample.id_batch = 'batch-uuid'
Everything is indexed, but the tables are relatively big with 1.5 million rows in sample_representation and in sample. I guess what happens is that first the tables get joined and then filtered with WHERE. But even if creating a large view as a result of the join, it should not take that long?!
In any case, I tried to use a CTE instead of joining two "massive" tables. The idea was to filter early and then join afterwards:
WITH sel_samplerepresentation AS (
SELECT *
FROM sample_representation
WHERE id_representation='1437a5da-e4b1-11e7-a254-7fff1955d16a'
), sel_samples AS (
SELECT *
FROM sample
WHERE id_video='75c04b9c-e4b9-11e7-a93f-132baa27ac91'
)
SELECT sel_samples.sample_pos, sel_samplerepresentation.id
FROM sel_samplerepresentation
JOIN sel_samples ON sel_samples.id = sel_samplerepresentation.id_representation
This query also takes forever. Here the reason is clear. sel_samples and sel_samplerepresentation have 50k records each. The join happens on a non-indexed column of the CTEs.
Since there are no indices for CTEs, I reformulated them as materialized views for which I can add indices:
CREATE MATERIALIZED VIEW sel_samplerepresentation AS (
SELECT *
FROM sample_representation
WHERE id_representation='1437a5da-e4b1-11e7-a254-7fff1955d16a'
);
CREATE MATERIALIZED VIEW sel_samples AS (
SELECT *
FROM sample
WHERE id_video = '75c04b9c-e4b9-11e7-a93f-132baa27ac91'
);
CREATE INDEX sel_samplerepresentation_sample_id_index ON sel_samplerepresentation (id_sample);
CREATE INDEX sel_samples_id_index ON sel_samples (id);
SELECT sel_samples.sample_pos, sel_samplerepresentation.id
FROM sel_samplerepresentation
JOIN sel_samples ON sel_samples.id = sel_samplerepresentation.id_sample;
DROP MATERIALIZED VIEW sel_samplerepresentation;
DROP MATERIALIZED VIEW sel_samples;
This is more of a hack than a solution, but executing these queries takes 1s! (down from 8min)
When I add news rows using the Insert into select code, the new rows get added randomly in between the already existing rows, instead of getting added to the end of the table.
I'm using, Insert into Table1 (Name1) select Name from Table2.
SQL tables are modeled after unordered sets, and hence you should not assume that there is any order to your data in the table. The only order which exists is what you specify when you query using ORDER BY, e.g.
SELECT Name1
FROM Table1
ORDER BY Name1
An index can also be thought of a way of ordering your records, but these two are mostly distinct entities from your actual table.
I agree with Tim's answer. But if you still want the data inserted in the way you want, then you can try to add the primary key yourself which is incremental (like 1,2,3 ... or 10,20,30 ...).
Although I don't recommend it, but I think following can help you if you don't want to handle the primary key yourself.
How do I add a auto_increment primary key in SQL Server database?
I've got a key table with 2 columns: Key, Id.
In a stored procedure I've written, my code joins the Employee table to the Key column, then selects the Id - something like this:
SELECT
E.EmployeeName, K.Id
FROM
Employee E
JOIN
KeyTable K ON E.Key = K.Key
The execution plan is suggesting to create the following index:
[schema].[Employee] ([Key]) INCLUDE ([Id])
My question is why? If all the information is in the table to begin with why create an index and duplicate that information?
Just because all of the information is "in the table", that doesn't mean that searching the entire table is going to be the most efficient way of obtaining the results for this query.
Here, the server is saying that, if it had a way to quickly locate rows in this table, given a Key value, that the query should be able to be processed more quickly (not that it's 100% reliable in its suggestions, so you should test before implementing).
This can be true if the table is a heap (no clustered index) or for a clustered table where the clustering key(s) don't match the desired access order for the query.
Also, if you think about it - every (non-clustered) index duplicates information. It's just that usually its a subset of the information rather than the whole set.
I have a table which depends on several other ones.
When I delete an entry in this table I should also delete entries in its "masters" (it's 1-1 relation). But here is a problem: when I delete it I get unnecessary table scans, because it checks a reference before deleting. I am sure that it's safe (becuase I get ids from OUTPUT clause):
DELETE TOP (#BatchSize) [doc].[Document]
OUTPUT DELETED.A, DELETED.B, DELETED.C, DELETED.D
INTO #DocumentParts
WHERE Id IN (SELECT d.Id FROM #DocumentIds d);
SET #r = ##ROWCOUNT;
DELETE [doc].[A]
WHERE Id IN (SELECT DISTINCT dp.A FROM #DocumentParts dp);
DELETE [doc].[B]
WHERE Id IN (SELECT DISTINCT dp.B FROM #DocumentParts dp);
DELETE [doc].[C]
WHERE Id IN (SELECT DISTINCT dp.C FROM #DocumentParts dp);
... several others
But here is what plan I get for each delete:
If I drop constraints from document table plan changes:
But problem is that I cannot drop constraints because inserts perform in parallel in other sessions. I also cannot lock a whole table becuase it's very large, and this lock will also lock a lot of others transactions.
The only way I found for now is create an index for every foreign key (which can be used instead of PK scan), but I wanted to avoid this scan at all (indexed or not), becuase I am SURE that documents with such ids doesn't exists becuase I used to delete them. Maybe there is some hint for SQL or some way to disable a reference check for one transaction insead of whole database.
SQL Server is rather stubborn in preserving the referential integrity, so no, you cannot "hint" to disable the check. The fact that you deleted the referencing rows doesn't matter at all (in a high transactional environment, there was plenty of time for some process to modify the tables between the deletes).
Creating the proper indexes is the way to go.