How to speedup delete from table with multiple references

How to speedup delete from table with multiple references - sql-server

I have a table which depends on several other ones.
When I delete an entry in this table I should also delete entries in its "masters" (it's 1-1 relation). But here is a problem: when I delete it I get unnecessary table scans, because it checks a reference before deleting. I am sure that it's safe (becuase I get ids from OUTPUT clause):
DELETE TOP (#BatchSize) [doc].[Document]
OUTPUT DELETED.A, DELETED.B, DELETED.C, DELETED.D
INTO #DocumentParts
WHERE Id IN (SELECT d.Id FROM #DocumentIds d);
SET #r = ##ROWCOUNT;
DELETE [doc].[A]
WHERE Id IN (SELECT DISTINCT dp.A FROM #DocumentParts dp);
DELETE [doc].[B]
WHERE Id IN (SELECT DISTINCT dp.B FROM #DocumentParts dp);
DELETE [doc].[C]
WHERE Id IN (SELECT DISTINCT dp.C FROM #DocumentParts dp);
... several others
But here is what plan I get for each delete:
If I drop constraints from document table plan changes:
But problem is that I cannot drop constraints because inserts perform in parallel in other sessions. I also cannot lock a whole table becuase it's very large, and this lock will also lock a lot of others transactions.
The only way I found for now is create an index for every foreign key (which can be used instead of PK scan), but I wanted to avoid this scan at all (indexed or not), becuase I am SURE that documents with such ids doesn't exists becuase I used to delete them. Maybe there is some hint for SQL or some way to disable a reference check for one transaction insead of whole database.

SQL Server is rather stubborn in preserving the referential integrity, so no, you cannot "hint" to disable the check. The fact that you deleted the referencing rows doesn't matter at all (in a high transactional environment, there was plenty of time for some process to modify the tables between the deletes).
Creating the proper indexes is the way to go.

Related

SQL Server : delete everything from the database for a specific user

Just playing around in SQL Server to get better with query writing. I'm using the Northwind sample database from Microsoft.
I want to delete 'Robert King', EmployeeID = 7.
So normally I would do:
DELETE FROM Employees
WHERE EmployeeID = 7
but it's linked to another table and throws
The DELETE statement conflicted with the REFERENCE constraint "FK_Orders_Employees". The conflict occurred in database "Northwind", table "dbo.Orders", column 'EmployeeID'
So I have to delete the rows from the Orders table first, but I also get an error because the order ID are linked to yet another table [Order Details].
How can I delete everything at once?
I have a query what shows me everything for EmployeeID = 7, but how can I delete it in one go?
Query to show all data for EmployeeID = 7:
SELECT
Employees.EmployeeID,
Orders.OrderID,
Employees.FirstName,
Employees.LastName
FROM
Employees
INNER JOIN
Orders on Employees.EmployeeID = Orders.EmployeeID
INNER JOIN
[Order Details] on orders.OrderID = [Order Details].orderID
WHERE
Employees.EmployeeID = 7

can you change the design of database?
if you have access to change, The best way is to set "cascade" type for delete operation for employee table.

Don't do Physical Deletes of important data on a Source Of Truth RDBMS
If this an OLTP system, then what you are suggesting, i.e. deleting OrderId rows linked to an employee, looks dangerous as it could break the data integrity of your system.
An OrderDetails row is likely also foreign keyed to a parent Orders table. Deleting an OrderDetails row will likely corrupt your Order processing data (since the Order table Totals will no longer match the cumulative line item rows).
By deleting what appears to be important transactional data, you may be destroying important business records, which could have dire consequences for both yourself and your company.
If the employee has left service of the company, physical deletion of data is NOT the answer. Instead, you should reconsider the table design, possibly by using a soft delete pattern on the Employee (and potentially associated data, but not likely important transactional data like Orders fulfilled by an employee). This way data integrity and audit trail will be preserved.
For important business data like Orders, if the order itself was placed in error, a compensating mechanism or status indication on the order (e.g. Cancelled status) should be used in preferenced to physical data deletion.
Cascading Deletes on non-critical data
In general, if DELETING data in cascading fashion is a deliberate, designed-for use case for the tables, the original design could include ON DELETE CASCADE definitions on the applicable foreign keys. To repeat the concerns others have mentioned, this decision should be taken at design time of the tables, not arbitrarily taken once the database is in Production.
If the CASCADE DELETE triggers are not defined, and your team is in agreement that a cascading delete is warranted, then an alternative is to run a script or better, create a stored procedure) which simulates the cascading delete. This can be somewhat tedious, but providing all dependency tables with foreign keys ultimately dependent on your Employee row (#EmployeeId), the script is of the form (and note that you should define a transaction boundary around the deletions to ensure an all-or-nothing outcome):
BEGIN TRAN
-- Delete all Nth level nested dependencies via foreign keys
DELETE FROM [TableNth-Dependency]
WHERE ForeignKeyNId IN
(
SELECT PrimaryKey
FROM [TableNth-1 Dependency]
WHERE ForeignKeyN-1 IN
(
SELECT PrimaryKey
FROM [TableNth-2 Dependency]
WHERE ForeignKeyN-3 IN
(
... Innermost query is the first level foreign key
WHERE
ForeignKey = #PrimaryKey;
)
)
);
-- Repeat the delete for all intermediate levels. Each level becomes one level simpler
-- Finally delete the root level object by it's primary key
DELETE FROM dbo.SomeUnimportantTable
WHERE PrimaryKey = #PrimaryKey;
COMMIT TRAN

How to find unused rows in a dimension table

I have a dimension table in my database that has grown too large. With that I mean that is has too many records - over a million - because it grew at the same pace as the linked facts. This is mostly due to a bad design, and I'm trying to clean it up.
One of the things I try to do is to remove dimension records which are no longer used. The fact tables are regularly maintained and old snapshots are removed. Because the dimensions were not maintained like that, there are many rows in the table whose primary key value no longer appears in any of the linked fact tables anymore.
All the fact tables have foreign key constraints.
Is there a way to locate table rows whose primary key value no longer appears in any of the tables which are linked with a foreign key constraint?
I tried writing a script to track this. Basically this:
select key from dimension
where not exists (select 1 from fact1 where fk = pk)
and not exists (select 1 from fact2 where fk = pk)
and not exists (select 1 from fact3 where fk = pk)
But with a lot of linked tables this query dies after some time - at least, my management studio crashed. So I'm not sure if there are any other options.

we had to do something similar to this at one of my clients. The query, like yours with "not exists.... and not exists.... and not exists...." was taking ~22 hours to run before we change our strategy to handle this in ~20 minutes.
As Nsousa suggest, you have to split the query so SQL Server doesn't have to handle all data in one shot, having to unnecessarily use tempdb and all other things.
First, create new table with all keys in it. The reason to create this table is to not have to read the full table scan for every query, having more keys on a 8k page and to deal with a smaller and smaller set of keys after each delete.
create table DimensionkeysToDelete (Dimkey char(32) primary key nonclustered);
insert into DimensionkeysToDelete
select key from dimension order by key;
Then, instead of deleting unused key, delete the keys that exists in facts table, beginning with the fact table that has the least numbers of rows.
Make sure facts table have proper indexing for performance.
delete from DimensionkeysToDelete
from DimensionkeysToDelete d
inner join fact1 on f.fk = d.Dimkey;
delete from DimensionkeysToDelete
from DimensionkeysToDelete d
inner join fact2 on f.fk = d.Dimkey;
delete from DimensionkeysToDelete
from DimensionkeysToDelete d
inner join fact3 on f.fk = d.Dimkey;
Once all facts tables done, only unused keys remains in DimensionkeysToDelete. To answers your question, just perform a select on this table to get all unused key for that particular dimension, or join it with the dimension to get data.
But, from what I understand of your needs for cleaning up you warehouse, use this table to delete from the orignal dimension table. At this step, you might also want take some action for auditing purposes (ie: insert in an audit table 'Key ' + key + ' deleted on + convert(datetime, getdate(),121) + ' by script X'.... )
I think this can be optimize, take a look at the execution plan, but my client was happy with it so we didn't have to put much effort in it.

You may want to split that into different queries. Check unused rows in fact1, then on fact2, etc, individually. Then intersect all those results to get to the rows that are unused in all fact tables.
I would also suggest a left outer join instead of nested queries, counting rows in the fact table for each pk, and filter out from the resultset those that have a non zero count.
Your query will struggle as it’ll scan every fact table at the same time.

Triggers: Tracking ID Updates

For a trigger that is tracking UPDATEs to a table, two temp tables may be referenced: deleted and inserted. Is there a way to cross-reference the two w/o using an INNER JOIN on their primary key?
I am trying to maintain referential integrity without foreign keys (don't ask), so I'm using triggers. I want UPDATEs to the primary key in table A to be reflected in the "foreign key" of look-up table B, and for this to happen when an UPDATE affects multiple records in table A.
All UPDATE trigger examples that I've seen hinge on joining the inserted and deleted tables to track changes; and they use the updated table's ID field (primary key) to set the join. But if that ID field (GUID) is the changed field in a record (or set of records), is there a good way to track those changes, so that I can enforce those changes in the corresponding look-up table?

I've just had this issue (or rather, a similar one), myself, hence the resurrection...
My eventual approach was to simply disallow updates to the PK field precisely because it would break the trigger. Thankfully, I had no business case to support updating the primary key column (these were surrogate IDs, anyway), so I could get away with it.
SQL Server offers the UPDATE function, for use within triggers, to check for this edge case:
CREATE TRIGGER your_trigger
ON your_table
INSTEAD OF UPDATE
AS BEGIN
IF UPDATE(pk1) BEGIN
ROLLBACK
DECLARE #proc SYSNAME, #table SYSNAME
SELECT TOP 1
#proc = OBJECT_NAME(##PROCID)
,#table = OBJECT_NAME(parent_id)
FROM sys.triggers
WHERE object_id = ##PROCID
RAISERROR ('Trigger %s prevents UPDATE of table %s due to locked primary key', 16, -1, #proc, #table) WITH NOWAIT
END
ELSE UPDATE t SET
col1 = i.col1
,col2 = i.col2
,col3 = i.col3
FROM your_table t
INNER JOIN inserted i ON t.pk1 = i.pk1
END
GO
(Note that the above is untested, and probably contains all manner of issues with regards to XACT_STATE or TRIGGER_NESTLEVEL -- it's just there to demonstrate the principle)
It gets a bit messy, though, so I would definitely consider code generation for this, to handle changes to the table during development (maybe even done by a DDL trigger on CREATE/ALTER table).
If you have a composite primary key, you can use IF UPDATE(pk1) OR UPDATE(pk2)... or do some bitwise work with the COLUMNS_UPDATED function, which will give you a bitmask based on the column ordinal (but I'm not going to cover that here -- see MSDN/BOL).
The other (simpler) option is to DENY UPDATE ON your_table(pk) TO public, but remember that any member of sysadmins (and probably dbo) will not honour this.

I'm with #Aaron, without a primary key you're stuck. If you have DDL privileges to add a trigger can't you add a auto increment PK column while you're at it? If you'd like, it doesn't even need to be the PK.

Does a foreign key make sense when I do not use Referential Integrity with On Delete/Update Cascade

I do not want the orders to be deleted when a customer is deleted. (On Delete Cascade)
I use identity columns so I do not need On Update Cascade
It should be possible to delete a customer table although there exist orders pointing/referencing to a customer. I do not care when the customer is gone because I still need the order table for other tables.
Does a foreign key make sense in this scenario when I do not use Referential Integrity with On Delete/Update Cascade ?

Yes. The foreign key is not in place only to clean up after yourself but primarily to make sure the data is right in the first place (it can also assist the optimizer in some cases). I use foreign keys all over the place but I have yet to find a need to implement on cascade actions. I do understand the purpose of cascade but I've always found it better to control those processes myself.
EDIT even though I tried to explain already that you can work around the cascade issue (thus still satisfying your third condition), I thought I would add an illustration:
You can certainly still allow for orders to remain after you've deleted a customer. The key is to make the Orders.CustomerID column nullable, e.g.
CREATE TABLE dbo.Customers(CustomerID INT PRIMARY KEY);
CREATE TABLE dbo.Orders(OrderID INT PRIMARY KEY, CustomerID INT NULL
FOREIGN KEY REFERENCES dbo.Customers(CustomerID));
Now when you want to delete a customer, assuming you control these operations via a stored procedure, you can do it this way, first setting their Orders.CustomerID to NULL:
CREATE PROCEDURE dbo.Customer_Delete
#CustomerID INT
AS
BEGIN
SET NOCOUNT ON;
UPDATE dbo.Orders SET CustomerID = NULL
WHERE CustomerID = #CustomerID;
DELETE dbo.Customers
WHERE CustomerID = #CustomerID;
END
GO
If you can't control ad hoc deletes from the Customers table, then you can still achieve this with an instead of trigger:
CREATE TRIGGER dbo.Cascade_CustomerDelete
ON dbo.Customers
INSTEAD OF DELETE
AS
BEGIN
SET NOCOUNT ON;
UPDATE o SET CustomerID = NULL
FROM dbo.Orders AS o
INNER JOIN deleted AS d
ON o.CustomerID = d.CustomerID;
DELETE c
FROM dbo.Customers AS c
INNER JOIN deleted AS d
ON c.CustomerID = d.CustomerID;
END
GO
That all said, I'm not sure I understand the purpose of deleting a customer and keeping their orders (or any indication at all about who placed that order).

So to be clear you have a FK from Customer to Orders presently. Cascade update/delete is not enabled on this relationship. Your plan is to delete customers but allow the orders to remain.
This would VIOLATE the foreign key constraint; and prevent the delete from occurring.
If you disable the constraint execute the delete then re-enable you could make it work.
However, this will leave orphaned order records in the system; which might make it harder to support in the long run. What's the next guy who has to support this going to think?
Wouldn't it be better to keep the records and add a status for Active/inactive or created and inactive dates?
I'm struggling with violating the integrity of the database to reduce space...? Or what's the main reason to remove?
If you don't want to have to always filter out the no longer active records use a view or a package which creates a collection of active customers. Eliminating some but not all data seems just wrong to me.

INSERT INTO vs SELECT INTO

What is the difference between using
SELECT ... INTO MyTable FROM...
and
INSERT INTO MyTable (...)
SELECT ... FROM ....
?
From BOL [ INSERT, SELECT...INTO ], I know that using SELECT...INTO will create the insertion table on the default file group if it doesn't already exist, and that the logging for this statement depends on the recovery model of the database.
Which statement is preferable?
Are there other performance implications?
What is a good use case for SELECT...INTO over INSERT INTO ...?
Edit: I already stated that I know that that SELECT INTO... creates a table where it doesn't exist. What I want to know is that SQL includes this statement for a reason, what is it? Is it doing something different behind the scenes for inserting rows, or is it just syntactic sugar on top of a CREATE TABLE and INSERT INTO.

They do different things. Use INSERT when the table exists. Use SELECT INTO when it does not.
Yes. INSERT with no table hints is normally logged. SELECT INTO is minimally logged assuming proper trace flags are set.
In my experience SELECT INTO is most commonly used with intermediate data sets, like #temp tables, or to copy out an entire table like for a backup. INSERT INTO is used when you insert into an existing table with a known structure.
EDIT
To address your edit, they do different things. If you are making a table and want to define the structure use CREATE TABLE and INSERT. Example of an issue that can be created: You have a small table with a varchar field. The largest string in your table now is 12 bytes. Your real data set will need up to 200 bytes. If you do SELECT INTO from your small table to make a new one, the later INSERT will fail with a truncation error because your fields are too small.

Which statement is preferable? Depends on what you are doing.
Are there other performance implications? If the table is a permanent table, you can create indexes at the time of table creation which has implications for performance both negatively and positiviely. Select into does not recreate indexes that exist on current tables and thus subsequent use of the table may be slower than it needs to be.
What is a good use case for SELECT...INTO over INSERT INTO ...? Select into is used if you may not know the table structure in advance. It is faster to write than create table and an insert statement, so it is used to speed up develoment at times. It is often faster to use when you are creating a quick temp table to test things or a backup table of a specific query (maybe records you are going to delete). It should be rare to see it used in production code that will run multiple times (except for temp tables) because it will fail if the table was already in existence.
It is sometimes used inappropriately by people who don't know what they are doing. And they can cause havoc in the db as a result. I strongly feel it is inappropriate to use SELECT INTO for anything other than a throwaway table (a temporary backup, a temp table that will go away at the end of the stored proc ,etc.). Permanent tables need real thought as to their design and SELECT INTO makes it easy to avoid thinking about anything even as basic as what columns and what datatypes.
In general, I prefer the use of the create table and insert statement - you have more controls and it is better for repeatable processes. Further, if the table is a permanent table, it should be created from a separate create table script (one that is in source control) as creating permanent objects should not, in general, in code are inserts/deletes/updates or selects from a table. Object changes should be handled separately from data changes because objects have implications beyond the needs of a specific insert/update/select/delete. You need to consider the best data types, think about FK constraints, PKs and other constraints, consider auditing requirements, think about indexing, etc.

Each statement has a distinct use case. They are not interchangeable.
SELECT...INTO MyTable... creates a new MyTable where one did not exist before.
INSERT INTO MyTable...SELECT... is used when MyTable already exists.

The primary difference is that SELECT INTO MyTable will create a new table called MyTable with the results, while INSERT INTO requires that MyTable already exists.
You would use SELECT INTO only in the case where the table didn't exist and you wanted to create it based on the results of your query. As such, these two statements really are not comparable. They do very different things.
In general, SELECT INTO is used more often for one off tasks, while INSERT INTO is used regularly to add rows to tables.
EDIT:
While you can use CREATE TABLE and INSERT INTO to accomplish what SELECT INTO does, with SELECT INTO you do not have to know the table definition beforehand. SELECT INTO is probably included in SQL because it makes tasks like ad hoc reporting or copying tables much easier.

Actually SELECT ... INTO not only creates the table but will fail if it already exists, so basically the only time you would use it is when the table you are inserting to does not exists.
In regards to your EDIT:
I personally mainly use SELECT ... INTO when I am creating a temp table. That to me is the main use. However I also use it when creating new tables with many columns with similar structures to other tables and then edit it in order to save time.

I only want to cover second point of the question that is related to performance, because no body else has covered this. Select Into is a lot more faster than insert into, when it comes to tables with large datasets. I prefer select into when I have to read a very large table. insert into for a table with 10 million rows may take hours while select into will do this in minutes, and as for as losing indexes on new table is concerned you can recreate the indexes by query and can still save a lot more time when compared to insert into.

SELECT INTO is typically used to generate temp tables or to copy another table (data and/or structure).
In day to day code you use INSERT because your tables should already exist to be read, UPDATEd, DELETEd, JOINed etc. Note: the INTO keyword is optional with INSERT
That is, applications won't normally create and drop tables as part of normal operations unless it is a temporary table for some scope limited and specific usage.
A table created by SELECT INTO will have no keys or indexes or constraints unlike a real, persisted, already existing table
The 2 aren't directly comparable because they have almost no overlap in usage

Select into creates new table for you at the time and then insert records in it from the source table. The newly created table has the same structure as of the source table.If you try to use select into for a existing table it will produce a error, because it will try to create new table with the same name.
Insert into requires the table to be exist in your database before you insert rows in it.

The simple difference between select Into and Insert Into is:
--> Select Into don't need existing table. If you want to copy table A data, you just type Select * INTO [tablename] from A. Here, tablename can be existing table or new table will be created which has same structure like table A.
--> Insert Into do need existing table.INSERT INTO [tablename] SELECT * FROM A;.
Here tablename is an existing table.
Select Into is usually more popular to copy data especially backup data.
You can use as per your requirement, it is totally developer choice which should be used in his scenario.
Performance wise Insert INTO is fast.
References :
https://www.w3schools.com/sql/sql_insert_into_select.asp
https://www.w3schools.com/sql/sql_select_into.asp

The other answers are all great/correct (the main difference is whether the DestTable exists already (INSERT), or doesn't exist yet (SELECT ... INTO))
You may prefer to use INSERT (instead of SELECT ... INTO), if you want to be able to COUNT(*) the rows that have been inserted so far.
Using SELECT COUNT(*) ... WITH NOLOCK is a simple/crude technique that may help you check the "progress" of the INSERT; helpful if it's a long-running insert, as seen in this answer).
[If you use...]
INSERT DestTable SELECT ... FROM SrcTable
...then your SELECT COUNT(*) from DestTable WITH (NOLOCK) query would work.

Select into for large datasets may be good only for a single user using one single connection to the database doing a bulk operation task. I do not recommend to use
SELECT * INTO table
as this creates one big transaction and creates schema lock to create the object, preventing other users to create object or access system objects until the SELECT INTO operation completes.
As proof of concept open 2 sessions, in first session try to use
select into temp table from a huge table
and in the second section try to
create a temp table
and check the locks, blocking and the duration of second session to create a temp table object. My recommendation it is always a good practice to create and Insert statement and if needed for minimal logging use trace flag 610.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight