Preventing Duplicates in a large SQL Server table most effectively

Preventing Duplicates in a large SQL Server table most effectively - sql-server

What would be the most efficient way of preventing insertion of duplicate rows in a SQL table that may contain up to 500 million rows?
I see two ways:
1) Create composite primary key on columns that define duplicate record and catch the DuplicateKey exceptions.
2) use IF NOT EXISTS(SELECT ID FROM TABLE WHERE [MyCondition]), but this will require indexing those columns that participate in the WHERE clause.

UNIQUE or PRIMARY KEY. The duplication check will be done on the insert.
If you are using SSIS, match lookup on the key and direct the duplicates to a hospital table.

ALTER TABLE MyTable ADD CONSTRAINT UC_MyConstraintName UNIQUE (col1,col2,col3)

1) is faster than 2) as you are just replicating what SQL Server does with machine code using compiled sql code.
For something a little out of the box: If insert performance is more critical than immediate correctness then ignore the duplicates... at first.
You could flag or separately log unchecked rows and run a scheduled task to go back and recheck them. This might be OK for a very slow read option, or where the duplicates aren't too much of a problem.
If you don't need the write to fail straight away on the duplicate you could use service broker to do the duplicate checks asynchronously this will take a bit of work, but start here for a a primer.

Related

(SQL Server) How to disable foreign key validation during a single INSERT query?

I'm trying to improve the performance of a multi-row INSERT query, and the biggest factor at the moment according to the query plan is a FK validation against a large parent table.
I know that the INSERT query will not be inserting data that violates the FK, because it is an INSERT INTO ... SELECT ... FROM query where the SELECT involves an INNER JOIN to the parent table on the key columns, so it's not possible that invalid values will be present in the inserted rows.
I do not want to disable the FK globally. I don't want to open a window when other queries could potentially insert bad data, and locking the table and disabling the FK before performing the INSERT doesn't help because re-enabling the FK after the INSERT implies revalidating all the rows (WITH CHECK) before the engine will trust the FK, and both tables are large (potentially tens of millions of rows, and it's a multi-column natural key).
Is there any way in MSSQL to disable the validation of a specific foreign key just during the scope of a single INSERT query? I'm sincerely hoping (without much hope[1]) that I've just missed the documentation where that option is explained.
[1] Why would the engine trust the user to not use that option on a query that might insert bad data? It seems like that would be little more than syntactic sugar for the LOCK TABLE - DISABLE FK - INSERT - ENABLE FK - UNLOCK TABLE approach. But I have to ask just in case...

Sometimes it's the best solution. Usually not, though. No way to do it other than to disable before and reenable afterwards, though.
ALTER TABLE foo NOCHECK CONSTRAINT CK_foo_column
Then, afterwards:
ALTER TABLE foo CHECK CONSTRAINT CK_foo_column

You can, but it's a terrible, terrible, terrible decision. And there is no free lunch. At some point, you must validate the constraint - pay now or pay later. And pay later will eventually mean that your assumption (all the rows are valid) will be proved false.
https://learn.microsoft.com/en-us/sql/relational-databases/indexes/disable-indexes-and-constraints
As #Randeep mentions, sql server does not automatically create an index to support a FK. And this can't be done for single statement or a single connection - it is global to all users and the particular table.

SQL Server update query that only update table itself not indexes

I need to write a query that update only table not indexes
because I want to update an int field and don't need to 10 huge index to be updated

If the int field is included in any of the index definitions then they will have to be updated too.
SQL Server won't allow the base table to have one value and the indexes another for obvious data integrity reasons.
If the int field is not included in any of the index definitions then only the table will be updated anyway.
You can disable the indexes but to re-enable them involves rebuilding the whole index.

It depends on what you really want to do
Keeping the index consistent with the table data is the Consistency in ACID. This is how SQL Server and ACID-compliant RDBMSes work.
There are cases such as bulk loads where you want to delay this Consistency. So if you have this use case, DROP or DISABLE the indexes.
If you disable the indexes:
they will never be used for any query-
all associated unique and foreign keys etc will be disabled too
they are not maintained
If you DROP them, of course they can't be used either.
After your bulk load is finished, you enable or create the indexes/constraints again.
If this is what you really want then read MSDN:
Disabling Indexes
Guidelines for Disabling Indexes and Constraints

Perhaps Filtered Indexes are what you're looking for.
This is a SQL Server 2008 feature that lets you create an index that only applies to certain values in a column.

Integration services and identity columns

I am a bit of an SSIS newbie and while the whole system seems straightforward, I don't conceptually understand the process I need to go through in this scenario:
Need to map Invoice and InvoiceLine tables from a source database to two equivalent tables in a destination database - with different identity values.
For each invoice inserted across, I need to get the identity it was assigned and then insert all its lines referencing that new identity
There is a surrogate key on the invoices (the invoice number), however these might also clash with invoice numbers in the target system, hence they would also have to be renumbered.
This must be a common scenario in integration - is there a common solution?

Chris KL - you are correct that this is harder than one would expect. I have three methods for this, which work in different situations:
IF the data you are loading is small (hundreds or thousands but not hundreds OF thousands) then you can do this: use an OLEDB command that performs one insert for each parent row and returns the identity value back; then downstream from that join the output from that to the child rows, and insert them. Advantage: intuitive. Disadvantage: scales badly. This method is documented on the web and should Google for you.
If we are talking about a bigger system where you need bulk loading, then there are two other flavors:
a. If you have exclusive access to the table during the load (really exclusive, enforced in some way) then you can grab the max existing ID from the table, use an SSIS script task to number the rows starting above that max id, then Set Identity Insert On, stuff them in, and Set Identity Insert Off. You then have those script-generated keys in SSIS to assign to the child rows. Advantage: fast and simple, one trip to the DB. Disadvantage: possible errors if some other process inserts into your table at the same time. Brittle.
b. If you don't have exclusive access, then the only way I know of is with a round trip to the DB, thus: Insert all parent rows but keep track of a key for them that is not the identity column (a business key, for example). In a second dataflow, process the child records by using a Lookup transform that uses the business key to fetch the parent ID. Make sure the lookup is tuned appropriately vs. caching, and that thee business key is indexed.

OK, this is a good news / bad news situation I'm afraid. First the good news and a bit of background which you may know but I'll put it down in case you don't.
You generally can't insert anything into IDENTITY columns. Of course, like everything else in life there are times when you need to and that can be done with the IDENTITY_INSERT option.
SET IDENTITY_INSERT MyTable ON
INSERT INTO MyTable (
MyIdCol,
Etc…
)
SELECT SourceIdCol,
Etc…
FROM MySourceTable
SET IDENTITY_INSERT MyTable OFF
Now, you say that you have surrogate keys in the target but then you say that they may clash. So I'm a little confused… Are you using the keys from the source (e.g. IDENTITY columns) or are you generating new keys in the target? I would strongly advise against trying to merge the keyspaces in a single key column. If you need to retain the keys then I would suggest a multi-field key using something like SourceSystemId to keep them unique.
Finally the bad news: SSIS doesn't provide a simple means of using the IDENTITY_INSERT option. The only way I've been able to do it is by turning it on in a SQL task that executes before the insert task. You should be able to pass the table name into the script as a variable. Make sure to include another SQL task afterwards to turn it off because you can only use on one table at a time.

Preventing Duplicate Inserts Into SQL With PHP

I'm going to running thousands of queries into SQL and I need to prevent the duplication of field 'domain'. Never had to do this before and any help would be appreciated.

You probably want to create a "UNIQUE" constraint on the field "Domain" - this constraint will raise an error if you create two rows that have the same domain in the database. For an explanation, see this tutorial in W3C school -
http://www.w3schools.com/sql/sql_unique.asp
If this doesn't solve your problem, please clarify the database you have chosen to use (MySql?).
NOTE: This constraint is completely separate from your choice of PHP as a programming language, it is a SQL database definition thing. A huge advantage of expressing this constraint in SQL is that you can trust the database to preserve the constraint even when people import / export data from the database, your application is buggy or another application shares the database.

If this is an absolute database integrity requirement (It's not likely to change, nor does existing data have this problem), I would enforce it at the database with a unique constraint.
As far as detecting it before or after the attempt in order to notify the user, there are a number of techniques which could be used.

Where is the data coming from? Is this something you only want to run once, or a couple of times, or often? If the domain-value already exists, do you just want to skip the insert or do something else (ie increment a counter)?
Depending on your answers, there are many possible solutions:
Pre-sort your data, eliminate duplicates, then insert
(assumes relatively static data, empty table to begin with)
Use an associative array in PHP as a local domain-value cache
(if table already contains data, start by reading existing content;
not thread-safe, but works if it only runs once at a time)
Make domain a UNIQUE column and write wrapper code to handle return errors
Make domain a UNIQUE or PRIMARY KEY column and use an ON DUPLICATE KEY clause:
INSERT INTO mydata ( domain, count ) VALUES
( 'firstdomain', 1 ),
( 'seconddomain', 1 ),
( 'thirddomain', 1 )
ON DUPLICATE KEY
UPDATE count = count+1
Insert all data into the table, then remove duplicates
Note that batching inserts (ie using multiple value clauses per statement) can be significantly faster.

I'm not really sure I understood your question, but perhaps you are looking for SQL's "UNIQUE" constraint. If the query tries to insert a pre-existing value to a field, you (PHP) will be notified about this constraint breach.

There are a bunch of ways to approach this. You could set a unique constraint (like a primary key) on that column. This will cause the insert to fail if that domain has also been inserted. You could also insert all of the duplicate domains and just delete them later on. This will work well if not that many of the domains are duplicated. There are a few questions posted already on finding duplicate rows.

This can be doen with sql, rather than with php.
i am assuming that you are using MySQl, but the same principles will work with different databases.
make the Domain column the primary key. (makes sense, as it has to unique.)
Rather than using INSERT, use UPDATE.
if the primary key already exists (that you are trying to put into the table), update will update the existing tuple, rather than creating a new tuple.
so you will overwrite existing data if it is different, and if it is identical the update will be skipped.

Fastest way to delete all the data in a large table

I had to delete all the rows from a log table that contained about 5 million rows. My initial try was to issue the following command in query analyzer:
delete from client_log
which took a very long time.

Check out truncate table which is a lot faster.

I discovered the TRUNCATE TABLE in the msdn transact-SQL reference. For all interested here are the remarks:
TRUNCATE TABLE is functionally identical to DELETE statement with no WHERE clause: both remove all rows in the table. But TRUNCATE TABLE is faster and uses fewer system and transaction log resources than DELETE.
The DELETE statement removes rows one at a time and records an entry in the transaction log for each deleted row. TRUNCATE TABLE removes the data by deallocating the data pages used to store the table's data, and only the page deallocations are recorded in the transaction log.
TRUNCATE TABLE removes all rows from a table, but the table structure and its columns, constraints, indexes and so on remain. The counter used by an identity for new rows is reset to the seed for the column. If you want to retain the identity counter, use DELETE instead. If you want to remove table definition and its data, use the DROP TABLE statement.
You cannot use TRUNCATE TABLE on a table referenced by a FOREIGN KEY constraint; instead, use DELETE statement without a WHERE clause. Because TRUNCATE TABLE is not logged, it cannot activate a trigger.
TRUNCATE TABLE may not be used on tables participating in an indexed view.

There is a common myth that TRUNCATE somehow skips transaction log.
This is misunderstanding, and is clearly mentioned in MSDN.
This myth is invoked in several comments here. Let's eradicate it together ;)

For reference TRUNCATE TABLE also works on MySQL

I use the following method to zero out tables, with the added bonus that it leaves me with an archive copy of the table.
CREATE TABLE `new_table` LIKE `table`;
RENAME TABLE `table` TO `old_table`, `new_table` TO `table`;

forget truncate and delete. maintain your table definitions (in case you want to recreate it) and just use drop table.

truncate table client_log
is your best bet, truncate kills all content in the table and indices and resets any seeds you've got too.

truncate table is not SQL-platform independent. If you suspect that you might ever change database providers, you might be wary of using it.

On SQL Server you can use the Truncate Table command which is faster than a regular delete and also uses less resources. It will reset any identity fields back to the seed value as well.
The drawbacks of truncate are that it can't be used on tables that are referenced by foreign keys and it won't fire any triggers. Also you won't be able to rollback the data if anything goes wrong.

Note that TRUNCATE will also reset any auto incrementing keys, if you are using those.
If you do not wish to lose your auto incrementing keys, you can speed up the delete by deleting in sets (e.g., DELETE FROM table WHERE id > 1 AND id < 10000). It will speed it up significantly and in some cases prevent data from being locked up.

Yes, well, deleting 5 million rows is probably going to take a long time. The only potentially faster way I can think of would be to drop the table, and re-create it. That only works, of course, if you want to delete ALL data in the table.

The suggestion of "Drop and recreate the table" is probably not a good one because that goofs up your foreign keys.
You ARE using foreign keys, right?

If you cannot use TRUNCATE TABLE because of foreign keys and/or triggers, you can consider to:
drop all indexes;
do the usual DELETE;
re-create all indexes.
This may speed up DELETE somewhat.

I am revising my earlier statement:
You should understand that by using
TRUNCATE the data will be cleared but
nothing will be logged to the
transaction log. Writing to the log
is why DELETE will take forever on 5
million rows. I use TRUNCATE often
during development, but you should be
wary about using it on a production
database because you will not be able
to roll back your changes. You should
immediately make a full database
backup after doing a TRUNCATE to
establish a new basis for restoration.
The above statement was intended to prompt you to be sure that you understand there is difference between the two. Unfortunately, it is poorly written and makes unsupported statements as I have not actually done any testing myself between the two. It is based on statements that I have heard from others.
From MSDN:
The DELETE statement removes rows one
at a time and records an entry in the
transaction log for each deleted row.
TRUNCATE TABLE removes the data by
deallocating the data pages used to
store the table's data, and only the
page deallocations are recorded in the
transaction log.
I just wanted to say that there is a fundamental difference between the two and because there is a difference, there will be applications where one or the other may be inappropriate.

DELETE * FROM table_name;
Premature optimization may be dangerous. Optimizing may mean doing something weird, but if it works you may want to take advantage of it.
SELECT DbVendor_SuperFastDeleteAllFunction(tablename, BOZO_BIT) FROM dummy;
For speed I think it depends on...
The underlying database: Oracle, Microsoft, MySQL, PostgreSQL, others, custom...
The table, it's content, and related tables:
There may be deletion rules. Is there an existing procedure to delete all content in the table? Can this be optimized for the specific underlying database engine? How much do we care about breaking things / related data? Performing a DELETE may be the 'safest' way assuming that other related tables do not depend on this table. Are there other tables and queries that are related / depend on the data within this table? If we don't care much about this table being around, using DROP might be a fast method, again depending on the underlying database.
DROP TABLE table_name;
How many rows are being deleted? Is there other information that is quickly gleaned that will optimize the deletion? For example, can we tell if the table is already empty? Can we tell if there are hundreds, thousands, millions, billions of rows?

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight