Bulk insert with data validation - sql-server

We have a requirement to insert large number of records (about 2 to 3 millions) in a table. However, we should be able to validate and segregate invalid records - primary key, foreign key and non null violations - into a separate error table for later reference. Per my study, bulk insert in SQL server works well for inserting, but I'm not able to figure out the best way to filter out bad data records. Does having having a staging table in between help? Although we could check for violations using some queues against staging table, we must load the good records into the actual table with another insert - either through insert select or merge - but is this an efficient approach? I'm concerned as it would be akin to doing 2x inserts.
I'm planning to use .net sqlbulkcopy for doing bulk insertions and it doesn't have a clear error reporting as well.
Can somebody point me to a more efficient solution?
EDIT: If this approach is the only solution, what method do you think is best for the second insert? Is it insert...select or MERGE? Would they match the efficiency and speed of BULK INSERT? Or is there any other better alternative?
Thanks!

Personally I would not consider 2/3M records as a large amount.
Unless you need the data in seconds, A Single (Non-Bulk) insert will perform adequately.
If I'm nervous about the src data quality - I like to load to a stg table first and then do "Soft RI" - Check for PKs, UQs, FKs etc using SQL.
If I'm worried about Numeric/non-Numeric or bad date type issues then I make the Stg table VARCHAR(8000) for all cols and use TRY_CONVERT when reading from the table.
Once data is in STG you can easily filter only the good rows and report in detail on the bad rows.

Related

Speed up table scans using a middle-man data dump table

I'm a dev (not dba) working on a project where we have been managing "updates" to data via bulk insert. However we first insert to a non-indexed "pre-staging" table. This is due to the fact that we need to normalize a lot of denormalized data and make sure it's properly split up into our schema.
Naturally this makes the update and insert processes slow since we have to check if the information exists for each specific table with non-indexed codes or identifiers.
Since the "pre-staging" table is truncated we didn't include auto-generated IDs either.
I'm looking into ways to speed up table scans on that particular table in our stored procedures. What would be the best approach to do so? Indexes? Auto-generated IDs as clustered indexes? This last one is tricky because we cannot establish relationships with our "staging" data since it is truncated per data dump.

SSIS Lookup Transform use Table or Query

I have a Lookup Transformation on a table with 30 columns but I only am using two columns: ID column for the join and Update column as Output.
On the connection should I enter a query Select ID, Update From T1 or Use Table in the drop down?
Using table in Drop down would this be like doing Select * From T1 or is SSIS clever enough to know I only need 2 columns.
I'm thinking I should go with the Query Select ID, Update From T1.
On the connection should I enter a query Select ID, Update From T1 or Use Table in the drop down?
It is best to specify which columns you want.
Using table in Drop down, would this be like doing Select * From T1
Yes, it is a SELECT *.
or is SSIS clever enough to know I only need 2 columns?
Nope.
Keep in mind that Lookups are good for pulling data from Dimension Tables where the row count and record set is small. If you are dealing with large amounts of unique data, then it will be better to perform a MERGE JOIN, instead. The performance difference can be substantial. For example, when using a Lookup on 20K rows of data, you could experience run times in the tens of minutes. A MERGE JOIN, however, would run within seconds.
Lookups have the drawback of behaving like correlated sub-queries in that they fire off a query to the server for every row passing through it. You can have the Lookup cache the data, which means SSIS will store the results in memory and then check the memory before going to the server for all subsequent rows passing through the Lookup. As a result, this is only effective if there are a large number of matching records for a small cache set. In other words, Lookups are not optimal when there is large amount of Distinct ID's to lookup. To that point, caching data is almost pointless.
This is where you would switch over to using a MERGE JOIN. Note: you will need to perform a SORT on both of the data flows before the MERGE JOIN because the MERGE JOIN component requires the incoming rows to be sorted.
When handled incorrectly, a single poorly placed Lookup can bring an entire package to its knees - lookups can be huge performance bottlenecks. Though, handled correctly, a Lookup can simplify the design of the dataflow and speed development by removing the extra development required to MERGE JOIN data flows.
The bottom line to all of this is that you want the Lookup performing the fewest number of queries against the server.
If you need only two columns from the lookup table then it is better to use a select query then selecting table from drop down list but the columns specified must contains the primary key (ID). Because reading all columns will consume more resources. Even if it may not meaningful effect in small tables.
You can refer to the following answer on database administrators community for more information:
SSIS OLE DB Source Editor Data Access Mode: “SQL command” vs “Table or view”
Note that what #JWeezy mentioned about lookup from large table is right. Lookups is not designed for large table, i will use SQL JOINs instead.

Preventing Duplicates in a large SQL Server table most effectively

What would be the most efficient way of preventing insertion of duplicate rows in a SQL table that may contain up to 500 million rows?
I see two ways:
1) Create composite primary key on columns that define duplicate record and catch the DuplicateKey exceptions.
2) use IF NOT EXISTS(SELECT ID FROM TABLE WHERE [MyCondition]), but this will require indexing those columns that participate in the WHERE clause.
UNIQUE or PRIMARY KEY. The duplication check will be done on the insert.
If you are using SSIS, match lookup on the key and direct the duplicates to a hospital table.
ALTER TABLE MyTable ADD CONSTRAINT UC_MyConstraintName UNIQUE (col1,col2,col3)
1) is faster than 2) as you are just replicating what SQL Server does with machine code using compiled sql code.
For something a little out of the box: If insert performance is more critical than immediate correctness then ignore the duplicates... at first.
You could flag or separately log unchecked rows and run a scheduled task to go back and recheck them. This might be OK for a very slow read option, or where the duplicates aren't too much of a problem.
If you don't need the write to fail straight away on the duplicate you could use service broker to do the duplicate checks asynchronously this will take a bit of work, but start here for a a primer.

Quickly update a large amount of rows in a table, without blocking inserts on referencing tables, in SQL Server 2008

Context:
I have a system that acts as a Web UI for a legacy accounting system. This legacy system sends me a large text file, several times a day, so I can update a CONTRACT table in my database (the file can have new contracts, or just updated values for existing contracts). This table currently has around 2M rows and about 150 columns. I can't have downtime during these updates, since they happen during the day and there's usually about 40 logged users in any given time.
My system's users can't update the CONTRACT table, but they can insert records in tables that reference the CONTRACT table (Foreign Keys to the CONTRACT table's ID column).
To update my CONTRACT table I first load the text file into a staging table, using a bulk insert, and then I use a MERGE statement to create or update the rows, in batches of 100k records. And here's my problem - during the MERGE statement, because I'm using READ COMMITED SNAPSHOT isolation, the users can keep viewing the data, but they can't insert anything - the transactions will timeout because the CONTRACT table is locked.
Question: does anyone know of a way to quickly update this large amount of rows, while enforcing data integrity and without blocking inserts on referencing tables?
I've thought about a few workarounds, but I'm hoping there's a better way:
Drop the foreign keys. - I'd like to enforce my data consistency, so this don't sound like a good solution.
Decrease the batch size on the MERGE statement so that the transaction is fast enough not to cause timeouts on other transactions. - I have tried this, but the sync process becomes too slow; Has I mentioned above, I receive the update files frequently and it's vital that the updated data is available shortly after.
Create an intermediate table, with a single CONTRACTID column and have other tables reference that table, instead of the CONTRACT table. This would allow me to update it much faster while keeping a decent integrity. - I guess it would work, but it sounds convoluted.
Update:
I ended up dropping my foreign keys. Since the system has been in production for some time and the logs don't ever show foreign key constraint violations, I'm pretty sure no inconsistent data will be created. Thanks to everyone who commented.

Fastest way to delete all the data in a large table

I had to delete all the rows from a log table that contained about 5 million rows. My initial try was to issue the following command in query analyzer:
delete from client_log
which took a very long time.
Check out truncate table which is a lot faster.
I discovered the TRUNCATE TABLE in the msdn transact-SQL reference. For all interested here are the remarks:
TRUNCATE TABLE is functionally identical to DELETE statement with no WHERE clause: both remove all rows in the table. But TRUNCATE TABLE is faster and uses fewer system and transaction log resources than DELETE.
The DELETE statement removes rows one at a time and records an entry in the transaction log for each deleted row. TRUNCATE TABLE removes the data by deallocating the data pages used to store the table's data, and only the page deallocations are recorded in the transaction log.
TRUNCATE TABLE removes all rows from a table, but the table structure and its columns, constraints, indexes and so on remain. The counter used by an identity for new rows is reset to the seed for the column. If you want to retain the identity counter, use DELETE instead. If you want to remove table definition and its data, use the DROP TABLE statement.
You cannot use TRUNCATE TABLE on a table referenced by a FOREIGN KEY constraint; instead, use DELETE statement without a WHERE clause. Because TRUNCATE TABLE is not logged, it cannot activate a trigger.
TRUNCATE TABLE may not be used on tables participating in an indexed view.
There is a common myth that TRUNCATE somehow skips transaction log.
This is misunderstanding, and is clearly mentioned in MSDN.
This myth is invoked in several comments here. Let's eradicate it together ;)
For reference TRUNCATE TABLE also works on MySQL
I use the following method to zero out tables, with the added bonus that it leaves me with an archive copy of the table.
CREATE TABLE `new_table` LIKE `table`;
RENAME TABLE `table` TO `old_table`, `new_table` TO `table`;
forget truncate and delete. maintain your table definitions (in case you want to recreate it) and just use drop table.
truncate table client_log
is your best bet, truncate kills all content in the table and indices and resets any seeds you've got too.
truncate table is not SQL-platform independent. If you suspect that you might ever change database providers, you might be wary of using it.
On SQL Server you can use the Truncate Table command which is faster than a regular delete and also uses less resources. It will reset any identity fields back to the seed value as well.
The drawbacks of truncate are that it can't be used on tables that are referenced by foreign keys and it won't fire any triggers. Also you won't be able to rollback the data if anything goes wrong.
Note that TRUNCATE will also reset any auto incrementing keys, if you are using those.
If you do not wish to lose your auto incrementing keys, you can speed up the delete by deleting in sets (e.g., DELETE FROM table WHERE id > 1 AND id < 10000). It will speed it up significantly and in some cases prevent data from being locked up.
Yes, well, deleting 5 million rows is probably going to take a long time. The only potentially faster way I can think of would be to drop the table, and re-create it. That only works, of course, if you want to delete ALL data in the table.
The suggestion of "Drop and recreate the table" is probably not a good one because that goofs up your foreign keys.
You ARE using foreign keys, right?
If you cannot use TRUNCATE TABLE because of foreign keys and/or triggers, you can consider to:
drop all indexes;
do the usual DELETE;
re-create all indexes.
This may speed up DELETE somewhat.
I am revising my earlier statement:
You should understand that by using
TRUNCATE the data will be cleared but
nothing will be logged to the
transaction log. Writing to the log
is why DELETE will take forever on 5
million rows. I use TRUNCATE often
during development, but you should be
wary about using it on a production
database because you will not be able
to roll back your changes. You should
immediately make a full database
backup after doing a TRUNCATE to
establish a new basis for restoration.
The above statement was intended to prompt you to be sure that you understand there is difference between the two. Unfortunately, it is poorly written and makes unsupported statements as I have not actually done any testing myself between the two. It is based on statements that I have heard from others.
From MSDN:
The DELETE statement removes rows one
at a time and records an entry in the
transaction log for each deleted row.
TRUNCATE TABLE removes the data by
deallocating the data pages used to
store the table's data, and only the
page deallocations are recorded in the
transaction log.
I just wanted to say that there is a fundamental difference between the two and because there is a difference, there will be applications where one or the other may be inappropriate.
DELETE * FROM table_name;
Premature optimization may be dangerous. Optimizing may mean doing something weird, but if it works you may want to take advantage of it.
SELECT DbVendor_SuperFastDeleteAllFunction(tablename, BOZO_BIT) FROM dummy;
For speed I think it depends on...
The underlying database: Oracle, Microsoft, MySQL, PostgreSQL, others, custom...
The table, it's content, and related tables:
There may be deletion rules. Is there an existing procedure to delete all content in the table? Can this be optimized for the specific underlying database engine? How much do we care about breaking things / related data? Performing a DELETE may be the 'safest' way assuming that other related tables do not depend on this table. Are there other tables and queries that are related / depend on the data within this table? If we don't care much about this table being around, using DROP might be a fast method, again depending on the underlying database.
DROP TABLE table_name;
How many rows are being deleted? Is there other information that is quickly gleaned that will optimize the deletion? For example, can we tell if the table is already empty? Can we tell if there are hundreds, thousands, millions, billions of rows?

Resources