SSIS insert a large number of rows if not existing? - sql-server

I need to do "insert/delete if not exists" for many very big tables from Server A to B. The lookup component doesn't work well because it issues one query for each row to check if the row exist in the destination database table.
What's the good option?
The tables all have PK but they may have the size of several hundred GB individually. The destination tables may only miss under 3% rows. So merge component may not be a good option?

You can use the merge SSIS component. I personally found better performance by loading all data to a staging table and running the Merge T-sql statement afterwards in a stored procedure.

Related

Best way to handle updates on a table

I am looking for much more better way to update tables using SSIS. Specifically, i wanted to optimize the updates on tables (around 10 tables uses same logic).
The logic is,
Select the source data from staging then inserts into physical temp table in the DW (i.e TMP_Tbl)
Update all data matching by customerId column from TMP_Tbl to MyTbl.
Inserts all non-existing customerId column from TMP_Tbl1 to MyTbl.
Using the above steps, this takes some time populating TMP_Tbl. Hence, i planned to change the logic to delete-insert but according to this:
In SQL, is UPDATE always faster than DELETE+INSERT? this would be a recipe for pain.
Given:
no index/keys used on the tables
some tables contains 5M rows, some contains 2k rows
each table update took up to 2-3 minutes, which took for about (15 to 20 minutes) all in all
these updates we're in separate sequence container simultaneously runs
Anyone knows what's the best way to use, seems like using physical temp table needs to be remove, is this normal?
With SSIS you usually BULK INSERT, not INSERT. So if you do not mind DELETE - reinserting the rows should in general outperform UPDATE.
Considering this the faster approach will be:
[Execute SQL Task] Delete all records which you need to update. (Depending on your DB design and queries, some index may help here).
[Data Flow Task] Fast load (using OLE DB Destination, Data access mode: Table of fiew - fast load) both updated and new records from source into MyTbl. No need for temp tables here.
If you cannot/don't want to DELETE records - your current approach is OK too.
You just need to fix the performance of that UPDATE query (adding an index should help). 2-3 minutes per every record updated is way too long.
If it is 2-3 minutes for updating millions of records though - then it's acceptable.
Adding the correct non-clustered index to a table should not result in "much more time on the updates".
There will be a slight overhead, but if it helps your UPDATE to seek instead of scanning a big table - it is usually well worth it.

SSIS merge join to update or delete rows locked forever

I follow this blog post to create ssis package for transferring data between 2 tables. http://radacad.com/insert-update-and-delete-destination-table-with-ssis.
Insert rows: not isnull(source_id) and isnull(dest_id)
Delete rows: isnull(source_id) and not isnull(dest_id)
Update rows: source_id = dest_id and source_row_version <> dest_row_version
It works well with few records. However, when there are a lot of update and delete rows detected (thousand or hundreds of thousand rows). It run very slow, destination table get locked and job never done. another thing, this approach delete and update row by row.
Could you please help me overcome table lock in this case. Is there any way so that we can update or delete batch of rows instead of doing row by row?
Use a staging area, and use Execute SQL tasks to run SQL or execute stored procedures to carry out bulk operations based on comparing your staging area to your destination.
I realise this isn't a very satisfying answer if you were hoping to do everything in SSIS, but using blocking transformations (MERGE JOIN is semi-blocking), and row-by-row OLE command transforms generally won't scale well to large amounts of data.

finding duplicate records in tables between two oracle schemas

I have two database sachems containing millions of records (60 - 100 million records) (lets assume student records)
First schema is staging schema, second is the target prod schema,
I would like to check if the same user in staging schema already exist in prod schema before i copy it over (if it exist then apply some merge logic)
I have some PL/Sql code that runs sequentially and matches records, but the process is extremely slow, even when indexing and performance tuning.
Any matchers, or multithreading of pl/sql function that can be used? Is there any better alternative in oracle that I might be missing?
One possible solution is to copy some of the data (data participating in the duplication process) from prod schema and perform the comparison in the staging schema but the copy data overhead might be the same as comparing.
sample record:
Student_first_name,Student_Last_name,SSN
foo, ,bar ,123456
First - copying the data between schemas would not benefit your performance, Oracle does not perform faster on intra-schema queries.
Second - Using a single SQL to identify your duplicate records (or missing records, whichever is the smaller part of the table, and then performing your pl/sql code on these rows alone may greatly help (by storing them in a cursor or flaging them using a dedicated column), especially if the amount of data added each day is negligible in comparison to the full prod table.

Tens of Millions inserts into an indexed table performance/strategy (Sql Server >= 2005)

I have to get data from many tables and combine them into a single one.
The final table will have about 120 millions rows.
I'm planning to insert the rows in the exact order needed by the big table indexes.
My question is, in terms of performance:
Is it better create the indexes of the new table from the start, or first make the inserts and at the end of the import create the indexes ?
Also, would it make a difference if, when building indexes at the end, the rows are already sorted in terms of indexes specifications ?
I can't test both cases and get an objective comparison since the database is on the main server which is used for many other databases and applications which can be heavy loaded or not on different moment of times. I can't restore the database to my local server either, since I don't have full access to the main server yet.
I suggest that copy date in first and then create your indexes. If you insert records on the table that have index, for each insert, SQL Server refresh table index. but when you create index after insert all record to your table, SQL Server don't need to refresh table index for each insert, and rebuild index one way.
You can use SSIS in order to copy data from source tables to destination. SSIS use balk insert and have good performance. also if you have any trigger on destination database, I suggest that disable that before start your convert.
When you create index each time on your table, rows stored in terms of your index.

Large Data Service Architecture

Everyday a company drops a text file with potentially many records (350,000) onto our secure FTP. We've created a windows service that runs early in the AM to read in the text file into our SQL Server 2005 DB tables. We don't do a BULK Insert because the data is relational and we need to check it against what's already in our DB to make sure the data remains normalized and consistent.
The problem with this is that the service can take a very long time (hours). This is problematic because it is inserting and updating into tables that constantly need to be queried and scanned by our application which could affect the performance of the DB and the application.
One solution we've thought of is to run the service on a separate DB with the same tables as our live DB. When the service is finished we can do a BCP into the live DB so it mirrors all of the new records created by the service.
I've never worked with handling millions of records in a DB before and I'm not sure what a standard approach to something like this is. Is this an appropriate way of doing this sort of thing? Any suggestions?
One mechanism I've seen is to insert the values into a temporary table - with the same schema as the target table. Null IDs signify new records and populated IDs signify updated records. Then use the SQL Merge command to merge it into the main table. Merge will perform better than individual inserts/updates.
Doing it individually, you will incur maintenance of the indexes on the table - can be costly if its tuned for selects. I believe with merge its a bulk action.
It's touched upon here:
What's a good alternative to firing a stored procedure 368 times to update the database?
There are MSDN articles about SQL merging, so Googling will help you there.
Update: turns out you cannot merge (you can in 2008). Your idea of having another database is usually handled by SQL replication. Again I've seen in production a copy of the current database used to perform a long running action (reporting and aggregation of data in this instance), however this wasn't merged back in. I don't know what merging capabilities are available in SQL Replication - but it would be a good place to look.
Either that, or resolve the reason why you cannot bulk insert/update.
Update 2: as mentioned in the comments, you could stick with the temporary table idea to get the data into the database, and then insert/update join onto this table to populate your main table. The difference is now that SQL is working with a set so can tune any index rebuilds accordingly - should be faster, even with the joining.
Update 3: you could possibly remove the data checking from the insert process and move it to the service. If you can stop inserts into your table while this happens, then this will allow you to solve the issue stopping you from bulk inserting (ie, you are checking for duplicates based on column values, as you don't yet have the luxury of an ID). Alternatively with the temporary table idea, you can add a WHERE condition to first see if the row exists in the database, something like:
INSERT INTO MyTable (val1, val2, val3)
SELECT val1, val2, val3 FROM #Tempo
WHERE NOT EXISTS
(
SELECT *
FROM MyTable t
WHERE t.val1 = val1 AND t.val2 = val2 AND t.val3 = val3
)
We do much larger imports than that all the time. Create an SSIS pacakge to do the work. Personally I prefer to create a staging table, clean it up, and then do the update or import. But SSIS can do all the cleaning in memory if you want before inserting.
Before you start mirroring and replicating data, which is complicated and expensive, it would be worthwhile to check your existing service to make sure it is performing efficiently.
Maybe there are table scans you can get rid of by adding an index, or lookup queries you can get rid of by doing smart error handling? Analyze your execution plans for the queries that your service performs and optimize those.

Resources