I follow this blog post to create ssis package for transferring data between 2 tables. http://radacad.com/insert-update-and-delete-destination-table-with-ssis.
Insert rows: not isnull(source_id) and isnull(dest_id)
Delete rows: isnull(source_id) and not isnull(dest_id)
Update rows: source_id = dest_id and source_row_version <> dest_row_version
It works well with few records. However, when there are a lot of update and delete rows detected (thousand or hundreds of thousand rows). It run very slow, destination table get locked and job never done. another thing, this approach delete and update row by row.
Could you please help me overcome table lock in this case. Is there any way so that we can update or delete batch of rows instead of doing row by row?
Use a staging area, and use Execute SQL tasks to run SQL or execute stored procedures to carry out bulk operations based on comparing your staging area to your destination.
I realise this isn't a very satisfying answer if you were hoping to do everything in SSIS, but using blocking transformations (MERGE JOIN is semi-blocking), and row-by-row OLE command transforms generally won't scale well to large amounts of data.
Related
I want to perform DELETE operation with record limit in Snowflake DB, my requirement is to perform DELETE operation in batches like first delete top 1000 records then next 1000 records and so on.
In SQL Server we can use query like below:
DELETE TOP (1000) FROM TABLE_NAME;
I am looking for similar query for Snowflake DB, I went through Snowflake Documentation for Delete - https://docs.snowflake.com/en/sql-reference/sql/delete.html, but I didn't find matching syntax with TOP OR LIMIT.
This might not be the answer you are looking for, but the reason you are deleting in chunks of 1000 on SQL Server is because SQL Server logs those into transaction logs, deletes the records, updates indexes, etc. and it is more efficient to do things that way. In Snowflake, it is not.
Snowflake doesn't log deletes, doesn't have indexes, and in fact, doesn't actually delete records, it instead recreates the micropartitions that those records are coming from without the records being deleted. So, in fact, it is actually far less efficient to delete in smaller batches in Snowflake. You should instead, simply issue a single delete statement with the records you wish to delete and let it do its thing.
Try this
DELETE FROM <TABLE>
WHERE <ID_COL> IN
(
SELECT TOP 2 <ID_COL> FROM <TABLE>);
I've created a SSIS package with a script component that calls data from a JSON API and inserts it into a table in SQL Server. I've setup the logic to add new rows, however I want to find the most appropriate way to delete/overwrite old rows. The data is fetched every 4 hours, so there's an overlap of approximately 1000 rows each time the package is run.
My first thought was to simply add a SQL Task after the Data Flow Task that deletes the duplicate rows (with the smallest ID number). However, I was wondering how to do this inside the Data Flow Task? The API call fetches no more than 5000 rows each time, the destination table has around 1m rows, and the entire project runs in approx. 10 seconds.
My simple Data Flow Task looks like this:
There are two main approaches you can try:
Run Lookup on Row ID. If matched run OLEDB Command Transformation for each line with an UPDATE statement. If not matched - direct rows to OLE DB destination.
Easy to implement, straightforward logic, but multitude of UPDATE statements will create performance problems.
Create an intermediate table in DB, clean it before running Data Flow Task, and store all rows in your Data Flow into this intermediate table. Then on the next task - do either of following:
MERGE intermediate table with the main table. More info on MERGE.
In transaction - drop rows from the main table which exists on the intermediate, then do INSERT INTO <main table> SELECT ... FROM <intermediate table>
I usually prefer the intermediate table approach with MERGE - performant, simple and flexible. MERGE statement can have downside effects when run in concurrent sessions or on clustered columnstore tables, then I use the intermediate table with DELETE...INSERT command
So I figured out that the easiest solution in my case (the case where there's only relatively few rows to update) was to use the OLE DB Component as can be seen below.
In the component I added an Update SQL statement with logic such as the following
UPDATE [dbo].[table]
SET [value1]=?,
[value2]=?,
[value2]=?,
WHERE [value1]=?
Then I mapped the parameters to their corresponding columns, and made sure that my where clause used the lookup match output to update the correct rows. The component makes sure that the "Lookup Match Output" is updated using the columns I use in the Lookup component.
One Destination - All Merge Join Rows
Two Destinations - Fewer Merge Join Rows
Can anyone please explain this behavior for me?
I am generating a count field and then feeding that to back into the main stream with the merge join and then performing a conditional split based on the count. It works fine without the update statement but I get different results when I run it with an update statement from the conditional split. Maybe also worth mentioning that there are no nulls in the data both pictures are the same file as well. Any thoughts are appreciated. Thanks.
When OLEDB command didn't finish executing the current batch of rows, it's precedent component (condotionnal split) will not send more rows until he finish processing and so on. Also it is depending on the dataFlow DefaultBufferSize and DefaultBufferMaxRows
Read more on Data Flow Performance Features
I figured I'd update what I learned. It appears that the issue with how many rows were loaded (49430 versus 52220) was due to the DefaultMaxBuffer size and DefaultMaxBuffer Rows settings in SSIS. This did not improve performance, just how many records were loaded into memory.
As Martin suggested above, the delay with processing the update was due to inefficiency. For anyone wanted to know what a Staging Table is.. it's just a generic term for a table you make in your database (Or make the table in SSIS with a sql command task) then use sql command in SSIS to run an update statement. You can drop the staging table in a SSIS task after the update if you want. I can not understate how much of a performance increase this gives you for large updates.
I'm looking for an efficient way of detecting deleted records in production and updating the data warehouse to reflect those deletes because the table is > 12M rows and contains transactional data used for accounting purposes.
Originally, everything was done in a stored procedure by somebody before me and I've been tasked with moving the process to SSIS.
Here is what my test pattern looks like so far:
Inside the Data Flow Task:
I'm using MD5 hashes to speed up the ETL process as demonstrated in this article.
This should give a huge speed boost to the process by not having to store so many rows in memory for comparison purposes and by removing the bulk of conditional split processing at the same time.
But the issue is it doesn't account for records that are deleted in production.
How should I go about doing this? It may be simple to you but I'm new to SSIS so I'm not sure how to ask correctly.
Thank you in advance.
The solution I ended up using was to add another Data Flow Task and use the Lookup transformation to find records that didn't exist in production when compared to our fact table. This task comes after all of the inserts and updates as shown in my question above.
Then we can batch delete missing records in an execute SQL task.
Inside Data Flow Task:
Inside Lookup Transformation:
(note the Redirect rows to no match output)
So, if the ID's don't match those rows will be redirected to the no match output which we set to go to our staging table. Then, we will join staging to the fact table and apply the deletions as shown below inside an execute SQL task.
I think you'll need to adopt you dataflow to use a merge join instead of a lookup.
That way you can see whats new/changed & deleted.
You'll need to sort both Flows by the same joining key (in this case your hash column).
Personally i'm not sure I'd bother and Instead I'd simply stage all my prod data and then do a 3-way SQL merge statement to handle Inserts updates & deletes in one pass. You can keep your hash column as a joining key if you like.
I need to do "insert/delete if not exists" for many very big tables from Server A to B. The lookup component doesn't work well because it issues one query for each row to check if the row exist in the destination database table.
What's the good option?
The tables all have PK but they may have the size of several hundred GB individually. The destination tables may only miss under 3% rows. So merge component may not be a good option?
You can use the merge SSIS component. I personally found better performance by loading all data to a staging table and running the Merge T-sql statement afterwards in a stored procedure.