One Destination - All Merge Join Rows
Two Destinations - Fewer Merge Join Rows
Can anyone please explain this behavior for me?
I am generating a count field and then feeding that to back into the main stream with the merge join and then performing a conditional split based on the count. It works fine without the update statement but I get different results when I run it with an update statement from the conditional split. Maybe also worth mentioning that there are no nulls in the data both pictures are the same file as well. Any thoughts are appreciated. Thanks.
When OLEDB command didn't finish executing the current batch of rows, it's precedent component (condotionnal split) will not send more rows until he finish processing and so on. Also it is depending on the dataFlow DefaultBufferSize and DefaultBufferMaxRows
Read more on Data Flow Performance Features
I figured I'd update what I learned. It appears that the issue with how many rows were loaded (49430 versus 52220) was due to the DefaultMaxBuffer size and DefaultMaxBuffer Rows settings in SSIS. This did not improve performance, just how many records were loaded into memory.
As Martin suggested above, the delay with processing the update was due to inefficiency. For anyone wanted to know what a Staging Table is.. it's just a generic term for a table you make in your database (Or make the table in SSIS with a sql command task) then use sql command in SSIS to run an update statement. You can drop the staging table in a SSIS task after the update if you want. I can not understate how much of a performance increase this gives you for large updates.
Related
I have a SSIS package with a task to load data. For some reason i need to update and insert same destination table. This happen deadlock
I use SSIS MULTI-CAST control.
What to do? how to resolve this situation?
In your OLE DB Destination, change the access mode from "FastLoad" to "Table or View". The former will take a table lock which is generally better for large inserts but in your scenario, you need the table to remain "unlocked." Your performance will suffer since you'll be issuing singleton inserts but I guess that doesn't really matter since you'll also be doing singleton updates with your "OLE DB Command"
Finally, I think you're doing this wrong. The multicast essentially duplicates a row so that you can direct it to N components. I generally see people trying to detect whether a row exists in the target and then either insert or update it based on that lookup. But that's the lookup component, not a multicast. Maybe you're doing a type 2 dimension or something but even then, there will be better ways to accomplish this versus what you're showing in the picture.
Your way seems strange, as billinkc said, you are effectively double data rows and perform INSERT and UPDATE actions with the same table concurrently from two different connections/contexts. This have to end in a deadlock.
I would use alternative approach - do required transforms with the data, and then write it to an intermediate table in the DataFlow. Then on the next SSIS task - execute MS SQL MERGE - Microsoft table upsert - with OLE DB Command. This will assure you do not have a deadlock between concurrent operations, logic of the MERGE could be quite flexible.
Last but not the least - use dedicated or global ##temp table for an intermediate table, Working with regular MS SQL #temp tables in SSIS is little tricky. Do not forget to clean up intermediate before and after MERGE, or create and dispose of ##temp table properly.
I follow this blog post to create ssis package for transferring data between 2 tables. http://radacad.com/insert-update-and-delete-destination-table-with-ssis.
Insert rows: not isnull(source_id) and isnull(dest_id)
Delete rows: isnull(source_id) and not isnull(dest_id)
Update rows: source_id = dest_id and source_row_version <> dest_row_version
It works well with few records. However, when there are a lot of update and delete rows detected (thousand or hundreds of thousand rows). It run very slow, destination table get locked and job never done. another thing, this approach delete and update row by row.
Could you please help me overcome table lock in this case. Is there any way so that we can update or delete batch of rows instead of doing row by row?
Use a staging area, and use Execute SQL tasks to run SQL or execute stored procedures to carry out bulk operations based on comparing your staging area to your destination.
I realise this isn't a very satisfying answer if you were hoping to do everything in SSIS, but using blocking transformations (MERGE JOIN is semi-blocking), and row-by-row OLE command transforms generally won't scale well to large amounts of data.
I'm looking for an efficient way of detecting deleted records in production and updating the data warehouse to reflect those deletes because the table is > 12M rows and contains transactional data used for accounting purposes.
Originally, everything was done in a stored procedure by somebody before me and I've been tasked with moving the process to SSIS.
Here is what my test pattern looks like so far:
Inside the Data Flow Task:
I'm using MD5 hashes to speed up the ETL process as demonstrated in this article.
This should give a huge speed boost to the process by not having to store so many rows in memory for comparison purposes and by removing the bulk of conditional split processing at the same time.
But the issue is it doesn't account for records that are deleted in production.
How should I go about doing this? It may be simple to you but I'm new to SSIS so I'm not sure how to ask correctly.
Thank you in advance.
The solution I ended up using was to add another Data Flow Task and use the Lookup transformation to find records that didn't exist in production when compared to our fact table. This task comes after all of the inserts and updates as shown in my question above.
Then we can batch delete missing records in an execute SQL task.
Inside Data Flow Task:
Inside Lookup Transformation:
(note the Redirect rows to no match output)
So, if the ID's don't match those rows will be redirected to the no match output which we set to go to our staging table. Then, we will join staging to the fact table and apply the deletions as shown below inside an execute SQL task.
I think you'll need to adopt you dataflow to use a merge join instead of a lookup.
That way you can see whats new/changed & deleted.
You'll need to sort both Flows by the same joining key (in this case your hash column).
Personally i'm not sure I'd bother and Instead I'd simply stage all my prod data and then do a 3-way SQL merge statement to handle Inserts updates & deletes in one pass. You can keep your hash column as a joining key if you like.
I need to perform a task in which we have a table who has 19 columns with text data type. I want to delete these columns from this source table and move those columns to a new table with data type as varchar(max). The source table has currently 30k rows (with text data type data). This will increase eventually as client will use the database for record storage. For transferring this old data i tried to use "insert into..select.." query but it is taking around 25-30 mins to transfer these much rows(30k). Same is the case with "Select from..insert.." query. I have also tried creating data flow task of SSIS for transferring with OLE DB as source and destination as well. But still it's taking same amount of time. I'm really confused as all posts over internet suggests that SSIS is fastest way for data transfer. Can you please suggests me better way to improve performance of data transfer using any technique?
Thanks
SSIS probably isn't faster if the source and the destination are in the same database and the SSIS process is on the same box.
One approach might be to figure out where you are spending the time and optimise that. If you set Management Studio to "discard results after execution" and run just the select part of your query, how long does that take? If this is a substantial part of the 25-30 minutes then work on optimising that.
If the select statement turns out to be really fast, then all the time is being spent on the insert and you need to look at improving that part of the process instead. There are a couple of things you can try here before you go hardware shopping; are there any indexes or constraints (or triggers!) on the target table that you can drop for the duration of the insert and put back again at the end? Can you put the database in simple mode?
Apologies if I am enraging the forum with repetitive question. Couldn't find the right solution in the forum, hence posting it.
I need to fetch 129991763 rows into a cursor or temp table or a staging table quickly and process them into another table. And this destination table is also huge table.
Currently I am using INSERT using SELECT statement (the SELECT is nested 4 levels) used hints like Option (FAST 1000), MAXDOP 1, RECOMPILE ...etc...
The procedure is consuming lot of time and showing no results or not getting completed at all.
Previously I used a cursor with the same hints; but as it was also running more than 22 hours; I switched to INSERT using SELECT.
Literally, I need to stop the execution for above both methods.
And to be honest, I am beginner in SQL Server database.
Even if specifically filter out the records in SELECT based on criteria; still the process needs to broken 4 or 5 chunks and these chunks are also taking more than 4 - 5 hours to complete.
Please help.
Thanks
Pradyumna
In the past I've used BULK INSERT with reasonable success, but I suspect the suggestion of breaking it into chunks and dropping indexes would still be wise. You can find some details on it here
https://msdn.microsoft.com/en-GB/library/ms188365.aspx
Hope it helps, good luck.
Apologies, you will probably be best using an SSIS package to pull it across. With this you can also transform the data if needed. I would still recommend keeping indexes off the table you are inserting the data into where possible. You'll need to have a bit of a read but hard to explain on here due to the use of the GUI.
Good luck