Table A is imported from Excel file into SQL Server. There is serious of Update and Delete operations performed on the Table to replace certain columns with certain values. I have written a Proc for this but Since I'm beginning to use SSIS I need to know if this can be accomplished via Execute SQL Task or by using any other transformation.
Yes it is achievable by either the slowly changing dimension component, or lookup components but both of the these require a RBAR (row by agonising row) update to be executed. If you have a small amount of data < 50,000 records) this is probably OK but it just doesn't work for larger datasets
When using the lookup or SCD components, the dataflow is split into an insert and and update stream. The insert stream is reasonably fast, but the update stream needs to be fed into an execute SQL task, which performs an update one data row at a time.
This is in contrast to the ELT method where you execute one UPDATE statement which performs all updates in one batch. This is generally much faster.
If you decide not to use the SSIS RBAR approach, you can stage your Excel data then use an execute SQL task (in the control flow) to call your existing SP. In this way a batch update can be performed rather than an RBAR one.
Related
I've created a SSIS package with a script component that calls data from a JSON API and inserts it into a table in SQL Server. I've setup the logic to add new rows, however I want to find the most appropriate way to delete/overwrite old rows. The data is fetched every 4 hours, so there's an overlap of approximately 1000 rows each time the package is run.
My first thought was to simply add a SQL Task after the Data Flow Task that deletes the duplicate rows (with the smallest ID number). However, I was wondering how to do this inside the Data Flow Task? The API call fetches no more than 5000 rows each time, the destination table has around 1m rows, and the entire project runs in approx. 10 seconds.
My simple Data Flow Task looks like this:
There are two main approaches you can try:
Run Lookup on Row ID. If matched run OLEDB Command Transformation for each line with an UPDATE statement. If not matched - direct rows to OLE DB destination.
Easy to implement, straightforward logic, but multitude of UPDATE statements will create performance problems.
Create an intermediate table in DB, clean it before running Data Flow Task, and store all rows in your Data Flow into this intermediate table. Then on the next task - do either of following:
MERGE intermediate table with the main table. More info on MERGE.
In transaction - drop rows from the main table which exists on the intermediate, then do INSERT INTO <main table> SELECT ... FROM <intermediate table>
I usually prefer the intermediate table approach with MERGE - performant, simple and flexible. MERGE statement can have downside effects when run in concurrent sessions or on clustered columnstore tables, then I use the intermediate table with DELETE...INSERT command
So I figured out that the easiest solution in my case (the case where there's only relatively few rows to update) was to use the OLE DB Component as can be seen below.
In the component I added an Update SQL statement with logic such as the following
UPDATE [dbo].[table]
SET [value1]=?,
[value2]=?,
[value2]=?,
WHERE [value1]=?
Then I mapped the parameters to their corresponding columns, and made sure that my where clause used the lookup match output to update the correct rows. The component makes sure that the "Lookup Match Output" is updated using the columns I use in the Lookup component.
I have a SSIS package with a task to load data. For some reason i need to update and insert same destination table. This happen deadlock
I use SSIS MULTI-CAST control.
What to do? how to resolve this situation?
In your OLE DB Destination, change the access mode from "FastLoad" to "Table or View". The former will take a table lock which is generally better for large inserts but in your scenario, you need the table to remain "unlocked." Your performance will suffer since you'll be issuing singleton inserts but I guess that doesn't really matter since you'll also be doing singleton updates with your "OLE DB Command"
Finally, I think you're doing this wrong. The multicast essentially duplicates a row so that you can direct it to N components. I generally see people trying to detect whether a row exists in the target and then either insert or update it based on that lookup. But that's the lookup component, not a multicast. Maybe you're doing a type 2 dimension or something but even then, there will be better ways to accomplish this versus what you're showing in the picture.
Your way seems strange, as billinkc said, you are effectively double data rows and perform INSERT and UPDATE actions with the same table concurrently from two different connections/contexts. This have to end in a deadlock.
I would use alternative approach - do required transforms with the data, and then write it to an intermediate table in the DataFlow. Then on the next SSIS task - execute MS SQL MERGE - Microsoft table upsert - with OLE DB Command. This will assure you do not have a deadlock between concurrent operations, logic of the MERGE could be quite flexible.
Last but not the least - use dedicated or global ##temp table for an intermediate table, Working with regular MS SQL #temp tables in SSIS is little tricky. Do not forget to clean up intermediate before and after MERGE, or create and dispose of ##temp table properly.
I have an SSIS job, and a relatively complex select, that use the same data. I have to make it so that my client doesn't have to call them separately, but use one thing to get the result of the select and call the job.
My original plan was to create a procedure, which will take necessary input, and then output a table variable with the select result.
However, after reading the Microsoft documentation, I found out that table variables might not be able to hold a result with more than 100 rows, while I might want to select ~10 000 rows. And now I'm stumped. What is the best way to call a job and select data, from one component?
I have permissions to create views, procedures, and I can edit the SSIS job. The user will provide me with 2 parameters.
This is how I would suggest that you do in this scenario, to take the complexity away from the SSIS.
Create the SP that you wanted to; but instead of Table Variable; push your output into a table. This table can be addded on the fly(dynamically using CREATE TABLE script) or can exist on the DB always available as a buffer.
Call this SP in your control flow.
In the Data flow task, select from this buffer table.
After completing the SSIS work, flush the buffer table, i.e. truncate the table.
Caveat: You may face problem in concurrency scenarios; To eliminate that, you should have a column BatchID or BatchStartTimeStamp which can store a unique value for each run.
You can pass data for BatchID or BatchStartTimeStamp from SSIS package.
I am using Change Data Capture to capture the change of data from software application. I am trying to generate the SQL statements (insert, update, delete) from the data captured.
Is there any proper way to get this done?
The way I have is/which I have worked on is, get all the change records from CDC tables along with the action (update/delete/insert) and pass the batch of records to a stored procedure which accepts table type as a input parameter. In the stored procedure you can basically write a cursor/ group by action to perform the operation on destination table. This way you don't need to generate dynamic SQL queries and run it on data base and we have seen this as a very efficient way when compared to generating dynamic sql and run it on DB.
I'm looking for an efficient way of detecting deleted records in production and updating the data warehouse to reflect those deletes because the table is > 12M rows and contains transactional data used for accounting purposes.
Originally, everything was done in a stored procedure by somebody before me and I've been tasked with moving the process to SSIS.
Here is what my test pattern looks like so far:
Inside the Data Flow Task:
I'm using MD5 hashes to speed up the ETL process as demonstrated in this article.
This should give a huge speed boost to the process by not having to store so many rows in memory for comparison purposes and by removing the bulk of conditional split processing at the same time.
But the issue is it doesn't account for records that are deleted in production.
How should I go about doing this? It may be simple to you but I'm new to SSIS so I'm not sure how to ask correctly.
Thank you in advance.
The solution I ended up using was to add another Data Flow Task and use the Lookup transformation to find records that didn't exist in production when compared to our fact table. This task comes after all of the inserts and updates as shown in my question above.
Then we can batch delete missing records in an execute SQL task.
Inside Data Flow Task:
Inside Lookup Transformation:
(note the Redirect rows to no match output)
So, if the ID's don't match those rows will be redirected to the no match output which we set to go to our staging table. Then, we will join staging to the fact table and apply the deletions as shown below inside an execute SQL task.
I think you'll need to adopt you dataflow to use a merge join instead of a lookup.
That way you can see whats new/changed & deleted.
You'll need to sort both Flows by the same joining key (in this case your hash column).
Personally i'm not sure I'd bother and Instead I'd simply stage all my prod data and then do a 3-way SQL merge statement to handle Inserts updates & deletes in one pass. You can keep your hash column as a joining key if you like.