I've created a SSIS package with a script component that calls data from a JSON API and inserts it into a table in SQL Server. I've setup the logic to add new rows, however I want to find the most appropriate way to delete/overwrite old rows. The data is fetched every 4 hours, so there's an overlap of approximately 1000 rows each time the package is run.
My first thought was to simply add a SQL Task after the Data Flow Task that deletes the duplicate rows (with the smallest ID number). However, I was wondering how to do this inside the Data Flow Task? The API call fetches no more than 5000 rows each time, the destination table has around 1m rows, and the entire project runs in approx. 10 seconds.
My simple Data Flow Task looks like this:
There are two main approaches you can try:
Run Lookup on Row ID. If matched run OLEDB Command Transformation for each line with an UPDATE statement. If not matched - direct rows to OLE DB destination.
Easy to implement, straightforward logic, but multitude of UPDATE statements will create performance problems.
Create an intermediate table in DB, clean it before running Data Flow Task, and store all rows in your Data Flow into this intermediate table. Then on the next task - do either of following:
MERGE intermediate table with the main table. More info on MERGE.
In transaction - drop rows from the main table which exists on the intermediate, then do INSERT INTO <main table> SELECT ... FROM <intermediate table>
I usually prefer the intermediate table approach with MERGE - performant, simple and flexible. MERGE statement can have downside effects when run in concurrent sessions or on clustered columnstore tables, then I use the intermediate table with DELETE...INSERT command
So I figured out that the easiest solution in my case (the case where there's only relatively few rows to update) was to use the OLE DB Component as can be seen below.
In the component I added an Update SQL statement with logic such as the following
UPDATE [dbo].[table]
SET [value1]=?,
[value2]=?,
[value2]=?,
WHERE [value1]=?
Then I mapped the parameters to their corresponding columns, and made sure that my where clause used the lookup match output to update the correct rows. The component makes sure that the "Lookup Match Output" is updated using the columns I use in the Lookup component.
Related
I have a SSIS package with a task to load data. For some reason i need to update and insert same destination table. This happen deadlock
I use SSIS MULTI-CAST control.
What to do? how to resolve this situation?
In your OLE DB Destination, change the access mode from "FastLoad" to "Table or View". The former will take a table lock which is generally better for large inserts but in your scenario, you need the table to remain "unlocked." Your performance will suffer since you'll be issuing singleton inserts but I guess that doesn't really matter since you'll also be doing singleton updates with your "OLE DB Command"
Finally, I think you're doing this wrong. The multicast essentially duplicates a row so that you can direct it to N components. I generally see people trying to detect whether a row exists in the target and then either insert or update it based on that lookup. But that's the lookup component, not a multicast. Maybe you're doing a type 2 dimension or something but even then, there will be better ways to accomplish this versus what you're showing in the picture.
Your way seems strange, as billinkc said, you are effectively double data rows and perform INSERT and UPDATE actions with the same table concurrently from two different connections/contexts. This have to end in a deadlock.
I would use alternative approach - do required transforms with the data, and then write it to an intermediate table in the DataFlow. Then on the next SSIS task - execute MS SQL MERGE - Microsoft table upsert - with OLE DB Command. This will assure you do not have a deadlock between concurrent operations, logic of the MERGE could be quite flexible.
Last but not the least - use dedicated or global ##temp table for an intermediate table, Working with regular MS SQL #temp tables in SSIS is little tricky. Do not forget to clean up intermediate before and after MERGE, or create and dispose of ##temp table properly.
Need to loading a flat file with an SSIS Package executed in a scheduled job in SQL Server 2016 but it's taking TOO MUCH TIME (like 2/3 hours) just to load data in source then it’s need extra (2/3 hours) time for sort and filter then need similar time to load data in target, the file just has like million rows and it’s not less than 3 GB file approximately. This is driving me crazy, because is affecting the performance of my server.
SSIS package: -My package is just a Data Flow Task that has a Flat File Source and an OLE DB Destination, that’s all -The Data Access Mode is set to FAST LOAD. -Just have 1 indexes in the table. My destination table has 32 columns
Input file:
Input text file has more than 32 columns, surrogate key data may not unique , referenced columns date may not unique , Need to filter them.
Face two problems one is SSIS FlatFile-Source take huge time to load date another one is sort and filter. What to do?
If you want it to run fast use this pattern:
Load the data exactly as-is into a staging table
Optionally add indexes to the staging table afterwards
Use SQL to perform whatever processing you need (i.e. SELECT DISTINCT, GROUP BY into the final table)
You can do this kind of thing in SSIS but you need to tune it properly etc. it's just easier to do it inside a database which is already well optimised for this
Some Suggestions
1. DROP and Recreate indexes
Add 2 Execute SQL Task; before and after the Data Flow Task, the first drop the index and the second recreate it after that the Data Flow Task is executed successfully
2. Adjust the buffer size
You can read more about buffer sizing in the following Technet article
3. Remove duplicates in SQL server instead of Sort Components
Try to remove the Sort components, and add an Execute SQL Task after the Data Flow Task which run a similar query:
;WITH x AS
(
SELECT col1, col2, col3, rn = ROW_NUMBER() OVER
(PARTITION BY col1, col2, col3 ORDER BY id)
FROM dbo.tbl
)
DELETE x WHERE rn > 1;
4. Use a script component instead of Sort
You have to implement a similar logic to the following answer:
SSIS: Flat File Source to SQL without Duplicate Rows
Some helpful Links
Integration Services: Performance Tuning Techniques
Data Flow Performance Features
Can I delete database duplicates based on multiple columns?
Removing duplicate rows (based on values from multiple columns) from SQL table
Deleting duplicates based on multiple columns
Remove duplicates based on two columns
I'm looking for an efficient way of detecting deleted records in production and updating the data warehouse to reflect those deletes because the table is > 12M rows and contains transactional data used for accounting purposes.
Originally, everything was done in a stored procedure by somebody before me and I've been tasked with moving the process to SSIS.
Here is what my test pattern looks like so far:
Inside the Data Flow Task:
I'm using MD5 hashes to speed up the ETL process as demonstrated in this article.
This should give a huge speed boost to the process by not having to store so many rows in memory for comparison purposes and by removing the bulk of conditional split processing at the same time.
But the issue is it doesn't account for records that are deleted in production.
How should I go about doing this? It may be simple to you but I'm new to SSIS so I'm not sure how to ask correctly.
Thank you in advance.
The solution I ended up using was to add another Data Flow Task and use the Lookup transformation to find records that didn't exist in production when compared to our fact table. This task comes after all of the inserts and updates as shown in my question above.
Then we can batch delete missing records in an execute SQL task.
Inside Data Flow Task:
Inside Lookup Transformation:
(note the Redirect rows to no match output)
So, if the ID's don't match those rows will be redirected to the no match output which we set to go to our staging table. Then, we will join staging to the fact table and apply the deletions as shown below inside an execute SQL task.
I think you'll need to adopt you dataflow to use a merge join instead of a lookup.
That way you can see whats new/changed & deleted.
You'll need to sort both Flows by the same joining key (in this case your hash column).
Personally i'm not sure I'd bother and Instead I'd simply stage all my prod data and then do a 3-way SQL merge statement to handle Inserts updates & deletes in one pass. You can keep your hash column as a joining key if you like.
Table A is imported from Excel file into SQL Server. There is serious of Update and Delete operations performed on the Table to replace certain columns with certain values. I have written a Proc for this but Since I'm beginning to use SSIS I need to know if this can be accomplished via Execute SQL Task or by using any other transformation.
Yes it is achievable by either the slowly changing dimension component, or lookup components but both of the these require a RBAR (row by agonising row) update to be executed. If you have a small amount of data < 50,000 records) this is probably OK but it just doesn't work for larger datasets
When using the lookup or SCD components, the dataflow is split into an insert and and update stream. The insert stream is reasonably fast, but the update stream needs to be fed into an execute SQL task, which performs an update one data row at a time.
This is in contrast to the ELT method where you execute one UPDATE statement which performs all updates in one batch. This is generally much faster.
If you decide not to use the SSIS RBAR approach, you can stage your Excel data then use an execute SQL task (in the control flow) to call your existing SP. In this way a batch update can be performed rather than an RBAR one.
Everyday a company drops a text file with potentially many records (350,000) onto our secure FTP. We've created a windows service that runs early in the AM to read in the text file into our SQL Server 2005 DB tables. We don't do a BULK Insert because the data is relational and we need to check it against what's already in our DB to make sure the data remains normalized and consistent.
The problem with this is that the service can take a very long time (hours). This is problematic because it is inserting and updating into tables that constantly need to be queried and scanned by our application which could affect the performance of the DB and the application.
One solution we've thought of is to run the service on a separate DB with the same tables as our live DB. When the service is finished we can do a BCP into the live DB so it mirrors all of the new records created by the service.
I've never worked with handling millions of records in a DB before and I'm not sure what a standard approach to something like this is. Is this an appropriate way of doing this sort of thing? Any suggestions?
One mechanism I've seen is to insert the values into a temporary table - with the same schema as the target table. Null IDs signify new records and populated IDs signify updated records. Then use the SQL Merge command to merge it into the main table. Merge will perform better than individual inserts/updates.
Doing it individually, you will incur maintenance of the indexes on the table - can be costly if its tuned for selects. I believe with merge its a bulk action.
It's touched upon here:
What's a good alternative to firing a stored procedure 368 times to update the database?
There are MSDN articles about SQL merging, so Googling will help you there.
Update: turns out you cannot merge (you can in 2008). Your idea of having another database is usually handled by SQL replication. Again I've seen in production a copy of the current database used to perform a long running action (reporting and aggregation of data in this instance), however this wasn't merged back in. I don't know what merging capabilities are available in SQL Replication - but it would be a good place to look.
Either that, or resolve the reason why you cannot bulk insert/update.
Update 2: as mentioned in the comments, you could stick with the temporary table idea to get the data into the database, and then insert/update join onto this table to populate your main table. The difference is now that SQL is working with a set so can tune any index rebuilds accordingly - should be faster, even with the joining.
Update 3: you could possibly remove the data checking from the insert process and move it to the service. If you can stop inserts into your table while this happens, then this will allow you to solve the issue stopping you from bulk inserting (ie, you are checking for duplicates based on column values, as you don't yet have the luxury of an ID). Alternatively with the temporary table idea, you can add a WHERE condition to first see if the row exists in the database, something like:
INSERT INTO MyTable (val1, val2, val3)
SELECT val1, val2, val3 FROM #Tempo
WHERE NOT EXISTS
(
SELECT *
FROM MyTable t
WHERE t.val1 = val1 AND t.val2 = val2 AND t.val3 = val3
)
We do much larger imports than that all the time. Create an SSIS pacakge to do the work. Personally I prefer to create a staging table, clean it up, and then do the update or import. But SSIS can do all the cleaning in memory if you want before inserting.
Before you start mirroring and replicating data, which is complicated and expensive, it would be worthwhile to check your existing service to make sure it is performing efficiently.
Maybe there are table scans you can get rid of by adding an index, or lookup queries you can get rid of by doing smart error handling? Analyze your execution plans for the queries that your service performs and optimize those.