How Deadlock Happen resolved on Same Table insertion and update - sql-server

I have a SSIS package with a task to load data. For some reason i need to update and insert same destination table. This happen deadlock
I use SSIS MULTI-CAST control.
What to do? how to resolve this situation?

In your OLE DB Destination, change the access mode from "FastLoad" to "Table or View". The former will take a table lock which is generally better for large inserts but in your scenario, you need the table to remain "unlocked." Your performance will suffer since you'll be issuing singleton inserts but I guess that doesn't really matter since you'll also be doing singleton updates with your "OLE DB Command"
Finally, I think you're doing this wrong. The multicast essentially duplicates a row so that you can direct it to N components. I generally see people trying to detect whether a row exists in the target and then either insert or update it based on that lookup. But that's the lookup component, not a multicast. Maybe you're doing a type 2 dimension or something but even then, there will be better ways to accomplish this versus what you're showing in the picture.

Your way seems strange, as billinkc said, you are effectively double data rows and perform INSERT and UPDATE actions with the same table concurrently from two different connections/contexts. This have to end in a deadlock.
I would use alternative approach - do required transforms with the data, and then write it to an intermediate table in the DataFlow. Then on the next SSIS task - execute MS SQL MERGE - Microsoft table upsert - with OLE DB Command. This will assure you do not have a deadlock between concurrent operations, logic of the MERGE could be quite flexible.
Last but not the least - use dedicated or global ##temp table for an intermediate table, Working with regular MS SQL #temp tables in SSIS is little tricky. Do not forget to clean up intermediate before and after MERGE, or create and dispose of ##temp table properly.

Related

Prevent deadlock in read-committed SELECT

I am extracting data from a business system supplied by a third party to use in reporting. I am using a single SELECT statement issued from an SSIS data flow task source component that joins across multiple tables in the source system to create the dataset I want. We are using the default read-committed isolation level.
To my surprise I regularly find this extraction query is deadlocking and being selected as the victim. I didn't think a SELECT in a read-committed transaction could do this, but according to this SO answer it is possible: Can a readcommitted isolation level ever result in a deadlock (Sql Server)?
Through the use of the trace flags 1204 and 12222 I've identified the conflicting statement, and the object and index in question. Essentially, the contention is over a data page in the primary key of one of the tables. I need to extract from this table using a join on its key (so I'm taking out an S lock), the conflicting statement is performing an INSERT and is requesting an IX lock on the index data page.
(Side note: the above SO talks about this issue occurring with non-clustered indexes, but this appears to be occurring in the clustered PK. At least, that is what I believe based on my interpretation of the deadlock information in the event log and the "associatedObjectId" property.)
Here are my constraints:
The conflicting statement is in an encrypted stored procedure supplied by a third party as part of off-the-shelf software. There is no possibility of getting the plaintext code or having it changed.
I don't want to use dirty-reads as I need my extracted data to maintain its integrity.
It's not clear to me how or if restructuring my extract query could prevent this. The lock is on the PK of the table I'm most interested in, and I can't see any alternatives to using the PK.
I don't mind my extract query being the victim as I prefer this over interrupting the operational use of the source system. However, this does cause the SSIS execution to fail, so if it must be this way I'd like a cleaner, more graceful way to handle this situation.
Can anyone suggestion ways to, preferably, prevent the deadlock, or if not, then handle the error better?
My assumption here is that you are attempting to INSERT into the same table that you are SELECTing from. If no, then a screenshot of the data flow tab would be helpful in determining the problem. If yes, then you're in luck - I have had this problem before.
Add a sort to the data flow as this is a fully blocking transformation (see below regarding blocking transformations). What this means is that the SELECT will be required to complete loading all data into the pipeline buffer before any data is allowed to pass down to the destination. Otherwise, SSIS is attempting to INSERT data while there is a lock on the table/index. You might be able to get creative with your indexing strategies here (I have not tried this). But, a fully blocking transformation will do the trick and eliminates the need for any additional indexes to the table (and the overhead that entails).
Note: never use NOLOCK query hints when selecting data from a table as an attempt to get around this. I have never tried this nor do I intend to. You (the royal you) run the risk of ingesting uncommitted data into your ETL.
Reference:
https://jorgklein.com/2008/02/28/ssis-non-blocking-semi-blocking-and-fully-blocking-components/

SSIS - Fast way of detecting true deletes and updating data warehouse

I'm looking for an efficient way of detecting deleted records in production and updating the data warehouse to reflect those deletes because the table is > 12M rows and contains transactional data used for accounting purposes.
Originally, everything was done in a stored procedure by somebody before me and I've been tasked with moving the process to SSIS.
Here is what my test pattern looks like so far:
Inside the Data Flow Task:
I'm using MD5 hashes to speed up the ETL process as demonstrated in this article.
This should give a huge speed boost to the process by not having to store so many rows in memory for comparison purposes and by removing the bulk of conditional split processing at the same time.
But the issue is it doesn't account for records that are deleted in production.
How should I go about doing this? It may be simple to you but I'm new to SSIS so I'm not sure how to ask correctly.
Thank you in advance.
The solution I ended up using was to add another Data Flow Task and use the Lookup transformation to find records that didn't exist in production when compared to our fact table. This task comes after all of the inserts and updates as shown in my question above.
Then we can batch delete missing records in an execute SQL task.
Inside Data Flow Task:
Inside Lookup Transformation:
(note the Redirect rows to no match output)
So, if the ID's don't match those rows will be redirected to the no match output which we set to go to our staging table. Then, we will join staging to the fact table and apply the deletions as shown below inside an execute SQL task.
I think you'll need to adopt you dataflow to use a merge join instead of a lookup.
That way you can see whats new/changed & deleted.
You'll need to sort both Flows by the same joining key (in this case your hash column).
Personally i'm not sure I'd bother and Instead I'd simply stage all my prod data and then do a 3-way SQL merge statement to handle Inserts updates & deletes in one pass. You can keep your hash column as a joining key if you like.

Faster SQL Performance

I have to insert one record per tables across 30 tables. The data coming from some other System. I have to insert data in the tables for the first time, then if any update happened, then I need to update tables in the SQL Server. I have two options:
a) I can check timestamp for individual table rows and update if the timestamp is greater.
b) Everytime I can stateway delete records and insert data.
Which one will be faster in SQL Server Database? Is there any other option to address the situatation?
If you are not changing the index fields of the record, the stategy of trying to update first and then insert is usually faster than drop/insert as you don't force the database into updating a bunch of index info.
If using Sql2008+ you should be using the merge command, as it explictly handles the update/insert condition cleanly and clearly
ADDED
I should also add that is you know the usage pattern in rarely update (i.e., 90% insert), you may have a case when drop/insert in faster than update/insert -- depends on lots of details. Regardless, merge is the clear winner if using 2008+
I generally like drop and re-insert. I find it to be cleaner and easier to code. However, if this is happening very frequently and you're worried about concurrency issues, you're probably better off with option 1.
Also, another thing to factor in is how often does the timestamp check fail (where you don't have to insert nor update). If 99% of data is redundant/outdated data, you're probably better off with option 1 regardless.

Large Data Service Architecture

Everyday a company drops a text file with potentially many records (350,000) onto our secure FTP. We've created a windows service that runs early in the AM to read in the text file into our SQL Server 2005 DB tables. We don't do a BULK Insert because the data is relational and we need to check it against what's already in our DB to make sure the data remains normalized and consistent.
The problem with this is that the service can take a very long time (hours). This is problematic because it is inserting and updating into tables that constantly need to be queried and scanned by our application which could affect the performance of the DB and the application.
One solution we've thought of is to run the service on a separate DB with the same tables as our live DB. When the service is finished we can do a BCP into the live DB so it mirrors all of the new records created by the service.
I've never worked with handling millions of records in a DB before and I'm not sure what a standard approach to something like this is. Is this an appropriate way of doing this sort of thing? Any suggestions?
One mechanism I've seen is to insert the values into a temporary table - with the same schema as the target table. Null IDs signify new records and populated IDs signify updated records. Then use the SQL Merge command to merge it into the main table. Merge will perform better than individual inserts/updates.
Doing it individually, you will incur maintenance of the indexes on the table - can be costly if its tuned for selects. I believe with merge its a bulk action.
It's touched upon here:
What's a good alternative to firing a stored procedure 368 times to update the database?
There are MSDN articles about SQL merging, so Googling will help you there.
Update: turns out you cannot merge (you can in 2008). Your idea of having another database is usually handled by SQL replication. Again I've seen in production a copy of the current database used to perform a long running action (reporting and aggregation of data in this instance), however this wasn't merged back in. I don't know what merging capabilities are available in SQL Replication - but it would be a good place to look.
Either that, or resolve the reason why you cannot bulk insert/update.
Update 2: as mentioned in the comments, you could stick with the temporary table idea to get the data into the database, and then insert/update join onto this table to populate your main table. The difference is now that SQL is working with a set so can tune any index rebuilds accordingly - should be faster, even with the joining.
Update 3: you could possibly remove the data checking from the insert process and move it to the service. If you can stop inserts into your table while this happens, then this will allow you to solve the issue stopping you from bulk inserting (ie, you are checking for duplicates based on column values, as you don't yet have the luxury of an ID). Alternatively with the temporary table idea, you can add a WHERE condition to first see if the row exists in the database, something like:
INSERT INTO MyTable (val1, val2, val3)
SELECT val1, val2, val3 FROM #Tempo
WHERE NOT EXISTS
(
SELECT *
FROM MyTable t
WHERE t.val1 = val1 AND t.val2 = val2 AND t.val3 = val3
)
We do much larger imports than that all the time. Create an SSIS pacakge to do the work. Personally I prefer to create a staging table, clean it up, and then do the update or import. But SSIS can do all the cleaning in memory if you want before inserting.
Before you start mirroring and replicating data, which is complicated and expensive, it would be worthwhile to check your existing service to make sure it is performing efficiently.
Maybe there are table scans you can get rid of by adding an index, or lookup queries you can get rid of by doing smart error handling? Analyze your execution plans for the queries that your service performs and optimize those.

Error handling and data integrity when changing table schema

We have a few customers with large data sets and during our upgrade procedure we need to modify the schema of various tables (adding some columns, renaming others, occasionally changing data types, but that's rare).
Previously we've been going via a temporary table with the new schema, and then dropping the original and renaming the temp table but I'm hoping to speed that up dramatically by using ALTER table ... instead.
My question is what data integrity and error handling issues do I need to consider? Should I enclose all changes to a table in a transaction (and if so, how?) or will the DBMS guarantee atomicity and integrity over an ALTER operation?
We already heavily recommend customers backup their data before starting the upgrade so that should always be a fall back option.
We need to target SQLServer 2005 and Oracle, but obviously I can add conditional code if they require different approaches.
Comments for Oracle only:
Table alterations are DDL, so the concept of a transaction doesn't apply - every DDL statement locks the table for the duration of the operation and either succeeds or fails.
Adding (nullable!) columns or renaming existing columns is a relatively lightweight process and shouldn't present any problems if the table lock can be acquired.
If you're adding/modifying constraints (either NOT NULL or other more complex check constraints) Oracle will check existing data to validate the constraints unless you add the ENABLE NOVALIDATE clause to the constraint DDL. The validation of existing data can be a lengthy process for large tables.
If you're scripting the upgrade to be run as a SQL*Plus script, save yourself a lot of headaches by using the "whenever sqlerror exit sql.sqlcode" directive to abort the script on the first failure to make the review of partially implemented upgrades easier.
If the upgrade must be performed on a live system where you can neither control transactions or afford to miss them, consider using the Oracle DBMS_REDEFINITION package, which automatically creates a temporary configuration of temp tables and triggers to capture in-flight transactions while redefining the table in the "background". Warning - lots of work and a steep learning curve for this option.
If you're using SQL Server then ddl statements are transactional, so wrap in a transaction (I don't think this applies to Oracle though).
We split upgrades into individual patches that go with a particular feature. Which patches are applied go in a database_patch_history table, and it's easy to see which patches were applied and how to roll them back.
As you say, taking a backup before you start is important.
I have had to do changes like this in the past and have always been very paranoid about data loss. To help mitigate that risk I have always done tons of testing against "sandbox" databases that mirrored the target databases in schema and data as closely as possible. Test out the process as much as possible before rolling it out, just like you would any other area of the application.
If you dramatically change any data types of columns, for instance change a VARCHAR to an INT, the DBMS will panic and you will probably loose that data. Luckily, nowadays DBMSs are intelligent enough to do some data type conversions without loosing the data, but you don't want to run the risk of damaging any of it when making the alterations.
You shouldn't loose any data by renaming columns and definitely won't by adding new columns, it's when you move the data about that you have to be concerned.
Firstly, backup the entire table, both the schema and data, so at a second's notice you can roll back to the previous schema. Secondly, look at the alterations you are trying to make, see how drastic they are - try to figure out exactly what needs to change. If you're making datatype conversions push that data to an intermediatery table first with 3 columns, the foreign key (id or whatever so you can locate the row), the old data and the new column. Then either push the old data to the new column directly, or convert it at the application-level.
When it's all in the correct types and everything's been successful, run the ALTER statements and repopulate the database! It's simple enough to do, just needs a logical thought process.

Resources