Everyday a company drops a text file with potentially many records (350,000) onto our secure FTP. We've created a windows service that runs early in the AM to read in the text file into our SQL Server 2005 DB tables. We don't do a BULK Insert because the data is relational and we need to check it against what's already in our DB to make sure the data remains normalized and consistent.
The problem with this is that the service can take a very long time (hours). This is problematic because it is inserting and updating into tables that constantly need to be queried and scanned by our application which could affect the performance of the DB and the application.
One solution we've thought of is to run the service on a separate DB with the same tables as our live DB. When the service is finished we can do a BCP into the live DB so it mirrors all of the new records created by the service.
I've never worked with handling millions of records in a DB before and I'm not sure what a standard approach to something like this is. Is this an appropriate way of doing this sort of thing? Any suggestions?
One mechanism I've seen is to insert the values into a temporary table - with the same schema as the target table. Null IDs signify new records and populated IDs signify updated records. Then use the SQL Merge command to merge it into the main table. Merge will perform better than individual inserts/updates.
Doing it individually, you will incur maintenance of the indexes on the table - can be costly if its tuned for selects. I believe with merge its a bulk action.
It's touched upon here:
What's a good alternative to firing a stored procedure 368 times to update the database?
There are MSDN articles about SQL merging, so Googling will help you there.
Update: turns out you cannot merge (you can in 2008). Your idea of having another database is usually handled by SQL replication. Again I've seen in production a copy of the current database used to perform a long running action (reporting and aggregation of data in this instance), however this wasn't merged back in. I don't know what merging capabilities are available in SQL Replication - but it would be a good place to look.
Either that, or resolve the reason why you cannot bulk insert/update.
Update 2: as mentioned in the comments, you could stick with the temporary table idea to get the data into the database, and then insert/update join onto this table to populate your main table. The difference is now that SQL is working with a set so can tune any index rebuilds accordingly - should be faster, even with the joining.
Update 3: you could possibly remove the data checking from the insert process and move it to the service. If you can stop inserts into your table while this happens, then this will allow you to solve the issue stopping you from bulk inserting (ie, you are checking for duplicates based on column values, as you don't yet have the luxury of an ID). Alternatively with the temporary table idea, you can add a WHERE condition to first see if the row exists in the database, something like:
INSERT INTO MyTable (val1, val2, val3)
SELECT val1, val2, val3 FROM #Tempo
WHERE NOT EXISTS
(
SELECT *
FROM MyTable t
WHERE t.val1 = val1 AND t.val2 = val2 AND t.val3 = val3
)
We do much larger imports than that all the time. Create an SSIS pacakge to do the work. Personally I prefer to create a staging table, clean it up, and then do the update or import. But SSIS can do all the cleaning in memory if you want before inserting.
Before you start mirroring and replicating data, which is complicated and expensive, it would be worthwhile to check your existing service to make sure it is performing efficiently.
Maybe there are table scans you can get rid of by adding an index, or lookup queries you can get rid of by doing smart error handling? Analyze your execution plans for the queries that your service performs and optimize those.
Related
I have a SSIS package with a task to load data. For some reason i need to update and insert same destination table. This happen deadlock
I use SSIS MULTI-CAST control.
What to do? how to resolve this situation?
In your OLE DB Destination, change the access mode from "FastLoad" to "Table or View". The former will take a table lock which is generally better for large inserts but in your scenario, you need the table to remain "unlocked." Your performance will suffer since you'll be issuing singleton inserts but I guess that doesn't really matter since you'll also be doing singleton updates with your "OLE DB Command"
Finally, I think you're doing this wrong. The multicast essentially duplicates a row so that you can direct it to N components. I generally see people trying to detect whether a row exists in the target and then either insert or update it based on that lookup. But that's the lookup component, not a multicast. Maybe you're doing a type 2 dimension or something but even then, there will be better ways to accomplish this versus what you're showing in the picture.
Your way seems strange, as billinkc said, you are effectively double data rows and perform INSERT and UPDATE actions with the same table concurrently from two different connections/contexts. This have to end in a deadlock.
I would use alternative approach - do required transforms with the data, and then write it to an intermediate table in the DataFlow. Then on the next SSIS task - execute MS SQL MERGE - Microsoft table upsert - with OLE DB Command. This will assure you do not have a deadlock between concurrent operations, logic of the MERGE could be quite flexible.
Last but not the least - use dedicated or global ##temp table for an intermediate table, Working with regular MS SQL #temp tables in SSIS is little tricky. Do not forget to clean up intermediate before and after MERGE, or create and dispose of ##temp table properly.
I'm looking for an efficient way of detecting deleted records in production and updating the data warehouse to reflect those deletes because the table is > 12M rows and contains transactional data used for accounting purposes.
Originally, everything was done in a stored procedure by somebody before me and I've been tasked with moving the process to SSIS.
Here is what my test pattern looks like so far:
Inside the Data Flow Task:
I'm using MD5 hashes to speed up the ETL process as demonstrated in this article.
This should give a huge speed boost to the process by not having to store so many rows in memory for comparison purposes and by removing the bulk of conditional split processing at the same time.
But the issue is it doesn't account for records that are deleted in production.
How should I go about doing this? It may be simple to you but I'm new to SSIS so I'm not sure how to ask correctly.
Thank you in advance.
The solution I ended up using was to add another Data Flow Task and use the Lookup transformation to find records that didn't exist in production when compared to our fact table. This task comes after all of the inserts and updates as shown in my question above.
Then we can batch delete missing records in an execute SQL task.
Inside Data Flow Task:
Inside Lookup Transformation:
(note the Redirect rows to no match output)
So, if the ID's don't match those rows will be redirected to the no match output which we set to go to our staging table. Then, we will join staging to the fact table and apply the deletions as shown below inside an execute SQL task.
I think you'll need to adopt you dataflow to use a merge join instead of a lookup.
That way you can see whats new/changed & deleted.
You'll need to sort both Flows by the same joining key (in this case your hash column).
Personally i'm not sure I'd bother and Instead I'd simply stage all my prod data and then do a 3-way SQL merge statement to handle Inserts updates & deletes in one pass. You can keep your hash column as a joining key if you like.
Here is my situation, my client wants to bulk insert 100,000+ rows into the database from a csv file which is simple enough but the values need to be checked against data that is already in the database (does this product type exist? is this product still sold? etc.). To make things worse these files will also be uploaded into the live system during the day so I need to make sure I’m not locking any tables for long. The data that is inserted will also be spread across multiple tables.
I’ve been adding the date into a staging table which takes seconds, I then tried creating a WebService to start processing the table using Linq and marking any erroneous rows with an invalid flag (this can take some time). Once the validation is done I need take the valid rows and update/add the rows to the appropriate tables.
Is there a process for this that I am unfamiliar with?
For a smaller dataset I would suggest
IF EXISTS (SELECT blah FROM blah WHERE....)
UPDATE (blah)
ELSE
INSERT (blah)
You could do this in chunks to avoid server load, but this is by no means a quick solution, so SSIS would be preferable
I have to insert one record per tables across 30 tables. The data coming from some other System. I have to insert data in the tables for the first time, then if any update happened, then I need to update tables in the SQL Server. I have two options:
a) I can check timestamp for individual table rows and update if the timestamp is greater.
b) Everytime I can stateway delete records and insert data.
Which one will be faster in SQL Server Database? Is there any other option to address the situatation?
If you are not changing the index fields of the record, the stategy of trying to update first and then insert is usually faster than drop/insert as you don't force the database into updating a bunch of index info.
If using Sql2008+ you should be using the merge command, as it explictly handles the update/insert condition cleanly and clearly
ADDED
I should also add that is you know the usage pattern in rarely update (i.e., 90% insert), you may have a case when drop/insert in faster than update/insert -- depends on lots of details. Regardless, merge is the clear winner if using 2008+
I generally like drop and re-insert. I find it to be cleaner and easier to code. However, if this is happening very frequently and you're worried about concurrency issues, you're probably better off with option 1.
Also, another thing to factor in is how often does the timestamp check fail (where you don't have to insert nor update). If 99% of data is redundant/outdated data, you're probably better off with option 1 regardless.
I am looking for a way to quickly compare the state of a database table with the results of a Web service call.
I need to make sure that all records returned by the Web service call exist in the database, and any records in the database that are no longer in the Web service response are removed from the table.
I have to problems to solve:
How do I quickly compare a data
structure with the results of a
database table?
When I find a
difference, how do I quickly add
what's new and remove what's gone?
For number 1, I was thinking of doing an MD5 of a data structure and storing it in the database. If the MD5 is different, then I'd move to step 2. Are there better ways of comparing response data with the state of a database?
I need more guidance on number 2. I can easily retrieve all records from a table (SELECT * FROM users WHERE user_id = 1) and then loop through an array adding what's not in the DB and creating another array of items to be removed in a subsequent call, but I'm hoping for a better (faster) was of doing this. What is the best way to compare and sync a data structure with a subset of a database table?
Thanks for any insight into these issues!
I've recently been caught up in a similar problem. Our--very simple--solution was to load the web service data into a table with the same structure as the DB table. The DB table keeps a hash of its most important columns, and the same hash function is applied to the corresponding columns in the web service table.
The "sync" logic then goes like this:
Delete any rows from the web service table with hashes that do exist in the DB table. This is duplicate data that doesn't need synchronizing.
DELETE FROM ws_table WHERE hash IN (SELECT hash from db_table);
Delete any rows from the DB table with hashes not found in the web service table.
DELETE FROM db_table WHERE hash NOT IN (SELECT hash FROM ws_table);
Anything left over in the web service table is new data, and should now be inserted into the DB table.
INSERT INTO db_table SELECT ... FROM ws_table;
It's a pretty brute-force approach, and if done transactionally (even just steps 2 and 3) locks up the DB table for the duration, but it's very simple.
One refinement would be to deal with changed records using UPDATE statements, but that adds a good deal of complexity, and may not be any faster than a DELETE followed by an INSERT.
Another possible optimization would be to set a flag instead of deleting rows. The rows could then be deleted later on. However, any logic using the DB table would have to ignore rows with a set flag.
Don't kill yourself doing premature optimization. Go with the simple approach of inserting each row one at a time. If you find your having transactional issues like locking of the table is to long while looping you could insert the rows first into a temporary table then do a single insert into the real destination table.
If you were using SQL Server you could do bulk inserts, or package the data into XML, But I'd still highly recommend implement it the easy way first, then test it and if you can test with production data (or the same quantity of data), then look to optimize only if you need to.