SSIS Sequential Processing - sql-server

I have 5 independent data flows in the same data flow task each having a source and destination. How can I make them run Sequentially .they seem to run in parallel . I may do it in different data flow tasks. but how can i do it in a single data flow task

Don't have independent data flows in the same task. I know the Import/Export wizard will do that but just because a team at Microsoft does something, doesn't make it a best practice. The Data Flow gets its power and performance through "free" parallelization. If you don't want that, please, for the sake of those who maintain your future code, create 4 additional data flows and copy/paste into them. There is absolutely no harm in doing this.
For the sake of actually answering the above question, you will have to introduce a dependency of some sort. In the pantheon of horrible ideas, the following is near the top.
I assume your data flow with multiple independent flows within it looks something like Source (doesn't matter) to an OLE DB Destination. Modify your source query or add a Derived Column in after it and create an column of type int (DT_I4) and call it something unique HackedSortKey and assign a value of 1 to it.
Remove the existing OLE DB Destination on all but one of them. Replace it with an OLE DB Command instead. The value of using OLE DB Command is that it allows rows to pass through. As the name implies, OLE DB Destination is only a sink for data. The only output column from it is an Error one. Write your INSERT queries for each. That's the design pain of the Command object but you'll also experience the run-time pain of them as they perform singleton operations on the database. "Oh, I have a row to insert. One moment while I issue the command. Oh, I have a row to insert. One moment please." Every single row will get this treatment.
Take your first Source to Command object. Attach a Fully Blocking component to it. Use a Sort. Order by HackedSortKey column, remove dupes and allow no other column through. The point of this is to force a wait. Only once all of the data has passed through the OLE DB Command above will the Sort release downstream rows (because it won't know what the sort isuntil it's seen all rows). By selecting the distinct value thing, this will reduce the original rows to A row.
Logjam in stream A, meet stream B. Stream B now looks like "Source B" -> "Sort B" -> "Merge Join AB" -> "OLE DB Command B" -> "Sort on HackedSourceKey". The "Sort B" is needed because Merge Join requires sorted input. There will be a match as the same value is used in our fake match columns. However, you will need to make sure it's a LEFT OUTER JOIN match and not an INNER.
Lather, rinse, repeat this process for the remaining data flows. But really, you want to use different data flows and have the precedence constraint manage execution.

String the data flows sequentially using the Completion constraint instead of the Success constraint. That way they will each run independently of the others success or failure but will run one at a time.
To set the value of the constraint, double-click the line going from one task to another and change the value from Success to Completion.

Related

How Deadlock Happen resolved on Same Table insertion and update

I have a SSIS package with a task to load data. For some reason i need to update and insert same destination table. This happen deadlock
I use SSIS MULTI-CAST control.
What to do? how to resolve this situation?
In your OLE DB Destination, change the access mode from "FastLoad" to "Table or View". The former will take a table lock which is generally better for large inserts but in your scenario, you need the table to remain "unlocked." Your performance will suffer since you'll be issuing singleton inserts but I guess that doesn't really matter since you'll also be doing singleton updates with your "OLE DB Command"
Finally, I think you're doing this wrong. The multicast essentially duplicates a row so that you can direct it to N components. I generally see people trying to detect whether a row exists in the target and then either insert or update it based on that lookup. But that's the lookup component, not a multicast. Maybe you're doing a type 2 dimension or something but even then, there will be better ways to accomplish this versus what you're showing in the picture.
Your way seems strange, as billinkc said, you are effectively double data rows and perform INSERT and UPDATE actions with the same table concurrently from two different connections/contexts. This have to end in a deadlock.
I would use alternative approach - do required transforms with the data, and then write it to an intermediate table in the DataFlow. Then on the next SSIS task - execute MS SQL MERGE - Microsoft table upsert - with OLE DB Command. This will assure you do not have a deadlock between concurrent operations, logic of the MERGE could be quite flexible.
Last but not the least - use dedicated or global ##temp table for an intermediate table, Working with regular MS SQL #temp tables in SSIS is little tricky. Do not forget to clean up intermediate before and after MERGE, or create and dispose of ##temp table properly.

SSIS - Fast way of detecting true deletes and updating data warehouse

I'm looking for an efficient way of detecting deleted records in production and updating the data warehouse to reflect those deletes because the table is > 12M rows and contains transactional data used for accounting purposes.
Originally, everything was done in a stored procedure by somebody before me and I've been tasked with moving the process to SSIS.
Here is what my test pattern looks like so far:
Inside the Data Flow Task:
I'm using MD5 hashes to speed up the ETL process as demonstrated in this article.
This should give a huge speed boost to the process by not having to store so many rows in memory for comparison purposes and by removing the bulk of conditional split processing at the same time.
But the issue is it doesn't account for records that are deleted in production.
How should I go about doing this? It may be simple to you but I'm new to SSIS so I'm not sure how to ask correctly.
Thank you in advance.
The solution I ended up using was to add another Data Flow Task and use the Lookup transformation to find records that didn't exist in production when compared to our fact table. This task comes after all of the inserts and updates as shown in my question above.
Then we can batch delete missing records in an execute SQL task.
Inside Data Flow Task:
Inside Lookup Transformation:
(note the Redirect rows to no match output)
So, if the ID's don't match those rows will be redirected to the no match output which we set to go to our staging table. Then, we will join staging to the fact table and apply the deletions as shown below inside an execute SQL task.
I think you'll need to adopt you dataflow to use a merge join instead of a lookup.
That way you can see whats new/changed & deleted.
You'll need to sort both Flows by the same joining key (in this case your hash column).
Personally i'm not sure I'd bother and Instead I'd simply stage all my prod data and then do a 3-way SQL merge statement to handle Inserts updates & deletes in one pass. You can keep your hash column as a joining key if you like.

Having trouble with interface table structures in SQL Server

I'm currently working on a project that involves a third party database and application. So far we are able to successfully TEST and interface data between our databases. However we are having trouble when we are extracting a large set of data (ex 100000 rows and 10 columns per row) and suddenly it stopped at the middle of transaction for whatever reason(ex blackouts, force exit or etc..), missing or duplication of data is happening in this type of scenario.
Can you please give us a suggestions to handle these types of scenarios? Thank you!
Here's our current interface structure
OurDB -> Interface DB -> 3rdParty DB
OurDB: we are extracting records from OurDB (with bit column as false) to the InterfaceDb
InterfaceDB: after inserting records from OurDB, we will update OurDB bit column as true
3rdPartyDB: they will extract and delete all records from InterfaceDB (they assume that all records is for extraction)
Well, you defintitely need a ETL tool then and preferably SSIS. First it will drastically improve your transfer rates while also providing robust error handling. Additionally you will have to use lookup transforms to ensure duplicates do not enter the sytsem. I would suggest go for Cache Connection Manager in order to perform the look-ups.
In terms of design, if your source system (OurDB) is having a primary key say recId, then have a column say source_rec_id in your InterfaceDB table. Say your first run has transferred 100 rows. Now in your second run, you would then need to pick 100+1th record and move on to the next rows. This way you will have a tracking mechanism and one-to-one correlation between source system and destination system to understand how many records have got transferred, how many are left etc.
For best understanding of SSIS go to Channel 9 - msdn - SSIS. Very helpful resource.

Ignore duplicate records in SSIS' OLE DB destination

I'm using a OLE DB Destination to populate a table with value from a webservice.
The package will be scheduled to run in the early AM for the prior day's activity. However, if this fails, the package can be executed manually.
My concern is if the operator chooses a date range that over-laps existing data, the whole package will fail (verified).
I would like it:
INSERT the missing values (works as expected if no duplicates)
ignore the duplicates; not cause the package to fail; raise an exception that can be captured by the windows application log (logged as a warning)
collect the number of successfully-inserted records and number of duplicates
If it matters, I'm using Data access mode = Table or view - fast load and
Suggestions on how to achieve this are appreciated.
That's not a feature.
If you don't want error (duplicates), then you need to defend against it - much as you'd do in your favorite language. Instead of relying on error handling, you test for the existence of the error inducing thing (Lookup Transform to identify existence of row in destination) and then filter the duplicates out (Redirect No Match Output).
The technical solution you absolutely should not implement
Change the access mode from the "Table or View Name - Fast Load" to "Table or View Name". This changes the method of insert from a bulk/set-based operation to singleton inserts. By inserting one row at a time, this will allow the SSIS package to evaluate the success/fail of each row's save. You then need to go into the advanced editor, your screenshot, and change the Error disposition from Fail Component to Ignore Failure
This solution should not used as it yields poor performance, generates unnecessary work load and has the potential to mask other save errors beyond just "duplicates" - referential integrity violations for example
Here's how I would do it:
Point your SSIS Destination to a staging table that will be empty
when the package is run.
Insert all rows into the staging table.
Run a stored procedure that uses SQL to import records from the
staging table to the final destination table, WHERE the records don't
already exist in the destination table.
Collect the desired meta-data and do whatever you want with it.
Empty the staging table for the next use.
(Those last 3 steps would all be done in the same stored procedure).

In SSIS, how do I get the number of rows returned from the Source that SHOULD be processed

I am working on a project to add logging to our SSIS packages. I am doing my own custom logging by implementing some of the event handlers. I have implemented the OnInformation event to write the time, source name, and message to the log file. When data is moved from one table to another, the OnInformation event will give me a message such as:
component "TABLENAME" (1)" wrote 87 rows.
In the event that one of the rows fails, and lets say only 85 rows were processed out of the expected 87. I would assume that the above line would read wrote 85 rows. How do I track how many rows SHOULD HAVE processed in this case? I would like to see something like wrote 85 of 87 rows. Basically, I think I need to know how to get the number of rows returned from the Source's query. Is there an easy way to do this?
Thank you
You can use the Row Count transaformation after the Data source and save it the variable. This is going to be number of rows to be processed. Once it got loaded into the Destination, you should use the Execute SQL Task in Control flow and use Select Count(*) from <<DestinationTable>> and save the count into the Other variable[You should use the Where clause in your query to identify the current load]. So you will have number rows processed for logging.
Hope this helps!
Not enough space in comments to provide feedback. Posting an incomplete answer as I need to leave for the day.
You are going to have trouble accomplishing what you are asking for. Based on your comments in Gowdhaman008's answer, the value of a variable is not visible outside of a Data flow until after the finalizer event fires (OnPostExecute, I think). You can cheat and get that data out by making use of a script task to count rows through and firing off events, custom or predefined, to reporting package progress. In fact, just capture the OnPipelineRowsSent event. That will record how many rows are passing through a particular juncture and time surrounding it. SSIS Performance Framework Plus, you don't have to do any custom work or maintenance on your stuff. Out of the box functionality is a definite win.
That said, you aren't really going to know how many rows are coming out of a source until it's finished. That sounds stupid and I completely agree but it's the truth. Imagine a simple case, an OLE DB Source that is going to send 1,000,000 rows straight into an OLE DB Destination. Most likely, not all 1M rows are going to start in the pipeline, maybe only 10k will be in the first buffer. Those buffers are pushed to the destination and now you know 10k rows out of 10k rows have been processed. Lather, rinse, repeat a few times and in this buffer, a row has a NULL where it shouldn't. Boom goes the dynamite and the process fails. We have had 60k rows flow into the pipeline and that's all we know about because of the failure.
The only way to ensure we have accounted for all the source rows is to put an asynchronous transformation into the the pipeline to block all downstream components until all the data has arrived. This will obliterate any chance you have of getting good performance out of your packages. You'd still be subject to the aforementioned restrictions on updating variables but your FireXEvent message would accurately describe how many rows could have been processed in the queue.
If you started an explicit transaction, you could do something ugly like an Execute SQL Task just to get the expected count, write that to a variable and then log rows processed but then you're double querying your data and you increase the likelyhood of blocking on the source system because of the double pump. And that's only going to work for something database like. The same concept would apply for a flat file except now you'd need a script task to read all the rows first.
Where this gets uglier is for a slow starting data source, like a web service. The default buffer size might cause the entire package to run much longer than it'd need to simple because we are waiting on the data to arrive Slow starts
What I'd do
I'd record my starting and error counts (and more) using the Row Count. This will help you account for all the data that came in and where it went. I'd then turn on the OnPipelineRowsSent event to allow me to query the log and see how many rows are flowing through it RIGHT NOW.
What you want is the Row Count transformation. Just add that to your data flow after your source query, and assign its output to a variable. Then you can write that variable to your log file.
Here is what I currently do. It's super tedious, but it works.
1)
2) I have a constant "1" value on all of the records. They are literally all the same.
3) Using a multicast step, I send the data flow off in 2 directions. Despite all being the same, we still have to sort by that constant value.
4) Use an aggregate step to aggregate on that constant and then resort it in order to join with the bottom data flow (it holds all of the actual data records with no aggregation).
Doing this allows me to have my initial row count.
Later on, shown below, is use a conditional split step and do the same thing again after applying your condition. If the row count is the same, everything is fine and there are no problems.
If the row count is not the same, something is wrong.
This is the general idea for the approach for solving your problem without having to use another data flow step.
TLDR:
Get a row count for 1 of the conditions by using a multicast, sort by some constant value, and aggregation step.
Do a sort and merge to grab the row count.
Use a conditional split and do it again.
If the pre and post row counts are the same, do this.
If the pre and post row counts are not the same, do that.
This MAY help if you have a column which has no bad data . Add a second Flat File Source to the package. Use the same connection as your existing File source. Choose the first column only and direct the output to a Row Count.

Resources