SSIS and CDC - Incorrect state at end of "Mark Processed Range"

SSIS and CDC - Incorrect state at end of "Mark Processed Range" - sql-server

The Problem
I currently have CDC running on a table named subscription_events. The corresponding CT table is being populated with new inserts, updates, and deletes.
I have two SSIS flows that move data from subscription_events into another table in a different database. The first flow is the initial flow and has the following layout:
The Import Rows Into Vertica step simply has a source and a destination and copies every row into another table. As a note, the source table is currently active and has new rows flowing into it every few minutes. The Mark Initial Load Start/End steps store the current state in a variable and that is stored in a separate table meant for storing CDC names and states.
The second flow is the incremental flow and has the following layout:
The Import Rows Into Vertica step uses a CDC source and should pull the latest inserts, updates, and deletes from the CT table and these should be applied to the destination. Here is where the problem resides; I never receive anything from the CDC source, even though there are new rows being inserted into the subscription_events table and the corresponding CT table is growing in size with new change data.
To my understanding, this is how things should work:
Mark Initial Load Start
CDC State should be ILSTART
Data Flow
Mark Initial Load End
CDC State should be ILEND
Get Processing Range (First Run)
CDC State should be ILUPDATE
Data Flow
Mark Processed Range (First Run)
CDC State should be TFEND
Get Processing Range (Subsequent Runs)
CDC State should be TFSTART
Data Flow
Mark Processed Range (Subsequent Runs)
CDC State should be TFEND
Repeat the last three steps
This is not how my CDC states are being set, though... Here are my states along the same process.
Mark Initial Load Start
CDC State is ILSTART
Data Flow
Mark Initial Load End
CDC State is ILEND
Get Processing Range (First Run)
CDC State is ILUPDATE
Data Flow
Mark Processed Range (First Run)
CDC State is ILEND
Get Processing Range (Subsequent Runs)
CDC State is ILUPDATE
Data Flow
Mark Processed Range (Subsequent Runs)
CDC State is ILEND
Repeat the last three steps
I am never able to get out of the ILUPDATE/ILEND loop, so I am never able to get any new data from the CT table. Why is this happening and what can I do to fix this?
Thank you so much, in advance, for your help! :)
Edit 1
Here are a couple of articles that sort of describe my situation, though not exactly. They also did not help me resolve this issue, but it might help you think of something I can try.
http://www.bradleyschacht.com/understanding-the-cdc-state-value/
http://msdn.microsoft.com/en-us/library/hh231087.aspx
The second article includes this image, which shows the ILUPDATE/ILEND loop I am trapped in.
Edit 2
Last week (May 26, 2014) I disabled then re-enabled CDC on the subscription_events table. This didn't change anything, so I then disabled CDC on the entire database, re-enabled CDC on the database, and then enabled CDC on the subscription_events table. This did make CDC work for a few days (and I thought the problem had been resolved by going through this process). However, at the end of last week (May 30, 2014) I needed to re-load the entire table via this process, and I ran into the same problem again. I'm still stuck in this loop and I'm not sure why or how to get out of it.
Edit 3
Before I was having this problem, I was having a separate issue which I posted about here:
CDC is enabled, but cdc.dbo<table-name>_CT table is not being populated
I am unsure if these are related, but figured it couldn't hurt to provide it.

I had the same problem.
I have an initial load package for kickking things off and a separate incremental load package for loading updates on a schedule.
You fix it by putting a "Mark CDC start" CDC Control Task at the end of you Initial Load package only. That will leave the state value in a TFEND state, which is what you want for when your Incremental load starts.

Related

SSIS ETL Pattern based on Rowversion occasionally missing rows, how to correct?

We have been using an SSIS pattern based around rowversion to synchronize records between two databases by looking at only rows in the source that have been inserted or updated since the last package run. Note data is never deleted from the source table, which is a prerequisite for this SSIS pattern.
However lately we discovered despite running daily our import has actually missed rows from last month, leaving them out of our data warehouse entirely!
This is what i'm seeking a solution to..how can we change our ETL pattern to avoid this problem, without going back to reading every row from source every day?
From internet searching we found an explanation for why this might be happening, but not a solution. The flaw seems to be related to the fact that the SQL column rowversion gets its value when an insert/update starts, not when it commits, which can lead to rows not being available at package execution time, but getting committed later with rowversion values less than your stored ETLRowversion value, so next time your job runs they get skipped.
In brief our pattern currently is like this: (I've left out steps involving index maintenance, etc for simplicity.)
Get the last active rowversion from source DB using min_active_rowversion() call that #MaxRv.
Get the rowversion value as of last successful execution of our SSIS task (stored in our data warehouse in a table called ETLRowversions). Call that #LSERV.
Read rows from the source table WHERE rowversion is >= #LSERV and rowversion is <= #MaxRv
For each row read, check if the row exists in target DB (if so add the row to an update staging table) or not (in which case, insert it directly into Target table)
Update the Target table using the update staging table
Update ETLrowVersions table in our data warehouse with the #MaxRv value.
Edit: Comments have suggested to implement Change Tracking and Snapshot Isolation as the best solution to this problem. Unfortunately both change tracking and allow_snapshot_isolation are both OFF for the source database..and I am pessimistic about my chances of getting these features turned on. For better or worse our BI concerns carry far less weight than performance concerns of the production application/DB that is our source.

How to control which rows were sent via SSIS

I'm trying to create SSIS package which will periodically send data to other database. I want to send only new records(I need to keep sent records) so I created status column in my source table.
I want my package to update this column after successfuly sending data, but I can't update all rows wih "unsent" status because during package execution some rows may have been added, and I also can't use transactions(I mean on isolation levels that would solve my problem: I can't use Serializable beacause i musn't prevent users from adding new rows, and Sequence Container doesn't support Snapshot).
My next idea was to use recordset and after sending data to other db use it to get ids of sent rows, but I couldn't find a way to use it as datasource.
I don't think I should set status "to send" and then update it to "sent", I believe it would be to costly.
Now I'm thinking about using temporary table, but I'm not convinced that this is the right way to do it, am I missing something?

Record Set is a destination. You cannot use it in Data Flow task.
But since the data is saved to a variable, it is available in the Control flow.
After completing the DataFlow, come to the control flow and create a foreach component that can run on the ResultSet varialbe.
Read each Record Set value into a variable and use it to run an update query.
Also, see if "Lookup Transform" can be useful to you. You can generate rows that match or doesn't match.
I will improve the answer based on discussions

What you have here is a very typical data mirroring problem. To start with, I would not simply have a boolean that signifies that a record was "sent" to the destination (mirror) database. At the very least, I would put a LastUpdated datetime column in the source table, and have triggers on that table, on insert and update, that put the system date into that column. Then, every day I would execute an SSIS package that reads the records updated in the last week, checks to see if those records exist in the destination, splitting the datastream into records already existing and records that do not exist in the destination. For those that do exist, if the LastUpdated in the destination is less than the LastUpdated in the source, then update them with the values from the source. For those that do not exist in the destination, insert the record from the source.
It gets a little more interesting if you also have to deal with record deletions.
I know it may seem wasteful to read and check a week's worth, every day, but your database should hardly feel it, it provides a lot of good double checking, and saves you a lot of headaches by providing a simple, error tolerant algorithm. Some record does not get transferred because of some hiccup on the network, no worries, it gets picked up the next day.
I would still set up the SSIS package as a server task that sends me an email with any errors, so that I can keep track. Most days, you get no errors, and when there are errors, you can wait a day or resolve the cause and let the next days run pick up the problems.

I am doing a similar thing, in my case, I have a status on the source record.
I read in all records with a status of new.
Then use a OLE DB Command to execute SQL on each row, changing
the status to "In progress"(in you where, enter a ? as the value in
the Component Property tab, and you can configure it as a parameter
from the table row like an ID or some pk in the Column Mappings
tab).
Once the records are processed, you can change all "In Progress"
records to "Success" or something similar using another OLE DB
Command.
Depending on what you are doing, you can use the status to mark records that errored at some point, and require further attention.

Persist Data in SSIS for Next Execution

I have data to load where I only need to pull records since the last time I pulled this data. There are no date fields to save this information in my destination table so I have to keep track of the maximum date that I last pulled. The problem is I can't see how to save this value in SSIS for the next time the project runs.
I saw this:
Persist a variable value in SSIS package
but it doesn't work for me because there is another process that purges and reloads the data separate from my process. This means that I have to do more than just know the last time my process ran.
The only solution I can think of is to create a table but it seems a bit much to create a table to hold one field.

This is a very common thing to do. You create an execution table that stores the package name, the start time, the end time, and whether or not the package failed/succeeded. You are then able to pull the max start time of the last successfully ran execution.

You can't persist anything in a package between executions.
What you're talking about is a form of differential replication and this has been done many many times.
For differential replication it is normal to store some kind of state in the subscriber (the system reading the data) or the publisher (the system providing the data) that remembers what state you're up to.
So I suggest you:
Read up on differential replication design patterns
Absolutely put your mind at rest about writing data to a table
If you end up having more than one source system or more than one source table your storage table is not going to have just one record. Have a think about that. I answered a question like this the other day - you'll find over time that you're going to add handy things like the last time the replication ran, how long it took, how many records were transferred etc.
Is it viable to have a SQL table with only one row and one column?

TTeeple and Nick.McDermaid are absolutely correct, and you should follow their advice if humanly possible.
But if for some reason you don't have access to write to an execution table, you can always use a script task to read/write the last loaded date to a text file on on whatever local file-system you're running SSIS on.

How do you reload incremental data using SQL Server CDC?

I haven't been able to find documentation/an explanation on how you would reload incremental data using Change Data Capture (CDC) in SQL Server 2014 with SSIS.
Basically, on a given day, if your SSIS incremental processing fails and you need to start again. How do you stage the recently changed records again?

I suppose it depends on what you're doing with the data, eh? :) In the general case, though, you can break it down into three cases:
Insert - check if the row is there. If it is, skip it. If not, insert it.
Delete - assuming that you don't reuse primary keys, just run the delete again. It will either find a row to delete or it won't, but the net result is that the row with that PK won't exist after the delete.
Update - kind of like the delete scenario. If you reprocess an update, it's not really a big deal (assuming that your CDC process is the only thing keeping things up to date at the destination and there's no danger of overwriting someone/something else's changes).

Assuming you are using the new CDC SSIS 2012 components, specifically the CDC Control Task at the beginning and end of the package. Then if the package fails for any reason before it runs the CDC Control Task at the end of the package those LSNs (Log Sequence Number) will NOT be marked as processed so you can just restart the SSIS package from the top after fixing the issue and it will just reprocess those records again. You MUST use the CDC Control Task to make this work though or keep track the LSNs yourself (before SSIS 2012 this was the only way to do it).
Matt Masson (Sr. Program Manager on MSFT SQL Server team) has a great post on this with a step-by-step walkthrough: CDC in SSIS for SQL Server 2012
Also, see Bradley Schacht's post: Understanding the CDC state Value

So I did figure out how to do this in SSIS.
I record the min and max LSN number everytime my SSIS package runs in a table in my data warehouse.
If I want to reload a set of data from the CDC source to staging, in the SSIS package I need to use the CDC Control Task and set it to "Mark CDC Start" and in the text box labelled "SQL Server LSN to start...." I put the LSN value I want to use as a starting point.
I haven't figured out how to set the end point, but I can go into my staging table and delete any data with an LSN value > then my endpoint.
You can only do this for CDC changes that have not been 'cleaned up' - so only for data that has been changed within the last 3 days.
As a side point, I also bring across the lsn_time_mapping table to my data warehouse since I find this information historically useful and it gets 'cleaned up' every 4 days in the source database.

To reload the same changes you can use the following methods.
Method #1: Store the TFEND marker from the [cdc_states] table in another table or variable. Reload back the marker to your [cdc_states] from the "saved" value to process the same range again. This method, however, allows you to start processing from the same LSN but if in the meanwhile you change table got more changes those changes will be captured as well. So, you can potentially get more changes that happened after you did the first data capture.
Method #2: In order to capture the specified range, record the TFEND markers before and after the range is processed. Now, you can use the OLEDB Source Connection (SSIS) with the following cdc functions. Then use the CDC Splitter as usual to direct Inserts, Updates, and Deletes.
DECLARE #start_lsn binary(10);
DECLARE #end_lsn binary(10);
SET #start_lsn = 0x004EE38E921A01000001;-- TFEND (1) -- if null then sys.fn_cdc_get_min_lsn('YourCapture') to start from the beginnig of _CT table
SET #end_lsn = 0x004EE38EE3BB01000001; -- TFEND (2)
SELECT * FROM [cdc].[fn_cdc_get_net_changes_YOURTABLECAPTURE](
#start_lsn
,#end_lsn
,N'all' -- { all | all with mask | all with merge }
--,N'all with mask' -- shows values in "__$update_mask" column
--,N'all with merge' -- merges inserts and updates together. It's meant for processing the results using T-SQL MERGE statement
)
ORDER BY __$start_lsn;

data migration in informatica

A large amount of data is coming from source to target. After a successful insertion in target, we have to change the status to every rows as "committed". But when will we know that all datas have come or not in target without directly querying the source?
For example - suppose 10 records have migrated to target from source.
We cannot change the status of all the records as "committed" before successful insertion of all records in target.
So before changing the status of all the records, how will we know that 11th record is coming or not?
Is there anything that will give me the information about total records in source?
I need a real-time based answer.

we had the same scenario and this is what we did:
First of all
to check if data is loaded in target you can join source and target table, update will lock the rows so for this commit must be fired at database level in target table (so that lock for update can happen).
after joining, update the loaded data based on join with target column.
Few things.
You have to stop you session (used pmcmd to stop session in command task)
update data in your source table and restart session.
keep load for counter of 20k-30 rows so update goes smoothly.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight