How do you reload incremental data using SQL Server CDC? - sql-server

I haven't been able to find documentation/an explanation on how you would reload incremental data using Change Data Capture (CDC) in SQL Server 2014 with SSIS.
Basically, on a given day, if your SSIS incremental processing fails and you need to start again. How do you stage the recently changed records again?

I suppose it depends on what you're doing with the data, eh? :) In the general case, though, you can break it down into three cases:
Insert - check if the row is there. If it is, skip it. If not, insert it.
Delete - assuming that you don't reuse primary keys, just run the delete again. It will either find a row to delete or it won't, but the net result is that the row with that PK won't exist after the delete.
Update - kind of like the delete scenario. If you reprocess an update, it's not really a big deal (assuming that your CDC process is the only thing keeping things up to date at the destination and there's no danger of overwriting someone/something else's changes).

Assuming you are using the new CDC SSIS 2012 components, specifically the CDC Control Task at the beginning and end of the package. Then if the package fails for any reason before it runs the CDC Control Task at the end of the package those LSNs (Log Sequence Number) will NOT be marked as processed so you can just restart the SSIS package from the top after fixing the issue and it will just reprocess those records again. You MUST use the CDC Control Task to make this work though or keep track the LSNs yourself (before SSIS 2012 this was the only way to do it).
Matt Masson (Sr. Program Manager on MSFT SQL Server team) has a great post on this with a step-by-step walkthrough: CDC in SSIS for SQL Server 2012
Also, see Bradley Schacht's post: Understanding the CDC state Value

So I did figure out how to do this in SSIS.
I record the min and max LSN number everytime my SSIS package runs in a table in my data warehouse.
If I want to reload a set of data from the CDC source to staging, in the SSIS package I need to use the CDC Control Task and set it to "Mark CDC Start" and in the text box labelled "SQL Server LSN to start...." I put the LSN value I want to use as a starting point.
I haven't figured out how to set the end point, but I can go into my staging table and delete any data with an LSN value > then my endpoint.
You can only do this for CDC changes that have not been 'cleaned up' - so only for data that has been changed within the last 3 days.
As a side point, I also bring across the lsn_time_mapping table to my data warehouse since I find this information historically useful and it gets 'cleaned up' every 4 days in the source database.

To reload the same changes you can use the following methods.
Method #1: Store the TFEND marker from the [cdc_states] table in another table or variable. Reload back the marker to your [cdc_states] from the "saved" value to process the same range again. This method, however, allows you to start processing from the same LSN but if in the meanwhile you change table got more changes those changes will be captured as well. So, you can potentially get more changes that happened after you did the first data capture.
Method #2: In order to capture the specified range, record the TFEND markers before and after the range is processed. Now, you can use the OLEDB Source Connection (SSIS) with the following cdc functions. Then use the CDC Splitter as usual to direct Inserts, Updates, and Deletes.
DECLARE #start_lsn binary(10);
DECLARE #end_lsn binary(10);
SET #start_lsn = 0x004EE38E921A01000001;-- TFEND (1) -- if null then sys.fn_cdc_get_min_lsn('YourCapture') to start from the beginnig of _CT table
SET #end_lsn = 0x004EE38EE3BB01000001; -- TFEND (2)
SELECT * FROM [cdc].[fn_cdc_get_net_changes_YOURTABLECAPTURE](
#start_lsn
,#end_lsn
,N'all' -- { all | all with mask | all with merge }
--,N'all with mask' -- shows values in "__$update_mask" column
--,N'all with merge' -- merges inserts and updates together. It's meant for processing the results using T-SQL MERGE statement
)
ORDER BY __$start_lsn;

Related

SSIS ETL Pattern based on Rowversion occasionally missing rows, how to correct?

We have been using an SSIS pattern based around rowversion to synchronize records between two databases by looking at only rows in the source that have been inserted or updated since the last package run. Note data is never deleted from the source table, which is a prerequisite for this SSIS pattern.
However lately we discovered despite running daily our import has actually missed rows from last month, leaving them out of our data warehouse entirely!
This is what i'm seeking a solution to..how can we change our ETL pattern to avoid this problem, without going back to reading every row from source every day?
From internet searching we found an explanation for why this might be happening, but not a solution. The flaw seems to be related to the fact that the SQL column rowversion gets its value when an insert/update starts, not when it commits, which can lead to rows not being available at package execution time, but getting committed later with rowversion values less than your stored ETLRowversion value, so next time your job runs they get skipped.
In brief our pattern currently is like this: (I've left out steps involving index maintenance, etc for simplicity.)
Get the last active rowversion from source DB using min_active_rowversion() call that #MaxRv.
Get the rowversion value as of last successful execution of our SSIS task (stored in our data warehouse in a table called ETLRowversions). Call that #LSERV.
Read rows from the source table WHERE rowversion is >= #LSERV and rowversion is <= #MaxRv
For each row read, check if the row exists in target DB (if so add the row to an update staging table) or not (in which case, insert it directly into Target table)
Update the Target table using the update staging table
Update ETLrowVersions table in our data warehouse with the #MaxRv value.
Edit: Comments have suggested to implement Change Tracking and Snapshot Isolation as the best solution to this problem. Unfortunately both change tracking and allow_snapshot_isolation are both OFF for the source database..and I am pessimistic about my chances of getting these features turned on. For better or worse our BI concerns carry far less weight than performance concerns of the production application/DB that is our source.

SQL Server CDC: Track additional column after the fact

If CDC has been setup on a table, only tracking Columns A,D,E instead of the entire table, is it possible to add a column Z to the source table, then add column Z to the list of tracked columns for CDC? Is it possible to do this without losing CDC data?
I've looked around and the only examples I find are for tracking the entire table and not for cherry picking columns. I'm hoping for a way to update a table schema and not lose CDC history, without doing the whole copy CDC to temp table then back to CDC process.
Do you always have to create a new instance of CDC for schema changes?
SQL Server 2012
The reason that CDC allows for two capture instances on a given table is exactly this reason. The idea is this:
You have a running instance tracking columns A, B, and C
You now want to start tracking column D
So you set up a second capture instance to track A, B, C, and D.
You then process everything from the original capture instance and note the last LSN that you process
Using the fn_cdc_increment_lsn() function, you set the start point to start processing the new capture instance
Once you're up and running on the new instance, you can drop the old one
Of course, if you're using CDC as history of all changes ever on the tableā€¦ that's not what it was designed for. CDC was meant to be an ETL facilitator. That is, capturing changes to data so that they could be consumed and then ultimately discarded from the CDC system. If you're using it for a historical system of record, I'd suggest setting up a "dumb" ETL meaning a straight copy out of the CDC tables into a user table. Once you do that, you can implement the above.

How to control which rows were sent via SSIS

I'm trying to create SSIS package which will periodically send data to other database. I want to send only new records(I need to keep sent records) so I created status column in my source table.
I want my package to update this column after successfuly sending data, but I can't update all rows wih "unsent" status because during package execution some rows may have been added, and I also can't use transactions(I mean on isolation levels that would solve my problem: I can't use Serializable beacause i musn't prevent users from adding new rows, and Sequence Container doesn't support Snapshot).
My next idea was to use recordset and after sending data to other db use it to get ids of sent rows, but I couldn't find a way to use it as datasource.
I don't think I should set status "to send" and then update it to "sent", I believe it would be to costly.
Now I'm thinking about using temporary table, but I'm not convinced that this is the right way to do it, am I missing something?
Record Set is a destination. You cannot use it in Data Flow task.
But since the data is saved to a variable, it is available in the Control flow.
After completing the DataFlow, come to the control flow and create a foreach component that can run on the ResultSet varialbe.
Read each Record Set value into a variable and use it to run an update query.
Also, see if "Lookup Transform" can be useful to you. You can generate rows that match or doesn't match.
I will improve the answer based on discussions
What you have here is a very typical data mirroring problem. To start with, I would not simply have a boolean that signifies that a record was "sent" to the destination (mirror) database. At the very least, I would put a LastUpdated datetime column in the source table, and have triggers on that table, on insert and update, that put the system date into that column. Then, every day I would execute an SSIS package that reads the records updated in the last week, checks to see if those records exist in the destination, splitting the datastream into records already existing and records that do not exist in the destination. For those that do exist, if the LastUpdated in the destination is less than the LastUpdated in the source, then update them with the values from the source. For those that do not exist in the destination, insert the record from the source.
It gets a little more interesting if you also have to deal with record deletions.
I know it may seem wasteful to read and check a week's worth, every day, but your database should hardly feel it, it provides a lot of good double checking, and saves you a lot of headaches by providing a simple, error tolerant algorithm. Some record does not get transferred because of some hiccup on the network, no worries, it gets picked up the next day.
I would still set up the SSIS package as a server task that sends me an email with any errors, so that I can keep track. Most days, you get no errors, and when there are errors, you can wait a day or resolve the cause and let the next days run pick up the problems.
I am doing a similar thing, in my case, I have a status on the source record.
I read in all records with a status of new.
Then use a OLE DB Command to execute SQL on each row, changing
the status to "In progress"(in you where, enter a ? as the value in
the Component Property tab, and you can configure it as a parameter
from the table row like an ID or some pk in the Column Mappings
tab).
Once the records are processed, you can change all "In Progress"
records to "Success" or something similar using another OLE DB
Command.
Depending on what you are doing, you can use the status to mark records that errored at some point, and require further attention.

SSIS and CDC - Incorrect state at end of "Mark Processed Range"

The Problem
I currently have CDC running on a table named subscription_events. The corresponding CT table is being populated with new inserts, updates, and deletes.
I have two SSIS flows that move data from subscription_events into another table in a different database. The first flow is the initial flow and has the following layout:
The Import Rows Into Vertica step simply has a source and a destination and copies every row into another table. As a note, the source table is currently active and has new rows flowing into it every few minutes. The Mark Initial Load Start/End steps store the current state in a variable and that is stored in a separate table meant for storing CDC names and states.
The second flow is the incremental flow and has the following layout:
The Import Rows Into Vertica step uses a CDC source and should pull the latest inserts, updates, and deletes from the CT table and these should be applied to the destination. Here is where the problem resides; I never receive anything from the CDC source, even though there are new rows being inserted into the subscription_events table and the corresponding CT table is growing in size with new change data.
To my understanding, this is how things should work:
Mark Initial Load Start
CDC State should be ILSTART
Data Flow
Mark Initial Load End
CDC State should be ILEND
Get Processing Range (First Run)
CDC State should be ILUPDATE
Data Flow
Mark Processed Range (First Run)
CDC State should be TFEND
Get Processing Range (Subsequent Runs)
CDC State should be TFSTART
Data Flow
Mark Processed Range (Subsequent Runs)
CDC State should be TFEND
Repeat the last three steps
This is not how my CDC states are being set, though... Here are my states along the same process.
Mark Initial Load Start
CDC State is ILSTART
Data Flow
Mark Initial Load End
CDC State is ILEND
Get Processing Range (First Run)
CDC State is ILUPDATE
Data Flow
Mark Processed Range (First Run)
CDC State is ILEND
Get Processing Range (Subsequent Runs)
CDC State is ILUPDATE
Data Flow
Mark Processed Range (Subsequent Runs)
CDC State is ILEND
Repeat the last three steps
I am never able to get out of the ILUPDATE/ILEND loop, so I am never able to get any new data from the CT table. Why is this happening and what can I do to fix this?
Thank you so much, in advance, for your help! :)
Edit 1
Here are a couple of articles that sort of describe my situation, though not exactly. They also did not help me resolve this issue, but it might help you think of something I can try.
http://www.bradleyschacht.com/understanding-the-cdc-state-value/
http://msdn.microsoft.com/en-us/library/hh231087.aspx
The second article includes this image, which shows the ILUPDATE/ILEND loop I am trapped in.
Edit 2
Last week (May 26, 2014) I disabled then re-enabled CDC on the subscription_events table. This didn't change anything, so I then disabled CDC on the entire database, re-enabled CDC on the database, and then enabled CDC on the subscription_events table. This did make CDC work for a few days (and I thought the problem had been resolved by going through this process). However, at the end of last week (May 30, 2014) I needed to re-load the entire table via this process, and I ran into the same problem again. I'm still stuck in this loop and I'm not sure why or how to get out of it.
Edit 3
Before I was having this problem, I was having a separate issue which I posted about here:
CDC is enabled, but cdc.dbo<table-name>_CT table is not being populated
I am unsure if these are related, but figured it couldn't hurt to provide it.
I had the same problem.
I have an initial load package for kickking things off and a separate incremental load package for loading updates on a schedule.
You fix it by putting a "Mark CDC start" CDC Control Task at the end of you Initial Load package only. That will leave the state value in a TFEND state, which is what you want for when your Incremental load starts.

After insert trigger - SQL Server 2008

I have data coming in from datastage that is being put in our SQL Server 2008 database in a table: stg_table_outside_data. The ourside source is putting the data into that table every morning. I want to move the data from stg_table_outside_data to table_outside_data where I keep multiple days worth of data.
I created a stored procedure that inserts the data from stg_table_outside_Data into table_outside_data and then truncates stg_table_outside_Data. The outside datastage process is outside of my control, so I have to do this all within SQL Server 2008. I had originally planned on using a simple after insert statement, but datastage is doing a commit after every 100,000 rows. The trigger would run after the first commit and cause a deadlock error to come up for the datastage process.
Is there a way to set up an after insert to wait 30 minutes then make sure there wasn't a new commit within that time frame? Is there a better solution to my problem? The goal is to get the data out of the staging table and into the working table without duplications and then truncate the staging table for the next morning's load.
I appreciate your time and help.
One way you could do this is take advantage of the new MERGE statement in SQL Server 2008 (see the MSDN docs and this blog post) and just schedule that as a SQL job every 30 minutes or so.
The MERGE statement allows you to easily just define operations (INSERT, UPDATE, DELETE, or nothing at all) depending on whether the source data (your staging table) and the target data (your "real" table) match on some criteria, or not.
So in your case, it would be something like:
MERGE table_outside_data AS target
USING stg_table_outside_data AS source
ON (target.ProductID = source.ProductID) -- whatever join makes sense for you
WHEN NOT MATCHED THEN
INSERT VALUES(.......)
WHEN MATCHED THEN
-- do nothing
You shouldn't be using a trigger to do this, you should use a scheduled job.
maybe building a procedure that moves all data from stg_table_outside_Data to table_outside_data once a day, or by using job scheduler.
Do a row count on the trigger, if the count is less than 100,000 do nothing. Otherwise, run your process.

Resources