We are running SQL 2008 R2 and have started exploring change tracking as our method for identifying changes to export to our data warehouse. We are only interested in specific columns.
We are identifying the changes on a replicated copy of the source database. If we query the change table on the source server, any specific column update is available and the SYS_CHANGE_COLUMNS is populated.
However on the replicated copy the changes are being tracked but the SYS_CHANGE_COLUMNS field is always NULL for an update change.
Track columns updated is set to true on the subscriber.
Is this due to the way replication works and it is performing whole row updates and therefore you cannot get column level changes on a subscriber?
Any help or alternative approaches would be much appreciated.
Thanks
I realize this is an old question, but since I've happened across it I figure I may as well provide an answer for others who come later.
SYS_CHANGE_COLUMNS is null when every column is "updated". "Updated" here doesn't nessarily mean the value changed, it just means the column was touched by the DML statement. So, "update t set c = c" would mean column c was "updated".
Inserts and deletes will therefore always have a SYS_COLUMNS_CHANGED value of "null", since the whole row is affected by an insert or a delete. But most replication technologies do an update by setting every column value to the value of the column on the replication source. Therefore, a replication "update" will touch every column, and so the SYS_CHANGE_COLUMNS value will always be null.
Related
If an ETL process attempts to detect data changes on system-versioned tables in SQL Server by including rows as defined by a rowversion column to be within a rowversion "delta window", e.g.:
where row_version >= #previous_etl_cycle_rowversion
and row_version < #current_etl_cycle_rowversion
.. and the values for #previous_etl_cycle_rowversion and #current_etl_cycle_rowversion are selected from a logging table whose newest rowversion gets appended to said logging table at the start of each ETL cycle via:
insert into etl_cycle_logged_rowversion_marker (cycle_start_row_version)
select ##DBTS
... is it possible that a rowversion of a record falling within a given "delta window" (bounded by the 2 ##DBTS values) could be missed/skipped due to rowversion's behavior vis-à-vis transactional consistency? - i.e., is it possible that rowversion would be reflected on a basis of "eventual" consistency?
I'm thinking of a case where say, 1000 records are updated within a single transaction and somehow ##DBTS is "ahead" of the record's committed rowversion yet that specific version of the record is not yet readable...
(For the sake of scoping the question, please exclude any cases of deleted records or immediately consecutive updates on a given record within such a large batch transaction.)
If you make sure to avoid row versioning for the queries that read the change windows you shouldn't miss many rows. With READ COMMITTED SNAPSHOT or SNAPSHOT ISOLATION an updated but uncommitted row would not appear in your query.
But you can also miss rows that got updated after you query ##dbts. That's not such a big deal usually as they'll be in the next window. But if you have a row that is constantly updated you may miss it for a long time.
But why use rowversion? If these are temporal tables you can query the history table directly. And Change Tracking is better and easier than using rowversion, as it tracks deletes and optionally column changes. The feature was literally built for to replace the need to do this manually which:
usually involved a lot of work and frequently involved using a
combination of triggers, timestamp columns, new tables to store
tracking information, and custom cleanup processes
.
Under SNAPSHOT isolation, it turns out the proper function to inspect rowversion which will ensure contiguous delta windows while not skipping rowversion values attached to long-running transactions is MIN_ACTIVE_ROWVERSION() rather than ##DBTS.
We have been using an SSIS pattern based around rowversion to synchronize records between two databases by looking at only rows in the source that have been inserted or updated since the last package run. Note data is never deleted from the source table, which is a prerequisite for this SSIS pattern.
However lately we discovered despite running daily our import has actually missed rows from last month, leaving them out of our data warehouse entirely!
This is what i'm seeking a solution to..how can we change our ETL pattern to avoid this problem, without going back to reading every row from source every day?
From internet searching we found an explanation for why this might be happening, but not a solution. The flaw seems to be related to the fact that the SQL column rowversion gets its value when an insert/update starts, not when it commits, which can lead to rows not being available at package execution time, but getting committed later with rowversion values less than your stored ETLRowversion value, so next time your job runs they get skipped.
In brief our pattern currently is like this: (I've left out steps involving index maintenance, etc for simplicity.)
Get the last active rowversion from source DB using min_active_rowversion() call that #MaxRv.
Get the rowversion value as of last successful execution of our SSIS task (stored in our data warehouse in a table called ETLRowversions). Call that #LSERV.
Read rows from the source table WHERE rowversion is >= #LSERV and rowversion is <= #MaxRv
For each row read, check if the row exists in target DB (if so add the row to an update staging table) or not (in which case, insert it directly into Target table)
Update the Target table using the update staging table
Update ETLrowVersions table in our data warehouse with the #MaxRv value.
Edit: Comments have suggested to implement Change Tracking and Snapshot Isolation as the best solution to this problem. Unfortunately both change tracking and allow_snapshot_isolation are both OFF for the source database..and I am pessimistic about my chances of getting these features turned on. For better or worse our BI concerns carry far less weight than performance concerns of the production application/DB that is our source.
If we receive an update statement that does not check if the value has changed in the where clause, what are the different ways to ignore that update inside a trigger?
I know we can do a comparison of each individual field (handling the ISNULL side as well), but where it's a table that has 50+ fields, is there a faster/easier way to do it?
Note:I want to save each and every event in logs for updated fields.for example i have 50 fields and one of the field is updated(for single row not for entire table),then i want to save only that updated field old value and new value in logs.
Thanks in Advance, RAHUL
If this is more about logging changes to tables, a simpler solution may be to use Change Data Capture (CDC) tables.
Every time a change is made to a table, a row is written to your CDC table. Then you could write a query over the CDC table to bring you back just the data that has changed.
More information is on CDC tables is availble here:
http://msdn.microsoft.com/en-us/library/bb522489(v=sql.105).aspx
Instead of using a ton of or statements to check if a row has been altered I was looking into checksum() or binary_checksum(). What is best practice for this situation? Is it using checksum(), binary_checksum() or some other method? I like the the idea of using one fo the checksum options so I don't have to build a massive or statement for my update.
EDIT:
Sorry everyone, I should have provided more detail. I need to pull in data from some outside sources, but because I am using merge replication I don't want to just blowout and rebuild the tables. I want to only update or insert the rows that really have changes or don't exist. I will have a paired down version of the source data in my target db that will get sync'd down to clients. I was trying to find a good way to detect the row changes without having to look at every single column to perform the update.
Any suggestions is greatly appreciated.
Thanks,
S
First, if you are using actual Merge replication, it should take care of updating the proper rows for you.
Second, typically the way to determine if a row has changed is to use a column with a data type of timestamp, now called rowversion, which changes each time the row updated. However, this type of column will only tell you if the value changed since the last time you read the value which means you have to have read and stored the timestamps to use in comparison. Thus, this may not work for you.
Lastly, a solution which may work for you would be triggers on the table in question that update an actual DateTime (or better yet, DateTime2) column with the current date and time when an insert takes place. Your comparison would need to store the datetime you last synchronized to the table and compare that datetime in the last updated column to determine which rows had changed.
It might help if we have a bit more info about what you are doing but in general the checksum() option does work well as long as you have access to the original checksum of the row to compare to.
The table doesn't have a last updated field and I need to know when existing data was updated. So adding a last updated field won't help (as far as I know).
SQL Server 2000 does not keep track of this information for you.
There may be creative / fuzzy ways to guess what this date was depending on your database model. But, if you are talking about 1 table with no relation to other data, then you are out of luck.
You can't check for changes without some sort of audit mechanism. You are looking to extract information that ha not been collected. If you just need to know when a record was added or edited, adding a datetime field that gets updated via a trigger when the record is updated would be the simplest choice.
If you also need to track when a record has been deleted, then you'll want to use an audit table and populate it from triggers with a row when a record has been added, edited, or deleted.
You might try a log viewer; this basically just lets you look at the transactions in the transaction log, so you should be able to find the statement that updated the row in question. I wouldn't recommend this as a production-level auditing strategy, but I've found it to be useful in a pinch.
Here's one I've used; it's free and (only) works w/ SQL Server 2000.
http://www.red-gate.com/products/SQL_Log_Rescue/index.htm
You can add a timestamp field to that table and update that timestamp value with an update trigger.
OmniAudit is a commercial package which implments auditng across an entire database.
A free method would be to write a trigger for each table which addes entries to an audit table when fired.