What is the main difference between begin_version and min_valid_version columns for a particular change tracking enabled table in sys.change_tacking_tables?
I am seeing same values for both the columns for each table (1000+ tables) in my database.
I also tried with truncate and insert operation. But, couldn't see any change.
When can we see the different values for both the columns, please give us the example.
From https://learn.microsoft.com/en-us/sql/relational-databases/system-catalog-views/change-tracking-catalog-views-sys-change-tracking-tables?view=sql-server-2017,
begin_version - Version of the database when change tracking began for the table. This version is usually indicates when change tracking was enabled, but this value is reset if the table is truncated.
When you enable change tracking on a database level, the version of database at that point is considered as 1. Every transaction that gets committed on this database increases the version sequentially. So the begin_version corresponding to each table is basically the version of the database at the time of enabling change_tracking on a particular table.
min_valid_version - Minimum valid version of change tracking information that is available for the table.
Based on the retention period that you configured for the change tracking auto cleanup, your min_valid_version gets updated everytime your autocleanup thread wakes up. The change tracking auto cleanup thread wakes up every 30 minutes in the background and updates the invalid cleanup version for all databases enabled for change tracking.
For eg, if your retention period is 2 days which is the default, any time this thread wakes up, it will determine the maximum version of the database from the transactions that got committed 2 days ago from now. You could query sys.dm_tran_commit_table to find more information about this. commit_ts is the version that I just talked about in the above lines.
Related
If an ETL process attempts to detect data changes on system-versioned tables in SQL Server by including rows as defined by a rowversion column to be within a rowversion "delta window", e.g.:
where row_version >= #previous_etl_cycle_rowversion
and row_version < #current_etl_cycle_rowversion
.. and the values for #previous_etl_cycle_rowversion and #current_etl_cycle_rowversion are selected from a logging table whose newest rowversion gets appended to said logging table at the start of each ETL cycle via:
insert into etl_cycle_logged_rowversion_marker (cycle_start_row_version)
select ##DBTS
... is it possible that a rowversion of a record falling within a given "delta window" (bounded by the 2 ##DBTS values) could be missed/skipped due to rowversion's behavior vis-à-vis transactional consistency? - i.e., is it possible that rowversion would be reflected on a basis of "eventual" consistency?
I'm thinking of a case where say, 1000 records are updated within a single transaction and somehow ##DBTS is "ahead" of the record's committed rowversion yet that specific version of the record is not yet readable...
(For the sake of scoping the question, please exclude any cases of deleted records or immediately consecutive updates on a given record within such a large batch transaction.)
If you make sure to avoid row versioning for the queries that read the change windows you shouldn't miss many rows. With READ COMMITTED SNAPSHOT or SNAPSHOT ISOLATION an updated but uncommitted row would not appear in your query.
But you can also miss rows that got updated after you query ##dbts. That's not such a big deal usually as they'll be in the next window. But if you have a row that is constantly updated you may miss it for a long time.
But why use rowversion? If these are temporal tables you can query the history table directly. And Change Tracking is better and easier than using rowversion, as it tracks deletes and optionally column changes. The feature was literally built for to replace the need to do this manually which:
usually involved a lot of work and frequently involved using a
combination of triggers, timestamp columns, new tables to store
tracking information, and custom cleanup processes
.
Under SNAPSHOT isolation, it turns out the proper function to inspect rowversion which will ensure contiguous delta windows while not skipping rowversion values attached to long-running transactions is MIN_ACTIVE_ROWVERSION() rather than ##DBTS.
I have noted following statement in the Cassandra documentation on commit log archive configuration:
https://docs.datastax.com/en/cassandra/3.0/cassandra/configuration/configLogArchive.html
"Restore stops when the first client-supplied timestamp is greater than the restore point timestamp. Because the order in which the database receives mutations does not strictly follow the timestamp order, this can leave some mutations unrecovered."
This statement made us concerned about using point in time recovery based on Cassandra commit logs, since this indicates a point in time recovery will not recover all mutations with timestamp lower than the indicated restore point timestamp if we have mutations out of timestamp order (which we will have).
I tried to verify this behavior via some experiments but have not been able to reproduce this behavior.
I did 2 experiments:
Simple row inserts
Set restore_point_in_time to 1 hour ahead in time.
insert 10 rows (using default current timestamp)
insert a row using timestamp <2 hours ahead in time>
insert 10 rows (using default current timestamp)
Now I killed my cassandra instance making sure it was terminated without having a chance to flush to SS tables.
During startup I could see from cassandra logs that it was doing CommitLog replay.
After replay I queried by table and could see that 20 rows had been recovered but the one with the timestamp ahead of time was not inserted. Though here based on the documentation I would have expected that only the first 10 rows had been inserted. I verified in casssandra log that CommitLog replay had been done.
Larger CommitLog split experiment
I wanted to see if the documented feature then was working over a commitlog split/rollover.
So I set commitlog_segment_size_in_mb to 1 MB to cause the commitlog to rollover more frequently instead of the 32MB default.
I then ran a script to mass insert rows to force the commit log to split.
So the results here was that I inserted 12000 records, then inserted a record with a timestamp ahead of my restore_point_in_time then I inserted 8000 records afterwards.
At about 13200 rows my commitlog rolled over to a new file.
I then again killed my cassandra instance and restarted. Again I could see in the log that CommitLog replay was being done and after replay I could see that all rows except the single row with timestamp ahead of restore_point_in_time was recovered.
Notes
I did similar experiments using commitlog_sync batch option and also to make sure my rows had not been flushed to SSTables I tried restoring snapshot with empty tables before starting up cassandra to make it perform commitlog replay. In all cases I got the same results.
I guess my question is if the statement in the documentation is still valid? or maybe I'm missing something in my experiments?
Any help would be greatly appreciated ? I need an answer for this to be able to conclude on a backup/recovery mechanism we want to implement in a larger scale cassandra cluster setup.
All experiments where done using Cassandra 3.11 (single-node-setup) in a Docker container (the official cassandra docker image). I ran the experiments on the image "from-scratch" so no changes in configs where done other than what I included in the description here.
I think that it will be relatively hard to reproduce, as you'll need to make sure that some of the mutations come later than other, and this may happen mostly when some clients has not synchronized clocks, or nodes are overloaded, and then hints are replayed some time later, etc.
But this parameter may not be required at all - if you look into CommitLogArchiver.java, then you can see that if this parameter is not specified, then it's set to the Long.MAX, meaning that there is no upper bound and all commit logs will be replayed, and then Cassandra will handle it standard way: "the latest timestamp wins".
I haven't been able to find documentation/an explanation on how you would reload incremental data using Change Data Capture (CDC) in SQL Server 2014 with SSIS.
Basically, on a given day, if your SSIS incremental processing fails and you need to start again. How do you stage the recently changed records again?
I suppose it depends on what you're doing with the data, eh? :) In the general case, though, you can break it down into three cases:
Insert - check if the row is there. If it is, skip it. If not, insert it.
Delete - assuming that you don't reuse primary keys, just run the delete again. It will either find a row to delete or it won't, but the net result is that the row with that PK won't exist after the delete.
Update - kind of like the delete scenario. If you reprocess an update, it's not really a big deal (assuming that your CDC process is the only thing keeping things up to date at the destination and there's no danger of overwriting someone/something else's changes).
Assuming you are using the new CDC SSIS 2012 components, specifically the CDC Control Task at the beginning and end of the package. Then if the package fails for any reason before it runs the CDC Control Task at the end of the package those LSNs (Log Sequence Number) will NOT be marked as processed so you can just restart the SSIS package from the top after fixing the issue and it will just reprocess those records again. You MUST use the CDC Control Task to make this work though or keep track the LSNs yourself (before SSIS 2012 this was the only way to do it).
Matt Masson (Sr. Program Manager on MSFT SQL Server team) has a great post on this with a step-by-step walkthrough: CDC in SSIS for SQL Server 2012
Also, see Bradley Schacht's post: Understanding the CDC state Value
So I did figure out how to do this in SSIS.
I record the min and max LSN number everytime my SSIS package runs in a table in my data warehouse.
If I want to reload a set of data from the CDC source to staging, in the SSIS package I need to use the CDC Control Task and set it to "Mark CDC Start" and in the text box labelled "SQL Server LSN to start...." I put the LSN value I want to use as a starting point.
I haven't figured out how to set the end point, but I can go into my staging table and delete any data with an LSN value > then my endpoint.
You can only do this for CDC changes that have not been 'cleaned up' - so only for data that has been changed within the last 3 days.
As a side point, I also bring across the lsn_time_mapping table to my data warehouse since I find this information historically useful and it gets 'cleaned up' every 4 days in the source database.
To reload the same changes you can use the following methods.
Method #1: Store the TFEND marker from the [cdc_states] table in another table or variable. Reload back the marker to your [cdc_states] from the "saved" value to process the same range again. This method, however, allows you to start processing from the same LSN but if in the meanwhile you change table got more changes those changes will be captured as well. So, you can potentially get more changes that happened after you did the first data capture.
Method #2: In order to capture the specified range, record the TFEND markers before and after the range is processed. Now, you can use the OLEDB Source Connection (SSIS) with the following cdc functions. Then use the CDC Splitter as usual to direct Inserts, Updates, and Deletes.
DECLARE #start_lsn binary(10);
DECLARE #end_lsn binary(10);
SET #start_lsn = 0x004EE38E921A01000001;-- TFEND (1) -- if null then sys.fn_cdc_get_min_lsn('YourCapture') to start from the beginnig of _CT table
SET #end_lsn = 0x004EE38EE3BB01000001; -- TFEND (2)
SELECT * FROM [cdc].[fn_cdc_get_net_changes_YOURTABLECAPTURE](
#start_lsn
,#end_lsn
,N'all' -- { all | all with mask | all with merge }
--,N'all with mask' -- shows values in "__$update_mask" column
--,N'all with merge' -- merges inserts and updates together. It's meant for processing the results using T-SQL MERGE statement
)
ORDER BY __$start_lsn;
I'm a beginner with SQL Server. For a project I need CDC to be turned on. I copy the cdc data to another (archive) database and after that the CDC tables can be cleaned immediately. So the retention time doesn't need to be high, I just put it on 1 minute and when the cleanup job runs (after the retention time is already fulfilled) it appears that it only deleted a few records (the oldest ones). Why didn't it delete everything? Sometimes it doesn't delete anything at all. After running the job a few times, the other records get deleted. I find this strange because the retention time has long passed.
I set the retention time at 1 minute (I actually wanted 0 but it was not possible) and didn't change the threshold (= 5000). I disabled the schedule since I want the cleanup job to run immediately after the CDC records are copied to my archive database and not particularly on a certain time.
My logic for this idea was that for example there will be updates in the afternoon. The task to copy CDC records to archive database should run at 2:00 AM, after this task the cleanup job gets called. So because of the minimum retention time, all the CDC records should be removed by the cleanup job. The retention time has passed after all?
I just tried to see what happened when I set up a schedule again in the job, like how CDC is meant to be used in general. After the time has passed I checked the CDC table and turns out it also only deletes the oldest record. So what am I doing wrong?
I made a workaround where I made a new job with the task to delete all records in the CDC tables (and disabled the entire default CDC cleanup job). This works better as it removes everything but it's bothering me because I want to work with the original cleanup job and I think it should be able to work in the way that I want it to.
Thanks,
Kim
Rather than worrying about what's in the table, I'd use the helper functions that are created for each capture instance. Specifically, cdc.fn_cdc_get_all_changes_ and cdc.fn_cdc_get_net_changes_. A typical workflow that I've used wuth these goes something below (do this for all of the capture instances). First, you'll need a table to keep processing status. I use something like:
create table dbo.ProcessingStatus (
CaptureInstance sysname,
LSN numeric(25,0),
IsProcessed bit
)
create unique index [UQ_ProcessingStatus]
on dbo.ProcessingStatus (CaptureInstance)
where IsProcessed = 0
Get the current max log sequence number (LSN) using fn_cdc_get_max_lsn.
Get the last processed LSN and increment it using fn_cdc_increment_lsn. If you don't have one (i.e. this is the first time you've processed), use fn_cdc_get_min_lsn for this instance and use that (but don't increment it!). Record whatever LSN you're using in the table with, set IsProcessed = 0.
Select from whichever of the cdc.fn_cdc_get… functions makes sense for your scenario and process the results however you're going to process them.
Update IsProcessed = 1 for this run.
As for monitoring your original issue, just make sure that the data in the capture table is generally within the retention period. That is, if you set it to 2 days, I wouldn't even think about it being a problem until it got to be over 4 days (assuming that your call to the cleanup job is scheduled at something like every hour). And when you process with the above scheme, you don't need to worry about "too much" data being there; you're always processing a specific interval rather than "everything".
I got an issue When some table on the development database has more columns added, and right now I should add the column in the production database.
I try to add the column manually, by Add Column on Table Tree, but when I try to save, it show me a warning:
here's the complete warning message :
Saving Definition Changes to tables with large amounts of data could take a
considerable amount of time. While changes are being saved,
table data will not be accessible.
is this safe? or is there another method to do this ?
UPDATE : I Tried to click Yes but after some minutes it give me another warning that said Time Out, and the process canceled, and yes the data in the production table is totally huge
To fix it, go to the Tools Menu and select Options then under Designers select Table and Database Designers the first option allows you to disable the timeout, or make it greater than the default 30 seconds.
Ref: http://surf11.com/entry/197/saving-table-times-out-in-sql-server-200
Basically this means that the SQL server is going to go over all the records in the table (on disk) and change them.
This is an operation that will take a lot of IO and will lock your table to reading / writing.
Something else to consider is your DB timeout as mentioned in
http://social.msdn.microsoft.com/Forums/sqlserver/en-US/7d46b158-a089-48c9-8153-53c5ed116d37/sql-2005-updating-table-definition-with-large-amounts-of-data-timeout
This is relatively safe (except for the lock issue) and will be the case in any method you will use (maybe without the nice UI warning).
There is two options for set time-out-execution time
1. Tools > option > Designers > Transaction time-out after
Set bigger value than 30 sec. (max 65536) this will fix that issue
but there is one more option
2. Tools > option > Query Execution > Execution time-out
Generaly this is not couse a problem because Default value is 0 (unlimited)
already but check this and fix it to 0 if you see different value there.
It is completely safe and there is no possibilty to data corrupt.