SQL Server vector clock

SQL Server vector clock - sql-server

Is there a global sequence number in SQL Server which guarantees to increment periodically (even when system time regresses), and can be accessed as part of an insert or update operation?

Yes the rowversion data type, and the ##dbts function are what you're looking for.
This pattern, of marking rows using a rowversion is implemented at a lower level by the Change Tracking feature. Which adds tracking of insert/updates and deletes, and doesn't require you to add a column to your table.

I'm pretty sure ROWVERSION does what you want. A ROWVERSION-typed column is guaranteed to be unique within any single database, and, per the SQL documentation, it is nothing more than an incrementing number. If you just save MAX(ROWVERSION) each time you've finished updated your data, you can find updated or inserted rows in your next update pass by looking fo0r ROWVERSIONs that are bigger than the saved MAX(). Note that you cannot catch deletes in this fashion!
Another approach is to use LineageId's and triggers. I'm happy to explain that approach if it would help, but I think ROWVERSION is a simpler solution.

Related

is rowversion a transactionally-consistent value to capture table data changes

If an ETL process attempts to detect data changes on system-versioned tables in SQL Server by including rows as defined by a rowversion column to be within a rowversion "delta window", e.g.:
where row_version >= #previous_etl_cycle_rowversion
and row_version < #current_etl_cycle_rowversion
.. and the values for #previous_etl_cycle_rowversion and #current_etl_cycle_rowversion are selected from a logging table whose newest rowversion gets appended to said logging table at the start of each ETL cycle via:
insert into etl_cycle_logged_rowversion_marker (cycle_start_row_version)
select ##DBTS
... is it possible that a rowversion of a record falling within a given "delta window" (bounded by the 2 ##DBTS values) could be missed/skipped due to rowversion's behavior vis-à-vis transactional consistency? - i.e., is it possible that rowversion would be reflected on a basis of "eventual" consistency?
I'm thinking of a case where say, 1000 records are updated within a single transaction and somehow ##DBTS is "ahead" of the record's committed rowversion yet that specific version of the record is not yet readable...
(For the sake of scoping the question, please exclude any cases of deleted records or immediately consecutive updates on a given record within such a large batch transaction.)

If you make sure to avoid row versioning for the queries that read the change windows you shouldn't miss many rows. With READ COMMITTED SNAPSHOT or SNAPSHOT ISOLATION an updated but uncommitted row would not appear in your query.
But you can also miss rows that got updated after you query ##dbts. That's not such a big deal usually as they'll be in the next window. But if you have a row that is constantly updated you may miss it for a long time.
But why use rowversion? If these are temporal tables you can query the history table directly. And Change Tracking is better and easier than using rowversion, as it tracks deletes and optionally column changes. The feature was literally built for to replace the need to do this manually which:
usually involved a lot of work and frequently involved using a
combination of triggers, timestamp columns, new tables to store
tracking information, and custom cleanup processes
.

Under SNAPSHOT isolation, it turns out the proper function to inspect rowversion which will ensure contiguous delta windows while not skipping rowversion values attached to long-running transactions is MIN_ACTIVE_ROWVERSION() rather than ##DBTS.

Faster SQL Performance

I have to insert one record per tables across 30 tables. The data coming from some other System. I have to insert data in the tables for the first time, then if any update happened, then I need to update tables in the SQL Server. I have two options:
a) I can check timestamp for individual table rows and update if the timestamp is greater.
b) Everytime I can stateway delete records and insert data.
Which one will be faster in SQL Server Database? Is there any other option to address the situatation?

If you are not changing the index fields of the record, the stategy of trying to update first and then insert is usually faster than drop/insert as you don't force the database into updating a bunch of index info.
If using Sql2008+ you should be using the merge command, as it explictly handles the update/insert condition cleanly and clearly
ADDED
I should also add that is you know the usage pattern in rarely update (i.e., 90% insert), you may have a case when drop/insert in faster than update/insert -- depends on lots of details. Regardless, merge is the clear winner if using 2008+

I generally like drop and re-insert. I find it to be cleaner and easier to code. However, if this is happening very frequently and you're worried about concurrency issues, you're probably better off with option 1.
Also, another thing to factor in is how often does the timestamp check fail (where you don't have to insert nor update). If 99% of data is redundant/outdated data, you're probably better off with option 1 regardless.

Difference between MIN(__$start_lsn) and fn_cdc_get_min_lsn?

Using CDC on SQL Server 2012.
I have a table (MyTable) which is CDC enabled. I thought the following two queries would always return the same value:
SELECT MIN(__$start_lsn) FROM cdc.dbo_MyTable_CT;
SELECT sys.fn_cdc_get_min_lsn('dbo_MyTable');
But they don't seem to do so: in my case the first one returns 0x00001EC6000000DC0003 and the second one 0x00001E31000000750001, so the absolute minimum in the table is actually greater than the value returned by fn_cdc_get_min_lsn.
My questions:
Why are the results different?
Is there any problem with using the value from the first query as the first parameter on fn_cdc_get_all_changes_dbo_MyTable? (all examples I've seen use the value from the second query)

My understanding is that the first one returns the oldest LSN for the data that's currently in the CDC table and the latter reflects when the table was added to CDC. I will note though that you'll only want to use the minimum (whichever method you go with) once so you don't process duplicate records. Also, since the second method gets its result from sys.cdc_tables (which very likely has far fewer rows than your CDC table does), it's going to be more efficient.

sys.fn_cdc_get_min_lsn returns the minimum available lsn for a change captured table.
Like #Ben says, this can be different (earlier) from the earliest change actually captured, for example when a table is first added to CDC and there haven't been any changes yet.
As per the MSDN doco you should always use this to validate your query ranges prior to execution because change data will eventually get cleaned up. So you will not only use this once - you will check it every time.
You should use this rather than getting the min LSN other ways because
it'll be faster (as Ben pointed out). Much faster potentially.
it's the documented API for doing so. The implementation of the backing tables might change in future versions etc...
Workflow is generally:
load your previous LSN from (your state)
query current LSN
query minimum available for the table
if prev > min available load changes only
otherwise load whole table and handle it (somehow)
save current LSN to (your state)

Does every record has an unique field in SQL Server?

I'm working in Visual Studio - VB.NET.
My problem is that I want to delete a specific row in SQL Server but the only unique column I have is an Identity that increments automatically.
My process of work:
1. I add a row in the column (the identity is being incremented, but I don't know the number)
2. I want to delete the previous row
Is there a sort of unique ID that every new record has?
It's possible that my table has 2 exactly the same records, just the sequence (identity) is different.
Any ideas how to handle this problem?

SQL Server has a few functions that return the generated ID for the last rows, each with it's own specific strengths and weaknesses.
Basically:
##IDENTITY works if you do not use triggers
SCOPE_IDENTITY() works for the code you explicitly called.
IDENT_CURRENT(‘tablename’) works for a specific table, across all scopes.
In almost all scenarios SCOPE_IDENTITY() is what you need, and it's a good habit to use it, opposed to the other options.
A good discussion on the pros and cons of the approaches is also available here.

I want to delete the previous row
And that is your problem. There is no such concept in SQL as a 'previous row'. The word previous implies order and order applies only to queries, where is achieved by adding an ORDER BY clause. Tables have no order. You need to rephrase this in terms of "I need to delete the record that satisfies <this> condition.". This may sound to you like pedantic gibberish, but you will never find a solution until you acknowledged the problem.
Searching for a way to interpret the value of the inserted identity column and then subtracting 1 from it is flawed with many many many problems. It is incorrect under concurrency. It is incorrect in presence of rollbacks. It is incorrect after ETL jobs. Overall, never expect monotonically increasing identities, they're free to jump gaps and your code should be correct in presence of gaps.

SQL Server 2008 - Check For Row Changes

Instead of using a ton of or statements to check if a row has been altered I was looking into checksum() or binary_checksum(). What is best practice for this situation? Is it using checksum(), binary_checksum() or some other method? I like the the idea of using one fo the checksum options so I don't have to build a massive or statement for my update.
EDIT:
Sorry everyone, I should have provided more detail. I need to pull in data from some outside sources, but because I am using merge replication I don't want to just blowout and rebuild the tables. I want to only update or insert the rows that really have changes or don't exist. I will have a paired down version of the source data in my target db that will get sync'd down to clients. I was trying to find a good way to detect the row changes without having to look at every single column to perform the update.
Any suggestions is greatly appreciated.
Thanks,
S

First, if you are using actual Merge replication, it should take care of updating the proper rows for you.
Second, typically the way to determine if a row has changed is to use a column with a data type of timestamp, now called rowversion, which changes each time the row updated. However, this type of column will only tell you if the value changed since the last time you read the value which means you have to have read and stored the timestamps to use in comparison. Thus, this may not work for you.
Lastly, a solution which may work for you would be triggers on the table in question that update an actual DateTime (or better yet, DateTime2) column with the current date and time when an insert takes place. Your comparison would need to store the datetime you last synchronized to the table and compare that datetime in the last updated column to determine which rows had changed.

It might help if we have a bit more info about what you are doing but in general the checksum() option does work well as long as you have access to the original checksum of the row to compare to.