Not all LSNs map to dates - sql-server

I'm building an ETL that processes data from SQL server's change data capture feature. Part of the ETL is recording logs about the data that is processed including the data import window start and end. To do this I use the function sys.fn_map_lsn_to_time() to map the LSNs used to import the data to the corresponding datetime values.
The function sys.fn_cdc_get_all_changes_() takes two parameters that are the start and end of the data import window. These parameters are inclusive so the next run needs to increment the previous LSN to avoid re-importing rows that fall on the boundary.
The obvious answers is to use the function sys.fn_cdc_increment_lsn() to get the next LSN before bringing in the data. However, what I found is that this LSN does not always map to a datetime using sys.fn_map_lsn_to_time(). The LSN is valid for use in the sys.fn_cdc_get_all_change_() but I would like to be able to easily and accurately log the dates that are being used.
For example:
DECLARE #state_lsn_str CHAR(22) = '0x0000EEE100003E16008F'; -- try using `sys.fn_cdc_get_min_lsn(<capture_instance>)` instead since this value won't work for anyone else
DECLARE #state_lsn BINARY(10) = CONVERT(BINARY(10), #state_lsn_str, 1);
DECLARE #incr_lsn BINARY(10) = sys.fn_cdc_increment_lsn(#state_lsn);
SELECT CONVERT(CHAR(22), #incr_lsn, 1) AS incremented_lsn,
sys.fn_cdc_map_lsn_to_time(#incr_lsn) AS incremeneted_lsn_date;
This code returns an LSN value of 0x0000EEE100003E160090 and NULL for incremented_lsn_date
Is there a way to force an LSN to be mapped to a time?
OR
Is there a way to get the next LSN that does map to a time without risking losing any data?

The reason the value returned from sys.fn_cdc_increment_lsn() doesn't map to a datetime is there was no change recorded for that specific LSN. It increments the LSN by the smallest possible value even if there was no change recorded for that date.
To work around the issue I used the sys.fn_map_time_to_lsn() function. This function takes a relational operator parameter. You can get the next datetime value by using 'smallest greater than' for this parameter. The following code returns the next LSN that maps to a datetime:
DECLARE #state_lsn_str CHAR(22) = '0x0000EEE100003E16008F'; -- try using `sys.fn_cdc_get_min_lsn(<capture_instance>)` instead since this value won't work for anyone else
DECLARE #state_lsn BINARY(10) = CONVERT(BINARY(10), #state_lsn_str, 1);
DECLARE #state_lsn_date DATETIME = sys.fn_cdc_map_lsn_to_time(#state_lsn);
DECLARE #next_lsn BINARY(10) = sys.fn_cdc_map_time_to_lsn('smallest greater than', #state_lsn_date);
SELECT CONVERT(CHAR(22), #next_lsn, 1) AS next_lsn,
sys.fn_cdc_map_lsn_to_time(#next_lsn) AS next_lsn_date;
This code returns what appears to be a logical datetime value for the next LSN. Though I'm unsure how to 100% check that there is no data in any other tables.
The code above has a #state_lsn_date value of 2018-02-15 23:59:57.447 and the value found for the next LSN is 2018-02-16 00:00:01.363 and the integration runs at midnight.
The functions sys.fn_cdc_map_lsn_to_time() and sys.fn_cdc_map_time_to_lsn() use the cdc.lsn_time_mapping table to return their results. The documentation for this table states:
Returns one row for each transaction having rows in a change table.
This table is used to map between log sequence number (LSN) commit
values and the time the transaction committed. Entries may also be
logged for which there are no change tables entries. This allows the
table to record the completion of LSN processing in periods of low or
no change activity.
Microsoft Docs - cdc.lsn_time_mapping (Transact-SQL)
As I understand it that means every LSN value in any change table will be mapped here. There may be additional LSNs but there won't be missing LSNs. This allows the code to map to the next valid change date.
Since all changes will have a mapping in the cdc.lsn_time_mapping table using this method shouldn't lose any data.
Do I sound a little unsure? Well, I am.
I'm hoping someone with a deeper knowledge of the SQL Server Change Data Capture system can confirm whether this is safe or not.

Related

How to filter records between timestamps from information_schema.warehouse_load_history()

I want to filter the records between timestamps from information_schema.warehouse_load_history() somehow below query is returning the empty result.
Query
select date_part(epoch_millisecond, convert_timezone('UTC', END_TIME)), WAREHOUSE_NAME, AVG_RUNNING, AVG_QUEUED_LOAD, AVG_QUEUED_PROVISIONING, AVG_BLOCKED from table(information_schema.warehouse_load_history()) where date_part(epoch_millisecond, convert_timezone('UTC', END_TIME)) >= 1668081337659 and date_part(epoch_millisecond, convert_timezone('UTC', END_TIME)) <= 1668083015000
The important point here is, the filters in the WHERE clause will be applied after the warehouse_load_history table function returns a result set. This rule is valid for any information schema table functions (ie query_history).
The function accepts DATE_RANGE_START, DATE_RANGE_END and WAREHOUSE_NAME parameters.
If an end date is not specified, then CURRENT_DATE is used as the end of the range.
If a start date is not specified, then the range starts 10 minutes prior to the start of DATE_RANGE_END
So your query only returns the last 10 minutes of data for all warehouses. Your WHERE filter is applied to this returning data.
In short, you should use the filters of the function first (as I said, it's the same for all information schema functions), and then you should use the WHERE clause for additional filters.
You might be using that wrong, the dates are a part of the date function itself, no need to add a where clause outside of the table function itself!
For reference: https://docs.snowflake.com/en/sql-reference/functions/warehouse_metering_history.html
Code from ref:
select *
from table(information_schema.warehouse_metering_history('2017-10-23', '2017-10-23', 'testingwh'));

How to fully Automate CDC in SQL Server?

Is there a way to 100% automate SQL Server CDC initialization in an active SQL Server database? I am trying to solve a problem finding from_lsn during first cdc data capture.
Sequence of events:
Enable CDC on given database/Table
Copy full table to destination (Data lake)
Use CDC to capture first delta (I want to avoid duplicates, without missing a transaction)
Problem:
How to get the from_lsn for fn_cdc_get_all_changes_Schema_Table(from_lsn, to_lsn, '<row_filter_option>') function
Note:
Need to automate 100%
Can not stop transactions on the table
Can not miss any data or can not afford duplicate data
Before doing the initial load, get the value of fn_cdc_get_max_lsn() and store it. This function returns the highest LSN known to CDC across all capture instances. It's the high water mark for the whole database.
Copy the whole table.
Start your delta process. The first time you call the delta function, the value of the min_lsn argument will be the stored value previously retrieved from fn_cdc_get_max_lsn() incremented by fn_cdc_increment_lsn. Get the current value from fn_cdc_get_max_lsn() (not the stored one) and use it as the value of the max_lsn argument.
From here proceed as you expect. Take the maximum LSN returned from the delta function, store it. Next time you pull a delta, use fn_cdc_increment_lsn on the stored value, use the result as the value of the min_lsn argument, and use the result of fn_cdc_get_max_lsn() as the max_lsn argument.
With this process you will never miss any data. (Not covered here: be sure to check that your boundary conditions fall within a valid lsn range)
Now, you mentioned that you want to avoid "duplicates". But if you try to define what a "duplicate" is in this scenario, I think you'll find it difficult.
For example, suppose I have this table to begin with:
create table t(i int primary key, c char);
insert t(i, c) values (1, 'a');
I call fn_cdc_get_max_lsn() and get 0x01.
A user inserts a new row into the table: insert t(i, c) values (2, 'b');
The user operation is associated with an LSN value of 0x02.
I select all the rows in this table (getting two rows).
I write both rows to my destination table.
I start my delta process. My min_lsn argument will be 0x02.
I will therefore get the {2, 'b'} row in the delta.
But I already retrieved the row {2, 'b'} as part of my initial load. Is this a "duplicate"? No, this represents a change to the table. What will I do with this delta when I load it into my destination? There are really only two options.
Option 1: I am going to merge the delta into the destination table based on the primary key. In that case, when I merge the delta I will overwrite the already-loaded row {2, 'b'} with the new row {2, 'b'}, the outcome of which looks the same as not doing anything.
Option 2: I am going to append all changes to the destination. In that case my destination table will contain the row {2, 'b'} twice. Is this a duplicate? No, because the two rows represent the how the data looked at different logical times. First when I did the initial load, and then when I did the delta.
If you try to argue that this is in fact a duplicate, then I counter by giving you this hypothetical scenario:
You do the initial load, receiving row {1, 'a'},
No users change any data.
You get your first delta, which is empty.
A user executes update T set c = 'b' where i = 1.
You get your second delta, which will include the row {1, 'b'}.
A user executes update T set c = 'a' where i = 1.
You get your third delta, which will include the row {1, 'a'}.
Question: Is the row you retrieved during your third delta a "duplicate"? It has the same values as a row we already retrieved previously.
If your answer is "yes", then you can never eliminate "duplicate" reads, because a "duplicate" will occur any time a row mutates to have the same values it had at some previous point in time, which is something over which you have no control. If this is a "duplicate" that you need to eliminate in the append scenario, then that elimination must be performed at the destination, by comparing the incoming values with the existing values.

DateTimeOffset value changing

I'm moving data that is currently been stored as a int to DateTimeOffset.
Example here is the starting value 1341190841 and when I use this query :
dateadd (s,Call.StartTime, '1970-01-01') AS StartTimeDate
It returns this value 2012-07-02 01:00:41.000 which is correct. However I'm using SSIS to move data from one db to another and when the data is in the new table the StartTimeDate now looks like this 2012-07-02 01:00:41.0000000 +01:00.
Anyone got any idea how to remove the +01:00? I want to keep the time as it is in the first query.
I wasn't able to reproduce that behaviour (even with two SQL Servers in different timezones), so this may not be exactly what you want, but you can "fix" the TZ offset (the "+01:00") after copying the data by updating the StartTimeDate column with the function ToDateTimeOffset like this:
UPDATE the_table SET StartTimeDate = TODATETIMEOFFSET(StartTimeDate, 0)
That will leave the date and time untouched while adjusting the offset to the specified one (0 since you want it to "adjust" the TZ from +1 to 0).

How to force table select to go over blocks

How can I make Sybase's database engine return an unsorted list of records in non-numeric order?
~~~
I have an issue where I need to reproduce an error in the application where I select from a table where the ID is generated in sequence, but the ID is not the last one in the selection.
Let me explain.
ID STATUS
_____________
1234 C
1235 C
1236 O
Above is 3 IDs. I had code where these would be the results of a
select #p_id = ID from table where (conditions).
However, there wasn't a clause to check for status = 'O' (open). Remember Sybase saves the last returned record into a variable.
~~~~~
I'm being asked to give the testing team something that will make the results not work. If Sybase selects the above in an unordered list, it could appear in ascending order, or, if the database engine needs to change blocks of stored data or something technical magic stuff, the order could be messed up. The original error was when the procedure would return say 1234 instead of 1236.
Is there a way that I can have a 100% guarantee that Sybase will search over a block of data and have to double back, effectively 'breaking' the ascending search, and returning not the last record, but any other one? (all records except the maximum will end up erroring, because they are all 'Closed')
I want some sort of magical SQL code that will make sure things don't search the table in exactly numeric order. Ideally I'd like to not have to change the procedure, as the test team want to see the exact same procedure breaking (as easy as plonking a order by id desc would fudge the results).
If you don't specify an order, there is no way to guarantee the return order of the results. It will be however the index is built - and can depend on the order of insertion, the type of index, and the content of index keys.
It's generally a bad idea to do those sorts of singleton SELECTs. You should always specify a specific record with the WHERE clause, or use a cursor, or TOPn or similar. The problem comes when someone tries to understand your code, because some databases when they see multiple hits take the first value, some take the last value, some take a random value (they call that "implementation-defined"), and some throw an error.
Is this by any chance related to 1156837? :)

Oracle Sequences

I'm using a Sequence to generate an id for me, but I use that sequence from my Java application. So say for example my last Id is 200.
If I add an entry with .sql by by using 201 as an id instead of doing seq.nextval. What would happen when my java application calls seq.nextval? Is sequence smart enough to check the max available number or it will just return 201?
It will just return 201, as the sequence has no idea what the numbers are used for.
Note: It may return, say, 220 if you have specified that the sequence has to cache values for some session (see the Oracle manual about CREATE SEQUENCE for more details)
Sequences just provide a way to "select" numbers that auto increment.
You will get 201 because they don't check anything, they just store the last value retrieved and when you query it again, it will return the next value in the sequence.
It will return 201.
You could also use nextval from JDBC, and then use that value to do the insert:
Statement st = conn.createStatement();
ResultSet rs = st.executeQuery("select seq.nextval from dual");
rs.next();
int yourId = rs.getInt(1);
// then use yourId to do the insert
This way you can insert using a number, and also keep the sequence the way it should be.
What nextval returns on the next call from your Java app depends on a number of factors:
If you run in a clustered environment, which node you next speak to. Each node will preallocate a pool of sequence values;
Whether or not the node you're talking to has been restarted. If this happens the pool of sequence values will tend to be "lost" (meaning skipped);
The step value of the sequence; and
Whether are transactions have called nextval on the sequence.
Sequences are loosely ordered, not absolutely ordered.
But Oracle has no idea what you do with sequence values so if you insert a 201 into the database, the sequence will happily return 201 completely oblivious to the inserted value as the two are basically unrelated.
It is never a good idea to mix sequence-generated values with manual inserts because then everything gets mixed up.
Not sure if it helps in your situation, but remember that you can ask for the current value of the sequence (with seq.currval or similar) so that you can check if already exists in the table due to a manual insert and, if necessary, ask for another nextval

Resources