How to fully Automate CDC in SQL Server? - sql-server

Is there a way to 100% automate SQL Server CDC initialization in an active SQL Server database? I am trying to solve a problem finding from_lsn during first cdc data capture.
Sequence of events:
Enable CDC on given database/Table
Copy full table to destination (Data lake)
Use CDC to capture first delta (I want to avoid duplicates, without missing a transaction)
Problem:
How to get the from_lsn for fn_cdc_get_all_changes_Schema_Table(from_lsn, to_lsn, '<row_filter_option>') function
Note:
Need to automate 100%
Can not stop transactions on the table
Can not miss any data or can not afford duplicate data

Before doing the initial load, get the value of fn_cdc_get_max_lsn() and store it. This function returns the highest LSN known to CDC across all capture instances. It's the high water mark for the whole database.
Copy the whole table.
Start your delta process. The first time you call the delta function, the value of the min_lsn argument will be the stored value previously retrieved from fn_cdc_get_max_lsn() incremented by fn_cdc_increment_lsn. Get the current value from fn_cdc_get_max_lsn() (not the stored one) and use it as the value of the max_lsn argument.
From here proceed as you expect. Take the maximum LSN returned from the delta function, store it. Next time you pull a delta, use fn_cdc_increment_lsn on the stored value, use the result as the value of the min_lsn argument, and use the result of fn_cdc_get_max_lsn() as the max_lsn argument.
With this process you will never miss any data. (Not covered here: be sure to check that your boundary conditions fall within a valid lsn range)
Now, you mentioned that you want to avoid "duplicates". But if you try to define what a "duplicate" is in this scenario, I think you'll find it difficult.
For example, suppose I have this table to begin with:
create table t(i int primary key, c char);
insert t(i, c) values (1, 'a');
I call fn_cdc_get_max_lsn() and get 0x01.
A user inserts a new row into the table: insert t(i, c) values (2, 'b');
The user operation is associated with an LSN value of 0x02.
I select all the rows in this table (getting two rows).
I write both rows to my destination table.
I start my delta process. My min_lsn argument will be 0x02.
I will therefore get the {2, 'b'} row in the delta.
But I already retrieved the row {2, 'b'} as part of my initial load. Is this a "duplicate"? No, this represents a change to the table. What will I do with this delta when I load it into my destination? There are really only two options.
Option 1: I am going to merge the delta into the destination table based on the primary key. In that case, when I merge the delta I will overwrite the already-loaded row {2, 'b'} with the new row {2, 'b'}, the outcome of which looks the same as not doing anything.
Option 2: I am going to append all changes to the destination. In that case my destination table will contain the row {2, 'b'} twice. Is this a duplicate? No, because the two rows represent the how the data looked at different logical times. First when I did the initial load, and then when I did the delta.
If you try to argue that this is in fact a duplicate, then I counter by giving you this hypothetical scenario:
You do the initial load, receiving row {1, 'a'},
No users change any data.
You get your first delta, which is empty.
A user executes update T set c = 'b' where i = 1.
You get your second delta, which will include the row {1, 'b'}.
A user executes update T set c = 'a' where i = 1.
You get your third delta, which will include the row {1, 'a'}.
Question: Is the row you retrieved during your third delta a "duplicate"? It has the same values as a row we already retrieved previously.
If your answer is "yes", then you can never eliminate "duplicate" reads, because a "duplicate" will occur any time a row mutates to have the same values it had at some previous point in time, which is something over which you have no control. If this is a "duplicate" that you need to eliminate in the append scenario, then that elimination must be performed at the destination, by comparing the incoming values with the existing values.

Related

Not all LSNs map to dates

I'm building an ETL that processes data from SQL server's change data capture feature. Part of the ETL is recording logs about the data that is processed including the data import window start and end. To do this I use the function sys.fn_map_lsn_to_time() to map the LSNs used to import the data to the corresponding datetime values.
The function sys.fn_cdc_get_all_changes_() takes two parameters that are the start and end of the data import window. These parameters are inclusive so the next run needs to increment the previous LSN to avoid re-importing rows that fall on the boundary.
The obvious answers is to use the function sys.fn_cdc_increment_lsn() to get the next LSN before bringing in the data. However, what I found is that this LSN does not always map to a datetime using sys.fn_map_lsn_to_time(). The LSN is valid for use in the sys.fn_cdc_get_all_change_() but I would like to be able to easily and accurately log the dates that are being used.
For example:
DECLARE #state_lsn_str CHAR(22) = '0x0000EEE100003E16008F'; -- try using `sys.fn_cdc_get_min_lsn(<capture_instance>)` instead since this value won't work for anyone else
DECLARE #state_lsn BINARY(10) = CONVERT(BINARY(10), #state_lsn_str, 1);
DECLARE #incr_lsn BINARY(10) = sys.fn_cdc_increment_lsn(#state_lsn);
SELECT CONVERT(CHAR(22), #incr_lsn, 1) AS incremented_lsn,
sys.fn_cdc_map_lsn_to_time(#incr_lsn) AS incremeneted_lsn_date;
This code returns an LSN value of 0x0000EEE100003E160090 and NULL for incremented_lsn_date
Is there a way to force an LSN to be mapped to a time?
OR
Is there a way to get the next LSN that does map to a time without risking losing any data?
The reason the value returned from sys.fn_cdc_increment_lsn() doesn't map to a datetime is there was no change recorded for that specific LSN. It increments the LSN by the smallest possible value even if there was no change recorded for that date.
To work around the issue I used the sys.fn_map_time_to_lsn() function. This function takes a relational operator parameter. You can get the next datetime value by using 'smallest greater than' for this parameter. The following code returns the next LSN that maps to a datetime:
DECLARE #state_lsn_str CHAR(22) = '0x0000EEE100003E16008F'; -- try using `sys.fn_cdc_get_min_lsn(<capture_instance>)` instead since this value won't work for anyone else
DECLARE #state_lsn BINARY(10) = CONVERT(BINARY(10), #state_lsn_str, 1);
DECLARE #state_lsn_date DATETIME = sys.fn_cdc_map_lsn_to_time(#state_lsn);
DECLARE #next_lsn BINARY(10) = sys.fn_cdc_map_time_to_lsn('smallest greater than', #state_lsn_date);
SELECT CONVERT(CHAR(22), #next_lsn, 1) AS next_lsn,
sys.fn_cdc_map_lsn_to_time(#next_lsn) AS next_lsn_date;
This code returns what appears to be a logical datetime value for the next LSN. Though I'm unsure how to 100% check that there is no data in any other tables.
The code above has a #state_lsn_date value of 2018-02-15 23:59:57.447 and the value found for the next LSN is 2018-02-16 00:00:01.363 and the integration runs at midnight.
The functions sys.fn_cdc_map_lsn_to_time() and sys.fn_cdc_map_time_to_lsn() use the cdc.lsn_time_mapping table to return their results. The documentation for this table states:
Returns one row for each transaction having rows in a change table.
This table is used to map between log sequence number (LSN) commit
values and the time the transaction committed. Entries may also be
logged for which there are no change tables entries. This allows the
table to record the completion of LSN processing in periods of low or
no change activity.
Microsoft Docs - cdc.lsn_time_mapping (Transact-SQL)
As I understand it that means every LSN value in any change table will be mapped here. There may be additional LSNs but there won't be missing LSNs. This allows the code to map to the next valid change date.
Since all changes will have a mapping in the cdc.lsn_time_mapping table using this method shouldn't lose any data.
Do I sound a little unsure? Well, I am.
I'm hoping someone with a deeper knowledge of the SQL Server Change Data Capture system can confirm whether this is safe or not.

Excel array function return all fields found with index and match

Here is the array function I am using:
=IFERROR(INDEX('Master Data'!$D$2:$D$153,MATCH(1,(B9='Master Data'!$J$2:$J$153)*('Master Data'!$W$2:$W$153=1),FALSE)),"")
Where D is the name of a project, J is a persons name and W is a flag to check if they are assigned to a project either equal to 0 or 1. B is also an instance of the persons name that is built up from a seperate list.
It basically references the master data and returns any rows with the criteria specified. However a single person may have two instances where the assigned flag is equal to 1 and thus as the master data is filtered different results are given back by the function.
Another problem I have is that the persons name is not repeated either so maybe the best way would be to start populating the names in the assigned table from the master data as well.
As requested here is an slight example of the data. On the left is the master data, the middle is the assigned table thats being built and the right is the list of employees that builds up the names in the assigned table.
Please note that there is two instances of david smith in the master data but only one in the assigned table as its being built with the employees list.
What I was thinking was to build up the names in the assigned table from the master data using an array where the assigned indicator is equal to 1 and to completely scrap the employees list, but I'm really unsure if this is possible or how to go about it.
Or even if there was some sort of way to select a few columns from the master data where the assigned indicator = 1?
Not sure if I understood it right. But the problem how to list multiple lookup results is achievable with SMALL function in combination with getting an array of ROW numbers.
For example have a sheet like this:
Then formulas are:
F4 downwards:
=COUNTIFS($B:$B,$E4,$C:$C,1)
G4 and then copied in G4:J8:
{=INDEX($A$1:$A$1000,SMALL(IF($B$1:$B$1000=$E4,IF($C$1:$C$1000=1,ROW($1:$1000))),COLUMN(A:A)))}
But if the goal is only to have a filtered list of all assigned resources, then the formulas could be
E13:
{=INDEX($B$1:$B$1000,MATCH(1,($C$1:$C$1000=1)*1,0))}
E14 downwards:
{=INDEX($B$1:$B$1000,MATCH(1,(COUNTIF($E$13:$E13,$B$1:$B$1000)=0)*($C$1:$C$1000=1),0))}
Formulas in {} are array formulas. These are entered into the cell without the curly brackets and then [Ctrl]+[Shift]+[Enter] is pressed to finish.
I have not handled the error values for better understanding the formulas. You can later hide the error values with IFERROR. You seems to know how to do this.

negative number in my datagridview

I'm having problem in my datagridview.
I'm using vb 2008 and an access database. When I create a new record for my item the No column (primary key and autonumber in access) always shows a negative number.
How can I make that a positive number and it should follow the numbering in the datagridview. Here's a screen shot of that:
Your "No" column will start out with an AutoIncrementSeed of -1 and an AutoIncrementStep of -1. The DataSet isn't smart enough to start with the Max value of the "No" column, so you need to programatically set it.
Me.MyDataSet.MyDataTable.Columns("No").AutoIncrementSeed = _
Me.MyDataSet.MyDataTable.Max(Function(Row) Row.No) + 1
Me.MyDataSet.MyDataTable.Columns("No").AutoIncrementStep = 1
The first line above finds the maximum value of the No column and sets the AutoIncrementSeed to 1 above the maximum value. The second line just sets the IncrementStep to 1.
Keep in mind that the No column in the DataGridView may not correspond to the actual value that the database creates. When your data is committed, the database will create a new AutoNumber value, ignoring any other values you may pass in. There are some pitfalls to doing this if your users expect the No value to stay the same after committing the new entry.
Just open the dataset designer and look for the increment parameter (-1) for the applicable datagriview of the table concerned and change it to (+1). Then, rebuild or recompile your application and the problem will done with.
if its showing in negative
check the logic why its converted to negative
changing - to + dosent matters, but logic ??
CONVERT AS put condition as column1>=0

How to force table select to go over blocks

How can I make Sybase's database engine return an unsorted list of records in non-numeric order?
~~~
I have an issue where I need to reproduce an error in the application where I select from a table where the ID is generated in sequence, but the ID is not the last one in the selection.
Let me explain.
ID STATUS
_____________
1234 C
1235 C
1236 O
Above is 3 IDs. I had code where these would be the results of a
select #p_id = ID from table where (conditions).
However, there wasn't a clause to check for status = 'O' (open). Remember Sybase saves the last returned record into a variable.
~~~~~
I'm being asked to give the testing team something that will make the results not work. If Sybase selects the above in an unordered list, it could appear in ascending order, or, if the database engine needs to change blocks of stored data or something technical magic stuff, the order could be messed up. The original error was when the procedure would return say 1234 instead of 1236.
Is there a way that I can have a 100% guarantee that Sybase will search over a block of data and have to double back, effectively 'breaking' the ascending search, and returning not the last record, but any other one? (all records except the maximum will end up erroring, because they are all 'Closed')
I want some sort of magical SQL code that will make sure things don't search the table in exactly numeric order. Ideally I'd like to not have to change the procedure, as the test team want to see the exact same procedure breaking (as easy as plonking a order by id desc would fudge the results).
If you don't specify an order, there is no way to guarantee the return order of the results. It will be however the index is built - and can depend on the order of insertion, the type of index, and the content of index keys.
It's generally a bad idea to do those sorts of singleton SELECTs. You should always specify a specific record with the WHERE clause, or use a cursor, or TOPn or similar. The problem comes when someone tries to understand your code, because some databases when they see multiple hits take the first value, some take the last value, some take a random value (they call that "implementation-defined"), and some throw an error.
Is this by any chance related to 1156837? :)

Oracle Sequences

I'm using a Sequence to generate an id for me, but I use that sequence from my Java application. So say for example my last Id is 200.
If I add an entry with .sql by by using 201 as an id instead of doing seq.nextval. What would happen when my java application calls seq.nextval? Is sequence smart enough to check the max available number or it will just return 201?
It will just return 201, as the sequence has no idea what the numbers are used for.
Note: It may return, say, 220 if you have specified that the sequence has to cache values for some session (see the Oracle manual about CREATE SEQUENCE for more details)
Sequences just provide a way to "select" numbers that auto increment.
You will get 201 because they don't check anything, they just store the last value retrieved and when you query it again, it will return the next value in the sequence.
It will return 201.
You could also use nextval from JDBC, and then use that value to do the insert:
Statement st = conn.createStatement();
ResultSet rs = st.executeQuery("select seq.nextval from dual");
rs.next();
int yourId = rs.getInt(1);
// then use yourId to do the insert
This way you can insert using a number, and also keep the sequence the way it should be.
What nextval returns on the next call from your Java app depends on a number of factors:
If you run in a clustered environment, which node you next speak to. Each node will preallocate a pool of sequence values;
Whether or not the node you're talking to has been restarted. If this happens the pool of sequence values will tend to be "lost" (meaning skipped);
The step value of the sequence; and
Whether are transactions have called nextval on the sequence.
Sequences are loosely ordered, not absolutely ordered.
But Oracle has no idea what you do with sequence values so if you insert a 201 into the database, the sequence will happily return 201 completely oblivious to the inserted value as the two are basically unrelated.
It is never a good idea to mix sequence-generated values with manual inserts because then everything gets mixed up.
Not sure if it helps in your situation, but remember that you can ask for the current value of the sequence (with seq.currval or similar) so that you can check if already exists in the table due to a manual insert and, if necessary, ask for another nextval

Resources