How to update the origin table in the CDC workflow (via SSIS)? - sql-server

I have a CDC process setup, whereby TableA's additional rows (or updates) are automatically picked up by an ETL and put into a TableB
TableA >>CDC>> TableB
The CDC works fine, except I want to update the first table once the CDC process is finished. I want to update the table by populating it with the
"extraction date". So my tableA has, lets say: Name, Age, OtherInfo, ExtractionDate. CDC is setup on Name,Age and OtherInfo columns (extractionDate column is excluded for obvious reasons).
Then, once CDC is performed on TableA and it's taken to TableB, I'd like to populate TableA's "extractionDate" with the current date. However, given I do not know which rows are being moved, I am having difficulty populating the column. Specifically, how can I make a "selective" where clause to select the "changed" rows, when that's only known to SSIS.

In the Table A database there are system tables that were created as part enabling CDC. You should be able to easily find the table associated with Table A. This is where MSSQL keeps track of all the changes.
The __$start_lsn is a timestamp of when the change was made and your SSIS imports use this value to bring across a range of changes. The lsn_time_mapping lets you look up the timestamp so it easier to understand.
In my processing I store the start and end lsn values so I know what was brought across with each SSIS run. I could then use these lsn values to go back to this CDC source table and see all the changes that MSSQL has tracked during that time-span.
Keep in mind that the CDC system tables are automatically cleaned out every few days - so you wouldn't be able to applyt this logic historically - only for recent imports.

Related

How to apply ROW ID in snowflake? Oracle code conversion to Snowflake

Is there any way of converting the last a.ROWID > b.ROWID values in below code in to snowflake? the below is the oracle code. Need to take the ROW ID to snowflake. But snowflake does not maintain ROW ID. Is there any way to achieve the below and convert the row id issue?
DELETE FROM user_tag.user_dim_default a
WHERE EXISTS (SELECT 1
FROM rev_tag.emp_site_weekly b
WHERE a.number = b.ID
AND a.accountno = b.account_no
AND a.ROWID > b.ROWID)
So this Oracle code seem very broken, because ROWID is a table specific pseudo column, thus comparing value between table seem very broken. Unless the is some aligned magic happening, like when user_tag.user_dim_default is inserted into rev_tag.emp_site_weekly is also written. But even then I can imagine data flows where this will not get what you want.
So as with most things Snowflake, "there is no free lunch", so the data life cycle that is relying on ROW_ID needs to be implemented.
Which implies if you are wanting to use two sequences, then you should do explicitly on each table. And if you are wanting them to be related to each other, it sounds like a multi table insert or Merge should be used so you can access the first tables SEQ and relate it in the second.
ROWID is an internal hidden column used by the database for specific DB operations. Depending on the vendor, you may have additional columns such as transaction ID or a logical delete flag. Be very carful to understand the behavior of these columns and how they work. They may not be in order, they may not be sequential, they may change in value as a DB Maint job runs while your code is running, or someone else runs an update on a table. Some of these internal columns may have the same value for more than one row for example.
When joining tables, the RowID on one table has no relation to the RowID on another table. When writing Dedup logic or delete before insert type logic, you should use the primary key, and then additionally an audit column that has the date of insert or date of last update in combo with that. Check the data model or ERD digram for the PK/FK relationships between the tables and what audit columns are available.

Using Dynamic SQL in a trigger to identify changes

I'm in the process of building a brand-new database, and want every table I create to have a corresponding audit table which would track any data changes.
In order to avoid having to hard-code every table column, what I would like to do is use Dynamic SQL to review each column in the table (with the exception of the Identity column) and work out whether or not the column has been changed, using the Inserted and Deleted tables to do so. By doing that, I could then theoretically add columns to the tables without having to re-create the triggers associated with the tables.
Is such a thing possible or am I running down a blind alley?

Synchronize table between two different databases

Once a day I have to synchronize table between two databases.
Source: Microsoft SQL Server
Destination: PostgreSQL
Table contains up to 30 million rows.
For the first time i will copy all table, but then for effectiveness my plan is to insert/update only changed rows.
In this way if I delete row from source database, it will not be deleted from the destination database.
The problem is that I don’t know which rows were deleted from the source database.
My dirty thoughts right now tend to use binary search - to compare the sum of the rows on each side and thus catch the deleted rows.
I’m at a dead end - please share your thoughts on this...
In SQL Server you can enable Change Tracking to track which rows are Inserted, Updated, or Deleted since the last time you synchronized the tables.
with TDS FDWs (Foreign Data Wrapper), map the source table with a temp table in pg, an use a join to find/exclude the rows that you need.

Is using a wide temporal table with only one regularly updated column efficient?

I have been unable to pin down how temporal table histories are stored.
If you have a table with several columns of nvarchar data and one stock quantity column that is updated regularly, does SQL Server store copies of the static columns for each change made to stock quantity, or is there an object-oriented method of storing the data?
I want to include all columns in the history because it is possible there will be rare changes to the nvarchar columns, but wary of the table history size if millions of qty updates are duplicating the other columns.
I suggest that you use the SQL Server temporal table only for the values that need monitoring otherwise the fixed unchanging attribute values would get duplicated with every change. SQL Server stores a whole new row whenever a row update occurs. See the docs:
UPDATES: On an UPDATE, the system stores the previous value of the row
in the history table and sets the value for the SysEndTime column to
the begin time of the current transaction (in the UTC time zone) based
on the system clock
You need to move your fixed varchar attributes/fields to another table and use a relation, 1:1 or whatever will be suitable.
Check also other relevant questions under the temporal-tables tag:
SQL Server - Temporal Table - Storage costs
SQL Server Temporal Table Creating Duplicate Records
Duplicates in temporal history table

CDC table not working after adding new columns to the source table

Two new columns were added to our source table while CDC was still enabled on the table. I need the new columns to appear in the CDC table but do not know what procedure should be followed to do this? I have already disabled CDC on the table, disabled CDC on the DB, added the new columns to the cdc.captured_columns table, and enabled CDC. But now I am getting no data in the CDC table!
Is there some other CDC table that must be updated after columns are added to the source table? These are all the CDC tables under the System Tables folder:
cdc.captured_columns <----- where I added the new columns
cdc.change_tables
cdc.dbo_myTable_CT <------ table where change data was being captured
cdc.ddl_history
cdc.index_columns
cdc.lsn_time_mapping
dbo.systranschemas
I recommend reading Tracking Changes in Your Enterprise Database. Is very detailed and deep. Among other extremly useful bits of info, there is such as:
DDL changes are unrestricted while change data capture is enabled.
However, they may have some effect on the change data collected if
columns are added or dropped. If a tracked column is dropped, all
further entries in the capture instance will have NULL for that
column. If a column is added, it will be ignored by the capture
instance. In other words, the shape of the capture instance is set
when it is created.
If column changes are required, it is possible to create another capture instance for a table (to a maximum of two capture instances per table) and allow consumers of the change data to migrate to the new table schema.
This is a very sensible and well thought design that considers schema drift (not all participants can have the schema updated simultaneously in a real online deployment). Having a multi-staged approach (deploy DDL, capture new CDC, upgrade subscribers, drop old CDC capture) is the only feasible approach and you should follow suit.

Resources