Is using a wide temporal table with only one regularly updated column efficient? - sql-server

I have been unable to pin down how temporal table histories are stored.
If you have a table with several columns of nvarchar data and one stock quantity column that is updated regularly, does SQL Server store copies of the static columns for each change made to stock quantity, or is there an object-oriented method of storing the data?
I want to include all columns in the history because it is possible there will be rare changes to the nvarchar columns, but wary of the table history size if millions of qty updates are duplicating the other columns.

I suggest that you use the SQL Server temporal table only for the values that need monitoring otherwise the fixed unchanging attribute values would get duplicated with every change. SQL Server stores a whole new row whenever a row update occurs. See the docs:
UPDATES: On an UPDATE, the system stores the previous value of the row
in the history table and sets the value for the SysEndTime column to
the begin time of the current transaction (in the UTC time zone) based
on the system clock
You need to move your fixed varchar attributes/fields to another table and use a relation, 1:1 or whatever will be suitable.
Check also other relevant questions under the temporal-tables tag:
SQL Server - Temporal Table - Storage costs
SQL Server Temporal Table Creating Duplicate Records
Duplicates in temporal history table

Related

Using Dynamic SQL in a trigger to identify changes

I'm in the process of building a brand-new database, and want every table I create to have a corresponding audit table which would track any data changes.
In order to avoid having to hard-code every table column, what I would like to do is use Dynamic SQL to review each column in the table (with the exception of the Identity column) and work out whether or not the column has been changed, using the Inserted and Deleted tables to do so. By doing that, I could then theoretically add columns to the tables without having to re-create the triggers associated with the tables.
Is such a thing possible or am I running down a blind alley?

Data Versioning/Auditing in SQL Database best patterns

I have a Job table where I post the Job description, posted date, qualifications etc.. with below schema
Job(Id ##Identity PK, Description varchar (200), PostedOn DateTime, Skills Varchar(50))
Other attributes of jobs we would like to track such as Department, team etc will be stored in another table as Attriibutes of Job
JobAttributesList(Id ##Identity PK, AttributeName varchar(50))
JobAttributes(JobID ##Identity PK, AttributeID FK REFERENCES JobAttributesList.Id, AttributeValue varchar(50))
Now if a job description has changed, we do not want to lose old one and hence keep track of versioning.What are the best practices? we may have to scale later by adding more versioning tables
A strategy would be to use a History table for all the tables we want to enable versioning but that would add more and more tables as we add versioning requirements and I feel its schema duplication.
There is a difference between versioning and auditing. Versioning only requires that you keep the old versions of the data somewhere. Auditing typically requires that you also know who made a change.
If you want to keep the old versions in the database, do create an "old versions" table for each table you want to version, but don't create a new table for every different column change you want to audit.
I mean, you can create a new table for every column, whose only columns are audit_id, key, old_column_value, created_datetime, and it can save disk space if the original table is very wide, but it makes reconstructing the complete row for a given date and time extraordinarily expensive.
You could also keep the old data in the same table, and always do inserts, but over time that becomes a performance problem as your OLTP table gets way, way too big.
Just have a single table with all the columns of the original table, which you always insert into, which you can do inside an update, delete trigger on the original table. You can tell which columns have changed either by adding a bit flag for every column, or just determine that at select time by comparing data in one row with data in the previously audited row for the given key.
I would absolutely not recommend creating a trigger which concatenates all of the values cast to varchar and dumps it all into a single, universal audit table with an "audited_data" column. It will be slow to write, and impossible to usefully read.
If you want to use this for actual auditing, and not just versioning, then either the user making the change must be captured in the original table so it is available to the trigger, or you need people to be connecting with specific logins, in which case you can use transport information like original_login(), or you need to set a value like context_info or session_context on the client side.

SSIS only extract Delta changes

After some advice. I'm using SSIS\SQL Server 2014. I have a nightly SSIS package that pulls in data from non-SQL Server db's into a single table (the SQL table is truncated beforehand each time) and I then extract from this table to create a daily csv file.
Going forward, I only want to extract to csv on a daily basis the records that have changed i.e. the Deltas.
What is the best approach? I was thinking of using CDC in SSIS, but as I'm truncating the SQL table before the initial load each time, will this be best method? Or will I need to have a master table in SQL with an initial load, then import into another table and just extract where there are different? For info, the table in SQL contains a Primary Key.
I just want to double check as CDC assumes the tables are all in SQL Server, whereas my data is coming from outside SQL Server first.
Thanks for any help.
The primary key on that table is your saving grace here. Obviously enough, the SQL Server database that you're pulling the disparate data into won't know from one table flush to the next which records have changed, but if you add two additional tables, and modify the existing table with an additional column, it should be able to figure it out by leveraging HASHBYTES.
For this example, I'll call the new table SentRows, but you can use a more meaningful name in practice. We'll call the new column in the old table HashValue.
Add the column HashValue to your table as a varbinary data type. NOT NULL as well.
Create your SentRows table with columns for all the columns in the main table's primary key, plus the HashValue column.
Create a RowsToSend table that's structurally identical to your main table, including the HashValue.
Modify your queries to create the HashValue by applying HASHBYTES to all of the non-key columns in the table. (This will be horribly tedious. Sorry about that.)
Send out your full data set.
Now move all of the key values and HashValues to the SentRows table. Truncate your main table.
On the next pull, compare the key values and HashValues from SentRows to the new data in the main table.
Primary key match + hash match = Unchanged row
Primary key match + hash mismatch = Updated row
Primary key in incoming data but missing from existing data set = New row
Primary key not in incoming data but in existing data set = Deleted row
Pull out any changes you need to send to the RowsToSend table.
Send the changes from RowsToSend.
Move the key values and HashValues to your SentRows table. Update hashes for changed key values, insert new rows, and decide how you're going to handle deletes, if you have to deal with deletes.
Truncate the SentRows table to get ready for tomorrow.
If you'd like (and you'll thank yourself later if you do) add a computed column to the SentRows table with default of GETDATE(), which will tell you when the row was added.
And away you go. Nothing but deltas from now on.
Edit 2019-10-31:
Step by step (or TL;DR):
1) Flush and Fill MainTable.
2) Compare keys and hashes on MainTable to keys and hashes on SentRows to identify new/changed rows.
3) Move new/changed rows to RowsToSend.
4) Send the rows that are in RowsToSend.
5) Move all the rows from RowsToSend to SentRows.
6) Truncate RowsToSend.

SQL Server - Temporal Table - Storage costs

are there any information in the net, where i can verify how hight are the storage costs for temporal tables feature?
Will the server creates a the full hardcopy of the row/tuple that was modified?
Or will the server use a reference/links to the original values of the master table that are not modified?
For example. I have a row with 10 columns = storage 100 KB. I change one value of that row, thow times. I have thow rows in the historical table after that changes. Is the fill storage cost for the master und historial table then ~300KB?
Thanks for every hint!
Ragards
Will the server creates a the full hardcopy of the row/tuple that was
modified? Or will the server use a reference/links to the original
values of the master table that are not modified?
Here is the cite of the book Pro SQL Server Internals
by Dmitri Korotkevitch that ansers your question:
In a nutshell, each temporal table consists of two tables — the
current table with the current data, and a history table that stores
old versions of the rows. Every time you modify or delete data in
the current table, SQL Server adds an original version of those rows
to the history table.
A current table should always have a primary key defined. Moreover,
both current and history tables should have two datetime2 columns,
called period columns, that indicate the lifetime of the row. SQL
Server populates these columns automatically based on transaction
start time when the new versions of the rows were created. When a row
has been modified several times in one transaction, SQL Server does
not preserve uncommitted intermediary row versions in the history
table.
SQL Server places the history tables in a default filegroup, creating
non-unique clustered indexes on the two datetime2 columns that
control row lifetime. It does not create any other indexes on the
table.
In both the Enterprise and Developer Editions, history tables use
page compression by default.
So it's not
reference/links to the original values of the master table
Previous row version is just copied as it is into historical table on every mofification.

How to update the origin table in the CDC workflow (via SSIS)?

I have a CDC process setup, whereby TableA's additional rows (or updates) are automatically picked up by an ETL and put into a TableB
TableA >>CDC>> TableB
The CDC works fine, except I want to update the first table once the CDC process is finished. I want to update the table by populating it with the
"extraction date". So my tableA has, lets say: Name, Age, OtherInfo, ExtractionDate. CDC is setup on Name,Age and OtherInfo columns (extractionDate column is excluded for obvious reasons).
Then, once CDC is performed on TableA and it's taken to TableB, I'd like to populate TableA's "extractionDate" with the current date. However, given I do not know which rows are being moved, I am having difficulty populating the column. Specifically, how can I make a "selective" where clause to select the "changed" rows, when that's only known to SSIS.
In the Table A database there are system tables that were created as part enabling CDC. You should be able to easily find the table associated with Table A. This is where MSSQL keeps track of all the changes.
The __$start_lsn is a timestamp of when the change was made and your SSIS imports use this value to bring across a range of changes. The lsn_time_mapping lets you look up the timestamp so it easier to understand.
In my processing I store the start and end lsn values so I know what was brought across with each SSIS run. I could then use these lsn values to go back to this CDC source table and see all the changes that MSSQL has tracked during that time-span.
Keep in mind that the CDC system tables are automatically cleaned out every few days - so you wouldn't be able to applyt this logic historically - only for recent imports.

Resources