Keeping update changes for each column - sql-server

I've been asked by my client to manage update history for each column/field in one of our SQL Server tables. I've managed "batch" versions of entire rows before, but haven't done this type of backup/history before. They want to be able to keep track of changes for each column/field in a row of a table. I could use some help in the most efficient way to track this. Thanks!

The answer is triggers and history tables, but the best practice depends on what data your database holds and how often and by how much it is modified.
Essentially every time a record in a table is updated, an update trigger (attached to the table) gets notified what the old record looked like and what the new record will look like. You can then write the change history to new records to another table (i.e. tblSomething_History). Note: if updates to your tables are done via stored procs, you could write the history from there, but the problem with this is if another stored procedure updates your table as well, then the history won't be written.
Depending on the amount of fields / tables you want history for, you may do as suggested by #M.Al, you may embedded your history directly into the base table through versioning, or you may create a history table for each individual table, or you may create a generic history table such as:
| TblName | FieldName | NewValue | OldValue | User | Date Time |
Getting the modified time is easy, but it depends on security setup to determine which user changed what. Keeping the history in a separate table means less impact on retrieving the current data, as it is already separated you do not need to filter it out. But if you need to show history most of the time, this probably won't has the same effect.
Unfortunately you cannot add a single trigger to all tables in a database, you need to create a separate trigger for each, but they can then call a single stored procedure to do the guts of the actual work.
Another word of warning as well: automatically loading all history associated with tables can dramatically increase the load required, depending on the type of data stored in your tables, the history may become systematically larger than the base table. I have encountered and been affected by numerous applications that have become unusable, because the history tables were needlessly loaded when the base table was, and given the change history for the table could run into 100's per item, that's how much the load time increased.
Final Note: This is a strategy which is easy to absorb if built into your application from the ground up, but be careful bolting it on to an existing solution, as it can have a dramatic impact on performance if not tailored to your requirements. And can cost more than the client would expect it to.

I have worked on a similar database not long ago, where no row is ever deleted, every time a record was updated it actually added a new row and assigned it a version number. something like....
Each table will have two columns
Original_ID | Version_ID
each time a new record is added it gets assigned a sequential Version_ID which is a Unique Column and the Original_ID column remains NULL, On every subsequent changes to this row will actually insert a new row into the table and increases the Version_ID and the Version_ID that was assigned when the record was 1st created will be assigned to Original_ID.
If you have some situations where records need deleting, use a BIT column Deleted and set its value to 1 when a records is suppose to be deleted, Its called (Soft Deletion).
Also add a column something like LastUpdate datetime, to keep track of time that a change was made.
This way you will end up with all versions of a row starting from where a row is inserted till its is deleted (Soft deletion).

Related

Concept of "version control" for database table rows (Not referring to storing scripts in GIT/SVN)

I require a data store that will maintain not only a history of changes made to data (easy to do) but also store any number of proposed changes to data, including chained proposals (ie. proposal-on-proposal).
Think of these "changes" as really long-running transactions which are saved to the database and have a lifespan of anywhere between minutes and years.
They are created (proposed) and then either rolled back (essentially deleted) or committed, when committed they become the effective data visible to 3rd parties.
Of course this all requires some form of conflict resolution as proposed changes can be in contradictory states (eg. Change A proposes to delete a record but change B proposes to update it - if change A is committed first then change B will have to revert)
I have found no off-the-shelf product that can do this. The closest was Oracle Workspace Manager but it did not provide for change-on-change or the ability to see proposed deletes. The only way I have been able to achieve this is to have a set of common columns on my versioned tables:
Root ID: Required - set once to the same value as the primary key when the first version of a record is created. This represents the primary key across all of time and is copied into each version of the record. You should consider the Root ID when naming relation columns (eg. PARENT_ROOT_ID instead of PARENT_ID). As the Root ID is also the primary key of the initial version, foreign keys can be created against the actual primary key - the actual desired row will be determined by the version filters defined below.
Change ID: Required - every record is created, updated, deleted via a change
Copied From ID: Nullable - null indicates newly created record, not-null indicates which record ID this row was cloned/branched from when updated/deleted
Effective From Date/Time: Nullable - null indicates proposed record, not-null indicates when the record became current. Unfortunately a unique index cannot be placed on Root ID/Effective From as there can be multiple null values for any Root ID. (Unless you want to restrict yourself to a single proposed change per record)
Effective To Date/Time: Nullable - null indicates current or proposed, not-null indicates when it became historical. Not technically required but helps speed up queries finding the current data. This field could be corrupted by hand-edits but can be rebuilt from the Effective From Date/Time if this occurs.
Delete Flag: Boolean - set to true when it is proposed that the record be deleted upon becoming current. When deletes are committed, their Effective To Date/Time is set to the same value as the Effective From Date/Time, filtering them out of the current data set.
The query to get the current state of data at a point in time would be;
SELECT * FROM table WHERE EFFECTIVE_FROM <= :Now AND (EFFECTIVE_TO IS NULL OR EFFECTIVE_TO > :Now)
The query to get the current state of data according to a change would be;
SELECT * FROM table WHERE (CHANGE_ID IN :ChangeIds OR (EFFECTIVE_FROM <= :Now AND (EFFECTIVE_TO IS NULL OR EFFECTIVE_TO > :Now) AND ROOT_ID NOT IN (SELECT ROOT_ID FROM table WHERE CHANGE_ID IN :ChangeIds)))
Note that this 2nd query contains the 1st time-based query to overlay the current data with the proposed changed data.
The change ID column refers to the primary key of a change table which also contains a parent ID column (nullable) providing the change-on-change functionality. Hence the 2nd query refers to change IDs not a single change ID. I am filtering multiple versions in a change-on-change scenario in the client and not using SQL so it's not seen in those queries (The client has a linked list of change IDs in memory and if more than 1 version of a row is retrieved it uses the linked list to determine which version to use).
Does anybody know of an off-the-shelf product that I could use? It is a large amount of work handling this versioning myself and introduces all manner of issues.
There does not appear to be any off-the-shelf database or database plugin that does what I need. So I ended up utilising Oracle features to implement a solution.
The final table structure is slightly different - "Delete Flag" turned into "Change Action" which is either Add, Remove or Modify.
A global temporary table was used to store the current connection change identifier/date-time settings and a stored procedure created to populate it after connecting. This is referred to as 'context'.
Views joining versioned tables to this temporary, connection-specific context table are created programmatically for every versioned table, including instead-of insert/update/delete triggers which perform the required data versioning.
The result is that you treat the versioned tables like normal tables (and don't use the suffix _ROOT_ID for foreign keys) for select, insert, update and delete.
Only the Change Action is returned in the views and this is the only field that distinguishes a versioned table from a normal one.
Revert (which doesn't have a SQL keyword) is achieved by a double-delete. That is, if we update a record and then want to undo that update, we issue a delete command which deletes the proposed row and the record reverts to the current version. It's the most fitting SQL keyword - the alternative is to make a specific revert stored procedure.
A virtual Change Action of None exists in the views which indicates the record is not affected by the current context.
This all works quite effectively making the concept of versioning largely transparent, the only custom action required is setting the connection after connecting to the database.

What are the pros and cons of having history tables?

WHAT is the better practice?:
Keep history records in a separate history table
Keep history records in the active table with different status?
In my opinion I rather to keep a separate table to avoid creating one huge table with duplicate records which may cause unwanted lag time when querying the table.
My preference has always been to have a separate table with history, purely because it removes the need to have a "WHERE Status = 'LIVE'" or "WHERE CurrentRecord = 1" to get latest record (I won't get into one design that required an inline query to get max(version) to get the latest). It should mean that the current records table should remain smaller and access times may be improved, etc. In the worst case scenario, I've seen an ad-hoc query against a table pick up the wrong version of a record, causing all sorts of problems later on.
Also, if you are already getting the history from another table, you could shard the data, so all history from one year is in one table/db and all history from another is in another table/db and so on.
Pro:
If you keep the history in a separate table then this data will be accessed only when you need to search something from the past. Most of the times the main table will be used far more than the historical one. So this means faster results.
Con:
In a project I worked, I had one table with 350 columns (don't ask why.....). So this table became very large as the data was inputed. At a specific moment records went from 'active' to 'closed' status. I was tempted to move all the closed records to a new table (a historical one), but I realized that it was more slow - in a lot of queries I had to make unions....
As a final opinion I think it depends for every case, but I will always think first for the separate table.
I prefer to use one table and partioning. I also would set up a view for the active records and use that instead of the base table when querying active records.
I would go for a separate table, otherwise setting up UNIQUE and FK constraints may be still doable, but too involved.

Where should I break up my user records to keep track of revisions

I am putting together a staff database and I need to be able to revise the staff member information, but also keep track of all the revisions. How should I structure the database so that I can have multiple revisions of the same user data but be able to query against the most recent revision? I am looking at information that changes rarely, like Last Name, but that I will need to be able to query for out of date values. So if Jenny Smith changes her name to Jenny James I need to be able to find the user's current information when I search against her old name.
I assume that I will need at least 2 tables, one that contains the uid and another that contains the revisions. Then I would join them and query against the most recent revision. But should I break it out even further, depending on how often the data changes or the type of data? I am looking at about 40 fields per record and only one or two fields will probably change per update. Also I cannot remove any data from the database, I need to be able to look back on all previous records.
A simple way of doing this is to add a deleted flag and instead of updating records you set the deleted flag on the existing record and insert a new record.
You can of course also write the existing record to an archive table, if you prefer. But if changes are infrequent and the table is not big I would not bother.
To get the active record, query with 'where deleted = 0', the speed impact will be minimal when there is an index on this field.
Typically this is augmented with some other fields like a revision number, when the record was last updated, and who updated it. The revision number is very useful to get the previous versions and also to do optimistic locking. The 'who updated this last and when' questions usually come once the system is running instead of during requirements gathering, and are useful fields to put in any table containing 'master' data.
I would use the separate table because then you can have a unique identifier that points to all the other child records that is also the PK of the table which I think makes it less likely you will have data integrity issues. For instance, you have Mary Jones who has records in the address table and the email table and performance evaluation table, etc. If you add a change record to the main table, how are you going to relink all the existing information? With a separate history table, it isn't a problem.
With a deleted field in one table, you then have to have an non-autogenerated person id and an autogenrated recordid.
You also have the possiblity of people forgetting to use the where deleted = 0 where clause that is needed for almost every query. (If you do use the deleted flag field, do yourself a favor and set a view with the where deleted = 0 and require developers to use the view in queries not the orginal table.)
With the deleted flag field you will also need a trigger to ensure one and only one record is marked as active.
#Peter Tillemans' suggestion is a common way to accomplish what you're asking for. But I don't like it.
The structure of a database should reflect the real-world facts that are being modeled.
I would create a separate table for obsolete_employee, and just store the historical information that would need to be searched in the future. This way you can keep your real employee data table clean and keep only the old data that is necessary. This approach will also simplify reporting and other features of the application that are not related to searching historical data.
Just think of that warm feeling you'll get when you type select * from employee and nothing but current, correct goodness comes flowing back!

trigger insertions into same table

I have many tables in my database which are interrelated. I have a table (table one) which has had data inserted and the id auto increments. Once that row has an ID i want to insert this into a table (table three) with another set of ID's which comes from a form(this data will also be going into a table, so it could from from that table), the same form as the data which went into the first table came from.
The two ID's together make the primary key of the third table.
How can I do this, its to show that more than one ID is joined to a single ID for something else.
Thanks.
You can't do that through a trigger as the trigger only has available to it the data that you already inserted not data that is currenlty only residing in your user interface.
Normally how you handle this situation is that you write a stored proc that inserts the meeting, returns the id value (using scope_identity() in SQL Server, but I'm sure other databases would have method to return the auto-generated id as well). Then you would use that value to insert to the other table with the other values you need for that table. You would of course want to wrap the whole thing in a transaction.
I think you can probably do what you're describing (just write the INSERTs to table 3) in the table 1 trigger) but you'll have to put the additional info for the table 3 rows into your table 1 row, which isn't very smart.
I can't see why you would do that instead of writing the INSERTs in your code, where someone reading it can see what's happening.
The trouble with triggers is that they make it easy to hide business logic in the database. I think (and I believe I'm in the majority here) that it's easier to understand, manage, maintain and generally all-round deal with an application where all the business rules exist in the same general area.
There are reasons to use triggers (for propagating denormalised values, for example) just as there are reasons for useing stored procedures. I'm going to assert that they are largely related to performance-critical areas. Or should be.

Maintaining audit log for entities split across multiple tables

We have an entity split across 5 different tables. Records in 3 of those tables are mandatory. Records in the other two tables are optional (based on sub-type of entity).
One of the tables is designated the entity master. Records in the other four tables are keyed by the unique id from master.
After update/delete trigger is present on each table and a change of a record saves off history (from deleted table inside trigger) into a related history table. Each history table contains related entity fields + a timestamp.
So, live records are always in the live tables and history/changes are in history tables. Historical records can be ordered based on the timestamp column. Obviously, timestamp columns are not related across history tables.
Now, for the more difficult part.
Records are initially inserted in a single transaction. Either 3 or 5 records will be written in a single transaction.
Individual updates can happen to any or all of the 5 tables.
All records are updated as part of a single transaction. Again, either 3 or 5 records will be updated in a single transaction.
Number 2 can be repeated multiple times.
Number 3 can be repeated multiple times.
The application is supposed to display a list of point in time history entries based on records written as single transactions only (points 1,3 and 5 only)
I'm currently having problems with an algorithm that will retrieve historical records based on timestamp data alone.
Adding a HISTORYMASTER table to hold the extra information about transactions seems to partially address the problem. A new record is added into HISTORYMASTER before every transaction. New HISTORYMASTER.ID is saved into each entity table during a transaction.
Point in time history can be retrieved by selecting the first record for a particular HISTORYMASTER.ID (ordered by timestamp)
Is there any more optimal way to manage audit tables based on AFTER (UPDATE, DELETE) TRIGGERs for entities spanning multiple tables?
Your HistoryMaster seems similar to how we have addressed history of multiple related items in one of our systems. By having a single point to hang all the related changes from in the history table, it is easy to then create a view that uses the history master as the hub and attached the related information. It also allows you to not create records in the history where an audit is not desired.
In our case the primary tables were called EntityAudit (where entity was the "primary" item being retained) and all data was stored EntityHistory tables related back to the Audit. In our case we were using a data layer for business rules, so it was easy to insert the audit rules into the data layer itself. I feel that the data layer is an optimal point for such tracking if and only if all modifications use that data layer. If you have multiple applications using distinct data layers (or none at all) then I suspect that a trigger than creates the master record is pretty much the only way to go.
If you don't have additional information to track in the Audit (we track the user who made the change, for example, something not on the main tables) then I would contemplate putting the extra Audit ID on the "primary" record itself. Your description does not seem to indicate you are interested in the minor changes to individual tables, but only changes that update the entire entity set (although I may be miss reading that). I would only do so if you don't care about the minor edits though. In our case, we needed to track all changes, even to the related records.
Note that the use of an Audit/Master table has an advantage in that you are making minimal changes to the History tables as compared to the source tables: a single AuditID (in our case, a Guid, although autonumbers would be fine in non distributed databases).
Can you add a TimeStamp / RowVersion datatype column to the entity master table, and associate all the audit records with that?
But an Update to any of the "child" tables will need to update the Master entity table to force the TimeStamp / RowVersion to change :(
Or stick a GUID in there that you freshen whenever one of the associated records changes.
Thinking that through, out loud, it may be better to have a table joined 1:1 to Master Entity that only contains the Master Entity ID and the "version number" fo the record - either TimeSTamp / RowVersion, GUID, incremented number, or something else.
I think it's a symptom of trying to capture "abstract" audit events at the lowest level of your application stack - the database.
If it's possible consider trapping the audit events in your business layer. This would allow you to capture the history per logical transaction rather than on a row-by-row basis. The date/time is unreliable for resolving things like this as it can be different for different rows, and the same for concurrent (or closely spaced) transactions.
I understand that you've asked how to do this in DB triggers though. I don't know about SQL Server, but in Oracle you can overcome this by using the DBMS_TRANSACTION.LOCAL_TRANSACTION_ID system package to return the ID for the current transaction. If you can retrieve an equivalent SQLServer value, then you can use this to tie the record updates for the current transaction together into a logical package.

Resources