Where should I break up my user records to keep track of revisions - database

I am putting together a staff database and I need to be able to revise the staff member information, but also keep track of all the revisions. How should I structure the database so that I can have multiple revisions of the same user data but be able to query against the most recent revision? I am looking at information that changes rarely, like Last Name, but that I will need to be able to query for out of date values. So if Jenny Smith changes her name to Jenny James I need to be able to find the user's current information when I search against her old name.
I assume that I will need at least 2 tables, one that contains the uid and another that contains the revisions. Then I would join them and query against the most recent revision. But should I break it out even further, depending on how often the data changes or the type of data? I am looking at about 40 fields per record and only one or two fields will probably change per update. Also I cannot remove any data from the database, I need to be able to look back on all previous records.

A simple way of doing this is to add a deleted flag and instead of updating records you set the deleted flag on the existing record and insert a new record.
You can of course also write the existing record to an archive table, if you prefer. But if changes are infrequent and the table is not big I would not bother.
To get the active record, query with 'where deleted = 0', the speed impact will be minimal when there is an index on this field.
Typically this is augmented with some other fields like a revision number, when the record was last updated, and who updated it. The revision number is very useful to get the previous versions and also to do optimistic locking. The 'who updated this last and when' questions usually come once the system is running instead of during requirements gathering, and are useful fields to put in any table containing 'master' data.

I would use the separate table because then you can have a unique identifier that points to all the other child records that is also the PK of the table which I think makes it less likely you will have data integrity issues. For instance, you have Mary Jones who has records in the address table and the email table and performance evaluation table, etc. If you add a change record to the main table, how are you going to relink all the existing information? With a separate history table, it isn't a problem.
With a deleted field in one table, you then have to have an non-autogenerated person id and an autogenrated recordid.
You also have the possiblity of people forgetting to use the where deleted = 0 where clause that is needed for almost every query. (If you do use the deleted flag field, do yourself a favor and set a view with the where deleted = 0 and require developers to use the view in queries not the orginal table.)
With the deleted flag field you will also need a trigger to ensure one and only one record is marked as active.

#Peter Tillemans' suggestion is a common way to accomplish what you're asking for. But I don't like it.
The structure of a database should reflect the real-world facts that are being modeled.
I would create a separate table for obsolete_employee, and just store the historical information that would need to be searched in the future. This way you can keep your real employee data table clean and keep only the old data that is necessary. This approach will also simplify reporting and other features of the application that are not related to searching historical data.
Just think of that warm feeling you'll get when you type select * from employee and nothing but current, correct goodness comes flowing back!

Related

Keeping update changes for each column

I've been asked by my client to manage update history for each column/field in one of our SQL Server tables. I've managed "batch" versions of entire rows before, but haven't done this type of backup/history before. They want to be able to keep track of changes for each column/field in a row of a table. I could use some help in the most efficient way to track this. Thanks!
The answer is triggers and history tables, but the best practice depends on what data your database holds and how often and by how much it is modified.
Essentially every time a record in a table is updated, an update trigger (attached to the table) gets notified what the old record looked like and what the new record will look like. You can then write the change history to new records to another table (i.e. tblSomething_History). Note: if updates to your tables are done via stored procs, you could write the history from there, but the problem with this is if another stored procedure updates your table as well, then the history won't be written.
Depending on the amount of fields / tables you want history for, you may do as suggested by #M.Al, you may embedded your history directly into the base table through versioning, or you may create a history table for each individual table, or you may create a generic history table such as:
| TblName | FieldName | NewValue | OldValue | User | Date Time |
Getting the modified time is easy, but it depends on security setup to determine which user changed what. Keeping the history in a separate table means less impact on retrieving the current data, as it is already separated you do not need to filter it out. But if you need to show history most of the time, this probably won't has the same effect.
Unfortunately you cannot add a single trigger to all tables in a database, you need to create a separate trigger for each, but they can then call a single stored procedure to do the guts of the actual work.
Another word of warning as well: automatically loading all history associated with tables can dramatically increase the load required, depending on the type of data stored in your tables, the history may become systematically larger than the base table. I have encountered and been affected by numerous applications that have become unusable, because the history tables were needlessly loaded when the base table was, and given the change history for the table could run into 100's per item, that's how much the load time increased.
Final Note: This is a strategy which is easy to absorb if built into your application from the ground up, but be careful bolting it on to an existing solution, as it can have a dramatic impact on performance if not tailored to your requirements. And can cost more than the client would expect it to.
I have worked on a similar database not long ago, where no row is ever deleted, every time a record was updated it actually added a new row and assigned it a version number. something like....
Each table will have two columns
Original_ID | Version_ID
each time a new record is added it gets assigned a sequential Version_ID which is a Unique Column and the Original_ID column remains NULL, On every subsequent changes to this row will actually insert a new row into the table and increases the Version_ID and the Version_ID that was assigned when the record was 1st created will be assigned to Original_ID.
If you have some situations where records need deleting, use a BIT column Deleted and set its value to 1 when a records is suppose to be deleted, Its called (Soft Deletion).
Also add a column something like LastUpdate datetime, to keep track of time that a change was made.
This way you will end up with all versions of a row starting from where a row is inserted till its is deleted (Soft deletion).

What are the pros and cons of having history tables?

WHAT is the better practice?:
Keep history records in a separate history table
Keep history records in the active table with different status?
In my opinion I rather to keep a separate table to avoid creating one huge table with duplicate records which may cause unwanted lag time when querying the table.
My preference has always been to have a separate table with history, purely because it removes the need to have a "WHERE Status = 'LIVE'" or "WHERE CurrentRecord = 1" to get latest record (I won't get into one design that required an inline query to get max(version) to get the latest). It should mean that the current records table should remain smaller and access times may be improved, etc. In the worst case scenario, I've seen an ad-hoc query against a table pick up the wrong version of a record, causing all sorts of problems later on.
Also, if you are already getting the history from another table, you could shard the data, so all history from one year is in one table/db and all history from another is in another table/db and so on.
Pro:
If you keep the history in a separate table then this data will be accessed only when you need to search something from the past. Most of the times the main table will be used far more than the historical one. So this means faster results.
Con:
In a project I worked, I had one table with 350 columns (don't ask why.....). So this table became very large as the data was inputed. At a specific moment records went from 'active' to 'closed' status. I was tempted to move all the closed records to a new table (a historical one), but I realized that it was more slow - in a lot of queries I had to make unions....
As a final opinion I think it depends for every case, but I will always think first for the separate table.
I prefer to use one table and partioning. I also would set up a view for the active records and use that instead of the base table when querying active records.
I would go for a separate table, otherwise setting up UNIQUE and FK constraints may be still doable, but too involved.

Are created and modified the two fields every database table should have?

I recently realized that I add some form of row creation timestamp and possibly a "updated on" field to most of my tables. Suddenly I started thinking that perhaps every table in the database should have a created and modified field that are set in the model behind the scenes.
Does this sound correct? Are there any types of high-load tables (like sessions) or massive sized tables that this wouldn't be a good idea for?
I wouldn't put those fields (which I generally call audit fields) on every database table. If it's a low-traffic, high-value table (like Users, for instance), it goes on, no question. I'd also add creator and modifier. If it's a table that gets hit a lot (an operation history table, say), then maybe the benefit isn't worth the cost of increased insert time and storage space.
It's a call you'll need to make separately for each table.
Obviously, there isn't a single rule.
Most of my tables have date-related things, DateCreated, DateModified, and occasionally a Revision to track changes and so on. Do whatever makes sense. Clearly, you can invent cases where it's appropriate and cases where it is not. If you're asking whether you should add them "by default" to most tables, I'd say "probably".

Entity Deletion Strategy

Say you have a ServiceCall database table that records down all the service calls made to you. Each of this record contains a many to one relationship to Customer record, where it stores which customer made the Service Call.
Ok, suppose the Customer has stop doing business with you and you do not need the Customer's record in your database. No longer need the Customer's name to appear in the dropdown list when you create a new ServiceCall record.
What do you do?
Do you allow the user to delete the Customer's record from the database?
Do you set a special column IsDeleted to true for that Customer's record, then make sure all dropdown list will not load all records that has IsDeleted set to true? Although this keeps the old records from breaking at innerjoins, it also prevents user from adding a new record with the same name as the old Customer, won't it?
Do you disallow deletion at all? Just allow to 'disable' it?
Any other strategies you used? I am guessing everyone have their way, I just need to see your opinions.
Of course the above is quite simplified, usually a ServiceCall record will link to many other entity tables. All of which will face the same problem when they are required to be deleted.
I prefer to set an IsDeleted flag, one of the benefits is you can still report on historical information (all teh data is still there).
As to the issue of not being able to insert another customer with the same name, this isn't a problem if you use an ID column (eg CustomerId) which is generally auto populated.
I agree with #Tetraneutron's answer.
Additionally, you can create a VIEW that lists only the active customers, to make it more convenient to populate drop-down lists and such.

How to separate automatically populated tables from manually populated tables, properly, in SQL Server?

Lets say I have the following 2 tables in a database:
[Movies] (Scheme: Automatic)
----------------------------
MovieID
Name
[Comments] (Scheme: Manual)
----------------------------
CommentID
MovieID
Text
The "Movies" table gets updated by a service every 10 minutes and the "Comments" table gets updated manually by the users of the database.
Normally you'd just create a simple foreign-key relationship between the two tables with cascading updates and deletes but in this case I want to be able to keep the manually entered data even if the movie it refers to gets deleted (the update service isn't that reliable). This should only be a problem in one-to-many releationships from an automatic table to a manual table. How would you separate the manual and the automatically populated parts of the database?
I was planning to add a foreign-key that isn't maintaining referencial integrity and only cascades updates, not deletions. But are there any pitfalls I should be aware of by doing it this way? I mean, except the fact that I might end up with some of the manual data that doesn't actually reference anything.
Edit / Clarification:
Just to clarify. The example tables are totally made up. In reality the DB will contain objects like servers, applications, application notes, versions numbers etc. Server related information will be populated automatically but some application details will be filled in manually. It could be information like special configurations and such. Even if the server record gets deleted the application notes on that server are still valuable and shouldn't be deleted.
I'd suggest you use an import table that gets updated by the service and then populate the movies tables from that. Then you get to keep movies that are deleted in the movies table. Possible tagging them as deleted or obsolete, but you'd still be able to keep them for historical purposes.
I think you should use a soft delete for that scenario. I don't think you want to have comments you don't know which movie they belong to.
Agree; an example route would be to copy the movies table and add a status field which indicates each record's present state (live/checking/deleted). Then the autoimport should go into a temporary table, set the status of all movies to 'checking', then use the temporary table to update the real movies table, setting the movie status to live when it's found in the temporary table. Once complete, set any movie which still has a status of 'checking' to deleted, since they weren't found in the autoimport. At the application end, select any movie which doesn't have status = deleted.
"I was planning to add a foreign-key
that isn't maintaining referencial
integrity and only cascades updates,
not deletions."
Since you appear to be using surrogate keys, updates will not be relevant to foreign elements. Additionally, since you do not care about orphaning data, then why use the referential constraint at all? You use constraints to ensure that something exists, which you do not appear to require in this situation.

Resources