WHAT is the better practice?:
Keep history records in a separate history table
Keep history records in the active table with different status?
In my opinion I rather to keep a separate table to avoid creating one huge table with duplicate records which may cause unwanted lag time when querying the table.
My preference has always been to have a separate table with history, purely because it removes the need to have a "WHERE Status = 'LIVE'" or "WHERE CurrentRecord = 1" to get latest record (I won't get into one design that required an inline query to get max(version) to get the latest). It should mean that the current records table should remain smaller and access times may be improved, etc. In the worst case scenario, I've seen an ad-hoc query against a table pick up the wrong version of a record, causing all sorts of problems later on.
Also, if you are already getting the history from another table, you could shard the data, so all history from one year is in one table/db and all history from another is in another table/db and so on.
Pro:
If you keep the history in a separate table then this data will be accessed only when you need to search something from the past. Most of the times the main table will be used far more than the historical one. So this means faster results.
Con:
In a project I worked, I had one table with 350 columns (don't ask why.....). So this table became very large as the data was inputed. At a specific moment records went from 'active' to 'closed' status. I was tempted to move all the closed records to a new table (a historical one), but I realized that it was more slow - in a lot of queries I had to make unions....
As a final opinion I think it depends for every case, but I will always think first for the separate table.
I prefer to use one table and partioning. I also would set up a view for the active records and use that instead of the base table when querying active records.
I would go for a separate table, otherwise setting up UNIQUE and FK constraints may be still doable, but too involved.
Related
I've been asked by my client to manage update history for each column/field in one of our SQL Server tables. I've managed "batch" versions of entire rows before, but haven't done this type of backup/history before. They want to be able to keep track of changes for each column/field in a row of a table. I could use some help in the most efficient way to track this. Thanks!
The answer is triggers and history tables, but the best practice depends on what data your database holds and how often and by how much it is modified.
Essentially every time a record in a table is updated, an update trigger (attached to the table) gets notified what the old record looked like and what the new record will look like. You can then write the change history to new records to another table (i.e. tblSomething_History). Note: if updates to your tables are done via stored procs, you could write the history from there, but the problem with this is if another stored procedure updates your table as well, then the history won't be written.
Depending on the amount of fields / tables you want history for, you may do as suggested by #M.Al, you may embedded your history directly into the base table through versioning, or you may create a history table for each individual table, or you may create a generic history table such as:
| TblName | FieldName | NewValue | OldValue | User | Date Time |
Getting the modified time is easy, but it depends on security setup to determine which user changed what. Keeping the history in a separate table means less impact on retrieving the current data, as it is already separated you do not need to filter it out. But if you need to show history most of the time, this probably won't has the same effect.
Unfortunately you cannot add a single trigger to all tables in a database, you need to create a separate trigger for each, but they can then call a single stored procedure to do the guts of the actual work.
Another word of warning as well: automatically loading all history associated with tables can dramatically increase the load required, depending on the type of data stored in your tables, the history may become systematically larger than the base table. I have encountered and been affected by numerous applications that have become unusable, because the history tables were needlessly loaded when the base table was, and given the change history for the table could run into 100's per item, that's how much the load time increased.
Final Note: This is a strategy which is easy to absorb if built into your application from the ground up, but be careful bolting it on to an existing solution, as it can have a dramatic impact on performance if not tailored to your requirements. And can cost more than the client would expect it to.
I have worked on a similar database not long ago, where no row is ever deleted, every time a record was updated it actually added a new row and assigned it a version number. something like....
Each table will have two columns
Original_ID | Version_ID
each time a new record is added it gets assigned a sequential Version_ID which is a Unique Column and the Original_ID column remains NULL, On every subsequent changes to this row will actually insert a new row into the table and increases the Version_ID and the Version_ID that was assigned when the record was 1st created will be assigned to Original_ID.
If you have some situations where records need deleting, use a BIT column Deleted and set its value to 1 when a records is suppose to be deleted, Its called (Soft Deletion).
Also add a column something like LastUpdate datetime, to keep track of time that a change was made.
This way you will end up with all versions of a row starting from where a row is inserted till its is deleted (Soft deletion).
I'm looking at the best practice approach here. I have a web page that has several drop down options. The drop downs are not related, they are for misc. values (location, building codes, etc). The database right now has a table for each set of options (e.g. table for building codes, table for locations, etc). I'm wondering if I could just combine them all into on table (called listOptions) and then just query that one table.
Location Table
LocationID (int)
LocatValue (nvarchar(25))
LocatDescription (nvarchar(25))
BuildingCode Table
BCID (int)
BCValue (nvarchar(25))
BCDescription (nvarchar(25))
Instead of the above, is there any reason why I can't do this?
ListOptions Table
ID (int)
listValue (nvarchar(25))
listDescription (nvarchar(25))
groupID (int) //where groupid corresponds to Location, Building Code, etc
Now, when I query the table, I can pass to the query the groupID to pull back the other values I need.
Putting in one table is an antipattern. These are differnt lookups and you cannot enforce referential integrity in the datbase (which is the ciorrect place to enforce it as applications are often not the only way data gets changed) unless they are in separate tables. Data integrity is FAR more important than saving a few minutes of development time if you need an additonal lookup.
If you plan to use the values later in some referencing FKeys - better use separate tables.
But why do you need "all in one" table? Which problem it solves?
You could do this.
I believe that is your master data and it would not be having any huge amounts of rows that it might create and performance problems.
Secondly, why would you want to do it once your app is up and running. It should have thought about earlier. The tables might be used in a lot of places and it's might be a lot of coding and most importantly testing.
Can you throw further light into your requirements.
You can keep them in separate tables and have your stored procedure return one set of data with a "datatype" key that signifies which set of values go with what option.
However, I would urge you to consider a much different approach. This suggestion is based on years of building data driven websites. If these drop-down options don't change very often then why not build server-side include files instead of querying the database. We did this with most of our websites. Think about it, each time the page is presented you query the database for the same list of values... that data hardly ever changes.
In cases when that data did have the tendency to change, we simply added a routine to the back end admin that rebuilt the server-side include file whenever an add, change or delete was done to one of the lookup values. This reduced database I/O's and spead up the load time of all our websites.
We had approximately 600 websites on the same server all using the same instance of SQL Server (separate databases) our total server database I/O's were drastically reduced.
Edit:
We simply built SSI that looked like this...
<option value="1'>Blue</option>
<option value="2'>Red</option>
<option value="3'>Green</option>
With single table it would be easy to add new groups in favour of creating new tables, but for best practices concerns you should also have a group table so you can name those groups in the db for future maintenance
The best practice depends on your requirements.
Do the values of location and building vary frequently? Where do the values come from? Are they imported from external data? Do other tables refer the unique table (so that I need a two-field key to preper join the tables)?
For example, I use unique table with hetorogeneus data for constants or configuration values.
But if the data vary often or are imported from external source, I prefer use separate tables.
I recently realized that I add some form of row creation timestamp and possibly a "updated on" field to most of my tables. Suddenly I started thinking that perhaps every table in the database should have a created and modified field that are set in the model behind the scenes.
Does this sound correct? Are there any types of high-load tables (like sessions) or massive sized tables that this wouldn't be a good idea for?
I wouldn't put those fields (which I generally call audit fields) on every database table. If it's a low-traffic, high-value table (like Users, for instance), it goes on, no question. I'd also add creator and modifier. If it's a table that gets hit a lot (an operation history table, say), then maybe the benefit isn't worth the cost of increased insert time and storage space.
It's a call you'll need to make separately for each table.
Obviously, there isn't a single rule.
Most of my tables have date-related things, DateCreated, DateModified, and occasionally a Revision to track changes and so on. Do whatever makes sense. Clearly, you can invent cases where it's appropriate and cases where it is not. If you're asking whether you should add them "by default" to most tables, I'd say "probably".
I am putting together a staff database and I need to be able to revise the staff member information, but also keep track of all the revisions. How should I structure the database so that I can have multiple revisions of the same user data but be able to query against the most recent revision? I am looking at information that changes rarely, like Last Name, but that I will need to be able to query for out of date values. So if Jenny Smith changes her name to Jenny James I need to be able to find the user's current information when I search against her old name.
I assume that I will need at least 2 tables, one that contains the uid and another that contains the revisions. Then I would join them and query against the most recent revision. But should I break it out even further, depending on how often the data changes or the type of data? I am looking at about 40 fields per record and only one or two fields will probably change per update. Also I cannot remove any data from the database, I need to be able to look back on all previous records.
A simple way of doing this is to add a deleted flag and instead of updating records you set the deleted flag on the existing record and insert a new record.
You can of course also write the existing record to an archive table, if you prefer. But if changes are infrequent and the table is not big I would not bother.
To get the active record, query with 'where deleted = 0', the speed impact will be minimal when there is an index on this field.
Typically this is augmented with some other fields like a revision number, when the record was last updated, and who updated it. The revision number is very useful to get the previous versions and also to do optimistic locking. The 'who updated this last and when' questions usually come once the system is running instead of during requirements gathering, and are useful fields to put in any table containing 'master' data.
I would use the separate table because then you can have a unique identifier that points to all the other child records that is also the PK of the table which I think makes it less likely you will have data integrity issues. For instance, you have Mary Jones who has records in the address table and the email table and performance evaluation table, etc. If you add a change record to the main table, how are you going to relink all the existing information? With a separate history table, it isn't a problem.
With a deleted field in one table, you then have to have an non-autogenerated person id and an autogenrated recordid.
You also have the possiblity of people forgetting to use the where deleted = 0 where clause that is needed for almost every query. (If you do use the deleted flag field, do yourself a favor and set a view with the where deleted = 0 and require developers to use the view in queries not the orginal table.)
With the deleted flag field you will also need a trigger to ensure one and only one record is marked as active.
#Peter Tillemans' suggestion is a common way to accomplish what you're asking for. But I don't like it.
The structure of a database should reflect the real-world facts that are being modeled.
I would create a separate table for obsolete_employee, and just store the historical information that would need to be searched in the future. This way you can keep your real employee data table clean and keep only the old data that is necessary. This approach will also simplify reporting and other features of the application that are not related to searching historical data.
Just think of that warm feeling you'll get when you type select * from employee and nothing but current, correct goodness comes flowing back!
I have made few projects (CMS and EC system) that required to have some data versioned.
Usually I come with that kind of schema
+--------------+
+ foobar +
+--------------+
+ foobar_id +
+ version +
+--------------+
it worked great but I am wondering if there is a better to way do it. The main problem with that solution you have to always use subquery to get the latest version.
i.e.:
SELECT * FROM foobar WHERE foobar_id = 2 and version = (SELECT MAX(version) FROM foobar f2 WHERE f2 = 2)
This render most of the queries more complicate and also have some performance drawbacks.
So it would be nice if you share your experience creating versioned table and what kind of pro and cons of each method.
Thanks
I prefer to have historical data in another table. I would make foobar_history or something similar and make a FK to foobar_id. This will stop you from having to use a subquery all together. This has the added advantage of not polluting your primary data table with the tons of historical data you probably don't want to see 99% of the time you're accessing it.
You will likely want to make a trigger for updating this data though, as it would require you to copy the current data in to _history and then do the update.
The cleanest solution in my opinion would be to have a History table for each table that requires versioned. In other words, have a foobar table, and then a foobar_History table, with a trigger on foobar that will write existing data to the History table with a timestamp and user that changed the data. Older data is easily queryably, sorted by timestamp descending, and you know that the data in the main table is always the latest version.
I used to work on a system with historical data, and we had a boolean to indicate which one was the latest version of the data. Of course you need to maintain the consitency of the flag at the applicative level. Then you can create indexes that use the flag and if you provide it in the where clause it's fast.
Pro:
easy to understand
does not require major change to your (existing) database schema
no need to copy old data in another table, only flag is updated.
Cons:
flag need to be maintained at applicative level
Otherwise, you can rely on a separate history table, as suggested in several answers.
Pro:
clean sepration of history from actual data
possible to have a db-level cascade delete between actual data and its history, in case the entity is removed
Cons:
need 2 queries (or a union) if you want the complete history (that is, old data + current data)
the row that corresponds to the latest version of the data will be updated. I heard that update are slower than insert, depending on the "size" of the data that changed.
What is best will depend from your use case. I had to deal with a document management system where we wanted to be able to version document. But we also had feature like reverting to old version. It was easier to use a boolean to speed up just the operation that required the last one. If you have real historical data (which never change) probably a dedicated history table is better.
Does the concept of history fit in your domain model? If no, then you have a db schema that differs from your conceptual domain model. If at the domain level, the actual data and the old data need to be handled the same way, having two tables complicates the design. Just consider the case you need to return the complete history (old + new). The easiest solution would be to have one class for each table, but then you can't return a list as easily as if you have only one table. But if these are two distinct concepts, then it's fine to have history be first-class in your design.
I would also recommend this article by M. Fowler also interesting when it comes to dealing with temporal data: Patterns for things that change with time
You can simplify the query by using a view over your table which filters to the latest version. This only makes the queries look nicer you still have the performance overhead.
Common technique is to add a column version_status for current/expired. Also a note, if you keep new and old records in the same table, you should have a business (natural) key for your entity, something like name + pin, because the primary key will change (increment) with each row.
TABLE foobar(foobar_id PK, business_key, version, version_status, .....)
SELECT *
FROM foobar
WHERE business_key = 'myFoobar3' AND version_status = 'current'
When deciding to keep the record history in the same table -- or move it to a separate one -- consider other tables which have the foobar_id as a foreign key. When issuing a new version, should existing foreign keys point to the new PK or to the old PK? If you want to keep history of relationships, you would probably want to keep everything in the same table. If only the new version is important, you may consider to move expired rows to another table -- though it is not necessary.
If you had used Oracle you could use analytic functions
select * from (
SELECT a.*
, row_number() over (partition by foobar_id order by version desc) rn
FROM foobar a
WHERE foobar_id = 2
) where rn = 1
It depends on how many of your tables require versioning, and if you've got a transactional ore reporting system.
If just a few transactional tables - the way that you're doing it is fine as long as the performance issues aren't too significant. You can make the querying easier by adding a column for current_row and a trigger that updates the prior row to make it non-current.
But if you've got a lot of tables or the extra rows are slowing down some of your queries then I'd do as others suggest and use history tables as well as history triggers. Note that you can generate that code to make it easier to develop & maintain.
If you're in the reporting world then there's a lot other options I won't address here. You can find the options given in detail in data warehousing data modeling books.