Effective strategy for leaving an audit trail/change history for DB applications? - database

What are some strategies that people have had success with for maintaining a change history for data in a fairly complex database. One of the applications that I frequently use and develop for could really benefit from a more comprehensive way of tracking how records have changed over time. For instance, right now records can have a number of timestamp and modified user fields, but we currently don't have a scheme for logging multiple change, for instance if an operation is rolled back. In a perfect world, it would be possible to reconstruct the record as it was after each save, etc.
Some info on the DB:
Needs to have the capacity to grow by thousands of records per week
50-60 Tables
Main revisioned tables may have several million records each
Reasonable amount of foreign keys and indexes set
Using PostgreSQL 8.x

One strategy you could use is MVCC, Multi-Value Concurrency Control. In this scheme, you never do updates to any of your tables, you just do inserts, maintaining version numbers for each record. This has the advantage of providing an exact snapshot from any point in time, and it also completely sidesteps the update lock problems that plague many databases.
But it makes for a huge database, and selects all require an extra clause to select the current version of a record.

If you are using Hibernate, take a look at JBoss Envers. From the project homepage:
The Envers project aims to enable easy versioning of persistent JPA classes. All that you have to do is annotate your persistent class or some of its properties, that you want to version, with #Versioned. For each versioned entity, a table will be created, which will hold the history of changes made to the entity. You can then retrieve and query historical data without much effort.
This is somewhat similar to Eric's approach, but probably much less effort. Don't know, what language/technology you use to access the database, though.

In the past I have used triggers to construct db update/insert/delete logging.
You could insert a record each time one of the above actions is done on a specific table into a logging table that keeps track of the action, what db user did it, timestamp, table it was performed on, and previous value.
There is probably a better answer though as this would require you to cache the value before the actual delete or update was performed I think. But you could use this to do rollbacks.

The only problem with using Triggers is that it adds to performance overhead of any insert/update/delete. For higher scalability and performance, you would like to keep the database transaction to a minimum. Auditing via triggers increase the time required to do the transaction and depending on the volume may cause performance issues.
another way is to explore if the database provides any way of mining the "Redo" logs as is the case in Oracle. Redo logs is what the database uses to recreate the data in case it fails and has to recover.

Similar to a trigger (or even with) you can have every transaction fire a logging event asynchronously and have another process (or just thread) actually handle the logging. There would be many ways to implement this depending upon your application. I suggest having the application fire the event so that it does not cause unnecessary load on your first transaction (which sometimes leads to locks from cascading audit logs).
In addition, you may be able to improve performance to the primary database by keeping the audit database in a separate location.

I use SQL Server, not PostgreSQL, so I'm not sure if this will work for you or not, but Pop Rivett had a great article on creating an audit trail here:
Pop rivett's SQL Server FAQ No.5: Pop on the Audit Trail
Build an audit table, then create a trigger for each table you want to audit.
Hint: use Codesmith to build your triggers.

Related

Architecture recommendation using SQL Server for real-time aggregation and denormalization

We have an enterprise LOB application for managing millions of bibliographic (lots of text) records using SQLServer (2008). The database is very normalized (a complete record might easily be made of up ten joined tables plus nested collections). Write transactions are fine, and we have a very responsive search solution for now, which makes generous use of full-text indexing and indexed views.
The issue is that in reality, much of what the research users need could be better served by a read-only warehouse-type copy of the data, but it would need to be continually copied near real-time (latency of a few minutes is fine).
Our search is optimized by several calculated columns or composite tables already, and we would like to add more. Indexed views cannot cover all needs because of their constraints (such as no outer joins). There are dozens of 'aspects' to this data, much like a read-only data warehouse might provide, involving permissions, geography, category, quality, and counts of associated documents. We also compose complex xml representations of the records that are fairly static and could be composed and stored once.
The total amount of denormalization, calculation and search optimization provokes an unacceptable delay if done completely via triggers, and is also prone to lock conflicts.
I've researched some of Microsoft's SQL Server suggestions, and I would like to know if anyone having experience with similar requirements has can offer recommendation from the following three (or other suggestions that use the SQL Server/.Net stack):
Transactional replication to a read-only copy - but it is unclear from the documentation how much one can change the schema on the subscriber side and add triggers, calculated columns or composite tables;
Table partitioning - not to alter the data, but perhaps to segment large areas of data that currently are recalculated constantly, such as permissions, record type (60), geographical region, etc...would that allow triggers on the transactional side to run with less locks?
Offline batch processing - Microsoft uses that phrase often, but does not give great examples, except for 'checking for signs of credit card fraud' on the subscriber side of transaction replication...which would be a great sample, but how is that done exactly in practice? SSIS jobs that run every 5 minutes? Service Broker? External executables that poll continually? We want to avoid the 'run a long process at night' solution, and we also want to avoid locking up the transactional side of things by running an update-intensive aggregating/compositing routine every 5 minutes on the transactional server.
Update to #3: after posting, I found this SO answer with a link to Real Time Data Integration using Change Tracking, Service Broker, SSIS and triggers - looks promising - would that be a recommended path?
Another Update: which, in turn, has helped me find rusanu.com - all things ServiceBroker by SO user Remus Rusanu. The asyncrhonous messaging solutions seem to match our scenario much better than the Replication scenarios...
Service Broker technology is good for serving your task although there are maybe potential drawback depending on your particular system configuration. The most valuable feature IMO is ability to decouple two kind of processing - writing and aggregation. You will be able to do this even using different databases/SQL Server instances/physical servers in very reliable way. Of course you need to spend some time designing message exchange process - specifying message formats, planning conversations, etc., because this has huge influence on satisfaction from resulting system.
I've used SSBS for my task that was more or less similar - near real-time creation of analytic data warehouse based on regular data flow.

Altering database tables on updating website

This seems to be an issue that keeps coming back in every web application; you're improving the back-end code and need to alter a table in the database in order to do so. No problem doing manually on the development system, but when you deploy your updated code to production servers, they'll need to automatically alter the database tables too.
I've seen a variety of ways to handle these situations, all come with their benefits and own problems. Roughly, I've come to the following two possibilities;
Dedicated update script. Requires manually initiating the update. Requires all table alterations to be done in a predefined order (rigid release planning, no easy quick fixes on the database). Typically requires maintaining a separate updating process and some way to record and manage version numbers. Benefit is that it doesn't impact running code.
Checking table properties at runtime and altering them if needed. No manual interaction required and table alters may happen in any order (so a quick fix on the database is easy to deploy). Another benefit is that the code is typically a lot easier to maintain. Obvious problem is that it requires checking table properties a lot more than it needs to.
Are there any other general possibilities or ways of dealing with altering database tables upon application updates?
I'll share what I've seen work best. It's just expanding upon your first option.
The steps I've usually seen when updating schemas in production:
Take down the front end applications. This prevents any data from being written during a schema update. We don't want writes to fail because relationships are messed up or a table is suddenly out of sync with the application.
Potentially disconnect the database so no connections can be made. Sometimes there is code out there using your database you don't even know about!
Run the scripts as you described in your first option. It definitely takes careful planning. You're right that you need a pre-defined order to apply the changes. Also I would note often times you need two sets of scripts, one for schema updates and one for data updates. As an example, if you want to add a field that is not nullable, you might add a nullable field first, and then run a script to put in a default value.
Have rollback scripts on hand. This is crucial because you might make all the changes you think you need (since it all worked great in development) and then discover the application doesn't work before you bring it back online. It's good to have an exit strategy so you aren't in that horrible place of "oh crap, we broke the application and we've been offline for hours and hours and what do we do?!"
Make sure you have backups ready to go in case (4) goes really bad.
Coordinate the application update with the database updates. Usually you do the database updates first and then roll out the new code.
(Optional) A lot of companies do partial roll outs to test. I've never done this, but if you have 5 application servers and 5 database servers, you can first roll out to 1 application/1 database server and see how it goes. Then if it's good you continue with the rest of the production machines.
It definitely takes time to find out what works best for you. From my experience doing lots of production database updates, there is no silver bullet. The most important thing is taking your time and being disciplined in tracking changes (versioning like you mentioned).

Replication - syncronizing most of the data some of the time

I have some data that isn't properly "partitioned" (for lack of a better word).
All inserts, processing and reporting happen on the same table. The bulk of the processing happens not long after the insert and not long after that it becomes immutable (we're talking days).
I could do all inserts and processing on a new table that I replicate to the old table. When I detect that the data has become immutable I would delete the data from the new table, but I would edit the delete replication stored procedure so that the delete did not replicate.
How bad an idea is this? <edit1>That is, editing the replication stored procedure.</edit1>
It seems attractive at the moment (I haven't slept on it yet) because it might mitigate a performance problem with only very small changes to the application. It also seems like it might be a good way to shoot myself in the foot.
Edit1:
I like the idea of inserting into two tables because I can avoid the view and the maintenance window described in Jono's answer. No offense, Jono, I actually use this technique elsewhere.
I might want to use replication because one table might be in another database (I know, I didn't mention this) and that way I don't have to worry about committing to two tables, I just let replication handle that.
My actual concern (that I didn't make clear) is that editing the replication stored procedure could end up being a deployment/maintenance headache.
I wouldn't advocate replication to solve a performance issue (unless it's a problem of physical data distribution); if anything it's going to slow your system down as the changes are propagated to their destination. If you're using a single server, I'd suggest adding a second table with the same schema as the first, but with your indexes optimised for the kind of work you do in your processing phase. Then create a view that selects from both tables, and use that view in any query where you want the union of both tables. You could then throw more hardware at the second table (I'm thinking of a separate file group over more spindles) and then migrate the data on a weekly delay into the first table, during an available maintenance window.

History tables pros, cons and gotchas - using triggers, sproc or at application level [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 9 years ago.
Improve this question
I am currently playing around with the idea of having history tables for some of my tables in my database. Basically I have the main table and a copy of that table with a modified date and an action column to store what action was preformed e.g., Update, Delete and Insert.
So far I can think of three different places that you can do the history table work.
Triggers on the main table for update, insert and delete. (Database)
Stored procedures. (Database)
Application layer. (Application)
My main question is; what are the pros, cons and gotchas of doing the work in each of these layers?
One advantage I can think of by using the triggers way is that integrity is always maintained no matter what is implemented on top of the database.
I'd put it this way:
Stored procs: they're bypassed if you modify the table directly. Security on the database can control this
Application: same deal. Also if you have multiple applications, possibly in different languages, it needs to be implemented in each stack, which is somewhat redundant; and
Triggers: transparent to the application and will capture all changes. This is my preferred method.
Triggers are the quickest and easiest way to achieve simple history. The following information assumes a more complex example where history processing may include some business rules and may require logging information not found in the table being tracked.
To those that think that triggers are safer than sprocs because they cannot be bypassed I remind them that they are making the following assumption:
!) Permissions exist that stop users from executing DISABLE TRIGGER [but then permissions could too exist to limit all access to the database except for EXECUTE on sprocs which is a common pattern for enterprise applications] - therefore one must assume correct permissions and therefore sprocs equal triggers in terms of security and ability to be bypassed
!) Depending on the database it may be possible to execute update statements that do not fire triggers. I could take advantage of knowledge of nested trigger execution depth to bypass a trigger. The only sure solution includes security in database and limiting access to data using only approved mechanisms - whether these be triggers, sprocs or data access layers.
I think the choices are clear here. If the data is being accessed by multiple applications then you want to control the history from the lowest common layer and this will mean the database.
Following the above logic, the choice of triggers or stored procedures depends again on whether the stored procedure is the lowest common layer. You should prefer the sproc over the trigger as you can control performance, and side effects better and the code is easier to maintain.
Triggers are acceptable, but try to make sure that you do not increase locks by reading data outside of the tables being updated. Limit triggers to inserts into the log tables, log only what you need to.
If the application uses a common logical access layer and it is unlikely that this would change over time I would prefer to implement the logic here. Use a Chain Of Responsibility pattern and a plug-in architecture, drive this from Dependency Injection to allow for all manner of processing in you history module, including logging to completely different types of technology, different databases, a history service or anything else that you could imagine.
Have been using the trigger based approach for years and it has definitely worked well for us, but then you do have the following points to ponder over:
Triggers on a heavily used (say, a multi-tenant SaaS based application) could be extremely expensive
In some scenarios, a few fields can get redundant. Triggers are good only when you are crystal clear on the fields to be logged; though using an application you could have an interceptor layer which could help you log certain fields based on the "configuration"; though with it's own share of overheads
Without adequate database control, a person could easily disable the triggers, modify the data and enable the triggers; all without raising any alarms
In case of web applications, where the connections are established from a pool, tracking the actual users who made the changes can be tedious. A possible solution would be to have the "EditedBy" field in every transaction table.
Late one but it adds couple more options that can be considered.
Change Data Capture: This feature is available in SQL Server 2008 R2+ but only in enterprise edition. It allows you to select tables you want to track and SQL Server will do the job for you. It works by reading transaction log and populating history tables with data.
Reading transaction log: If database is in full recovery mode then transaction log can be read and details on almost transactions can be found.
Downside is that this is not supported by default. Options are to read transaction log using undocumented functions like fn_dblog or third party tools such as ApexSQL Log.
Triggers: Works just fine for small number of tables where there are not too many triggers to manage. If you have a lot of tables you want to audit then you should consider some third party tool for this.
All of these work at the database level and are completely transparent to application.
Triggers are the only reliable way to capture changes. If you do it in Stored Procs or the App, you can always go in and SQL away a change that you don't have a log for (inadvertantly). Of course, somebody who doesn't want to leave a log can disable triggers. But you'd rather force somebody to disable the logging than hope that they remember to include it.
Usually if you choose the application layer, you can design your app code to do the logging in a single point, that will handle consistenly all your historical table. differently triggers are a more complicated approach to maintain because they are (depending on the db technology) replicated for every table: in case of hundred of tables the amount of code for the trigger coud be a problem.
if you have a support organization that will maintain the code you are writing now, and you don't know who will maintain your code (tipical for big industries) you cannot assume which is the skill level of the person who will do fix on your application, in that case it is better in my opinion to make the historical table working principle as simple as possible, and the application layer is probably the best place for this purpose.

django AuditTrail vs Reversion

I am working on an new web app I need to store any changes in database to audit table(s). Purpose of such audit tables is that later on in a real physical audit we can asecertain what happened in a situation, who edited what and what was the state of db at the time of e.g. a complex calculation.
So mostly audit table will be written and not read. Report may be generated though sometimes.
I have looked for available solution
AuditTrail - simple and that is why I am inclining towards it, I can understand it single file code.
Reversion - looks simple enough to use but not sure how easy it would be to modify it if needed.
rcsField seems to be very complex and too much for my needs
I haven't tried anyone of these, so I wanted to know some real experiences and which one I should be using. e.g. which one is faster uses less space, easy to extend and maintain?
Personally I prefer to create audit tables in the database and populate through triggers so that any change even ad hoc queries from the query window are stored. I would never consider an audit solution that is not based in the database itself. This is important because people who are making malicious changes to the database or committing fraud are not likely to do so through the web interface but on the backend directly. Far more of this stuff happens from disgruntled or larcenous employees than outside hackers. If you are using an ORM already, your data is at risk because the permissions are at the table level rather than the sp level where they belong. Therefore it is even more important that you capture any possible change to the dat not just what was from the GUI. WE have a dynamic proc to create audit tables that is run whenever new tables are added to the database. Since our audit tables populate only the changes and not the whole record, we do not need to change them every time a field is added.
Also when evaluating possible solutions, make sure you consider how hard it will be to revert the data to undo a specific change. Once you have audit tables, you will find that this is one of the most important things you need to do from them. Also consider how hard it will be to maintian the information as the database schema changes.
Choosing a solution because it appears to be the easiest to understand, is not generally a good idea. That should be lowest of your selction criteria after meeting the requirements, security, etc.
I can't give you real experience with any of them but would like to make an observation.
I assume by AuditTrail you mean AuditTrail on the Django wiki. If so, I think you'll want to instead look at HistoricalRecords developed by the same author (Marty Alchin aka #gulopine) in his book Pro Django. It should work better with Django 1.x.
This is the approach I'll be using on an upcoming project, not because it necessarily beats the others from a technical standpoint, but because it matches the "real world" expectations of the audit trail for that application.
As i stated in my question rcField seems to be to much for my needs, which is simple that i want store any changes to my table, and may be come back later to those changes to generate some reports.
So I tested AuditTrail and Reversion
Reversion seems to be a better full blown application with many features(which i do not need), Also as far as i know it saves data in a single table in XML or YAML format, which i think
will generate too much data in a single table
to read that data I may not be able to use already present db tools.
AuditTrail wins in that regard that for each table it generates a corresponding audit table and hence changes can be tracked easily, per table data is less and can be easily manipulated and user for report generation.
So i am going with AuditTrail.

Resources