How to keep historic details of modification in a database (Audit trail)? - database

I'm a J2EE developer & we are using hibernate mapping with a PostgreSQL database.
We have to keep track of any changes occurs in the database, in others words all previous & current values of any field should be saved. Each field can be any type (bytea, int, char...)
With a simple table it is easy but we a graph of objects things are more difficult.
So we have, speaking in a UML point of view, a graph of objects to store in the database with every changes & the user.
Any idea or pattern how to do that?

A common way to do this is by storing versions of objects.
If add a "version" and a "deleted" field to each table that you want to store an audit trail on, then instead of doing normal updates and deletes, follow these rules:
Insert - Set the version number to 0 and insert as normal.
Update - Increment the version number and do an insert instead.
Delete - Increment the version number, set the deleted field to true and do an insert instead.
Retrieve - Get the record with the highest version number and return that.
If you follow this pattern, every time you update you will create a new record rather than overwriting the old data, so you will always be able to track back and see all the old objects.
This will work exactly the same for graphs of objects, just add the new fields to each table within the object graph, and handle each insert/update/delete for each table as described above.
If you need to know which user made the modification, you just add a "ModifiedBy" field as well.
(You can either do this processing in your DA layer code, or if you prefer you can use database triggers to catch your update/delete/retrieve calls and re-process them following the rules.)
Obviously, you need to consider space requirements, as every single update will result in a fully new record. If your application is update heavy, you are going to generate a lot of data. It's common to also include a "last modified time" fields so you can process the database off line and delete data older than required.

Current RDBMS implementations are not very good at handling temporal data. That's one reason why maintaining separate journalling tables through triggers is the usual approach. (The other is that audit trails frequently have different use cases to regular data, and having them in separate tables makes it easier to manage access to them). Oracle does a pretty slick job of hiding the plumbing in its Total Recall product, but being Oracle it charges $$$ for this.
Scott Bailey has published a presentation on temporal data in PostgreSQL. Alas it won't help you right now but it seems like some features planned for 8.5 and 8.6 will enable the transparent storage of time-related data. Find out more.

Related

Insert New Record Before Existing Recordds in MS SQL Server

I'm working with an existing MS SQL database and ASP.NET web application. An update is needed in a table, but in order to add the new data and have it display correctly in the site, I need to be able to take a series of existing records, essentially "push them down", and then add the new data in the open space created.
Is there a cleaner, more efficient method than by just creating a new record that's a copy of the last related record, and then essentially doing a copy-and-paste for the remaining records until I reach the insertion point? There are quite a few records to move and I'd prefer something that isn't as mind-numbing and potentially error-prone as that.
I know the existing site and database isn't designed optimally for inserting new data into this table unless it's added to the end, but reconfiguring the database and stored procedures is not an option I presently have.
-- EDIT --
For additional requested information...
Screen shot of table definition:
Screen shot of some table data (filtered by TemplateID):
When looking at the table data, there are a couple other template ID values that bring back a bit more complex data. The issue is that this data needs to maintain this order, which happens to be the order in which it has been entered, since it gets returned and displayed in the shown order. The new data needs to be entered prior to one of these lettered subject headers. Honestly, I think this is not the best way to do this, but I had no hand in the design. It was created by a different company, and mine was hired to handle updates and maintenance after the creators became unpleasant to work with. A different template ID value brings back two levels of headers, which doesn't make my task any easier or alterations much cleaner considering the CS code that calls the stored procedures is completely separated from the code that builds the contents of the pages, and the organizational structure i tough to follow. There are some very poor naming conventions in places.
At any rate, there needs to be an insertion into this group of data under the "A" header value. The same needs to occur with another chunk associated with a different template ID, and there is another main header below the insertion point:

Database design for application with wiki-like functions

I'm making an api for movie/tv/actors etc. with web api 2 and sql server. The database now has >30 tables, most of them storing data users will be able to edit.
How should I store old version of entries?
Say someone edits description, runtime and tagline for a entry(movie) in the movies table.
I'll have a table(movies_old), where I store the editable files in 'movies' pluss who/when it was edited.
All in the same database. The '???_old' tables has no relationships.
I'm very new to database design. Is there something obviously wrong with this?
To my mind, there are two issues here: what table you store the data in, and what goes in the "historical value" field.
On the first question, there are two obvious options: Store old and new records in the same table, with some sort of indication of which is "current" and which is "history", or have a separate table for history.
The main advantage of one table is that you have a simpler schema. This is especially true if the table contains many fields. If there are two tables, then all the field definitions are duplicated. When you move data from the current table to the history table, you have to copy every field, and if the list of fields changes, or their formats change, you have to remember to update the copy. Any queries that show the history have to read two tables. Etc. But with one table, all that goes away. Converting a record from current to history just means changing the setting of the "is_current" flag or however you indicate it.
The main advantages of two tables are, (a) Access is probably somewhat faster, as you don't have so many irrelevant records to skip over. (b) When reading the current table you don't have to worry about excluding the history records.
Oh, an annoying thing about SQL: In principle you could put a date on each record, and then the record with the latest date is the current one. In practice this is a pain: you usually have to have an inner query to find the latest date, and then feed this back in to an outer query that re-reads the record with that date. (Some SQL engines have ways around this. Postgres, for example.) So in practice, you need an "is_current" flag, probably 1 for current and 0 for history or some such.
The other issue is what to put in the contents. If you're dealing with short fields, customer number and amount billed and so forth, then the simple and easy thing to do is just store the complete old contents in one record and the complete new contents in the new record. But if you're dealing with a long text block, like a plot synopsis or a review, there could be many small editorial changes. If every time someone fixes a grammar or spelling error, we have a whole new record with the entire 1000 characters, of which 5 characters are different, this could really clutter up the database. If that's the case you might want to investigate ways to store changes more efficiently. May or may not be an issue to you.

Bitemporal Database Design Question

I am designing a database that needs to store transaction time and valid time, and I am struggling with how to effectively store the data and whether or not to fully time-normalize attributes. For instance I have a table Client that has the following attributes: ID, Name, ClientType (e.g. corporation), RelationshipType (e.g. client, prospect), RelationshipStatus (e.g. Active, Inactive, Closed). ClientType, RelationshipType, and RelationshipStatus are time varying fields. Performance is a concern as this information will link to large datasets from legacy systems. At the same time the database structure needs to be easily maintainable and modifiable.
I am planning on splitting out audit trail and point-in-time history into separate tables, but I’m struggling with how to best do this.
Some ideas I have:
1)Three tables: Client, ClientHist, and ClientAudit. Client will contain the current state. ClientHist will contain any previously valid states, and ClientAudit will be for auditing purposes. For ease of discussion, let’s forget about ClientAudit and assume the user never makes a data entry mistake. Doing it this way, I have two ways I can update the data. First, I could always require the user to provide an effective date and save a record out to ClientHist, which would result in a record being written to ClientHist each time a field is changed. Alternatively, I could only require the user to provide an effective date when one of the time varying attributes (i.e. ClientType, RelationshipType, RelationshipStatus) changes. This would result in a record being written to ClientHist only when a time varying attribute is changed.
2) I could split out the time varying attributes into one or more tables. If I go this route, do I put all three in one table or create two tables (one for RelationshipType and RelationshipStatus and one for ClientType). Creating multiple tables for time varying attributes does significantly increase the complexity of the database design. Each table will have associated audit tables as well.
Any thoughts?
A lot depends (or so I think) on how frequently the time-sensitive data will be changed. If changes are infrequent, then I'd go with (1), but if changes happen a lot and not necessarily to all the time-sensitive values at once, then (2) might be more efficient--but I'd want to think that over very carefully first, since it would be hard to manage and maintain.
I like the idea of requiring users to enter effective daes, because this could serve to reduce just how much detail you are saving--for example, however many changes they make today, it only produces that one History row that comes into effect tomorrow (though the audit table might get pretty big). But can you actually get users to enter what is somewhat abstract data?
you might want to try a single Client table with 4 date columns to handle the 2 temporal dimensions.
Something like (client_id, ..., valid_dt_start, valid_dt_end, audit_dt_start, audit_dt_end).
This design is very simple to work with and I would try and see how ot scales before going with somethin more complicated.

What will be the best way to keep track of modified tuples in a database?

I am currently working on a project in which I have to keep track of the tuples that are modified in a relational database. This should include updated tuples, but also inserted and deleted tuples. My question is what will be the best way to accomplish this? I have several ideas of my own, but maybe there are easier/better ways that I did not think of, or there already exists a project that exactly does this.
The final goal of the project is that it will work for relational databases of different vendors, but the first implementation will use a MySQL database. Other database systems can be supported later. But it would be nice if the solution that works for MySQL can be easily adapted to another database.
My first idea was to parse log files. However, I am not certain whether these logfiles contain the actual modified tuples, and furthermore I can imagine that these logfiles will not always be available (e.g. on shared hosting).
My second idea was to intercept the queries at the application level. When a INSERT, DELETE or UPDATE query is performed, these queries can be parsed, and the tuples that they will affect can be determined beforehand. For an INSERT operation this simply is the inserted tuple, and for a DELETE or UPDATE operation the tuples can be identified by applying the WHERE clause in a new SELECT statement.
As a last remark I want to add that performance is not an important factor at this stage of development.
If more details are needed I am happy to provide them.
Use triggers to capture the INSERT, UPDATE, and DELETE and log your entries to a new table. You can use a timestamp on that table to note when the transactions occurred. In the future you can query that table for your modification information.
This will require some database dependent features but you can encapsulate them depending on your architecture but you could use database triggers, which I normally advise against except for this very thing, auditing. In each kind of trigger, you could simply write to a log table whatever info you need. Just one suggestion.

Versioning Database Persisted Objects, How would you? [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
(Not related to versioning the database schema)
Applications that interfaces with databases often have domain objects that are composed with data from many tables. Suppose the application were to support versioning, in the sense of CVS, for these domain objects.
For some arbitry domain object, how would you design a database schema to handle this requirement? Any experience to share?
Think carefully about the requirements for revisions. Once your code-base has pervasive history tracking built into the operational system it will get very complex. Insurance underwriting systems are particularly bad for this, with schemas often running in excess of 1000 tables. Queries also tend to be quite complex and this can lead to performance issues.
If the historical state is really only required for reporting, consider implementing a 'current state' transactional system with a data warehouse structure hanging off the back for tracking history. Slowly Changing Dimensions are a much simpler structure for tracking historical state than trying to embed an ad-hoc history tracking mechanism directly into your operational system.
Also, Changed Data Capture is simpler for a 'current state' system with changes being done to the records in place - the primary keys of the records don't change so you don't have to match records holding different versions of the same entity together. An effective CDC mechanism will make an incremental warehouse load process fairly lightweight and possible to run quite frequently. If you don't need up-to-the minute tracking of historical state (almost, but not quite, and oxymoron) this can be an effective solution with a much simpler code base than a full history tracking mechanism built directly into the application.
A technique I've used for this in that past has been to have a concept of "generations" in the database, each change increments the current generation number for the database - if you use subversion, think revisions.
Each record has 2 generation numbers associated with it (2 extra columns on the tables) - the generation that the record starts being valid for, and the generation the it stops being valid for. If the data is currently valid, the second number would be NULL or some other generic marker.
So to insert into the database:
increment the generation number
insert the data
tag the lifetime of that data with valid from, and a valid to of NULL
If you're updating some data:
mark all data that's about to be modified as valid to the current generation number
increment the generation number
insert the new data with the current generation number
deleting is just a matter of marking the data as terminating at the current generation.
To get a particular version of the data, find what generation you're after and look for data valid between those generation versions.
Example:
Create a person.
|Name|D.O.B |Telephone|From|To |
|Fred|1 april|555-29384|1 |NULL|
Update tel no.
|Name|D.O.B |Telephone|From|To |
|Fred|1 april|555-29384|1 |1 |
|Fred|1 april|555-43534|2 |NULL|
Delete fred:
|Name|D.O.B |Telephone|From|To |
|Fred|1 april|555-29384|1 |1 |
|Fred|1 april|555-43534|2 |2 |
An alternative to strict versioning is to split the data into 2 tables: current and history.
The current table has all the live data and has the benefits of all the performance that you build in.
Any changes first write the current data into the associated "history" table along with a date marker which says when it changed.
If you are using Hibernate JBoss Envers could be an option. You only have to annotate classes with #Audited to keep their history.
You'll need a master record in a master table that contains the information common among all versions.
Then each child table uses master record id + version no as part of the primary key.
It can be done without the master table, but in my experience it will tend to make the SQL statements a lot messier.
A simple fool-proof way, is to add a version column to your tables and store the Object's version and choose the appropriate application logic based on that version number.
This way you also get backwards compatibility for little cost. Which is always good
ZoDB + ZEO implements a revision based database with complete rollback to any point in time support. Go check it.
Bad Part: It's Zope tied.
Once an object is saved in a database, we can modify that object any number of times right, If we want to know how many no of times that an object is modified then we need to apply this versioning concept.
When ever we use versioning then hibernate inserts version number as zero, when ever object is saved for the first time in the database. Later hibernate increments that version no by one automatically when ever a modification is done on that particular object.
In order to use this versioning concept, we need the following two changes in our application
Add one property of type int in our pojo class.
In hibernate mapping file, add an element called version soon after id element
I'm not sure if we have the same problem, but I required a large number of 'proposed' changes to the current data set (with chained proposals, ie, proposal on proposal).
Think branching in source control but for database tables.
We also wanted a historical log but this was the least important factor - the main issue was managing change proposals which could hang around for 6 months or longer as the business mulled over change approval and got ready for the actual change to be implemented.
The idea is that users can load up a Change and start creating, editing, deleting the current state of data without actually applying those changes. Revert any changes they may have made, or cancel the entire change.
The only way I have been able to achieve this is to have a set of common fields on my versioned tables:
Root ID: Required - set once to the primary key when the first version of a record is created. This represents the primary key across all of time and is copied into each version of the record. You should consider the Root ID when naming relation columns (eg. PARENT_ROOT_ID instead of PARENT_ID). As the Root ID is also the primary key of the initial version, foreign keys can be created against the actual primary key - the actual desired row will be determined by the version filters defined below.
Change ID: Required - every record is created, updated, deleted via a change
Copied From ID: Nullable - null indicates newly created record, not-null indicates which record ID this row was cloned from when updated
Effective From Date/Time: Nullable - null indicates proposed record, not-null indicates when the record became current. Unfortunately a unique index cannot be placed on Root ID/Effective From as there can be multiple null values for any Root ID. (Unless you want to restrict yourself to a single proposed change per record)
Effective To Date/Time: Nullable - null indicates current/proposed, not-null indicates when it became historical. Not technically required but helps speed up queries finding the current data. This field could be corrupted by hand-edits but can be rebuilt from the Effective From Date/Time if this occurs.
Delete Flag: Boolean - set to true when it is proposed that the record be deleted upon becoming current. When deletes are committed, their Effective To Date/Time is set to the same value as the Effective From Date/Time, filtering them out of the current data set.
The query to get the current state of data according to a change would be;
SELECT * FROM table WHERE (CHANGE_ID IN :ChangeId OR (EFFECTIVE_FROM <= :Now AND (EFFECTIVE_TO IS NULL OR EFFECTIVE_TO > :Now) AND ROOT_ID NOT IN (SELECT ROOT_ID FROM table WHERE CHANGE_ID IN :ChangeId)))
(The filtering of change-on-change multiples is done outside of this query).
The query to get the current state of data at a point in time would be;
SELECT * FROM table WHERE EFFECTIVE_FROM <= :Now AND (EFFECTIVE_TO IS NULL OR EFFECTIVE_TO > :Now)
Common indexes created on (ROOT_ID, EFFECTIVE_FROM), (EFFECTIVE_FROM, EFFECTIVE_TO) and (CHANGE_ID).
If anyone knows a better solution I would love to hear about it.

Resources