Why do we need Audit Columns in Database Tables? - database

I have seen many database designs having following audit columns on all the tables...
Created By
Create DateTime
Updated By
Upldated DateTime
From one perspective I see tables from the following view...
Entity Tables:
Good candidate for Audit columns)
Reference Tables:
Audit columns may or may not required. In some case last update information is not at all required because record is never going to be modified.)
Reference Data Tables
Like Country Names, Entity State etc... Audit columns may not required because these information is created only during system installation time, and never going to be changed.
I have seen many designers blindly put all audit columns to all tables, is this practice good, if yes what could be the reason...
I just want to know because to me it seems illogical. It is difficult for me to figure out why do they design their db this way? I am not saying they are wrong or right, just want to know the WHY?
You can also suggest me, if there is an alternative auditing patter or solution available...
Thanks and Regards

Data auditing is a required internal control for many business systems (see Sarbanes Oxley for reasons why). It must be at the database level to assure that all changes are captured especially unauthorized ones.
Even with lookup tables an unauthorized change could wreak havok in your system and thus it is important to know who made the change and when. When is especially important because it helps the dbas know how far back to grab a backup to restore information accidentally or maliciously changed.
We like to think all our employees are trustworthy, but many of the thefts of personal data and the malicious changes to destroy company data come from internal sources (this is why is is dangerous to have many disgruntled employees) as does almost all of the fraud. Yet most programmers seem to think that they only have to protect against outside threats.
Of course you are still going to have a few people who can make unauthorized changes, you can't prevent system admins from doing this. But with auditing at least you can limit the potential for data damage (and be especially careful when hiring dbas and allow no one else admin rights on your database servers).

These columns are for the benefit of the DBA and the database developers. They just provide a quick mechanism to answer questions like "When did this record last change?" "who changed it?" They are not robust enough or fine-grained enough to satisfy compliance with SOX, HIPAA or whatever.
It is simply easier to have these columns on every table. All data can change, so it is useful to know when changes happened, especially if that data isn't supposed to change. It is possible to automate the process of adding them, by using the data dictionary to generate scripts.
It is good practice for these columns to be populated independently of the application, by triggers or some similar mechanism. These columns are metadata, the application shouldn't really be aware of them.
Relying on a full-blown audit trail to provide this functionality is usually not an option. Audit data which is collected for compliance purposes usually has restricted access, and indeed may be stored in a separate physical location.

Many applications are developed using some OOP language in which there is generally a class like BusinessObject that contains what is perceived generally helpful information like such auditing fields. Not all subclassing entities may need it, but it's there if they do. Since the overhead of the db is small and the chances that the client may request another odd statistic based on the audit fields it's better to have them around than not to have them at all. If something represents a static list of information such as country names I generally wouldn't put it in the db at all - enumerated data type are created just for such purposes.

I come across this thread by chance, as the same question popped up in my mind this morning. Every answer has got the point and I definitely agree with all of you. It is undeniable to safeguard business data and transaction data. Instead of that, the author feels doubtful about audit fields for some configuration or static data.
This kind of configuration data are not updatable by users. Usually they can be placed in other places as well, like properties, config files or even hard-coded constants. Of course putting configuration data in these places might be bad designs or styles, but from the perspective of auditing, do they matter? In addition, if these data are updatable by users, then the only ones who can update it are either dba or hackers. Truly malicious dba or hackers will already know laws before they break the laws and they do find ways circumvent the laws.
To me, the question is more related to the environment in your company. Does your company have a culture of keeping track of every little bit of tiny information? Does your company constantly enforce strict discipline, monitoring or auditing? Having these auditing fields for non-user data are simply for their satisfaction, more than any other purposes.

Related

business logic in database layer

I hate to ask the classic question of "business logic in database vs code" again, but I need some concrete reasons to convince an older team of developers that business logic in code is better, because it's more maintainable, above all else. I used to have a lot of business logic in the DB, because I believed it was the single point of access. Maintenance is easy, if I was the only one doing the changing it. In my experience, the problems came when the projects got larger and complicated. Source Control for DB Stored Procs are not so advanced as the ones for newer IDEs, nor are the editors. Business logic in code can scale much better than in the DB, is what I've found in my recent experience.
So, just searching around stackoverflow, I found quite the opposite philosophy from its esteemed members:
https://stackoverflow.com/search?q=business+logic+in+database
I know there is no absolute for any situation, but for a given asp.net solution, which will use either sql server or oracle, for a not a particularly high traffic site, why would I put the logic in the DB?
Depends on what you call business.
The database should do what is expected.
If the consumers and providers of data expect the database to make certain guarantees, then it needs to be done in the database.
Some people don't use referential integrity in their databases and expect the other parts of the system to manage that. Some people access tables in the database directly.
I feel that from a systems and component perspective, the database is like any other service or class/object. It needs to protect its perimeter, hide its implementation details and provide guarantees of integrity, from low-level integrity up to a certain level, which may be considered "business".
Good ways to do this are referential integrity, stored procedures, triggers (where necessary), views, hiding base tables, etc., etc.
Database does data things, why weigh down something that is already getting hit pretty hard to give you data. It's a performance thing and a code thing. It's MUCH easier to maintain business logic code than to store it all in the database. Sprocs, Views and Functions can only go so far until you have Views of Views of Views with sprocs to fill that mess in. With business logic you separate your worries. If you have a bug that's causing something to be calculated wrong it's easier to check the business logic code than go into the DB and see if someone messed up something in a Stored Procedure. This is highly opinionated and in some cases it's OK to put some logic in the database but my thoughts on this are it's a database not a logicbase, put things where they belong.
P.S: Might be catchin some heat for this post, it's highly opinionated and other than performance numbers there's no real evidence for either and it becomes a case of what you're working with.
EDIT: Something that Cade mentioned that I forgot. Refrential integrity. By all means please have correct data integrity in your DB, no orphaned records ON DELETE CASCADE's, checks and whatnot.
I have faced with database logic on one of huge projects. This was caused by the decision of main manager who was the DBA specialist. He said that the application should be leightweight, it should know nothing about database scheme, joined tables, etc, and anyway stored Procs executes much faster than the transaction scopes and queries from client.
At the other side, we had too much bugs with database object mappings (stored prod or view based on view based on other view etc). It was unreachable to understand what is happening with our data because of each button clicked called a huge stored proc with 70-90-120 parameters and updated several (10-15) tables. We had no ability to query simple select request so we had to compile a view or stored Proc and class in code for this just for one simple join :-( of course when the table or view definition changes you should recompile all other dB objects based on edited object elsewhere you will get runtime Exception.
So I think that logic in database is a horrible way. Of course you can store some pieces of code in stored procs if needed by performance or security issues, but you shoul not develop everything in the Database) the logic should be flexible, testable and maintenable, and you can not reach this points using database for storing logic)

Database Designing: An art or headache (Managing relationships)

I have seen in my past experience that most of the people don't use physical relationships in tables and they try to remember them and apply them through coding only.
Here 'Physical Relationships' refer to Primary Key, Foreign Key, Check constraints, etc.
While designing a database, people try to normalize the database on paper and keep things documented. Like, if I have to create a database for a marketing company, I will try to understand its requirements.
For example, what fields are mandatory, what fields will contain only (a or b or c) etc.
When all the things are clear, then why are most of the people afraid of the constraints?
Don't they want to manage things?
Do they have a lack of knowledge
(which I don't think is so)?
Are they not confident about future
problems?
Is it really a tough job managing all these entities?
What is the reason in your opinion?
I always have the DBMS enforce both primary key and foreign key constraints; I often add check constraints too. As far as I am concerned, the data is too important to run the risk of inaccurate data being stored.
If you think of the database as a series of stored true logical propositions, you will see that if the database contains a false proposition - an error - then you can argue to any conclusion you want. Given a false premise, any conclusion is true.
Why don't other people use PK and FK constraints, etc?
Some are unaware of their importance (so lack of knowledge is definitely a factor, even a major factor). Others are scared that they will cost too much in performance, forgetting that one error that has to be fixed may easily use up all the time saved by not having the DBMS do the checking for you. I take the view that if the current DBMS can't handle them well, it might be (probably is) time to change DBMS.
Many developers will check the constraints in code above the database before they actually go to perform an operation. Sometimes, this is driven by user experience considerations (we don't want to present choices / options to users that can't be saved to the database). In other cases, it may be driven by the pain associated with executing a statement, determining why it failed, and then taking corrective action. Most people would consider code more maintainable if it did the check upfront, along with other business logic that might be at play, rather than taking corrective action through an exception handler. (Not that this is necessarily an ideal line of thinking, but it is a prevalent one.) In any case, if you are doing the check in advance of issuing the statement, and not particularly conscious of the fact that the database might get touched by applications / users who are not coming in through your integrity-enforcing code, then you might conclude that database constraints are unnecessary, especially with the performance hit that could be incurred from their use. Also, if you are checking integrity in the application code above the database, one might consider it a violation of DRY (Don't Repeat Yourself) to implement logically equivalent checks in the database itself. The two manifestations of integrity rules (those in database constraints and those in application code above the database) could in principle become out-of-sync if not managed carefully.
Also, I would not discount option 2, that many developers don't know much about database constraints, too readily.
Well, I mean, everyone is entitled to their own opinion and development strategy I suppose, but in my humble opinion these people are almost certainly wrong :)
The reason, however, someone may wish to avoid constraints is efficiency. Not because constraints are slow, but because storing redundant data (i.e. caching) is a very effective way of speeding up (well, avoiding) an expensive calculation. This is an acceptable approach, when implemented properly (i.e. the cache is updated a regular/appropriate intervals, generally I do this with a trigger).
As to the motivation to not us FKs without a caching motivation, I can't imagine it. Perhaps they aim to be 'flexible' in their DB structure. If so, fine, but then don't use a relational DB, because it's pointless. Non-relational DBs (OO dbs) certainly have their place, and may even arguably be better (quite arguable, but interesting to argue) but it's a mistake to use a relational DB and not use it's core properties.
I would always define PK and FK constraints. especially when using an ORM. it really makes the life easy for everybody to let the ORM reverse engineer the database instead of manually configuring it to use some PKs and FKs
There are several reasons for not enforcing relationships in descending order of importance:
People-friendly error handling.
Your program should check constraints and send an intelligible message to the user. For some reason normal people dont like "SQL exception code -100013 goble rule violated for table gook'.
Operational flexibility.
You dont really want your operators trying to figure out which order you must load your tables in at 3 a.m., nor do you want your testers pulling their hair out 'cause they cannot reset the database back to its starting position.
Efficiency.
Cheking constraints does consume IO and CPU.
Functionality.
Its a cheap way to save details for later recovery. For instance in an on line order system you could leave the detail item rows in the table when the users kills a parent order, if he later reinstates the order the details re-appear as if by a miracle -- you acheive this extra feature by deleteing lines of code. (course you need some housekeeping process but it is trivial!)
As things get more complex and more tables and relationships are needed in the database, how can you ensure the database developer remembers to check all of them? When you makea change to the schema that adds a new "informal" relationship, how can you ensure all the application code which might be affected gets changed?
Suddenly you could be deleting records that should stay because they have related data the developer forgot to check when writng the delete process or because that process was in place before the last ten related tables were added to the schema.
It is foolhardy in the extreme to not formally set up PK/FK relationships. I process data received from many different vendors and databases. You can tell which ones have data integrity problems most likely caused by a failure to explicitly define relationships by the poor quality of their data.

django AuditTrail vs Reversion

I am working on an new web app I need to store any changes in database to audit table(s). Purpose of such audit tables is that later on in a real physical audit we can asecertain what happened in a situation, who edited what and what was the state of db at the time of e.g. a complex calculation.
So mostly audit table will be written and not read. Report may be generated though sometimes.
I have looked for available solution
AuditTrail - simple and that is why I am inclining towards it, I can understand it single file code.
Reversion - looks simple enough to use but not sure how easy it would be to modify it if needed.
rcsField seems to be very complex and too much for my needs
I haven't tried anyone of these, so I wanted to know some real experiences and which one I should be using. e.g. which one is faster uses less space, easy to extend and maintain?
Personally I prefer to create audit tables in the database and populate through triggers so that any change even ad hoc queries from the query window are stored. I would never consider an audit solution that is not based in the database itself. This is important because people who are making malicious changes to the database or committing fraud are not likely to do so through the web interface but on the backend directly. Far more of this stuff happens from disgruntled or larcenous employees than outside hackers. If you are using an ORM already, your data is at risk because the permissions are at the table level rather than the sp level where they belong. Therefore it is even more important that you capture any possible change to the dat not just what was from the GUI. WE have a dynamic proc to create audit tables that is run whenever new tables are added to the database. Since our audit tables populate only the changes and not the whole record, we do not need to change them every time a field is added.
Also when evaluating possible solutions, make sure you consider how hard it will be to revert the data to undo a specific change. Once you have audit tables, you will find that this is one of the most important things you need to do from them. Also consider how hard it will be to maintian the information as the database schema changes.
Choosing a solution because it appears to be the easiest to understand, is not generally a good idea. That should be lowest of your selction criteria after meeting the requirements, security, etc.
I can't give you real experience with any of them but would like to make an observation.
I assume by AuditTrail you mean AuditTrail on the Django wiki. If so, I think you'll want to instead look at HistoricalRecords developed by the same author (Marty Alchin aka #gulopine) in his book Pro Django. It should work better with Django 1.x.
This is the approach I'll be using on an upcoming project, not because it necessarily beats the others from a technical standpoint, but because it matches the "real world" expectations of the audit trail for that application.
As i stated in my question rcField seems to be to much for my needs, which is simple that i want store any changes to my table, and may be come back later to those changes to generate some reports.
So I tested AuditTrail and Reversion
Reversion seems to be a better full blown application with many features(which i do not need), Also as far as i know it saves data in a single table in XML or YAML format, which i think
will generate too much data in a single table
to read that data I may not be able to use already present db tools.
AuditTrail wins in that regard that for each table it generates a corresponding audit table and hence changes can be tracked easily, per table data is less and can be easily manipulated and user for report generation.
So i am going with AuditTrail.

Users asking for denormalized database

I am in the early stages of developing a database-driven system and the largest part of the system revolves around an inheritance type of relationship. There is a parent entity with about 10 columns and there will be about 10 child entities inheriting from the parent. Each child entity will have about 10 columns. I thought it made sense to give the parent entity its own table and give each of the children their own tables - a table-per-subclass structure.
Today, my users requested to see the structure of the system I created. They balked at the idea of the table-per-subclass structure. They would prefer one big ~100 column table because it would be easier for them to perform their own custom queries.
Should I consider denormalizing the database for the sake of the users?
Absolutely not. You can always create a view later to show them what they want to see.
They are effectively asking for a report.
You could give them access to a view containing all the fields they require... that way you don't mess up your data model.
No. Structure the data properly and if the users need the a denormalized view of the data create it as a VIEW in the database.
Alternatively, consider that perhaps an RDBMS is not the appropriate storage tool for this project.
They are the users and not the programmers of the system for a reason. Provide a separate interface for their queries. Power users like this can both be helpful and a pain to deal with. Just explain you need the database designed a certain way so you can do your job, period. Once that is accomplished you and provide other means to make querying easier.
What do they know!? You could argue that users shouldn't even be having direct access to a database in the first place.
Doing that leaves you open to massive performance issues, just because a couple of users are running ridiculous queries.
How about if you created a VIEW in the format your users wanted while still maintaining a properly normalized table?
Aside from a lot of the technical reasons for or against your users' proposition, you need to be on same page in communicating the consequences of various scenarious and (more importantly) the costs of those consequences. If the users are your clients and they are paying you to do a job, explain that their awful "proposed" ideas may cost them more money in development time, additional hardware resources, etc.
Hopefully you can explain it in such a way that shows your expertise and why your idea is a much better value to your users in the long run.
As everyone more or less mentioned, that way lies madness, and you can always build a view.
If you just can't get them to come around on this point, consider showing them this thread and the number of pros who weighed in saying that the users are meddling with things that they don't fully understand, and the impact will be an undermined foundation.
A big part of the developer's craft is the feel for what won't work out long term, and the rules of normalization are almost canonical in that respect. There are situations where you need to denormalize (data warehouses, etc) but this doesn't sound like one of them!
It also sounds as though you may have a particularly troubling brand of user on your hand -- the amatuer developer who thinks they could do your job better themselves if only they had the time. This may or may not help, but I've found that those types respond well to presentation -- a few times now I've found that if I dress sharp and show a little bit of force in my personality, it helps them feel like I'm an expert and prevents a bunch of problems before they start.
I would strongly recommend coming up with an answer that doesn't involve someone running direct reports against your database. The moment that happens, your DB structure is set in stone and you can basically consider it legacy.
A view is a good start, but later on you'll probably want to structure this as an export, to decouple further. Of course, then you'll encounter someone who wants "real time" data. Proper business analysis usually reveals this to be unnecessary. Actual real time requirements are not best handled through reporting systems.
Just to be clear: I'd personally favour the table per subclass approach, but I don't think it's actually as big an issue as the direct reporting off transaction tables is going to be.
I would opt for a view (as others have suggested) or an inline table-valued function (the benefits of this is you require parameters - like an date range or a customer account - which can help to stop users from querying without any limits on the problem space) first. An inline TVF is really a parametrized view and is far closer to a view in terms of how the engine treats them than it is to a multi-statement table valued function or a scalar function, which can perform incredibly poorly.
However, in some cases, this can impact production performance if the view is complex or intensive. With poorly written ad hoc user queries, it can also cause locks to persist longer or be escalated further than they would on a better built query. It is also possible for users to misinterpret an E-R data model and produce multiplied numbers in cases where there are many-to-one or many-to-many relationships. The next option might be to materialize these views with indexes or make tables and keep them updated, which gets us closer to my next option...
So, given those drawbacks of the view option and already thinking of mitigating it by starting to make copies of data, the next option I would consider is to have a separate read-only (for these users) version of the data which is structured differently. Typically, I would first look at a Kimball-style star schema. You do not need to have a full-fledged time-consistent data warehouse. Of course, that's an option, but you could simply keep a reporting model up to date with data. Star-schemas are a special form of denormalization and are particularly good for numerical reporting, and a given star should not be able to be abused by users accidentally. You can keep the star up to date in a number of ways, including triggers, scheduled jobs, etc. They can be very fast for reporting needs and run on the same production installation - perhaps on a separate instance if not just a separate database.
Although such a solution may require you to effectively more than double your storage requirements, when compared with other practices it might be a really good option if you understand your data well and don't mind having two models - one for transactions and one for analysis (note that you will already start to have this logical separation anyway with the use of a the simplest first option of view).
Some architects will often double their servers and use the SAME model with some kind of replication in order to provide a reporting server which is indexed more heavily or differently. Such a second server doesn't impact production transactions with reporting requirements and can be kept up to date fairly easily. There will only be one model, but of course, this has the same usability problems with allowing users ad hoc access to the underlying model only, without the performance affects, since they get their own playground.
There are a lot of ways to skin these cats. Good luck.
The customer is always right. However, the customer is likely to back down when you convert their requirement into dollars and cents. A 100 column table will require extra dev time to write the code that does what the database would do automatically with the proper implementation. Further, their support costs will be higher since more code means more problems and lower ease of debugging.
I'm going to play devil's advocate here and say that both solutions sound like poor approximations of the actual data. There's a reason that object-oriented programming languages don't tend to be implemented with either of these data models, and it's not because Codd's 1970 ideas about relations were the ideal system for storing and querying object-oriented data structures. :-)
Remember that SQL was originally designed as a user interface language (that's why it looks vaguely like English and not at all like other languages of that era: Algol, C, APL, Prolog). The only reasons I've heard for not exposing a SQL database to users today are security (they could take down the server!) and usability (who wants to write SQL when you can clicky clicky?), but if it's their server and they want to, then why not let them?
Given that "the largest part of the system revolves around an inheritance type of relationship", then I'd seriously consider a database that lets me represent that natively, either Postgres (if SQL is important) or a native object database (which are awesome to work with, if you don't need SQL compatibility).
Finally, remember that every engineering decision is a tradeoff. By "sticking to your guns" (as somebody else proposed), you're implicitly saying the value of your users' desires are zero. Don't ask SO for a correct answer to this, because we don't know what your users want to do with your data (or even what your data is, or who your users are). Go tell them why you want a many-tables solution, and then work out a solution with them that's acceptable to both of you.
You've implemented Class Table Inheritance and they're asking for Single Table Inheritance. Both designs are valid in certain situations.
You might want to get a copy of Martin Fowler's Patterns of Enterprise Application Architecture to read more about the advantages and disadvantages of each design. That book is a classic reference to have on your bookshelf, in any case.

Audit trails and implementing SOX/HIPAA/etc, best practices for sensitive data

I consider myself to be relatively proficient in terms of application design, but I've never had to work with sensitive data. I've been wondering about what the best practices were for audit trails and how exactly one should implement them. I don't have to do it right now, but it'd be nice to be able to confidently talk with a medical company if they ask me to do some work for them.
Let's say we have a "school" database, with 'teachers', 'classes', 'students' all normalized in a many-to-many 'grades' table. What would you log? Every insert/update on the 'grades table'? Only updates (say, a kid breaks in and wants to change grades, this should send up redflags)? Does this vary entirely based on how paranoid one wants to be? Is there a best practice?
Is this something that should be done in the database? (A trigger on each sensitive SELECT which inserts a row to an 'audit' table logging each query?) What should be logged? Is there functionality automatically built into Oracle/DB2 that do it for you? Should this be application side logic?
If anyone has any formal documentation/books on how to deal with sensitive data (not quite DoD "Trusted Computing" spec, but something along the lines of that :P), I'd appreciate it. I'm sorry if this question is terribly vague. I realize that this varies from application to application. I just want to hear your detailed experiences with dealing with sensitive data.
The first thing to understand is the native auditing capabilities of your chosen DBMS. These vary in detail, but generally provide a way to configure which operations are audited, and provide secure storage for the audit records that they generate.
The next thing to understand is what you want to audit. In the case of HIPAA and SOX, for example, you are probably looking at PII - Personal Identifying Information. Remember the fuss made about people accessing Obama's phone records, or various celebrities medical records, or ... Those were caught because the system audited who read those records, and the audit analysis officer (AAO) spotted that the celebrity records were accessed by people who were not specifically authorized to do so. So, those systems must be logging who accesses each record, and spotting when the user who does so does not have an authentic business reason to do so. In these cases, it appears that the users had read authority for the records, so if their ordinary duties required them to look at the records, they could do so. But, when they were not required to do so, then they were abusing their power and appropriately sanctioned (up to and including losing jobs over it).
What this means is that you probably don't want to track who accesses the table of States which records the state code and full name (and assorted other bits of information about the state). There is nothing confidential about that list - it doesn't matter who reads it. Of course, almost no-one should write to it; the list of states does not change very often - but that can probably be handled by revoking update and delete permission on the table from everyone.
OTOH, you probably do want to record who accesses the records in medical histories (HIPAA), or who modifies the data in the accounting systems (SOX). You might or might not need to worry about who reads the accounting data; a lot of that can be dealt with by basic permissions (accounting staff have permission; IT staff do not). However, auditing is always an extra line of defense.
Bear in mind that audit records are no help whatsoever if they are never looked at. In general, auditing slows a system down (simply because it is doing more work when it writes audit records); it is important to understand how much it slows down before deciding to implement your auditing strategy. However, there are some things that are more important than application speeed, and one of those is keeping yourself and other staff members out of jail. Auditing can be necessary to ensure that happens.
Oracle has a product called Oracle Audit Vault- DB2 probably has an equivalent.
You should start by prevention. The system should not allow invalid actions. Period. If the system allows 'dubious' actions that need to be monitored, that's "business logic", you are probably better of implementing like the rest of your business logic.
If you want to do something in your database, you can look into log shipping (terminology might differ from RDBMS to RDBMS). Basically, any DML operation is logged to a file. You can use this information for backups and point-in-time recovery, even for replication/HA/failover/etc. If you ship your logs to a separate, "trusted" system in an "append-only" (i.e. the log shipping process has privileges to create new log files, but not to modify existing information) fashion, you already have a primitive auditing functionality. If you do it in a secure way (i.e. authentication, non-repudiation), you probably even are quite close to "compliance" :-p
Of course, sifting through lots and lots of INSERT/UPDATE/DELETE statements is not the most sophisticated way to work.

Resources