maintain history in a database - database

I am designing this database that must maintain a history of employee salary and the movements within the organization. Basically, my design has 3 tables (I mean, there more tables but for this question I'll mention 3, so bear with me). Employee table (containing the most current salary, position data, etc), SalaryHistory table (salary, date, reason, etc.) and MovementHistory(Title, Dept., comments). I'll be using Linq to Sql, so what I was thinking is that every time employee data is updated, the old values will be copied to their respective history tables. Is this a good approach? Should I just do it using Linq to SQL or triggers? Thanks for any help, suggestion or idea.

Have a look at http://www.simple-talk.com/sql/database-administration/database-design-a-point-in-time-architecture .
Basically, the article suggests that you have the following columns in the tables you need to track history for -
* DateCreated – the actual date on which the given row was inserted.
* DateEffective – the date on which the given row became effective.
* DateEnd – the date on which the given row ceased to be effective.
* DateReplaced – the date on which the given row was replaced by another row.
* OperatorCode – the unique identifier of the person (or system) that created the row.
DateEffective and DateEnd together tell you the time for which the row was valid (or the time for which an employee was in a department, or the time for which he earned a particular salary).

It is a good idea to keep that logic internal to the database: that's basically why triggers exist. I say this carefully, however, as there are plenty of reasons to keep it external. Often times - especially with a technology as easy as LINQ-to-SQL - it is easier to write the code externally. In my experience, more people could write that logic in C#/LINQ than could do it correctly using a trigger.
Triggers are fast - they're compiled! However, they're very easy to misuse and make your logic overcomplex to a degree that performance can degrade rapidly. Considering how simple your use case is, I would opt to use triggers, but that's me personally.

Triggers will likely be faster, and don't require a "middle man" to get the job done, eliminating at least one chance for errors.
Depending on your database of choice, you can just use one table and enable OID's on it, and add two more columns, "flag" and "previous". Never update this table, only insert. Add a trigger so that when a row is added for employee #id, set all records with employee #id to have a flag of "old" and set the new rows "previous" value to the previous row.

I think this belongs in the database for two reasons.
First, middle tiers come and go, but databases are forever. This year Java EJBs, next year .NET, the year after that something else. The data remains, in my experience.
Second, if the database is shared at all it should not have to rely on every application that uses it to know how to maintain its data integrity. I would consider this an example of encapsulation of the database. Why force knowledge and maintenance of the history on every client?

Triggers make your front-end easier to migrate to something else and they will keep the database consistent no matter how data is inserted/updated/removed.
Besides in your case I would write the salaries straight to the salary history - from your description I wouldn't see a reason why you should go the way via an update-trigger on the employee table.

Related

Database design for application with wiki-like functions

I'm making an api for movie/tv/actors etc. with web api 2 and sql server. The database now has >30 tables, most of them storing data users will be able to edit.
How should I store old version of entries?
Say someone edits description, runtime and tagline for a entry(movie) in the movies table.
I'll have a table(movies_old), where I store the editable files in 'movies' pluss who/when it was edited.
All in the same database. The '???_old' tables has no relationships.
I'm very new to database design. Is there something obviously wrong with this?
To my mind, there are two issues here: what table you store the data in, and what goes in the "historical value" field.
On the first question, there are two obvious options: Store old and new records in the same table, with some sort of indication of which is "current" and which is "history", or have a separate table for history.
The main advantage of one table is that you have a simpler schema. This is especially true if the table contains many fields. If there are two tables, then all the field definitions are duplicated. When you move data from the current table to the history table, you have to copy every field, and if the list of fields changes, or their formats change, you have to remember to update the copy. Any queries that show the history have to read two tables. Etc. But with one table, all that goes away. Converting a record from current to history just means changing the setting of the "is_current" flag or however you indicate it.
The main advantages of two tables are, (a) Access is probably somewhat faster, as you don't have so many irrelevant records to skip over. (b) When reading the current table you don't have to worry about excluding the history records.
Oh, an annoying thing about SQL: In principle you could put a date on each record, and then the record with the latest date is the current one. In practice this is a pain: you usually have to have an inner query to find the latest date, and then feed this back in to an outer query that re-reads the record with that date. (Some SQL engines have ways around this. Postgres, for example.) So in practice, you need an "is_current" flag, probably 1 for current and 0 for history or some such.
The other issue is what to put in the contents. If you're dealing with short fields, customer number and amount billed and so forth, then the simple and easy thing to do is just store the complete old contents in one record and the complete new contents in the new record. But if you're dealing with a long text block, like a plot synopsis or a review, there could be many small editorial changes. If every time someone fixes a grammar or spelling error, we have a whole new record with the entire 1000 characters, of which 5 characters are different, this could really clutter up the database. If that's the case you might want to investigate ways to store changes more efficiently. May or may not be an issue to you.

Bitemporal Database Design Question

I am designing a database that needs to store transaction time and valid time, and I am struggling with how to effectively store the data and whether or not to fully time-normalize attributes. For instance I have a table Client that has the following attributes: ID, Name, ClientType (e.g. corporation), RelationshipType (e.g. client, prospect), RelationshipStatus (e.g. Active, Inactive, Closed). ClientType, RelationshipType, and RelationshipStatus are time varying fields. Performance is a concern as this information will link to large datasets from legacy systems. At the same time the database structure needs to be easily maintainable and modifiable.
I am planning on splitting out audit trail and point-in-time history into separate tables, but I’m struggling with how to best do this.
Some ideas I have:
1)Three tables: Client, ClientHist, and ClientAudit. Client will contain the current state. ClientHist will contain any previously valid states, and ClientAudit will be for auditing purposes. For ease of discussion, let’s forget about ClientAudit and assume the user never makes a data entry mistake. Doing it this way, I have two ways I can update the data. First, I could always require the user to provide an effective date and save a record out to ClientHist, which would result in a record being written to ClientHist each time a field is changed. Alternatively, I could only require the user to provide an effective date when one of the time varying attributes (i.e. ClientType, RelationshipType, RelationshipStatus) changes. This would result in a record being written to ClientHist only when a time varying attribute is changed.
2) I could split out the time varying attributes into one or more tables. If I go this route, do I put all three in one table or create two tables (one for RelationshipType and RelationshipStatus and one for ClientType). Creating multiple tables for time varying attributes does significantly increase the complexity of the database design. Each table will have associated audit tables as well.
Any thoughts?
A lot depends (or so I think) on how frequently the time-sensitive data will be changed. If changes are infrequent, then I'd go with (1), but if changes happen a lot and not necessarily to all the time-sensitive values at once, then (2) might be more efficient--but I'd want to think that over very carefully first, since it would be hard to manage and maintain.
I like the idea of requiring users to enter effective daes, because this could serve to reduce just how much detail you are saving--for example, however many changes they make today, it only produces that one History row that comes into effect tomorrow (though the audit table might get pretty big). But can you actually get users to enter what is somewhat abstract data?
you might want to try a single Client table with 4 date columns to handle the 2 temporal dimensions.
Something like (client_id, ..., valid_dt_start, valid_dt_end, audit_dt_start, audit_dt_end).
This design is very simple to work with and I would try and see how ot scales before going with somethin more complicated.

Database design question - which is the best solution?

I'm using Firebird 2.1 and I'm looking for the best way to solve this issue.
I'm writing a calendaring application. Different users' calendar entries are stored in a big Calendar table. Each calendar entry can have a reminder set - only one reminder/entry.
Statistically, the Calendar table could grow to hundreds of thousands of records over time, while there are going to be much less reminders.
I need to query the reminders on a constant basis.
Which is the best option?
A) Store the reminders' info in the Calendar table (in which case I'm going to query hundreds of thousands of records for IsReminder = 1)
B) Create a separate Reminders table which contains only the ID of calendar entries which have reminders set, then query the two tables with a JOIN operation (or maybe create a view on them)
C) I can store all information about reminders in the Reminders table, then query only this table. The downside is that some information needs to be duplicated in both tables, like in order to show the reminder, I'll need to know and store the event's starttime in the Reminders table - thus I'm maintaining two tables with the same values.
What do you think?
And one more question: The Calendar table will contain the calender of multiple users, separated only by a UserID field. Since there can be only 4-5 users, even if I put an index on this field, its selectivity is going to be very bad - which is not good for a table with hundreds of thousands of records. Is there a workaround here?
Thanks!
There are advantages and drawbacks to all three choices. Whis one is best depends on details you have not provided. In general, don't worry too much about selecting three or four entries out of a hundred thousand, provided the indexes you have set up allow the right retrieval strategy. If don't understand indexing, you're likely to be in trouble no matter which of the three choices you make.
If it were me, I would go with choice B. I'd also store any attributes of a reminder in the table for reminders.
Be very careful about whether you identify an event by EventId alone or by (UserId, EventId). If you choose the latter, it behooves you to use a compound primary key for the Event table. Don't worry too much about compound primary keys, especially with Firebird.
If you declare a compound primary key, be aware that declaring (UserId, EventId) will not have the same consequences as declaring (EventId, UserId). They are logically equivalent, but the structure of the automatically generated index will be different in the two cases.
This in turn will affect the speed of queries like "find all the reminders for a given user".
Again, if it were me, I'd avoid choice C. the introduction of harmful redundancy into a schema carries with it the responsibility for some very careful programming when you go to update the data. Otherwise, you can end up with a database that stores contradictory versions of the same fact in different places of the database.
And, if you really want to know the effect on perfromance, try all three ways, load with test data, and do your own benchmarks.
I think you need to create realistic, fake user data and measure the difference with some typical queries you expect to run.
Indexing, query optimization and the types of query results you need can make a big difference,
so it's not easy to say what's best without knowing more.
When choosing Option (A) you should
provide an index on "IsReminder" (or a combined index on IsReminder, UserId, whatever fits best to your intended queries)
make sure your queries use this index
Option B is preferable over A if you have more than a boolean flag for each reminder to store (for example, the number of minutes the user shall be notified before the event). You should, however, make some guessing how often in your program you will have to JOIN both tables.
If you can, avoid option C. If you don't want to benchmark all three cases, I suggest start with A or B, according to the described circumstances, and probably the solution you choose will be fast enough, so you don't have to bother with the other cases.

How to keep historic details of modification in a database (Audit trail)?

I'm a J2EE developer & we are using hibernate mapping with a PostgreSQL database.
We have to keep track of any changes occurs in the database, in others words all previous & current values of any field should be saved. Each field can be any type (bytea, int, char...)
With a simple table it is easy but we a graph of objects things are more difficult.
So we have, speaking in a UML point of view, a graph of objects to store in the database with every changes & the user.
Any idea or pattern how to do that?
A common way to do this is by storing versions of objects.
If add a "version" and a "deleted" field to each table that you want to store an audit trail on, then instead of doing normal updates and deletes, follow these rules:
Insert - Set the version number to 0 and insert as normal.
Update - Increment the version number and do an insert instead.
Delete - Increment the version number, set the deleted field to true and do an insert instead.
Retrieve - Get the record with the highest version number and return that.
If you follow this pattern, every time you update you will create a new record rather than overwriting the old data, so you will always be able to track back and see all the old objects.
This will work exactly the same for graphs of objects, just add the new fields to each table within the object graph, and handle each insert/update/delete for each table as described above.
If you need to know which user made the modification, you just add a "ModifiedBy" field as well.
(You can either do this processing in your DA layer code, or if you prefer you can use database triggers to catch your update/delete/retrieve calls and re-process them following the rules.)
Obviously, you need to consider space requirements, as every single update will result in a fully new record. If your application is update heavy, you are going to generate a lot of data. It's common to also include a "last modified time" fields so you can process the database off line and delete data older than required.
Current RDBMS implementations are not very good at handling temporal data. That's one reason why maintaining separate journalling tables through triggers is the usual approach. (The other is that audit trails frequently have different use cases to regular data, and having them in separate tables makes it easier to manage access to them). Oracle does a pretty slick job of hiding the plumbing in its Total Recall product, but being Oracle it charges $$$ for this.
Scott Bailey has published a presentation on temporal data in PostgreSQL. Alas it won't help you right now but it seems like some features planned for 8.5 and 8.6 will enable the transparent storage of time-related data. Find out more.

Database design question. BIT column for deletions

Part of my table design is to include a IsDeleted BIT column that is set to 1 whenever a user deletes a record. Therefore all SELECTS are inevitable accompanied by a WHERE IsDeleted = 0 condition.
I read in a previous question (I cannot for the love of God re-find that post and reference it) that this might not be the best design and an 'Audit Trail' table might be better.
How are you guys dealing with this problem?
Update
I'm on SQL Server. Solutions for other DB's are welcome albeit not as useful for me but maybe for other people.
Update2
Just to encapsulate what everyone said so far. There seems to be basically 3 ways to deal with this.
Leave it as it is
Create an audit table to keep track of all the changes
Use of views with WHERE IsDeleted = 0
Therefore all SELECTS are inevitable accompanied by a WHERE IsDeleted = 0 condition.
This is not a really good way to do it, as you probably noticed, it is quite error-prone.
You could create a VIEW which is simply
CREATE VIEW myview AS SELECT * FROM yourtable WHERE NOT deleted;
Then you just use myview instead of mytable and you don't have to think about this damn column in SELECTs.
Or, you could move deleted records to a separate "archive" table, which, depending on the proportion of deleted versus active records, might make your "active" table a lot smaller, better cached in RAM, ie faster.
If you have to have this kind of Deleted Bit column, then you really should consider setting up some VIEWs with the WHERE clause in it, and use those rather than the underlying tables. Much less error prone.
For example, if you have this view:
CREATE VIEW [Current Product List] AS
SELECT ProductID,ProductName
FROM Products
WHERE Discontinued=No
Then someone who wants to see current products can simply write:
SELECT * FROM [Current Product List]
This is much less error prone than writing:
SELECT ProductID,ProductName
FROM Products
WHERE Discontinued=No
As you say, people will forget that WHERE clause, and get confusing and incorrect results.
P.S. the example SQL comes from Microsoft's Northwind database. Normally I would recommend NOT using spaces in column and table names.
We're actively using the "Deleted" column in our enterprise software. It is however a source of constant errors when forgetting to add "WHERE Deleted = 0" to an SQL query.
Not sure what is meant by "Audit Trail". You may wish to have a table to track all deleted records. Or there may be an option of moving the deleted content to paired tables (like Customer_Deleted) to remove the passive content from tables to minimize their size and optimize performance.
A while ago there was some blog uproar on this issue, Ayende and Udi Dahan both posted on this.
Nai this is totally up to you.
Do you need to be able to see who has deleted / modified / inserted what and when? If so, you should design the tables for this and adjust your procs to write these values when they are called.
If you dont need an audit trail, dont waste time with one. Just do as you are with IsDeleted.
Personally, I flag things right now, as an audit trail wasn't specified in my spec, but that said, I don't like to actually delete things. Hence, I chose to flag it. I'm not going to waste a clients time writing something they diddn't request. I wont mess about with other tables because that's another thing for me to think about. I'd just make sure my index's were up to the job.
Ask your manager or client. Plan out how long the audit trail would take so they can cost it and let them make the decision for you ;)
Udi Dahan said this:
Model the task, not the data
Looking back at the story our friend from marketing told us, his intent is to discontinue the product – not to delete it in any technical sense of the word. As such, we probably should provide a more explicit representation of this task in the user interface than just selecting a row in some grid and clicking the ‘delete’ button (and “Are you sure?” isn’t it).
As we broaden our perspective to more parts of the system, we see this same pattern repeating:
Orders aren’t deleted – they’re cancelled. There may also be fees incurred if the order is canceled too late.
Employees aren’t deleted – they’re fired (or possibly retired). A compensation package often needs to be handled.
Jobs aren’t deleted – they’re filled (or their requisition is revoked).
In all cases, the thing we should focus on is the task the user wishes to perform, rather than on the technical action to be performed on one entity or another. In almost all cases, more than one entity needs to be considered.
If you have Oracle DB, then you can use audit trail for auditing. Check the AUDIT VAULT tool form OTN, here. It even supports SQL Server.
Views (or stored procs) to get at the underlying table data are the best way. However, if you have the problem with "too many cooks in the kitchen" like we do (too many people have rights to the data and may just use the table without knowing enough to use the view/proc) you should try using another table.
We have a complete mimic of the base table with a few extra columns for tracking. So Employee table has an EmployeeDeleted table with the same schema but extra columns for when it was deleted and who deleted it and sometimes even the reason for deletion. You can even get fancy and have triggers do the insertion directly instead of going through applications/procs.
Biggest Advantage: no flag to worry about during selects
Biggest Disadvantage: any schema changes to the base table also have to be made on the "deleted" table
Best for: situations where for whatever reason (usually political with us) many not-as-experienced people have rights to the data but still expect it to be accurate without having to understand flags or schemas, etc
I've used soft deletes before on a number of applications I've worked on, and overall it's worked out quite well. Yes, there is the issue of always having to remember to add AND IsActive = 1 to all of your SELECT queries, but really that's not so bad. You can create views if you don't want to have to remember to always do that.
The reason we've done this is because we had very specific business needs to be able to report on records that have been deleted. The reporting needs varied widely - sometimes they'd need to see just the active records, or just the inactive records, or sometimes a mix of both - so pushing all the deleted records into an audit table wasn't a very good option.
So, depending on your particular business needs, I think this approach is certainly a viable option.

Resources