Versioning Database Persisted Objects, How would you? [closed] - database

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
(Not related to versioning the database schema)
Applications that interfaces with databases often have domain objects that are composed with data from many tables. Suppose the application were to support versioning, in the sense of CVS, for these domain objects.
For some arbitry domain object, how would you design a database schema to handle this requirement? Any experience to share?

Think carefully about the requirements for revisions. Once your code-base has pervasive history tracking built into the operational system it will get very complex. Insurance underwriting systems are particularly bad for this, with schemas often running in excess of 1000 tables. Queries also tend to be quite complex and this can lead to performance issues.
If the historical state is really only required for reporting, consider implementing a 'current state' transactional system with a data warehouse structure hanging off the back for tracking history. Slowly Changing Dimensions are a much simpler structure for tracking historical state than trying to embed an ad-hoc history tracking mechanism directly into your operational system.
Also, Changed Data Capture is simpler for a 'current state' system with changes being done to the records in place - the primary keys of the records don't change so you don't have to match records holding different versions of the same entity together. An effective CDC mechanism will make an incremental warehouse load process fairly lightweight and possible to run quite frequently. If you don't need up-to-the minute tracking of historical state (almost, but not quite, and oxymoron) this can be an effective solution with a much simpler code base than a full history tracking mechanism built directly into the application.

A technique I've used for this in that past has been to have a concept of "generations" in the database, each change increments the current generation number for the database - if you use subversion, think revisions.
Each record has 2 generation numbers associated with it (2 extra columns on the tables) - the generation that the record starts being valid for, and the generation the it stops being valid for. If the data is currently valid, the second number would be NULL or some other generic marker.
So to insert into the database:
increment the generation number
insert the data
tag the lifetime of that data with valid from, and a valid to of NULL
If you're updating some data:
mark all data that's about to be modified as valid to the current generation number
increment the generation number
insert the new data with the current generation number
deleting is just a matter of marking the data as terminating at the current generation.
To get a particular version of the data, find what generation you're after and look for data valid between those generation versions.
Example:
Create a person.
|Name|D.O.B |Telephone|From|To |
|Fred|1 april|555-29384|1 |NULL|
Update tel no.
|Name|D.O.B |Telephone|From|To |
|Fred|1 april|555-29384|1 |1 |
|Fred|1 april|555-43534|2 |NULL|
Delete fred:
|Name|D.O.B |Telephone|From|To |
|Fred|1 april|555-29384|1 |1 |
|Fred|1 april|555-43534|2 |2 |

An alternative to strict versioning is to split the data into 2 tables: current and history.
The current table has all the live data and has the benefits of all the performance that you build in.
Any changes first write the current data into the associated "history" table along with a date marker which says when it changed.

If you are using Hibernate JBoss Envers could be an option. You only have to annotate classes with #Audited to keep their history.

You'll need a master record in a master table that contains the information common among all versions.
Then each child table uses master record id + version no as part of the primary key.
It can be done without the master table, but in my experience it will tend to make the SQL statements a lot messier.

A simple fool-proof way, is to add a version column to your tables and store the Object's version and choose the appropriate application logic based on that version number.
This way you also get backwards compatibility for little cost. Which is always good

ZoDB + ZEO implements a revision based database with complete rollback to any point in time support. Go check it.
Bad Part: It's Zope tied.

Once an object is saved in a database, we can modify that object any number of times right, If we want to know how many no of times that an object is modified then we need to apply this versioning concept.
When ever we use versioning then hibernate inserts version number as zero, when ever object is saved for the first time in the database. Later hibernate increments that version no by one automatically when ever a modification is done on that particular object.
In order to use this versioning concept, we need the following two changes in our application
Add one property of type int in our pojo class.
In hibernate mapping file, add an element called version soon after id element

I'm not sure if we have the same problem, but I required a large number of 'proposed' changes to the current data set (with chained proposals, ie, proposal on proposal).
Think branching in source control but for database tables.
We also wanted a historical log but this was the least important factor - the main issue was managing change proposals which could hang around for 6 months or longer as the business mulled over change approval and got ready for the actual change to be implemented.
The idea is that users can load up a Change and start creating, editing, deleting the current state of data without actually applying those changes. Revert any changes they may have made, or cancel the entire change.
The only way I have been able to achieve this is to have a set of common fields on my versioned tables:
Root ID: Required - set once to the primary key when the first version of a record is created. This represents the primary key across all of time and is copied into each version of the record. You should consider the Root ID when naming relation columns (eg. PARENT_ROOT_ID instead of PARENT_ID). As the Root ID is also the primary key of the initial version, foreign keys can be created against the actual primary key - the actual desired row will be determined by the version filters defined below.
Change ID: Required - every record is created, updated, deleted via a change
Copied From ID: Nullable - null indicates newly created record, not-null indicates which record ID this row was cloned from when updated
Effective From Date/Time: Nullable - null indicates proposed record, not-null indicates when the record became current. Unfortunately a unique index cannot be placed on Root ID/Effective From as there can be multiple null values for any Root ID. (Unless you want to restrict yourself to a single proposed change per record)
Effective To Date/Time: Nullable - null indicates current/proposed, not-null indicates when it became historical. Not technically required but helps speed up queries finding the current data. This field could be corrupted by hand-edits but can be rebuilt from the Effective From Date/Time if this occurs.
Delete Flag: Boolean - set to true when it is proposed that the record be deleted upon becoming current. When deletes are committed, their Effective To Date/Time is set to the same value as the Effective From Date/Time, filtering them out of the current data set.
The query to get the current state of data according to a change would be;
SELECT * FROM table WHERE (CHANGE_ID IN :ChangeId OR (EFFECTIVE_FROM <= :Now AND (EFFECTIVE_TO IS NULL OR EFFECTIVE_TO > :Now) AND ROOT_ID NOT IN (SELECT ROOT_ID FROM table WHERE CHANGE_ID IN :ChangeId)))
(The filtering of change-on-change multiples is done outside of this query).
The query to get the current state of data at a point in time would be;
SELECT * FROM table WHERE EFFECTIVE_FROM <= :Now AND (EFFECTIVE_TO IS NULL OR EFFECTIVE_TO > :Now)
Common indexes created on (ROOT_ID, EFFECTIVE_FROM), (EFFECTIVE_FROM, EFFECTIVE_TO) and (CHANGE_ID).
If anyone knows a better solution I would love to hear about it.

Related

Working with accumulated bucket values in Entity Framework

I'm attempting to find design patterns/strategies for working with accumulated bucket values in a database where concurrency can be a problem. I don't know the proper search terms to use to find information on the topic.
Here's my use case (I'm using code-first Entity Framework, so EF-specific advice is welcome):
I have a database table that contains a quantity value. This quantity value can be incremented or decremented by multiple clients at the same time (due to this, I call this value a "bucket" value as it is a bucket for a bunch of accumulated activity; this is in opposition of the other strategy where you keep all activity and calculate the value based on the activity). I am looking for strategies on ensuring accuracy of this "bucket" value (within the context of EF) that takes into consideration that multiple clients may attempt to change it simultaneously (concurrency).
The answer "you must track activity and derive your value from that activity" is acceptable, but I want to consider all bucket-centric solutions as well.
I am looking for advice on search terms to use to find good information on this topic as well as specific links.
Edit: You may assume that all activity is relative to the "bucket" value (no clients will be making an absolute change to the value; they will only increment or decrement).
Without directly coding the SQL Queries that update the buckets, you would have to use client-side Optimistic Concurrency. See Entity Framework Optimistic Concurrency Patterns. Clients whose update would overwrite a change will get an exception, after which you can reload with the current value and retry. This pattern requires a ROWVERSION column on the target table.
If you code the updates in TSQL you can code an atomic update, something like
update foo with (updlock)
set bucket_a = bucket_a + 1
output inserted.*
where id = #id
(The 'updlock' isn't strictly necessary in this query, but is good form any time you want to ensure this kind of isolation)

Bitemporal Database Design Question

I am designing a database that needs to store transaction time and valid time, and I am struggling with how to effectively store the data and whether or not to fully time-normalize attributes. For instance I have a table Client that has the following attributes: ID, Name, ClientType (e.g. corporation), RelationshipType (e.g. client, prospect), RelationshipStatus (e.g. Active, Inactive, Closed). ClientType, RelationshipType, and RelationshipStatus are time varying fields. Performance is a concern as this information will link to large datasets from legacy systems. At the same time the database structure needs to be easily maintainable and modifiable.
I am planning on splitting out audit trail and point-in-time history into separate tables, but I’m struggling with how to best do this.
Some ideas I have:
1)Three tables: Client, ClientHist, and ClientAudit. Client will contain the current state. ClientHist will contain any previously valid states, and ClientAudit will be for auditing purposes. For ease of discussion, let’s forget about ClientAudit and assume the user never makes a data entry mistake. Doing it this way, I have two ways I can update the data. First, I could always require the user to provide an effective date and save a record out to ClientHist, which would result in a record being written to ClientHist each time a field is changed. Alternatively, I could only require the user to provide an effective date when one of the time varying attributes (i.e. ClientType, RelationshipType, RelationshipStatus) changes. This would result in a record being written to ClientHist only when a time varying attribute is changed.
2) I could split out the time varying attributes into one or more tables. If I go this route, do I put all three in one table or create two tables (one for RelationshipType and RelationshipStatus and one for ClientType). Creating multiple tables for time varying attributes does significantly increase the complexity of the database design. Each table will have associated audit tables as well.
Any thoughts?
A lot depends (or so I think) on how frequently the time-sensitive data will be changed. If changes are infrequent, then I'd go with (1), but if changes happen a lot and not necessarily to all the time-sensitive values at once, then (2) might be more efficient--but I'd want to think that over very carefully first, since it would be hard to manage and maintain.
I like the idea of requiring users to enter effective daes, because this could serve to reduce just how much detail you are saving--for example, however many changes they make today, it only produces that one History row that comes into effect tomorrow (though the audit table might get pretty big). But can you actually get users to enter what is somewhat abstract data?
you might want to try a single Client table with 4 date columns to handle the 2 temporal dimensions.
Something like (client_id, ..., valid_dt_start, valid_dt_end, audit_dt_start, audit_dt_end).
This design is very simple to work with and I would try and see how ot scales before going with somethin more complicated.

How to keep historic details of modification in a database (Audit trail)?

I'm a J2EE developer & we are using hibernate mapping with a PostgreSQL database.
We have to keep track of any changes occurs in the database, in others words all previous & current values of any field should be saved. Each field can be any type (bytea, int, char...)
With a simple table it is easy but we a graph of objects things are more difficult.
So we have, speaking in a UML point of view, a graph of objects to store in the database with every changes & the user.
Any idea or pattern how to do that?
A common way to do this is by storing versions of objects.
If add a "version" and a "deleted" field to each table that you want to store an audit trail on, then instead of doing normal updates and deletes, follow these rules:
Insert - Set the version number to 0 and insert as normal.
Update - Increment the version number and do an insert instead.
Delete - Increment the version number, set the deleted field to true and do an insert instead.
Retrieve - Get the record with the highest version number and return that.
If you follow this pattern, every time you update you will create a new record rather than overwriting the old data, so you will always be able to track back and see all the old objects.
This will work exactly the same for graphs of objects, just add the new fields to each table within the object graph, and handle each insert/update/delete for each table as described above.
If you need to know which user made the modification, you just add a "ModifiedBy" field as well.
(You can either do this processing in your DA layer code, or if you prefer you can use database triggers to catch your update/delete/retrieve calls and re-process them following the rules.)
Obviously, you need to consider space requirements, as every single update will result in a fully new record. If your application is update heavy, you are going to generate a lot of data. It's common to also include a "last modified time" fields so you can process the database off line and delete data older than required.
Current RDBMS implementations are not very good at handling temporal data. That's one reason why maintaining separate journalling tables through triggers is the usual approach. (The other is that audit trails frequently have different use cases to regular data, and having them in separate tables makes it easier to manage access to them). Oracle does a pretty slick job of hiding the plumbing in its Total Recall product, but being Oracle it charges $$$ for this.
Scott Bailey has published a presentation on temporal data in PostgreSQL. Alas it won't help you right now but it seems like some features planned for 8.5 and 8.6 will enable the transparent storage of time-related data. Find out more.

Adding relations to an Access Database

I have an MS Access database with plenty of data. It's used by an application me and my team are developing. However, we've never added any foreign keys to this database because we could control relations from the code itself. Never had any problems with this, probably never will either.
However, as development has developed further, I fear there's a risk of losing sight over all the relationships between the 30+ tables, even though we use well-normalized data. So it would be a good idea go get at least the relations between the tables documented.
Altova has created DatabaseSpy which can show the structure of a database but without the relations, there isn't much to display. I could still use to add relations to it all but I don't want to modify the database itself.
Is there any software that can analyse a database by it's structures and data and then do a best-guess about its relations? (Just as documentation, not to modify the database.)
This application was created more than 10 years ago and has over 3000 paying customers who all use it. It's actually document-based, using an XML document for it's internal storage. The database is just used as storage and a single import/export routine converts it back and to XML. Unfortunately, the XML structure isn't very practical to use for documentation and there's a second layer around this XML document to expose it as an object model. This object model is far from perfect too, but that's what 10 years of development can do to an application. We do want to improve it but this takes time and we can't disappoint the current users by delaying new updates.Basically, we're stuck with its current design and to improve it, we need to make sure things are well-documented. That's what I'm working on now.
Only 30+ tables? Shouldn't take but a half hour or an hour to create all the relationships required. Which I'd urge you to do. Yes, I know that you state your code checks for those. But what if you've missed some? What if there are indeed orphaned records? How are you going to know? Or do you have bullet proof routines which go through all your tables looking for all these problems?
Use a largish 23" LCD monitor and have at it.
If your database does not have relationships defined somewhere other than code, there is no real way to guess how tables relate to each other.
Worse, you can't know the type of relationship and whether cascading of update and deletion should occur or not.
Having said that, if you followed some strict rules for naming your foreign key fields, then it could be possible to reconstruct the structure of the relationships.
For instance, I use a scheme like this one:
Table Product
- Field ID /* The Unique ID for a Product */
- Field Designation
- Field Cost
Table Order
- Field ID /* the unique ID for an Order */
- Field ProductID
- Field Quantity
The relationship is easy to detect when looking at the Order: Order.ProductID is related to Product.ID and this can easily be ascertain from code, going through each field.
If you have a similar scheme, then how much you can get out of it depends on how well you follow your own convention, but it could go to 100% accuracy although you're probably have some exceptions (that you can build-in your code or, better, look-up somewhere).
The other solution is if each of your table's unique ID is following a different numbering scheme.
Say your Order.ID is in fact following a scheme like OR001, OR002, etc and Product.ID follows PD001, PD002, etc.
In that case, going through all fields in all tables, you can search for FK records that match each PK.
If you're following a sane convention for naming your fields and tables, then you can probably automate the discovery of the relations between them, store that in a table and manually go through to make corrections.
Once you're done, use that result table to actually build the relationships from code using the Database.CreateRelation() method (look up the Access documentation, there is sample code for it).
You can build a small piece of VBA code, divided in 2 parts:
Step 1 implements the database relations with the database.createrelation method
Step 2 deleted all created relations with the database.delete command
As Tony said, 30 tables are not that much, and the script should be easy to set. Once this set, stop the process after step 1, run the access documenter (tools\analyse\documenter) to get your documentation ready, launch step 2. Your database will then be unchanged and your documentation ready.
I advise you to keep this code and run it regularly against your database to check that your relational model sticks to the data.
There might be a tool out there that might be able to "guess" the relations but I doubt it. Frankly I am scared of databases without proper foreign keys in particular and multi user apps that uses Access as a DBMS as well.
I guess that the app must be some sort of internal tool, otherwise I would suggest that you move to a proper DBMS ( SQL Express is for free) and adds the foreign keys.

maintain history in a database

I am designing this database that must maintain a history of employee salary and the movements within the organization. Basically, my design has 3 tables (I mean, there more tables but for this question I'll mention 3, so bear with me). Employee table (containing the most current salary, position data, etc), SalaryHistory table (salary, date, reason, etc.) and MovementHistory(Title, Dept., comments). I'll be using Linq to Sql, so what I was thinking is that every time employee data is updated, the old values will be copied to their respective history tables. Is this a good approach? Should I just do it using Linq to SQL or triggers? Thanks for any help, suggestion or idea.
Have a look at http://www.simple-talk.com/sql/database-administration/database-design-a-point-in-time-architecture .
Basically, the article suggests that you have the following columns in the tables you need to track history for -
* DateCreated – the actual date on which the given row was inserted.
* DateEffective – the date on which the given row became effective.
* DateEnd – the date on which the given row ceased to be effective.
* DateReplaced – the date on which the given row was replaced by another row.
* OperatorCode – the unique identifier of the person (or system) that created the row.
DateEffective and DateEnd together tell you the time for which the row was valid (or the time for which an employee was in a department, or the time for which he earned a particular salary).
It is a good idea to keep that logic internal to the database: that's basically why triggers exist. I say this carefully, however, as there are plenty of reasons to keep it external. Often times - especially with a technology as easy as LINQ-to-SQL - it is easier to write the code externally. In my experience, more people could write that logic in C#/LINQ than could do it correctly using a trigger.
Triggers are fast - they're compiled! However, they're very easy to misuse and make your logic overcomplex to a degree that performance can degrade rapidly. Considering how simple your use case is, I would opt to use triggers, but that's me personally.
Triggers will likely be faster, and don't require a "middle man" to get the job done, eliminating at least one chance for errors.
Depending on your database of choice, you can just use one table and enable OID's on it, and add two more columns, "flag" and "previous". Never update this table, only insert. Add a trigger so that when a row is added for employee #id, set all records with employee #id to have a flag of "old" and set the new rows "previous" value to the previous row.
I think this belongs in the database for two reasons.
First, middle tiers come and go, but databases are forever. This year Java EJBs, next year .NET, the year after that something else. The data remains, in my experience.
Second, if the database is shared at all it should not have to rely on every application that uses it to know how to maintain its data integrity. I would consider this an example of encapsulation of the database. Why force knowledge and maintenance of the history on every client?
Triggers make your front-end easier to migrate to something else and they will keep the database consistent no matter how data is inserted/updated/removed.
Besides in your case I would write the salaries straight to the salary history - from your description I wouldn't see a reason why you should go the way via an update-trigger on the employee table.

Resources