Database versioning - database

I have made few projects (CMS and EC system) that required to have some data versioned.
Usually I come with that kind of schema
+--------------+
+ foobar +
+--------------+
+ foobar_id +
+ version +
+--------------+
it worked great but I am wondering if there is a better to way do it. The main problem with that solution you have to always use subquery to get the latest version.
i.e.:
SELECT * FROM foobar WHERE foobar_id = 2 and version = (SELECT MAX(version) FROM foobar f2 WHERE f2 = 2)
This render most of the queries more complicate and also have some performance drawbacks.
So it would be nice if you share your experience creating versioned table and what kind of pro and cons of each method.
Thanks

I prefer to have historical data in another table. I would make foobar_history or something similar and make a FK to foobar_id. This will stop you from having to use a subquery all together. This has the added advantage of not polluting your primary data table with the tons of historical data you probably don't want to see 99% of the time you're accessing it.
You will likely want to make a trigger for updating this data though, as it would require you to copy the current data in to _history and then do the update.

The cleanest solution in my opinion would be to have a History table for each table that requires versioned. In other words, have a foobar table, and then a foobar_History table, with a trigger on foobar that will write existing data to the History table with a timestamp and user that changed the data. Older data is easily queryably, sorted by timestamp descending, and you know that the data in the main table is always the latest version.

I used to work on a system with historical data, and we had a boolean to indicate which one was the latest version of the data. Of course you need to maintain the consitency of the flag at the applicative level. Then you can create indexes that use the flag and if you provide it in the where clause it's fast.
Pro:
easy to understand
does not require major change to your (existing) database schema
no need to copy old data in another table, only flag is updated.
Cons:
flag need to be maintained at applicative level
Otherwise, you can rely on a separate history table, as suggested in several answers.
Pro:
clean sepration of history from actual data
possible to have a db-level cascade delete between actual data and its history, in case the entity is removed
Cons:
need 2 queries (or a union) if you want the complete history (that is, old data + current data)
the row that corresponds to the latest version of the data will be updated. I heard that update are slower than insert, depending on the "size" of the data that changed.
What is best will depend from your use case. I had to deal with a document management system where we wanted to be able to version document. But we also had feature like reverting to old version. It was easier to use a boolean to speed up just the operation that required the last one. If you have real historical data (which never change) probably a dedicated history table is better.
Does the concept of history fit in your domain model? If no, then you have a db schema that differs from your conceptual domain model. If at the domain level, the actual data and the old data need to be handled the same way, having two tables complicates the design. Just consider the case you need to return the complete history (old + new). The easiest solution would be to have one class for each table, but then you can't return a list as easily as if you have only one table. But if these are two distinct concepts, then it's fine to have history be first-class in your design.
I would also recommend this article by M. Fowler also interesting when it comes to dealing with temporal data: Patterns for things that change with time

You can simplify the query by using a view over your table which filters to the latest version. This only makes the queries look nicer you still have the performance overhead.

Common technique is to add a column version_status for current/expired. Also a note, if you keep new and old records in the same table, you should have a business (natural) key for your entity, something like name + pin, because the primary key will change (increment) with each row.
TABLE foobar(foobar_id PK, business_key, version, version_status, .....)
SELECT *
FROM foobar
WHERE business_key = 'myFoobar3' AND version_status = 'current'
When deciding to keep the record history in the same table -- or move it to a separate one -- consider other tables which have the foobar_id as a foreign key. When issuing a new version, should existing foreign keys point to the new PK or to the old PK? If you want to keep history of relationships, you would probably want to keep everything in the same table. If only the new version is important, you may consider to move expired rows to another table -- though it is not necessary.

If you had used Oracle you could use analytic functions
select * from (
SELECT a.*
, row_number() over (partition by foobar_id order by version desc) rn
FROM foobar a
WHERE foobar_id = 2
) where rn = 1

It depends on how many of your tables require versioning, and if you've got a transactional ore reporting system.
If just a few transactional tables - the way that you're doing it is fine as long as the performance issues aren't too significant. You can make the querying easier by adding a column for current_row and a trigger that updates the prior row to make it non-current.
But if you've got a lot of tables or the extra rows are slowing down some of your queries then I'd do as others suggest and use history tables as well as history triggers. Note that you can generate that code to make it easier to develop & maintain.
If you're in the reporting world then there's a lot other options I won't address here. You can find the options given in detail in data warehousing data modeling books.

Related

database design - best practice- one table for web form drop down options or separate table for each drop down options

I'm looking at the best practice approach here. I have a web page that has several drop down options. The drop downs are not related, they are for misc. values (location, building codes, etc). The database right now has a table for each set of options (e.g. table for building codes, table for locations, etc). I'm wondering if I could just combine them all into on table (called listOptions) and then just query that one table.
Location Table
LocationID (int)
LocatValue (nvarchar(25))
LocatDescription (nvarchar(25))
BuildingCode Table
BCID (int)
BCValue (nvarchar(25))
BCDescription (nvarchar(25))
Instead of the above, is there any reason why I can't do this?
ListOptions Table
ID (int)
listValue (nvarchar(25))
listDescription (nvarchar(25))
groupID (int) //where groupid corresponds to Location, Building Code, etc
Now, when I query the table, I can pass to the query the groupID to pull back the other values I need.
Putting in one table is an antipattern. These are differnt lookups and you cannot enforce referential integrity in the datbase (which is the ciorrect place to enforce it as applications are often not the only way data gets changed) unless they are in separate tables. Data integrity is FAR more important than saving a few minutes of development time if you need an additonal lookup.
If you plan to use the values later in some referencing FKeys - better use separate tables.
But why do you need "all in one" table? Which problem it solves?
You could do this.
I believe that is your master data and it would not be having any huge amounts of rows that it might create and performance problems.
Secondly, why would you want to do it once your app is up and running. It should have thought about earlier. The tables might be used in a lot of places and it's might be a lot of coding and most importantly testing.
Can you throw further light into your requirements.
You can keep them in separate tables and have your stored procedure return one set of data with a "datatype" key that signifies which set of values go with what option.
However, I would urge you to consider a much different approach. This suggestion is based on years of building data driven websites. If these drop-down options don't change very often then why not build server-side include files instead of querying the database. We did this with most of our websites. Think about it, each time the page is presented you query the database for the same list of values... that data hardly ever changes.
In cases when that data did have the tendency to change, we simply added a routine to the back end admin that rebuilt the server-side include file whenever an add, change or delete was done to one of the lookup values. This reduced database I/O's and spead up the load time of all our websites.
We had approximately 600 websites on the same server all using the same instance of SQL Server (separate databases) our total server database I/O's were drastically reduced.
Edit:
We simply built SSI that looked like this...
<option value="1'>Blue</option>
<option value="2'>Red</option>
<option value="3'>Green</option>
With single table it would be easy to add new groups in favour of creating new tables, but for best practices concerns you should also have a group table so you can name those groups in the db for future maintenance
The best practice depends on your requirements.
Do the values of location and building vary frequently? Where do the values come from? Are they imported from external data? Do other tables refer the unique table (so that I need a two-field key to preper join the tables)?
For example, I use unique table with hetorogeneus data for constants or configuration values.
But if the data vary often or are imported from external source, I prefer use separate tables.

data synchronization from unreliable data source to SQL table

I am looking for pattern, framework or best practice to handle a generic problem of application level data synchronisation.
Let's take an example with only 1 table to make it easier.
I have an unreliable datasource of product catalog. Data can occasionally be unavailable or incomplete or inconsistent. ( issue might come from manual data entry error, ETL failure...)
I have a live copy in a Mysql table in use by a live system. Let's say a website.
I need to implement safety mecanism when updating the mysql table to "synchronize" with original data source. Here are the safety criteria and the solution I an suggesting:
avoid deleting records when they temporarily disappear from datasource => use "deleted" boulean/date column or an archive/history table.
check for inconsistent changes => configure rules per columns such as : should never change, should only increment,
check for integrity issue => (standard problem, no point discussing approach)
ability to rollback last sync=> restore from history table ? use a version inc/date column ?
What I am looking for is best practice and pattern/tool to handle such problem. If not you are not pointing to THE solution, I would be grateful of any keywords suggestion that would me narrow down which field of expertise to explore.
We have the same problem importing data from web analytics providers - they suffer the same problems as your catalog. This is what we did:
Every import/sync is assigned a unique id (auto_increment int64)
Every table has a history table that is identical to the original, but has an additional column "superseded_id" which gets the import-id of the import, that changed the row (deletion is a change) and the primary key is (row_id,superseded_id)
Every UPDATE copies the row to the history table before changing it
Every DELETE moves the row to the history table
This makes rollback very easy:
Find out the import_id of the bad import
REPLACE INTO main_table SELECT <everything but superseded_id> FROM history table WHERE superseded_id=<bad import id>
DELETE FROM history_table WHERE superseded_id>=<bad import id>
For databases, where performance is a problem, we do this in a secondary database on a different server, then copy the found-to-be-good main table to the production database into a new table main_table_$id with $id being the highest import id and have main_table be a trivial view to SELECT * FROM main_table_$someid. Now by redefining the view to SELECT * FROM main_table_$newid we can atomically swicth the table.
I'm not aware of a single solution to all this - probably because each project is so different. However, here are two techniques I've used in the past:
Embed the concept of version and validity into your data model
This is a way to deal with change over time without having to resort to history tables; it does complicate your queries, so you should use it sparingly.
For instance, instead of having a product table as follows
PRODUCTS
Product_ID primary key
Price
Description
AvailableFlag
In this model, if you want to delete a product, you execute "delete from product where product_id = ..."; modifying price would be "update products set price = 1 where product_id = ...."
With the versioned model, you have:
PRODUCTS
product_ID primary key
valid_from datetime
valid_until datetime
deleted_flag
Price
Description
AvailableFlag
In this model, deleting a product requires you to update products set valid_until = getdate() where product_id = xxx and valid_until is null, and then insert a new row with the "deleted_flag = true".
Changing price works the same way.
This means that you can run queries against your "dirty" data and insert it into this table without worrying about deleting items that were accidentally missed off the import. It also allows you to see the evolution of the record over time, and roll-back easily.
Use a ledger-like mechanism for cumulative values
Where you have things like "number of products in stock", it helps to create transactions to modify the amount, rather than take the current amount from your data feed.
For instance, instead of having a amount_in_stock column on your products table, have a "product_stock_transaction" table:
product_stock_transactions
product_id FK transaction_date transaction_quantity transaction_source
1 1 Jan 2012 100 product_feed
1 2 Jan 2012 -3 stock_adjust_feed
1 3 Jan 2012 10 product_feed
On 2 Jan, the quantity in stock was 97; on 3 Jan, 107.
This design allows you to keep track of adjustments and their source, and is easier to manage when moving data from multiple sources.
Both approaches can create large amounts of data - depending on the number of imports and the amount of data - and can lead to complex queries to retrieve relatively simple data sets.
It's hard to plan for performance concerns up front - I've seen both "history" and "ledger" work with large amounts of data. However, as Eugen says in his comment below, if you get to an excessively large ledger, it may be necessary to to clean up the ledger table by summarizing the current levels, and deleting (or archiving) old records.

What are the pros and cons of having history tables?

WHAT is the better practice?:
Keep history records in a separate history table
Keep history records in the active table with different status?
In my opinion I rather to keep a separate table to avoid creating one huge table with duplicate records which may cause unwanted lag time when querying the table.
My preference has always been to have a separate table with history, purely because it removes the need to have a "WHERE Status = 'LIVE'" or "WHERE CurrentRecord = 1" to get latest record (I won't get into one design that required an inline query to get max(version) to get the latest). It should mean that the current records table should remain smaller and access times may be improved, etc. In the worst case scenario, I've seen an ad-hoc query against a table pick up the wrong version of a record, causing all sorts of problems later on.
Also, if you are already getting the history from another table, you could shard the data, so all history from one year is in one table/db and all history from another is in another table/db and so on.
Pro:
If you keep the history in a separate table then this data will be accessed only when you need to search something from the past. Most of the times the main table will be used far more than the historical one. So this means faster results.
Con:
In a project I worked, I had one table with 350 columns (don't ask why.....). So this table became very large as the data was inputed. At a specific moment records went from 'active' to 'closed' status. I was tempted to move all the closed records to a new table (a historical one), but I realized that it was more slow - in a lot of queries I had to make unions....
As a final opinion I think it depends for every case, but I will always think first for the separate table.
I prefer to use one table and partioning. I also would set up a view for the active records and use that instead of the base table when querying active records.
I would go for a separate table, otherwise setting up UNIQUE and FK constraints may be still doable, but too involved.

Database design - do I need one of two database fields for this?

I am putting together a schema for a database. The goal of the database is to track applications in our department. I have a repeated problem that I am trying to solve.
For example, I have an "Applications" table. I want to keep track if any application uses a database or a bug tracking system so right now I have fields in the Applications table called
Table: Applications
UsesDatabase (bit)
Database_ID (int)
UsesBugTracking (bit)
BugTracking_ID (int)
Table: Databases:
id
name
Table: BugTracking:
id
name
Should I consolidate the "uses" column with the respective ID columns so there is only one bug tracking column and only one database column in the applications table?
Any best practice here for database design?
NOTE: I would like to run reports like "Percent of Application that use bug tracking" (although I guess either approach could generate this data.)
You could remove the "uses" fields and make the id columns nullable, and let a null value mean that it doesn't use the feature. This is a common way of representing a missing value.
Edit:
To answer your note, you can easily get that statistics like this:
select
count(*) as TotalApplications,
count(Database_ID) as UsesDatabase,
count(BugTracking_ID) as UsesBugTracking
from
Applications
Why not get rid of the two Use fields and simply let a NULL value in the _ID fields indicate that the record does not use that application (bug tracking or database)
Either solution works. However, if you think you may want to occasionally just get a list of applications which do / do not have databases / bugtracking consider that having the flag fields reduces the query by one (or two) joins.
Having the bit fields is slightly denormalized, as you have to keep two fields in sync to keep one piece of data updated, but I tend to prefer them for cases like this for the reason I gave in the prior paragraph.
Another option would be to have the field nullable, and put null in it for those entries which do not have DBs / etc, but then you run into problems with foreign key constraints.
I don't think there is any one supreme right way, just consider the tradeoffs and go with what makes sense for your application.
I would use 3 tables for the objects: Application, Database, and BugTracking. Then I would use 2 join tables to do 1-to-many joins: ApplicationDatabases, and ApplicationBugTracking.
The 2 join tables would have both an application_id and the id of the other table. If an application used a single database, it would have a single ApplicationDatabases record joining them together. Using this setup, an application could have 0 database (no records for this app in the ApplicationDatabases table), or many databases (multiple records for this app in the ApplicationDatabases table).
"Should i consolidate the "uses" column"
If I look at your problem statement, then there either is no "uses" column at all, or there are two. In either case, it is wrong of you to speak of "THE" uses column.
May I politely suggest that you learn to be PRECISE when asking questions ?
Yes using null in the foreign key fields should be fine - it seems superfluous to have the bit fields.
Another way of doing it (though it might be considered evil by database people ^^) is to default them to 0 and add in an ID 0 data row in both bugtrack and database tables with a name of "None"... when you do the reports, you'll have to do some more work unless you present the "None" values as they are as well with a neat percentage...
To answer the edited question-
Yes, the fields should be combined, with NULL meaning that the application doesn't have a database (or bug tracker).

Generic version control strategy for select table data within a heavily normalized database

Sorry for the long winded title, but the requirement/problem is rather specific.
With reference to the following sample (but very simplified) structure (in psuedo SQL), I hope to explain it a bit better.
TABLE StructureName {
Id GUID PK,
Name varchar(50) NOT NULL
}
TABLE Structure {
Id GUID PK,
ParentId GUID, -- FK to Structure
NameId GUID NOT NULL -- FK to StructureName
}
TABLE Something {
Id GUID PK,
RootStructureId GUID NOT NULL -- FK to Structure
}
As one can see, Structure is a simple tree structure (not worried about ordering of children for the problem). StructureName is a simplification of a translation system. Finally 'Something' is simply something referencing the tree's root structure.
This is just one of many tables that need to be versioned, but this one serves as a good example for most cases.
There is a requirement to version to any changes to the name and/or the tree 'layout' of the Structure table. Previous versions should always be available.
There seems to be a few possibilities to tackle this issue, like copying the entire structure, but most approaches causes one to 'loose' referential integrity. Example if one followed this approach, one would have to make a duplicate of the 'Something' record, given that the root structure will be a new record, and have a new ID.
Other avenues of possible solutions are looking into how Wiki's handle this or go a lot further and look how proper version control systems work.
Currently, I feel a bit clueless how to proceed on this in a generic way.
Any ideas will be greatly appreciated.
Thanks
leppie
Some quick ideas:
Full copy: Create a copy of the structure, but for every table add a version_id column to the PK and all FKs; thus you can create copies of the life data with complete referential integrity.
pro: easy to query the history
con: large amount of (redundant data copied)
Change copy: Only copy the stuff that actually changes, along with valid_from / valid_to data.
pro: low data volum copied
con: hard to query, because one has to join on intervals
Variation: This applies to both schemes. Instead of creating a copy of the structure, you might keept the current record in the same table as the old versions, but tag it as current.
pro: smaller number of tables, easier mixing of history and current information
con: normal operation operates on much bigger tables, which will cause a performance impact
Auditing log: Depending on your actual requirements it be sufficient to just create an audit trail like this:
id, timestamp, changed_table, changed_column, old_value, new_value, changed_by
You might extend that to a full table structure:
transaction, table_change, changed_column
pro: generic, hence easy to implement for a large number of tables
con: if you need to reconstruct the state of a set of records at a given time, querying will become a nightmare
I wrote a blog about various approaches to versioning, but be warned: it's in German.
The data warehousing folks have several algorithms for "slowly-changing dimensions".
The more sophisticated algorithms provide data ranges around a dimension value to indicate when it's valid.
Depending on your versioning requirements you could do one of these things, cribbed from Kimball's The Data Warehousing Toolkit.
Assign a version number to rows of the structure table. This means you have to do some reasoning to collect a a complete structure. It includes the selected version number unioned with rows that are unchanged in an earlier version.
Assign a date range or version range to rows of the structure table. This means that some rows have start dates and end dates; some rows will have end dates at some epoch in the impossible future. Or, if you use version numbers, you'll have a start-end pair or a start-infinity pair that indicates this row is still current. You can then trivially query the rows that are valid "today" or apply to the requested version.
Clone the structure for each version. This unpleasant because the clone operation is costly. The queries however, are trivial because the entire structure is available with a single, consistent version number.

Resources