I was reading about temporal databases and it seems they have built in time aspects. I wonder why would we need such a model?
How different is it from a normal RDBMS? Can't we have a normal database i.e. RDBMS and say have a trigger which associates a time stamp with each transaction that happens? May be there would be a performance hit. But I'm still skeptical on temporal databases having a strong case in the market.
Does any of the present databases support such a feature?
Consider your appointment/journal diary - it goes from Jan 1st to Dec 31st. Now we can query the diary for appointments/journal entries on any day. This ordering is called the valid time. However, appointments/entries are not usually inserted in order.
Suppose I would like to know what appointments/entries were in my diary on April 4th. That is, all the records that existed in my diary on April 4th. This is the transaction time.
Given that appointments/entries can be created and deleted etc. A typical record has a beginning and end valid time that covers the period of the entry and a beginning and end transaction time that indicates the period during which the entry appeared in the diary.
This arrangement is necessary when the diary may undergo historical revision. Suppose on April 5th I realise that the appointment I had on Feb 14th actually occurred on February 12th i.e. I discover an error in my diary - I can correct the error so that the valid time picture is corrected, but now, my query of what was in the diary on April 4th would be wrong, UNLESS, the transaction times for appointments/entries are also stored. In that case if I query my diary as of April 4th it will show an appointment existed on February 14th but if I query as of April 6th it would show an appointment on February 12th.
This time travel feature of a temporal database makes it possible to record information about how errors are corrected in a database. This is necessary for a true audit picture of data that records when revisions were made and allows queries relating to how data have been revised over
time.
Most business information should be stored in this bitemporal scheme in order to provide a true audit record and to maximise business intelligence - hence the need for support in a relational database. Notice that each data item occupies a (possibly unbounded) square in the two dimensional time model which is why people often use a GIST index to implement bitemporal indexing. The problem here is that a GIST index is really designed for geographic data and the requirements for temporal data are somewhat different.
PostgreSQL 9.0 exclusion constraints should provide new ways of organising temporal data e.g. transaction and valid time PERIODs should not overlap for the same tuple.
A temporal database efficiently stores a time series of data, typically by having some fixed timescale (such as seconds or even milliseconds) and then storing only changes in the measured data. A timestamp in an RDBMS is a discretely stored value for each measurement, which is very inefficient. A temporal database is often used in real-time monitoring applications like SCADA. A well-established system is the PI database from OSISoft (http://www.osisoft.com/).
As I understand it (and over-simplifying enormously), a temporal database records facts about when the data was valid as well as the the data itself, and permits you to query on the temporal aspects. You end up dealing with 'valid time' and 'transaction time' tables, or 'bitemporal tables' involving both 'valid time' and 'transaction time' aspects. You should consider reading either of these two books:
Darwen, Date and Lorentzos "Temporal Data and the Relational Model" (out of print),
and (at a radically different extreme) "Developing Time-Oriented Database Applications in SQL", Richard T. Snodgrass, Morgan Kaufmann Publishers, Inc., San Francisco, July, 1999, 504+xxiii pages, ISBN 1-55860-436-7. That is out of print but available as PDF on his web site at cs.arizona.edu (so a Google search makes it pretty easy to find).
Temporal databases are often used in the financial services industry. One reason is that you are rarely (if ever) allowed to delete any data, so ValidFrom - ValidTo type fields on records are used to provide an indication of when a record was correct.
Besides "what new things can I do with it", it might be useful to consider "what old things does it unify?". The temporal database represents a particular generalization of the "normal" SQL database. As such, it may give you a unified solution to problems that previously appeared unrelated. For example:
Web Concurrency When your database has a web UI that lets multiple users perform standard Create/Update/Delete (CRUD) modifications, you have to face the concurrent web changes problem. Basically, you need to check that an incoming data modification is not affecting any records that have changed since that user last saw those records. But if you have a temporal database, it quite possibly already associates something like a "revision ID" with each record (due to the difficulty of making timestamps unique and monotonically ascending). If so, then that becomes the natural, "already built-in" mechanism for preventing the clobbering of other users' data during database updates.
Legal/Tax Records The legal system (including taxes) places rather more emphasis on historical data than most programmers do. Thus, you will often find advice about schemas for invoices and such that warns you to beware of deleting records or normalizing in a natural way--which can lead to an inability to answer basic legal questions like "Forget their current address, what address did you mail this invoice to in 2001?" With a temporal framework base, all the machinations to those problems (they usually are halfway steps to having a temporal database) go away. You just use the most natural schema, and delete when it make sense, knowing that you can always go back and answer historical questions accurately.
On the other hand, the temporal model itself is half-way to complete revision control, which could inspire further applications. For example, suppose you roll your own temporal facility on top of SQL and allow branching, as in revision control systems. Even limited branching could make it easy to offer "sandboxing" -- the ability to play with and modify the database with abandon without causing any visible changes to other users. That makes it easy to supply highly realistic user training on a complex database.
Simple branching with a simple merge facility could also simplify some common workflow problems. For example, a non-profit might have volunteers or low-paid workers doing data entry. Giving each worker their own branch could make it easy to allow a supervisor to review their work or enhance it (e.g., de-duplification) before merging it into the main branch where it would become visible to "normal" users. Branches could also simplify permissions. If a user is only granted permission to use/see their unique branch, you don't have to worry about preventing every possible unwanted modification; you'll only merge the changes that make sense anyway.
Apart from reading the Wikipedia article? A database that maintains an "audit log" or similar transaction log will have some properties of being "temporal". If you need answers to questions about who did what to whom and when then you've got a good candidate for a temporal database.
You can imagine a simple temporal database that just logs your GPS location every few seconds. The opportunities for compressing this data is great, a normal database you would need to store a timestamp for every row. If you have a great deal of throughput required, knowing the data is temporal and that updates and deletes to a row will never be required permits the program to drop a lot of the complexity inherit in a typical RDBMS.
Despite this, temporal data is usually just stored in a normal RDBMS. PostgreSQL, for example has some temporal extensions, which makes this a little easier.
Two reasons come to mind:
Some are optimized for insert and read only and can offer dramatic perf improvements
Some have better understandings of time than traditional SQL - allowing for grouping operations by second, minute, hour, etc
Just an update, Temporal database is coming to SQL Server 2016.
To clear all your doubts why one need a Temporal Database, rather than configuring with custom methods, and how efficiently & seamlessly SQL Server configures it for you, check the in-depth video and demo on Channel9.msdn here: https://channel9.msdn.com/Shows/Data-Exposed/Temporal-in-SQL-Server-2016
MSDN link: https://msdn.microsoft.com/en-us/library/dn935015(v=sql.130).aspx
Currently with the CTP2 (beta 2) release of SQL Server 2016 you can play with it.
Check this video on how to use Temporal Tables in SQL Server 2016.
My understanding of temporal databases is that are geared towards storing certain types of temporal information. You could simulate that with a standard RDBMS, but by using a database that supports it you have built-in idioms for a lot of concepts and the query language might be optimized for these sort of queries.
To me this is a little like working with a GIS-specific database rather than an RDBMS. While you could shove coordinates in a run-of-the-mill RDBMS, having the appropriate representations (e.g., via grid files) may be faster, and having SQL primitives for things like topology is useful.
There are academic databases and some commercial ones. Timecenter has some links.
Another example of where a temporal database is useful is where data changes over time. I spent a few years working for an electricity retailer where we stored meter readings for 30 minute blocks of time. Those meter readings could be revised at any point but we still needed to be able to look back at the history of changes for the readings.
We therefore had the latest reading (our 'current understanding' of the consumption for the 30 minutes) but could look back at our historic understanding of the consumption. When you've got data that can be adjusted in such a way temporal databases work well.
(Having said that, we hand carved it in SQL, but it was a fair while ago. Wouldn't make that decision these days.)
Related
I am trying to collect information about temporal databases. I know it is not a modern technology, but I saw that many people who work with databases don't ever know how temporal approach works (I asked some senior programmers and system analysts about temporal databases and they answered something like "Huh?").
I know there are valid-time state tables and transaction-time state tables, along with bitemporal tables. I think that bitemporal tables are way too complex for most usages, because nowadays space is not a problem anymore, and it is more efficient to write the same information on 2 different tables, even if data is redundant. However, I made many searches online trying to see where bitemporal tables are actually used, but I didn't find anything useful.
Are there cases when use of a bitemporal table is really convenient than valid-time and transaction-time state tables separately? Are there real-world examples?
Of course! Take for example, balance sheet data. You will find that this information will change from WD1 (Working Day) to WD x due to late arriving data, adjustments, manual errors and suchlike.
In order to enable repeatable reporting, audit trail and temporal comparisons, a record must be kept of 'old' (invalid?) results. Bitemporal is a great way to manage such updates, especially on an intraday basis. I don't think it's that complicated from a user perspective - just another filter on the where clause.
I admit that the loading process is complicated, but it's not that bad.. I literally just finished writing a generic transform (in SAS, coping with all scenarios for a unique business key) and it took a single day.
Coming back to use cases.. Having both valid (business) time and transaction (version) time on the same table enables:
Repeatable results (having separate tables and corresponding updates could mean getting different results for the same query on two different days)
Comparable results (can answer questions such as "what was the value of X, as we knew it at time Y?")
Rapid results (only dealing with a single table, updated in a transparent and easy-to-query way).
In this sense it is an appropriate structure to use on many, if not all tables in a DWh.
UPDATE 2020: A bitemporal data transform for SAS (both SAS 9 and Viya) is available with Data Controller for SAS. A demo version is available: https://docs.datacontroller.io/dcc-tables/#var_busfrom-var_busto
I think your question raises more issues but it all comes down to how much is enough.
I developed a Bi_Temporal SQL Server engine that supports object versioning and relationship by time as well as all the other beautiful parts of Temporal DB's.
This was because the project needed to be able to be rewound to a place in time and see everything as it was at that time.
I mean everything including data, relationships and User access.
It was the most complex thing I have built but in the end it was so complex no-one else could maintain it, or understand what was happening.
So there was a real world use case and a deliverable.
Is not everyones cup of tea as you have to be able to think in time dimension as well as object version changes as all db's do.
Hope this helps someone. I know the post is old but as it was the first I found when searching Temporal DB's it might be of interest to someone.
I have work with a few databases up to now and the philosophys where verry different. It got me wondering,
Is it a good idea to duplicate tables for historic purpose in a business application?
By buisiness application i mean :
a software used by an enterprise to manage all of his data (eg. invoices, clients, stocks [if applicable], etc)
By 'duplicating tables' i mean :
when, lets say your invoices, goes out of date (like after one year, after being invoiced and paid, w/e), you can store them into 'historic' tables which makes them aviable for consultation but shouldent be modified. Same thing clients inactive for years.
Pros :
Using historic tables can accelerate researches trough actually used data since it make your actually used tables smaller.
Better separation of historic and actual data
Easier to remove data from the database to store it on hard media without affecting your database, (more predictable beacause the data had no chance of being used since it was in an historic table). This often happend after 10 years when you got unused data.
Cons :
Make your database have up to 2 times more tables.
Make your database more complex
Make your program more complex for reports since you sometimes have to import twice the amount of tables.
Archiving is a key aspect of enterprise applications, but in general, I'd recommend against it unless you really, really need it.
Archiving means you either accept you can't get at historical data before a specific date, or that you create some scheme for managing "current" and "historical" data; your solution (archive tables) is one solution to this problem.
Neither solution is all that nice - archive tables mean lots of duplicated code/data, complex archival procedures (esp. with foreign key relationships), lots of opportunity for errors.
I do believe the concept of "time" should be baked into the domain and data model for most business applications, along with mutability - you shouldn't be able to change an order once it's been confirmed, but you should be able to add products to a new order.
As for your pros:
In general, I don't think you'd notice the performance impact unless you're talking about very, very large scale businesses. I don't think - on modern SQL server solutions - you'd notice the speed difference between querying 10.000 customer records or 1.000.000 customer records.
The definition of "historic" is actually rather tricky - most businesses have to keep historical around for regulatory and tax purposes, often for many years; they'll probably want to be able to analyse trends over several years, etc. If the business wants to see "how many widgets did we sell per month over the last 5 years", that means you have to keep 5 years of data around somehow (either "raw" or pre-aggregated).
Yes, separating out data would be easier. Building a feature today - which you have to maintain every time you change the application - for pay-off in 10 years seems a poor investment to me...
I would only have a "duplicate" type table to store historic VERSIONS of each record, like a change log. Even a change log is not a duplicate as it would have to have info on when it was changed, etc. As a general practice,I would not recommend migrating rows from an active to a historical table. You'd have to manage different versions of queries to find the data in two places! Use a status to control if the data can be changed. I could see it may be done if there are certain circumstances for a particular application. Once you start adding foreign keys, it becomes difficult to remove data. If you had a truly enterprise business application and you attempted to remove invoices, you have all sorts of issues with FKs to other tables, accounts payable/receivable, costs of raw materials, profits from sales, shipping info, etc.
Our masters thesis project is creating a database schema analyzer. As a foundation to this, we are working on quantifying bad database design.
Our supervisor has tasked us with analyzing a real world schema, of our choosing, such that we can identify some/several design issues. These issues are to be used as a starting point in the schema analyzer.
Finding a good schema is a bit difficult because we do not want a schema which is well designed in all aspects, but a schema that is more "rare to medium".
We have already scheduled the following schemas for analysis: wikimedia, moodle and drupal. Not sure in which category each fit. It is not necessary that the schema is open source.
The database engine used is not important, though we would like to focus on SQL server, Posgresql and Oracle.
For now literature will be deferred, as this task is supposed to give us real world examples which can be used in the thesis. i.e. "Design X is perceived by us as bad design, which our analyzer identifies and suggests improvements to", instead of coming up with contrived examples.
I will update this post when we have some kind of a tool ready.
Check the Dell-dvd-store, you can use it for free.
The Dell DVD Store is an open source
simulation of an online ecommerce site
with implementations in Microsoft SQL
Server, Oracle and MySQL along with
driver programs and web applications
Bill Karwin has written a great book about bad designs: SQL antipatterns
I'm working on a project including a geographical information system. And in my opinion these designs are often "medium" to "rare".
Here are some examples:
1) Geonames.org
You can find the data and the schema here: http://download.geonames.org/export/dump/ (scroll down to the bottom of the page for the schema, it's in plain text on the site !)
It'd be interesting how this DB design performs with such a HUGE amount of data!
2) OpenGeoDB
This one is very popular in german-speaking countries (Germany, Austria, Switzerland) because it's a database containing nearly every city/town/village in the german speaking region with zip-code, name, hierarchy and coordinates.
This one comes with a .sql schema and the table fields are in english, so this shouldn't be a problem.
http://fa-technik.adfc.de/code/opengeodb/
The interesting thing in both examples is how they managed the hierarchy of entities like Country -> State -> County -> City -> Village etc.
PS: Maybe you could judge my DB design too ;) DB Schema of a Role Based Access Control
vBulletin has a really bad database schema.
"we are working on quantifying bad database design."
It seems to me like you are developing a model, or process, or apparatus, that takes a relational schema as input and scores it for quality.
I invite you to ponder the following:
Can a physical schema be "bad" while the logical schema is nonetheless "extremely good" ? Do you intend to distinguish properly between "logical schema" and "physical schema" ? How do you dream to achieve that ?
How do you decide that a certain aspect of physical design is "bad" ? Take for example the absence of some index. If the relvar that that "supposedly desirable index" is to be on, is itself constrained to be a singleton, then what detrimental effects would the absence of that index cause for the system ? If there are no such detrimental effects, then what grounds are there for qualifying the absence of such an index as "bad" ?
How do you decide that a certain aspect of logical design is "bad" ? Choices in logical design are done as a consequence of what the actual requirements are. How can you make any judgment whatsoever about a logical design, without a formalized and machine-readable way to specify what the actual requirements are ?
Wow - you have an ambitious project ahead of you. To determine what is a good database design may be impossible, except for broadly understood principles and guidelines.
Here are a few ideas that come to mind:
I work for a company that does database management for several large retail companies. We have custom databases designed for each of these companies, according to how they intend for us to use the data (for direct mail, email campaigns, etc.), and what kind of analysis and selection parameters they like to use. For example, a company that sells musical equipment in stores and online will want to distinguish between walk-in and online customers, categorize the customers according to the type of items they buy (drums, guitars, microphones, keyboards, recording equipment, amplifiers, etc.), and keep track of how much they spent, and what they bought, over the past 6 months or the past year. They use this information to decide who will receive catalogs in the mail. These mailings are very expensive; maybe one or two dollars per customer, so the company wants to mail the catalogs only to those most likely to buy something. They may have 15 million customers in their database, but only 3 million buy drums, and only 750,000 have purchased anything in the past year.
If you were to analyze the database we created, you would find many "work" tables, that are used for specific selection purposes, and that may not actually be properly designed, according to database design principles. While the "main" tables are efficiently designed and have proper relationships and indexes, these "work" tables would make it appear that the entire database is poorly designed, when in reality, the work tables may just be used a few times, or even just once, and we haven't gone in yet to clear them out or drop them. The work tables far outnumber the main tables in this particular database.
One also has to take into account the volume of the data being managed. A customer base of 10 million may have transaction data numbering 10 to 20 million transactions per week. Or per day. Sometimes, for manageability, this data has to be partitioned into tables by date range, and then a view would be used to select data from the proper sub-table. This is efficient for this huge volume, but it may appear repetitive to an automated analyzer.
Your analyzer would need to be user configurable before the analysis began. Some items must be skipped, while others may be absolutely critical.
Also, how does one analyze stored procedures and user-defined functions, etc? I have seen some really ugly code that works quite efficiently. And, some of the ugliest, most inefficient code was written for one-time use only.
OK, I am out of ideas for the moment. Good luck with your project.
If you can get ahold of it, the project management system Clarity has a horrible database design. I don't know if they have a trial version you can download.
I want to design a database which will keep record for financial transaction.I want to design it as a product so that it can be used for any type of financial transaction.Are there some design principles specific to financial transaction database design that can help me out to make database more durable for long term with minimal architectural level changes.Some good examples will be a great help too.
Thanks
Some things particular to financial systems include internal controls (This is a critical accounting term, do some research to really think this one through). Things like the person entering the check value can't also approve it. Things like using stored procs and not SQL generated from the application so that you can restrict rights to only the procs (no dynamic SQL at all - ever - in a financial system) and so users can only do what they are authorized to do. No rights for anyone except the production dba and an alternate to the tables. Fraud is what you are trying to protect the system from not just outside attacks. Security is critical to financial systems.
You also need audit tables to know who changed what data and when and what the old value was. This is not only an additional way to help find problems if someone got around the internal controls (or the system forgot to implement some critical ones) stole money, but it is often critical to be able to undo a mistake without having to restore. In general accounting systems often have data fields that are not viewable by the user and that are generated through default values or in a way that the user doesn't see them.
Another thing is you need to view actions in time so things that might look like a natural relationship may need denormalizing to preserve what the cost was at the time the action happened. So if you have an hourly rate table, you would use that as a lookup to get the rate at the time of the action not join to it to get the rate when you query.
Financial systems have private data in them, almost always, think how you are going to protect this data. You will need to be encrypting and decrypting data. You probably want an encrypted backup as well.
This data is the lifeblood of a company, it is critical that you have a good backup plan and much practice restoring. Off-site backups are critical.
Data integrity is critical. You need the correct datatypes and you need pk/fk relationships, constraints and triggers to enforce the rules. A financial system can't afford to have orphaned records.
You need to consider deletes very carefully. Financial systems often do soft deletes (mark records as deleted to avoid losing historical data. Yes XYZ company is no longer a customer, but you don't want to lose the financial history of the orders they had in the past. I would not even consider using cascade delete in a financial system.
Don't just talk to accountants in designing the system, talk to financial people who will run the system and auditors who will audit the results. Read and know thoroughly the published accounting standard for the country you are designing for. Look at tax implications. This is complex stuff.
Think about data warehousing and archiving data. Financial systems often query old data for reports, reporting is big, big, big for financial systems. Think how to do it effectively without affecting day-to-day data entry.
Depending one what you are actually trying to achieve, for you to create a "financial transaction" system that is useful you will need to teach yourself about journals, ledgers and other details of accounting. It isn't as simple as logging the actual transactions in a table...
Really, I don't think you will find database design principles for financial systems that are all that different from from any database system that needs it's information to be 100% correct.
Hence, reading the following when working with databases never hurt anyone:
Database Design Best Practices
Do you source control your databases?
Database Development Mistakes Made by App Developers
What are some of your most useful database standards?
I have a project involving a web voting system. The current values and related data is stored in several tables. Historical data will be an important aspect of this project so I've also created Audit Tables to which current data will be moved to on a regular basis.
I find this strategy highly inefficient. Even if I only archive data on a daily basis, the number of rows will become huge even if only 1 or 2 users make updates on a given day.
The next alternative I can think of is only storing entries that have changed. This will mean having to build logic to automatically create a view of a given day. This means less stored rows, but considerable complexity.
My final idea is a bit less conventional. Since the historical data will be for reporting purposes, there's no need for web users to have quick access. I'm thinking that my db could have no historical data in it. DB only represents current state. Then, daily, the entire db could be loaded into objects (number of users/data is relatively low) and then serialized to something like XML or JSON. These files could be diffed with the previous day and stored. In fact, SVN could do this for me. When I want the data for a given past day, the system has to retrieve the version for that day and deserialize into objects. This is obviously a costly operation but performance is not so much a concern here. I'm considering using LINQ for this which I think would simplify things. The serialization procedure would have to be pretty organized for the diff to work well.
Which approach would you take?
Thanks
If you're basically wondering how revisions of data are stored in relational databases, then I would look into how wikis do it.
Wikis are all about keeping detailed revision history. They use simple relational databases for storage.
Consider Wikipedia's database schema.
All you've told us about your system is that it involves votes. As long as you store timestamps for when votes were cast you should be able to generate a report describing the vote state tally at any point in time... no?
For example, say I have a system that tallies favorite features (eyes, smile, butt, ...). If I want to know how many votes there were for a particular feature as of a particular date, then I would simply tally all the votes for the feature with a timestamp smaller or equal to that date.
If you want to have a history of other things, then you would follow a similar approach.
I think this is the way it is done.
Have you considered using a real version control system rather than trying to shoehorn a database in its place? I myself am quite partial to git, but there are many options. They all have good support for differences between versions, and they tend to be well optimised for this kind of workload.