temporal database modeling and normalisation

temporal database modeling and normalisation - database

Should dates for a temporal database stored in one or 2 tables ? If one doesn't this violate normalisation ?
PERSON1 DATE11 DATE21 INFO11 INFO21 DEPRECATED
PERSON2 DATE21 DATE22 INFO21 INFO22 CURRENT
PERSON1 DATE31 DATE32 INFO31 INFO32 CURRENT
DATE1 and DATE2 Columns indicate that INFO1 and INFO2 are true for the period between DATE1 and DATE2. If DATE < TODAY, the facts are deprecated and shouldn't show any more in the user interface but they shouldn't be deleted for historical purpose. For example INFO11 and INFO21 are now deprecated.
Should I split this table ? Should I store the state (deprecated or current) in the table ?
To clarify the question further more, Deprecated is the term used by the Business, if you prefer "not current", the problem is not semantic, it's not about sql queries either, I just want to know which design violates or best suits Normalisation rules (I know normalisation is not always the way to go, that is not my question either).

"I want to know which design violates Normalisation rules"
Depends on which set of normalization rules you want to go by.
The first and most likely violation of normal forms, and in Date's book it is a violation of first NF, is your end-dates in the rows that hold "current" information (making abstraction of the possibility of future-dated information): you violate 1NF if you make that attribute nullable.
Violations of BCNF may obviously occur as a consequence of your choice of keys (as it is the case in nontemporal database designs too - the temporal aspect makes no difference here). Wrt "choice of keys": if you use separate start- and end-dates (and SQL kind of leaves you no other choice), then most likely you should declare TWO keys: one that includes the start date, and one that includes the end-date.
Another design issue is the multiple data columns. This issue is discussed quite at large in "Temporal Data and the Relational Model" : if INFO1 and INFO2 can change independently of one another, it might be better to decompose your tables to hold just one attribute, in order to avoid an "explosion of rows count" that might otherwise occur if you have to create a new complete row every time one single attribute in the row changes. In that case, your design as you gave it constitutes a violation of SIXTH normal form, as (that normal form is) defined in "Temporal Data and the Relational Model".

Normalization is a Relational database concept - it does not apply as well to temporal databases. That's not to say that you cannot store temporal data in a relational database. You definitely can.
But if you are going with Temporal Database Design, then the concepts of Temporal Normalization apply rather than Relational normalization.

You have not indicated the meaning of the dates. Do they refer to (a) the period when the stated fact was true in real-life, or (b) to the period when the stated fact was believed to be true by the holder of the database ? If (b), then I would never do it this way. Move the updated line to an archive table/log immediately when the update is done. If (a), then the following statement is questionable :
"the facts are deprecated and shouldn't show any more in the user interface"
If a fact doesn't "need to show up in the user interface" anymore, then it doesn't need to be in the database anymore either. Keeping such facts there achieves only one thing : deteriorate general performance for all the rest.
If you really need these historical statements of fact to suit your requirements, then chances are that your so-called "deprecated facts" are still very much relevant to the business, and therefore not "deprecated" at all. Assumming that for this reason, there are very little "genuinely deprecated" facts in your database, your design is good. Just keep the number of "genuinely deprecated facts" small by periodically removing them from the operational database.
(PS) To say that your design is good, doesn't mean you won't run into any problems. SQL is extremely ill-suited to handle this kind of information elegantly. "Temporal Data and the Relational Model" is an excellent treatment of the subject. Another book, the one from Snodgrass, is often praised too, though not by me. That one is something of a cookbook with recipes for dealing with these problems in SQL, as proven by the following conversation on SO about this book :
(Q) "Why would I read that ?"
(A) "Because the trigger you asked for is on page 135."

Related

ERD without relationships

I haven't touched database since graduated from school, so please forgive me if my question is too entry level.
As i remember how to draw ERD with UML, recently, my boss asked me to create a database for inventory system with frontend. I googled some similar systems and found that in backend their databases don't have any relationships between tables ( i did DB reversed UML ).
So I thought about it, it seems application works fine even without relationships ( no foreign keys ), so what's the point we have reasons we still need relationship between tables?

This is one of the areas where there is often a noticeable disparity between the theory that is taught in CS courses and the reality of what happens in practice.
Often what you'll run into is a mash-up between the two: an ERD model that shows all the proper relationships and keys, and the "reality" of what actually gets implemented in the database.
The implementation side is probably the part that catches people by surprise, as you have seen: no relationships defined, and foreign keys are simply implied by the matching column names across different tables. This is a tradeoff.
One one hand, managing foreign keys in a database has overhead. Every time a row is added or modified, the database will need to examine those foreign keys and make sure that the change will preserve the relational integrity. After all, that's what you are asking for when you define those relationships, right? And in an ideal world where that overhead is negligible, this is probably a good thing, because as DBA's we like it when our physical implementation matches the idealized model we spent all that time creating. We sleep better knowing that every entry in the customer table references a valid location in the company_location table.
On the other hand, there is reality. That overhead is not something we can easily ignore. Not when that nightly batch load is 4 hours late, and some marketing manager is asking you every 10 minutes for an estimate on when his data will be available. So we cut some corners and make some compromises. And hey, we're pretty good programmers, right? Certainly we can code the application in a way that will always maintain the referential integrity of the database without having to spend all that extra time to deal with foreign keys in the database....well, maybe.... the truth is that it is really hard to be sure that RI will always be preserved by an application that is already implementing some potentially complex business logic.
There are, of course, many other reasons for using explicit RI, and plenty of good reasons for ignoring it in the physical implementation. You are right, at the end of the day applications often do work OK without relationships being defined. And at the end of the day, I will probably get home safe even if I don't put on my seatbelt for the drive. But having the relationships implemented in the database is a pretty solid insurance policy when it comes to guaranteeing the integrity our a data. Analysts that use that database to generate business insights like consistent data. And transactional applications might depend on the assumption that the data is relationally consistent.
Î guess my point is that there is no "always right" answer here, and it really is a case-by-base thing. I would just suggest to start from the assumption that you'll physically implement the model, complete with RI, as faithfully as possible. Then, if you find hot spots, carefully and conservatively relax those constraints as needed.

Foreign key's take care of referential integrity.
To explain this a bit more: By adding a foreign key you are saying "what is in this column must be in the column I am pointing to as well". This makes sure your naming stay's consistent.
If you did not do this mistakes could be introduced when adding redundant information. Like calling "James", "Jamed" by mistake.

Low normal form database design on relational RDBS

A friend of mine is creating an application after she designed the database. However, she just put everything in two 1st NF tables, which have a few functional dependencies on non-PK columns. Indeed, she added the PKs after I suggested her. The 3rd+ NF design of her system may require 7 to 15 tables.
I pointed out that there will be a couple of possible update abnormal. However, she said she has stored procedures for insertion and updates so the update abnormal will never happen since she will force the users/applications insert/update the data via the stored procedures.
Is there any other good reason to persuade her to design the system with higher normal form? Or is her solution good enough practically?

She is probably unskilled and unaware of it. It has already been demonstrated at large that there is no way to help such people.
She probably thinks that she can get all that code in those SP's right, without making a single mistake. That is more than just likely a ludicrous overestimation of her own abilities.
I advise you to let her simply make her mistakes. For YOU, it will save the time of trying and the frustration of not succeeding in getting the message through, and for HER, it will give her the opportunity to learn from her own mistakes (only option left since she's apparently not really willing to learn from other peoples' mistakes), and it will save her the frustration of having to undergo your later coming 'round with that glorious "I told you so.".
But if you really insist on giving it a try, then you might point out to her that :
her 1NF tables are actually views (materialized views at that) on the 3/BC NF tables that she is willingly not implementing.
the updates that her end-user probably expects to be doing, are most likely, however, updates to those 7-15 3/BC NF tables. (I say this because you indicated as her sole reason for the lower NF that she wants to avoid such a "huge" number of tables - the cynicism in "huge" is intentional.)
but the updates that the DBMS is expecting, are updates to the 1NF tables (which are views).
Therefore, her problem boils down to "distilling the appropriate updates to 3NF tables from updates that are specified as updates to views on those tables".
Therefore, getting all her SP code to be correct, is a problem of how to do view updating.
And guess what, after more than 40 years of research on the relational model, by thousands of researchers whose intellectual abilities exceed hers probably by orders of magnitude, it is exactly this problem of view updating that still stands unsolved ... (although granted, her problem is not "view updating in general", but "view updating in some given specific situation". Nonetheless, I very much doubt that she has the skill to spot all the intricacies involved.)

Alternatives to Entity-Attribute-Value (EAV)?

Our database is designed based on EAV (Entity-Attribute-Value) model. Those who have worked with EAV models know all the crap that comes with for the purpose of flexibility.
I asked my client about the reasons why using EAV model (flexibility), and their response was: Their entities change over time. So, today they may have a table with a few attributes, but in a month time, a few new attributes may be added, or an existing attribute may be renamed. They need to produce reports to get back to any stage in time and query the data based on the shape of entities at that stage.
I understand this is not feasible with a conventional relational model, but I personally see EAV as anti-pattern. Are there any other alternative models that enables us to capture the time dimension in changes to the entities and instances?
Cheers,
Mosh

There is a difference between EAV done faithfully or badly; 5NF done by skilled people or by those who are clueless.
Sixth Normal Form is the Irreducible Normal Form (no further Normalisation is possible). It eliminates many of the problems that are common, such as The Null Problem, and provides the ultimate method identifying missing values. It is the academically and technically robust NF. There are no products to support it, and it is not commonly used. To be implemented properly and consistently, it requires a catalogue for metadata to be implemented. Of course, the SQL required to navigate it becomes even more cumbersome (SQL already being cumbersome re joins), but this is easily overcome by automating the production of SQL from the metadata.
EAV is a partial set or a subset of 6NF. The problem is, usually it is done for a purpose (to allow columns to be added without having to make DDL changes), and by people who are not aware of the 6NF, and who do not implement metadata. The point is, 6NF and EAV as principles and concepts offer substantial benefits, and performance increases; but commonly it is not implemented properly, and the benefits are not realised. Quite a few EAV implementations are disasters, not because EAV is bad, but because the implementation is poor.
Eg. Some people think that the SQL required to construct the 3NF rows from the 6NF/EAV database is complex: no, it is cumbersome but not complex. More important, an ordinary SQL VIEW can be provided, so that all users and report tools see only the straight 3NF VIEW, and the 6NF/EAV issues are transparent to them. Last, the SQL required can be automated, so the labour cost that many people endure is quite unnecessary.
So the answer really is, Sixth Normal Form, being the father of EAV, and a purer form, is the replacement for it. The Caveat is, ensure it is done properly. I have one large 6NF db, and it suffers none of the problems people post about, it performs beautifully, the customer is very happy (no further work is a sign of complete functional satisfaction).
I have already posted a very detailed answer to another question which applies to your question as well, which you may be interested in.
Other EAV Question

Regardless of the kind of relational model you use, tracking field name changes requires a lot of meta data which you must keep track of in either transaction logs or audit tables. Unfortunately, querying either of those for state at a particular date is very complicated. If your client only requires state at a particular time date however, meaning the entire state, not just with respect to name changes, you can duplicate the database and roll back the transaction log to the particular time required and run your queries on the new instance. If entities added after the specified date need to show up in the query with the old field names however, you have a very large engineering problem ahead of you. In that case, with the information you provided in your question, I would suggest either negotiating alternatives with the client or getting more information about the use of the reports to find alternative solutions.
You could move to a document based datastore, but that still wouldn't solve the problem in the second case. Sorry this isn't really an answer, but having worked through similar situations, the client likely needs a more realistic reporting solution or a number of other investors willing to front the capital for the engineering.
When this problem came up for us, we kept the db schema constant and implemented an entity mapping factory based on a timestamp. In the end, the client continually changed requirements (on a weekly to monthly basis) as to how aggregate fields were calculated and were never fully satisfied.

To add to the answers from #NickLarsen and #PerformanceDBA
If you need to track historical changes to things like field name, you may want to look into something like Slowly Changing Dimensions. It appears to me like you are using the EAV to model dynamic dimensional models (probably lookup lists).
The simplest (and probably least efficient) way of achieving this would be to include an "as of" date field on EAV tables, and whenever a change occurs, insert a new record (instead of updating an existing record) with the current date. This means that you need to alter your queries to always include or look for an "as of" date, or deafult to "now" if none provided. Your base entity that joins to the EAV objects would then have to query "top 1" from the EAV table where "as of" date is less than or equal to the 'last updated' date of the row, ordered by "as of" descending. Worst case scenario, if you need to track the most recent change to a given row where both the name (stored in the 'attribute' table) and the value have changed, you would chain this logic to the value table using 'last modified' of the row to find the appropriate value for that particular date.
This obviously has the potential to generate LARGE amounts of data if there are a lot of changes. That's why this approach is referred to as "slowly" changing. It's intended for dimensional values that may change, but not very often. To help with query performance, indexes on the "as of" and "last modified" fields should help.

If your client needs such flexibility, then a relational database might not be the right match.
Consider MongoDB where JSON structures are stored. You can add or not add fields without limitations. You can even use nesting.

Create a new table description for each Entity description Version
and one additional table that tells you which table is which version.
The query system should be updated as well.
I think creating a script that generates, tables and queries is your best shot.

have address columns in each table or an address table that is referenced by the other tables?

Say I had three tables: Accommodation, Train Stations and Airports. Would I have address columns in each table or an address table that is referenced by the other tables? Is there such a thing as over-normalization?

Database Normalization is all about constructing relations (tables) that maintain certain functional
dependencies among the facts (columns) within the relation (table) and among the various relations (tables)
making up the schema (database). Bit of a mouth-full, but that is what it is all about.
A Simple Guide to Five Normal Forms in Relational Database Theory
is the classic reference for normal forms. This paper defines in simple terms what the essence of each normal form is
and its significance with respect to database table design. This is a very good "touch-stone" reference.
To answer your specific question properly requires additional information. Some critical questions you have to ask
are:
Is an Address a simple fact (e.g. blob of text) or a composite fact (e.g.
composed of multiple attributes: Address line, City Name, Postal Code etc.)
What are the other "facts" relating to "Accommodation",
"Airport" and "Train Station"?
What sets of "facts" uniquely and minimally identify an "Airport", an "Accommodation"
and a "Train Station" (these facts are typically called a key or candidate key)?
What functional dependencies exist among Address facts and the facts
composing each relations key?
All this to say, the answer to your question is not as straight forward as one might hope for!
Is there such a thing as "over normalization"? Maybe. This depends on whether the
functional dependencies you have identified and used to build your tables are
of significance to your application domain.
For example, suppose it was determined that an address
was composed of multiple attributes; one of which is postal code. Technically a postal
code is a composite item too (at least Canadian Postal Codes are). Further normalizing your
database to recognize these facts would probably be an over-normalization. This is because
the components of a postal code are irrelevant to your application and therefore factoring
them into the database design would be an over-normalization.

For addresses, I would almost always create a separate address table. Not only for normalization but also for consistency in fields stored.
As for such a thing as over-normalization, absolutely there is! It's hard to give you guidance on what is and isn't over-normalization as I think it mostly comes from experience. However, follow the books on each level of normalization and then once it starts to get difficult to see where things are you've probably gone too far.
Look at all the sample/example databases you can as well. They will give you a good indication on when you should be splitting out data and when you shouldn't.
Also, be well aware of the type and amount of data you're storing, along with the speed of access, etc. A lot of modern web software is going fully de-normalized for many performance and scalability reason. It's worth looking into those for reason why and when you should and shouldn't de-normalize.

Would I have address columns in each table or an address table that is referenced by the other tables?
Can airports, train stations and accommodation each have a different address format?
A single ADDRESS table minimizes the work necessary dealing with addresses - suite, RR, postal/zip code, state/province...
Is there such a thing as over-normalization?
There are different levels of normalization. I've only encountered what I'd consider poor design rather than normalization.

Personally I'd go for another table.
I think it makes the design cleaner, makes reporting on addresses much simpler and will make any changes you need to make to the address schema easier.
If you need to have it denormalized later on you can always create two views that contain the Train station and airport information along with any address information you need.

This isn't really what I understand by normalisation. You don't seem to be talking about removing redundancy, just how to partition the storage or data model. I'm assuming that the example of addresses for Accommodation, Train Stations and Airports will all be disjoint?
As far as I know it would only be normalisation if you started thinking along the lines. Postcode is functionally dependent upon street address so should be factored out into its own table.
In which case this could be ever desirable or undesirable dependent upon context. Perhaps desirable if you administer the records and can ensure correctness, and less desirable if users can update their own records.
A related question is Is normalizing a person’s name going too far?

If you have a project/piece of functionality that is very performance sensitive, it may be smart to denormalize the database in some cases. However, this can lead to maintenance issues for various reasons. You may instead want to duplicate the data with cache tables but there are drawbacks to this as well. It's really a case by case basis but in normal practice, database normalization is a good thing. 99% of the non-normalized databases I've seen are not by design, but rather by a misunderstanding/mistake by the developer.

Would I have address columns in each table or an address table that is referenced by the other tables?
As others have alluded to, this is not really a question of normalization because you're not attempting to reduce redundancy or organize dependencies. Either way is perfectly acceptable. Moving the addresses to a separate table might make sense if you are going to have centralized validation or business logic specific to addresses.
Is there such a thing as over-normalization?
Yes. As has been mentioned, in large systems (lots of data, lots of transactions, or both) you can normalize to the point where performance becomes an issue. This is why lots of systems use denormalized database for reporting and querying.
In addition to performance though, there is also the issue of how easy the data is to query. In systems where there will be a lot of end-user querying of the data (can be dangerous!), a denormalized structure is easier for most non-technical or non-database people to understand.
Like most things we deal with, it's a trade-off between understanding, performance, and future maintainability and there is rarely a clear-cut answer to where you draw the line in any given system.
With experience, you will learn where the line is best drawn for the systems you write.
With that said, my preference is to err on the side of more vs less normalization.

If you are using Oracle 9i, you could store address objects in your tables. That would remove the (justified) concerns about address formats.

I agree with S.Lott, and would like to add:
A good answer depends on what you know already. The basic "math" of relational database theory, however, defines very well-defined, distinct levels of normalization. You cannot normalize anymore when you've reached the ultimate normal form.
Depending on what you want to model with your three entities, and how you identify them, you can come up with very different conceptual data models, all of which can be represented in a mix of normal forms -- or unnormalized at all (like 1 table for all data with descriptors and NULL holes all over the place...).
Consider you normalize your three entities to the ultimate normal form. I can now introduce a new requirement, or use case, or extension, which gives an upto-now descriptive attribute a somehow ordered, or referencing, or structured nature if you look at its content. Then, the model should represent this behavior, and what used to be an attribute perhaps will better be a separate entity referenced by other entities.
Over-normalization? Only in the sense that can you normalize a given model so it gets inefficient to store, or process, on a given DB platform. Depending on what can be handled efficiently there, you might want to de-normalize certain aspects, trading off redundancy for speed (data warehouse dbs do this all the time), and insight, or vice versa.
All (working) db designs I've seen so far either have a rather normalized conceptual data model, with quite some denormalization done at the logical and/or physical data model level (speaking in Sybase PowerDesigner terms) to make the model "manageable" -- either that, or they were not working, i.e. failed because the maintenance problems became kingsize real quick.

When you say "address", I presume you mean a complete address, like street, city, state/province, maybe country, and zip/postal code. That's 4 or 5 fields, maybe more if you allow for "address line 1" and "address line 2", care-of's, etc. That should definately be in a separate table, with an "addressid" to link to the Station, etc tables. Otherwise, you are creating 3 separate copies of the same set of field definitions. That's bad news because it creates extra effort to keep them consistent. Like, what if initially you are only dealing with U.S. addresses (I'm an American so I'll assume U.S.), but later you find you also need to allow for Canadians. You'll need to expand the size of the postal code field and add a country code. If there's a common table, then you only have to do this once. If there isn't, then you have to do this three times. And it's likely that the "three times" is not just changing the database schema, but changing every place in your programs that processes an address.
One of the benefits of normalization is to minimize the impact of changes.

There are times when you want to denormalize to make queries more efficient. But this should be done very cautiously, only after you have good reason to believe that the fully normalized model creates serious inefficiency problems. In my humble experience, most programmers are far to quick to denormalize, usually with a quick "oh, breaking that out into a separate table is too much trouble".

I think in this situation it is OK to have address columns in each table. You'll hardly have an address which will be used more than two times. Most of the adresses will be used just one per entity.
But what could be in an extra table are names of streets, cities, countries...
And most important every train station, accomodoation and airport will probably have just one address so it's an n:1 relation.

I can only add one more constructive note to the answers already posted here. However you choose to normalize your database, that very process becomes almost trivial when the addresses are standardized (look the same). This is because as you endeavor to prevent duplicates, all the addresses that are actually the same do look the same.
Now, standardizing addresses is not trivial. There are CASS services which do this for you (for US addresses) which have been certified by the USPS. I actually work for SmartyStreets where this is our expertise, so I'd suggest you start your search there. You can either perform batch processing or use the API to standardize the addresses as you receive them.
Without something like this, your database may be normalized, but duplicate address data (whether correct or incomplete and invalid, etc) will still seep in because of the many, many forms they can take. If you have any further questions about this, I'll personally assisty you.

Can you have 2 tables with identical structure in a good DB schema?

2 tables:
- views
- downloads
Identical structure:
item_id, user_id, time
Should I be worried?

I don't think that there is a problem, per se.
When designing a DB there are lots of different parameters, and some (e.g.: performance) may take precedence.
Case in point: even if the structures (and I suppose indexing) are identical, maybe "views" has more records and will be accessed more often.
This alone could be a good reason not to burden it with records from the downloads.
Also, the fact that they are indentical now does not mean they will be in the future: views and downloads are different, after all, so sooner or later one or both could grow an extra field or two.

These tables are the same NOW but may schema change in the future. If they represent 2 different concepts it is good to keep them separate. What if you wanted to have a foreign key from another table to the downloads table but not the views table, if they were that same table you could not do this.

I think the answer has to be "it depends". As someone else pointed out, if the schema of one or both tables is likely to evolve then no. I can think of other cases well (simplifying the security model by allow apps/users access to one or the other).
Having said this, I work with a legacy DB where this is a problem. We have multiple identical tables for customer invoices. Data is actually moved between then at different stages in the processing life-cycle. It makes for a complicated mess when trying to access data. It would have been easily solved by a state flag in the original schema, but we now have 20+ years of code written against the multi-table version.
Short answer: depends on why they are the same schema :).

From a E/R modelling point of view I don't see a problem with that, as long as they represent two semantically different entities.
From an implementation point of view, it really depends on how you plan to query that data:
If you plan to query those tables independently from each other, keeping them separate is a good choice
If you plan to query those tables together (maybe with a UNION of a JOIN operation) you should consider storing them in a single table with a discriminator column to distinguish their type
When considering whether to consolidate them into a single table you should also take into account other factors like:
The amount of data stored in each table
The rate at which data grows in each table
The ratio of read/write operations executed on each table

Chris Date and Dave McGoveran formalised the "Principle of Orthogonal Design". Roughly speaking it means that in database design you should avoid the possibility of allowing the same tuple in two different relvars. The aim being to avoid certain types of redundancy and ambiguity that could result.
Arguably it isn't always totally practical to do that and it isn't necessarily clear cut exactly when the principle is being broken. However, I do think it's a good guiding rule, if only because it avoids the problem of duplicate logic in data access code or constraints, i.e. it's a good DRY principle. Avoid having tables with potentially overlapping meanings unless there is some database constraint that prevents duplication between them.

It depends on the context - what is a View and what is a Download? Does a Download imply a View (how else would it be downloaded)?
It's possible that you have well-defined, separate concepts there - but it is a smell I'd want to investigate further. It seems likely that a View and a Download are related somehow, but your model doesn't show anything.

Are you saying that both tables have an 'item_id' Primary Key? In this case, the fields have the same name, but do not have the same meaning. One is a 'view_id', and the other one is a 'download_id'. You should rename your fields consequently to avoid this kind of misunderstanding.