database normalization first normal form confusion - when should separating tables out - database

Please consider from an academic view not practical engineering view. This is about 1NF and 1NF only.
Considering the unnormalized form below, The primary key is {trainingDateTime, employeeNumber}, how would you make it to first normal form?
If we separate course, instructor and employee tables out as separate tables, it will automatically become 3NF.
!
If i split into different rows, it would be something like:
But problem here is obvious - the primary key is no longer valid.
Changing primary key now to {trainingDateTime, employeeNumber, employeeSkill} doesn't seems to be a sensible solution.

Just to make it satisfy 1NF, you need to have seperate rows for the individual teaching skills. But you should be ensuring that the higher normal forms are also satisfied by splitting tables.
So one row should have teaching skill as Advanced PHP and second row as advanced Java and third row as Advanced SQL and so on for the same employee.

Together with your other question
database normalization - merge/combine tables it seems you are looking for an answer to a question you did not ask.
With regards to your comment "in practice i can't imagine anyone would start from a complete unormalized form." I would think your question is more, why do we need those normalization rules in the way they are formulated in order to produce normalization efficiently. Something like that. I guess your real motivation/question plays a role here.
Normalization is typically perceived as a process or a methodology. And there is no harm to this. However the formulation of those normalization rules also allows for a checklist like usage. So you can doublecheck an arbitrary set of tables with an arbitrary size against normalization rules and confirm or reject normalization compliance. So even if you can find probably thousands of examples where any of those normalization rules confirm normalization compliance from the very first natural schema version you could also find thousands of other examples that would fail normalization compliance on those same rules.
In fact trying to squeeze in multiple somehow coupled information in a historically grown collection of MS Excel tables accross several sheets usually is a extraordinary source for conflicting any set of normalization rules. (e.g. render a business case and connect that with planning aspects and ressource planning)...

Related

How do I know which is the more appropriate database design? (Authors, Articles & Comments)

Let's assume a database with three tables: Author, Articles, Comments
Assuming the relationship is as follows:
Author has many Articles
Article belongs to one Author
Article has many Comments
Comment belongs to one Article
If I want to know which Author writes the most commented article, I need to select all Articles that belong to a specific Author first. Then I can count the number of comments that were posted under each of those articles. Which in general leads to more complex queries.
If The relationships were as follows:
Author has many Articles
Article belongs to one Author
Article has many Comments
Comment belongs to one Article
**Comment belongs to one Author of the relevant Article**
then I could directly select and count all comments that were posted under the articles of a specific Author, without bothering about including the articles in the query.
But it implies a redundant relationship.
In view of performance, usability and coding best practices, which approach is the better one?
I remember to have read somewhere, that one should only use the first approach, and avoid redundant relationships. But I don't remember where or why. What is a link to a scientific approach to answer this question?
"But I don't remember where or why? Please link to a scientific approach to answer this question."
The "scientific approach" is the entire body of normalization theory.
The "redundant relationship" creates an additional problem in integrity enforcement. The system must make sure that the comment/author relationship as specified by a user updating the db, is the same as the one implied by the comment/article and article/author relationships.
That is a problem of additional complexity for the system when enforcing data integrity, and is a problem of additional complexity for the users doing the updating to ensure that they won't be specifying invalid updates.
So your "second approach" might make querying "simpler" indeed, but only at the expense of creating additional complexities on the "updating" side.
Your first approach is a normalized design. It should be the default - it's more maintainable, less error-prone, and requires less code overall.
The second option is a denormalized design. If you think it through, it would require you to find the author for the article every time someone posts a comment, and increment the "comments" field; that's probably more code, and makes writing the comment slower. It also means a simple bug in your "create comment" code could break the application logic, and you probably need to create a transaction for each comment "write" action so you can guarantee that both the comment and update to "authors.comment_count" succeeds or fails.
So, the second option is definitely more complex, and slower for writing comments. It may be faster for querying, but as you'll be joining on primary keys, you will almost certainly not be able to measure that performance impact until you get to a database size of hundreds of millions of records.
In general, I recommend the following approach; take each step only if the previous steps haven't given you enough performance.
design a relational model.
tune that relational database (indexes, etc.)
improve the hardware - RAM, CPU, SSD disks etc.
create a measurement rig so you can identify the performance challenges and run experiments. Create benchmarks based on current and expected data sizes; find a way to fill your test rig with dummy data until you have the data volume you need to scale to.
run your queries on the test rig. Make sure there are no further performance tweaks from indexing or query optimization.
introduce application-level caching. In your example, caching the number of comments for an author for 1 hour may be acceptable.
de-normalize your schema. Use your test rig to prove it gives you the performance you expect.
look at more exotic data solutions - sharding, data partitioning etc.
Denormalization is so far down the line because it introduces real maintenance risks, makes your code much more complex, and is nowhere near as effective as adding an extra 4GB to your server in most cases.
Tables represent business/application relation(ship)s/associations. As in the relational model & entity-relationship modeling. Every query result holds the rows of values that are related by some business relationship expressed by the query expression.
Your "relationships" [sic] are FKs (foreign keys). Those are constraints--statements true in every business situation & its database state--saying that if some values are related by a certain business relationship then they are also related by a certain other one. But FKs are neither necessary nor sufficient for using the database--for interpreting it or updating it. They constrain the database state, but they don't tell you what's in it.
Your business relationships & corresponding tables are actually like:
Author authored Article
Commenter commented Comment re Article
Such a statement template denoting a business relationship is its (characteristic) predicate. To query using these it does not matter what the constraints are--if you want the authors who commented on articles authored by themselves that's
/* rows where
FOR SOME a.* & cr.*,
Author = a.Author
AND a.Author authored a.Article
AND cr.Commenter commented cr.Comment re cr.Article
AND a.Author = cr.Commenter
*/
select Author
from authored a join commented_re cr on a.Author = cr.Commenter
regardless of whether an author can author multiple articles, or multiple authors can author an article, or multiple authors can author multiple articles, or commenters can comment re multiple comments, etc, or commenters can comment re multiple articles, etc, or a comment can be re multiple articles, etc, or authors can comment, or commenters can author, or commenters can only comment on articles they authored (a FK constraint) or authors named 'henk' can comment re at most 7 articles, or any constraint whatsoever.
Normalization replaces a table by selects of it that join back to it, which is the same as saying it replaces a business relationship that is expressible via an AND by others that are expressible by the expressions that were ANDed. It happens that if an author can only write one article and an article can only be written by one author then the AND/join table above might (depending on other things) be a good design but otherwise it would not be a good design, and should be replaced by the separate tables. FDs & other constraints are the post-design table-based expression of corresponding business rules that follow from the chosen business relationships & what business situations can arise.
So your "scientific approach" is proper relational information modeling and database design, including normalization.

Database design, multiple M-M tables or just one?

Today I was designing a database for a potential personal project of mine. Since I couldn't decide what would be a better option I asked my teacher Databases, unfortunately he couldn't tell me which of the two options is better than the other and why.
I designed the database for a dummy data generator. Since I want to generate multilangual data I thought of these tables. (But its a simplification of the tables).
(first and last)names: id, name
streets: id, name
languages: id, name
Each names.name and streets.name originates from a language, sometimes a name can have multiple origins (ex: Nick is both a Dutch as an English name).
Each language has multiple names and streets.
These two rules result in a Many-to-Many relationship. At the moment I've got only two tables, but I know I will get between 10 and 20 of these kind of tables.
The regular way one would do this is just make 10 to 20 Many-to-Many relationship tables.
Another idea I came up with was just one Many-to-Many table with a third column which specifies which table the id relates to.
At the moment I've got the design on my other PC so I will update it with my ideas visualized after dinner (2 hours or so).
Which idea is better and why?
To make the project idea a bit clearer:
It is always a hassle to create good and enough realistic looking working data for projects. This application will generate this data for you and return the needed SQL so you only have to run the queries.
The user comes to the site to get the data. He states his tablename, his columnnames and then he can link the columnnames to types of data, think of:
* Firstname
* Lastname
* Email adress (which will be randomly generated from the name of the person)
* Adress details (street, housenumber, zipcode, place, country)
* A lot more
Then, after linking columns with the types the user can set the number of rows he wants to make. The application will then choose a country at random and generate realistic looking data according to the country they live in.
That's actually an excellent question. This sort of thing leads to a genuine problem in database design and there is a real tradeoff. I don't know what rdbms you are using but....
Basically you have four choices, all of them with serious downsides:
1. One M-M table with check constraints that only one fkey can be filled in besides language and one column per potential table. Ick....
2. One M-M table per relationship. This makes things quite hard to manage over time especially if you need to change something from an int to a bigint at some point.
3. One M-M table with a polymorphic relationship. You lose a lot of referential integrity checks when you do this and to make it safe, have fun coding (and testing!) triggers.
4. Look carefully at the advanced features in your rdbms for a solution. For example in postgresql this can be solved with table inheritance. The downside is that you lose portability and end up in advanced territory.
Unfortunately there is no single definite answer. You need to consider the tradeoffs carefully and decide what makes sense for your project. If I was just working with one RDBMS, I would do the last one. But if not, I would probably do one table per relationship and focus on tooling to manage the problems that come up. But the former preference is about my level of knowledge and confidence, and the latter is a bit more of a personal opinion.
So I hope this helps you look at the tradeoffs and select what is right for you.

What is the best way to realize this database

I have to realize a system with different kind of users and I think to realize it in this way:
A user table with only id, email and password.
Two different tables correlated to the user table in a 1-to-1 relation. Each table define specific attributes of each kind of user.
Is this the best way to realize it? I should use the InnoDB storage engine?
If I realize it in this way, how can I handle the tables in the Zend Framework?
I can't answer the second part of your question but the pattern you describe is called super and subtype in datamodelling. If this is the right choice can't be answered without knowing more about the differences between these user types and how they will be used in the application. There are different approaches when converting logical super/subtypes into physical tables.
Here are some relevant links:
http://www.sqlmag.com/article/data-modeling/implementing-supertypes-and-subtypes
and the next one about pitfalls and (mis)use of subtyping
http://www.ocgworld.com/doc/OCG_Subtyping_Techniques.pdf
In general I am, from a pragmatic point of view, very reluctant to follow your choice and most often opt to create one table containing all columns. In most cases there are a number of places where the application needs show all users in some sort of listing with specific columns for specific types (and empty if not applicable for that type). It quickly leads to non-straigtforward queries and all sort of extra code to deal with the different tables that it's just not worth being 'conceptually correct'.
Two reasons for me to still split the subtypes into different tables are if the subtypes are so truly different that it makes no logical sense to have them in one table and if the number of rows is so enormous that the overhead of the 'unneeded' columns when putting it all in one table actually starts to matter
On php side you can use Doctrine 2 ORM. It's easy to integrate with zf, and you could easily implement this table structure as inheritance in your doctrine mapping.

What is data normalization? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What exactly does database normalization do?
Can someone please clarify data normalization? What are the different levels? When should I "de-normalize"? Can I over normalize? I have a table with millions of records, and I believe I over-normalized it, but I'm not sure.
If you have million columns you probably under-normalized it.
What normalizing means is that
every non-key attribute "must provide
a fact about the key, the whole key,
and nothing but the key."
If you have a column that depends on anything but the key, you should normalize your table.
see here.
Added to reply to comment:
If you have ProductID | ProductType | ProductTypeID, where ProdcutTypeID depends only on ProductType, you should make a new table for that:
ProductID | ProductTypeID and on the other table: ProductTypeID | ProductTypeName .
So to answer your question, pertaining to Product isn't accurate enough, in my example at the first case, I was pertaining to the Product as well. All columns should pertain only to ProductID (you may say you only describe product, but not describing anything else, even if it's related to product - that's accurate).
Number of rows, generally speaking isn't relevent.
Normalization is about reducing data duplication in a relational database. The most popular level is third normal form (it's the one described by "the key, the whole key, and nothing but the key"), but there are a lot of different levels, see the Wikipedia entry for a list of the main ones. (In practice people seem to think they're doing well to achieve third normal form.) Denormalizing means accepting more data duplication, typically in exchange for better performance.
As others said Database normalization is about reduction of data duplication and more generic data models (that can easily answer to queries unexpected at design time). Normalisation of a database is allow a formal enough process. When you are experimented you mostly follow data analysis methods and get a normalized database at the end.
Normalizing database is usually a good idea, but there is a catch. In many case it involve creation of new tables and JOIN relationships between tables. JOIN is known to have a (very) high performance cost at runtime, henceforth for big volumes of data you may want to denormalize.
Another cost may also be the need to write more complex requests to access to the data you need and that can be a problem for SQL beginners. The best idea is probably to stick with normalization anyway (Third Normal Form is usually enough, as there is several levels of normalization as others said) and to become more skilled with SQL.

Database Normalization Vocabulary

There is lot or material on database normalization available on Steve's Class and the Web. However, I still seem to lack on very definite reasons on explaining normalization.
For example, for a simple design such as a table Item with a Type field, it makes sense to have the Type as a separate table. The reason I forwarded for that was if in future any need arose to add properties to the Type, it would be much easier with a separate table already existing.
Are there more reasons which can be shown to be obvious?
Check these out too:
An Introduction to Database Normalization
A Simple Guide to Five Normal Forms
in Relational Database Theory
This article says it better than I can:
There are two goals of the normalization process: eliminating redundant data (for example, storing the same data in more than one table) and ensuring data dependencies make sense (only storing related data in a table). Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored.
Normalization is the process of organizing data in a database. This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency.
Redundant data wastes disk space and creates maintenance problems. If data that exists in more than one place must be changed, the data must be changed in exactly the same way in all locations. A customer address change is much easier to implement if that data is stored only in the Customers table and nowhere else in the database.
What is an "inconsistent dependency"? While it is intuitive for a user to look in the Customers table for the address of a particular customer, it may not make sense to look there for the salary of the employee who calls on that customer. The employee's salary is related to, or dependent on, the employee and thus should be moved to the Employees table. Inconsistent dependencies can make data difficult to access because the path to find the data may be missing or broken.
following links can be useful:
http://support.microsoft.com/kb/283878
http://neerajtripathi.wordpress.com/2010/01/12/normalization-of-data-base/
Edgar F. Codd, the inventor of the relational model, introduced the concept of normalization. In his own words:
To free the collection of relations from undesirable insertion, update and deletion dependencies;
To reduce the need for restructuring the collection of relations as new types of data are introduced, and thus increase the life span of application programs;
To make the relational model more informative to users;
To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.
— E.F. Codd, "Further Normalization of the Data Base Relational Model"
Taken word-for-word from Wikipedia:Database normalization

Resources