Tips while 'normalizing' databases - database

I would be grateful if someone wrote how I should look for databases normalization errors in databases AND in entity classes in any language.
I just would like to know what is the most important and where should I look for possible errors in classes - in DAOs, BEANS or wherever. What should I take into account - any conventions, schemes etc?
For any answer, thanks in advance! :)

I guess you've read something about the normal forms, e.g. on wikipedia. Then I guess you know something, but you are not sure why should you do that or what is really important.
For example, if you have a table that contains relations between persons, it should not contain names, just IDs. If you have e.g. a table of patients where there are columns father_name and mother_name, it's an example of non-normalized table causing troubles.
Let's say the mother changes her name - from this moment on, your database is in inconsistent state. You decide to add some cascade/trigger on this change and you get into even worse problem: You realize several people can have the same name.
That is basically the main reason for using IDs as keys, not some column that is not a unique identifier. There is much more to learn, I hope someone provides you a link to some tutorial, as this is not really Q/A stuff.
Another good reason for normalizing a table are sparse tables - tables where some columns rarely contain anything else than null. E.g., there are four types of some device, each has different properties that are left null on the other types. In this case, creating a table that holds the specific properties of each device type (even though it's just {0,1}:1 relation) is advisable.

Related

Proper name for an intermediate table between others two intermediate

I have 4 entities: Event, Message, Flow and Document.
Event table stores a limited (seeded) number of records. Message has many events and each event can be related to many messages. The name event_message was given for the intermediate table.
As you can see, the convention for intermediate tables are: {tablename}_{tablename}.
Flow table stores a limited (seeded) number of records. Message has many flows and each flow can be related to many messages. The name flow_message was given for the intermediate table.
A document is created on each relation between Flow and Message (each record on flow_message).
The issue starts here:
Each event on a message has different documents by flow. It means: for each new record on intermediate table flow_message, each record on intermediate event_message has a new document related.
To solve this, I created an intermediate table between event_message and flow_message named: event_message_flow_message.
Is this correct (in some conventional way)? Is this modeling correct?
How to proper model and naming the intermediate table derivative by two others intermediate tables?
I also wish there was some convention. Since I do not know any official convention, I invented mine. The important thing is to respect the convention you choose.
So I would change the event_message_flow_message to rel_eventmessage_flowmessage.
But for me your convention is pretty nice.
It's hard to make a recommendation because your model seems a bit odd to me. You have 1:1 relationships between both DOCUMENT and FLOW_MESSAGE and DOCUMENT and EVENT_MESSAGE_FLOW_MESSAGE. It's hard to reconcile this in my mind with the many to one relationships to EVENT_MESSAGE_FLOW_MESSAGE. If you're relationships to DOCUMENT are really 1:1 (mandatory), then why keep documents in a separate table?
To address your question about table naming: I would argue that the {table}_{table} convention for naming intersection tables is not a best practice but rather a fallback for cases where you can't think of a better name.
The best practice is for names of tables to reflect the business name of the thing which is recorded / described by the data in the table. It's not always possible to do this, especially for intersection tables. Intersection tables represent many-to-many relationships, and relationships are often difficult to describe with a noun.
In your case, I don't think that your convention is actually making things especially easy to understand. I'd probably try to simplify with something like MESSAGE_DOCUMENT or even just DOCUMENT - since these seem to be 1:1 related in any case.

Database design, multiple M-M tables or just one?

Today I was designing a database for a potential personal project of mine. Since I couldn't decide what would be a better option I asked my teacher Databases, unfortunately he couldn't tell me which of the two options is better than the other and why.
I designed the database for a dummy data generator. Since I want to generate multilangual data I thought of these tables. (But its a simplification of the tables).
(first and last)names: id, name
streets: id, name
languages: id, name
Each names.name and streets.name originates from a language, sometimes a name can have multiple origins (ex: Nick is both a Dutch as an English name).
Each language has multiple names and streets.
These two rules result in a Many-to-Many relationship. At the moment I've got only two tables, but I know I will get between 10 and 20 of these kind of tables.
The regular way one would do this is just make 10 to 20 Many-to-Many relationship tables.
Another idea I came up with was just one Many-to-Many table with a third column which specifies which table the id relates to.
At the moment I've got the design on my other PC so I will update it with my ideas visualized after dinner (2 hours or so).
Which idea is better and why?
To make the project idea a bit clearer:
It is always a hassle to create good and enough realistic looking working data for projects. This application will generate this data for you and return the needed SQL so you only have to run the queries.
The user comes to the site to get the data. He states his tablename, his columnnames and then he can link the columnnames to types of data, think of:
* Firstname
* Lastname
* Email adress (which will be randomly generated from the name of the person)
* Adress details (street, housenumber, zipcode, place, country)
* A lot more
Then, after linking columns with the types the user can set the number of rows he wants to make. The application will then choose a country at random and generate realistic looking data according to the country they live in.
That's actually an excellent question. This sort of thing leads to a genuine problem in database design and there is a real tradeoff. I don't know what rdbms you are using but....
Basically you have four choices, all of them with serious downsides:
1. One M-M table with check constraints that only one fkey can be filled in besides language and one column per potential table. Ick....
2. One M-M table per relationship. This makes things quite hard to manage over time especially if you need to change something from an int to a bigint at some point.
3. One M-M table with a polymorphic relationship. You lose a lot of referential integrity checks when you do this and to make it safe, have fun coding (and testing!) triggers.
4. Look carefully at the advanced features in your rdbms for a solution. For example in postgresql this can be solved with table inheritance. The downside is that you lose portability and end up in advanced territory.
Unfortunately there is no single definite answer. You need to consider the tradeoffs carefully and decide what makes sense for your project. If I was just working with one RDBMS, I would do the last one. But if not, I would probably do one table per relationship and focus on tooling to manage the problems that come up. But the former preference is about my level of knowledge and confidence, and the latter is a bit more of a personal opinion.
So I hope this helps you look at the tradeoffs and select what is right for you.

Naming database table fields to designate relationships?

Lets say I have tables Student and Mentor
Does anyone use naming convention for relationship as below? I think this way is good to see the relationships quickly. Would anyone suggest a better way?
Student
StudentID
StudentName
Student2MentorID
To start from scratch, - you probably know this already - there are several ways to represent your database schema, I mean, by using diagrams, for example ER-diagrams that helps you (and your team) stay up to date with your database's design and thus making it simpler to understand.
Now, personally when it comes to implementation, I do use some kind of naming-convention. For example:
For large projects, I use double underscores to split between table categories, (ie. hr__personnel, hr__clocks, hr__timetable, vehicles__cars, vehicles__trips) and so on.
Now, having a relationship between two tables, I do Include both (or all) of the involved table names. (ie. hr__personnel_timetable, vehicles__cars_trips, etc)
Sometimes, (as we all know), we cannot follow strictly a standard, so in those cases I use my own criteria when choosing large relationships' names.
As a rule, I also name table attributes by a three-letter preffix. For example, in my table trips, my fields will be tri_id,tri_distance, tri_elapsed
Note also, that in the above item, I didn't include a Foreign Key. So here I go then. When it comes to FK's, It's easy for me (and my team) to realize that the field IS a FK.
If we follow the previous example, I would like to know who drives in each trip (to make it easier, we assume that only one person drives one trip). So my table now is something like this: tri_id, per_id, tri_distance, tri_elapsed. Now you can easily realize that per_id is just a foreign field of the table. Just, another hint to help.
Just by following these simple steps, you will save hours, and probably some headaches too.
Hope this helps.
I think: you can add prefix (3 letters) to table depending that module represents (scholar,sales,store)
module: scholar ->sc
table: scStudent ( IdStudent,nameStudent..)
table: scMentor(IdMentor,nameMentor...)
relationship
scMentorStudent (IdMentorStudent pk..)
You can use Microsoft's EF notation :
http://weblogs.asp.net/jamauss/pages/DatabaseNamingConventions.aspx
It is better to use underscores...
I suggest to simply use existing naming convention rules such as this one:
http://www.oracle-base.com/articles/misc/naming-conventions.php

Database Design: Explain this schema

Full disclosure...Trying feverishly here to learn more about databases so I am putting in the time and also tried to get this answer from the source to no avail.
Barry Williams from databaseanswers has this schema posted.
Clients and Fees Schema
I am trying to understand the split of address tables in this schema. Its clear to me that the Addresses table contains the details of a given address. The Client_Addresses and Staff_Addresses tables are what gets me.
1) I understand the use of Primary Foreign Keys as shown but I was under the assumption that when these are used you don't have a resident Primary Key in that same table (date_address_from in this case). Can someone explain the reasoning for both and put it into words how this actually works out?
2) Why would you use date_address_from as the primary key instead of something like client_address_id as the PK? What if someone enters two addresses in one day would there be conflicts in his design? If so or if not, what?
3) Along the lines of normalization...Since both date_address_from and date_address_to are the same in the Client_Addresses and Staff_Addresses table should those fields just not be included in the main Address table?
Evaluation
First an Audit, then the specific answers.
This is not a Data Model. This is not a Database. It is a bucket of fish, with each fish drawn as a rectangle, and where the fins of one fish are caught in the the gills of another, there is a line. There are masses of duplication, as well as masses of missing elements. It is completely unworthy of using as an example to learn anything about database design from.
There is no Normalisation at all; the files are very incomplete (see Mike's answer, there are a hundred more problem like that). The other_details and eg.s crack me up. Each element needs to be identified and stored: StreetNo, ApartmentNo, StreetName, StreetType, etc. not line_1_number_street, which is a group.
Customer and Staff should be normalised into a Person table, with all the elements identified.
And yes, if Customer can be either a Person or an Organisation, then a supertype-subtype structure is required to support that correctly.
So what this really is, the technically accurate terms, is a bunch of flat files, with descriptions for groups of fields. Light years distant from a database or a relational one. Not ready for evaluation or inspection, let alone building something with. In a Relational Data Model, that would be approximately 35 normalised tables, with no duplicated columns.
Barry has (wait for it) over 500 "schemas" on the web. The moment you try to use a second "schema", you will find that (a) they are completely different in terms of use and purpose (b) there is no commonality between them (c) let's say there was a customer file in both; they would be different forms of customer files.
He needs to Normalise the entire single "schema" first,
then present the single normlaised data model in 500 sections or subject areas.
I have written to him about it. No response.
It is important to note also, that he has used some unrecognisable diagramming convention. The problem with these nice interesting pictures is that they convey some things but they do not convey the important things about a database or a design. It is no surprise that a learner is confused; it is not clear to experienced database professionals. There is a reason why there is a standard for modelling Relational databases, and for the notation in Data Models: they convey all the details and subtleties of the design.
There is a lot that Barry has not read about yet: naming conventions; relations; cardinality; etc, too many to list.
The web is full of rubbish, anyone can "publish". There are millions of good- and bad-looking "designs" out there, that are not worth looking at. Or worse, if you look, you will learn completely incorrect methods of "design". In terms of learning about databases and database design, you are best advised to find someone qualified, with demonstrated capability, and learn from them.
Answer
He is using composite keys without spelling it out. The PK for client_addresses is client_id, address_id, date_address_from). That is not a bad key, evidently he expects to record addresses forever.
The notion of keeping addresses in a separate file is a good one, but he has not provided any of the fields required to store normalised addresses, so the "schema" will end up with complete duplication of addresses; in which case, he could remove addresses, and put the lines back in the client and staff files, along with their other_details, and remove three files that serve absolutely no purpose other than occupying disk space.
You are thinking about Associative Tables, which resolve the many-to-many relations in Databases. Yes, there, the columns are only the PKs of the two parent tables. These are not Associative Tables or files; they contain data fields.
It is not the PK, it is the third element of the PK.
The notion of a person being registered at more than one address in a single day is not reasonable; just count the one address they slept the most at.
Others have answered that.
Do not expect to identify any evidence of databases or design or Normalisation in this diagram.
1) In each of those tables the primary key is a compound key consisting of three attributes: (staff_id, address_id, date_address_from) and (client_id, address_id, date_address_from). This presumably means that the mapping of clients/staff to addresses is expected to change over time and that the history of those changes is preserved.
2) There's no obvious reason to create a new "id" attribute in those tables. The compound key does the job adequately. Why would you want to create the same address twice for the same client on the same date? If you did then that might be a reason to modify the design but that seems like an unlikely requirement.
3) No. The apparent purpose is that they are the applicable dates for the mapping of address to client/staff - not dates applicable to the address alone.
3) Along the lines of
normalization...Since both
date_address_from and date_address_to
are the same in the Client_Addresses
and Staff_Addresses table should those
fields just not be included in the
main Address table?
No. But you did find a problem.
The designer has decided that clients and staff are two utterly different things. By "utterly different", I mean they have no attributes in common.
That's not true, is it? Both clients and staff have addresses. I'm sure most of them have telephones, too.
Imagine that someone on staff is also a client. How many places is that person's name stored? That person's address? Can you hear Mr. Rogers in the background saying, "Can you spell 'update anomaly'? . . . I knew you could."
The problem is that the designer was thinking of clients and staff as different kinds of people. They're not. "Client" describes a business relationship between a service provider (usually, that is, not a retailer) and a customer, which might be either a person or a company. "Staff" describes a employment relationship between a company and a person. Not different kinds of people--different kinds of relationships.
Can you see how to fix that?
This 2 extra tables enables you to have address history per one person.
You can have them both in one table, but since staff and client are separated, it is better to separate them as well (b/c client id =1 and staff id =1 can't be used on the same table of address).
there is no "single" solution to a design problem, you can use 1 person table and then add a column to different between staff and client. BUT The major Idea is that the DB should be clear, readable and efficient, and not to save tables.
about 2 - the pk is combined, both clientID, AddressID and from.
so if someone lives 6 month in the states, then 6 month in Israel, and then back to the states, to the same address - you need only 2 address in address table, and 3 in the client_address.
The idea of heaving the from_Date as part of the key is right, although it doesn't guaranty data integrity - as you also need manually to check that there isn't overlapping dates between records of the same person.
about 3 - no (look at 2).
Viewing the data model, i think:
1) PF means that the field is both part of the primary key of the table and foreign key with other table.
2) In the same way, the primary key of Staff_Addresses is {staff_id,address_id,date_adderess_from} not just date_adderess_from
3) The same that 2)
In reference to Staff_Addresses table, the Primary Key on date_address_from basically prevents a record with the same staff_id/address_id entered more than once. Now, i'm no DBA, but i like my PKs to be integers or guids for performance reasons/faster indexing. If i were to do this i would make a new column, say, Staff_Address_Id and make it the PK column and put a unique constraint on staff_id/address_id/date_address_from.
As for your last concern, Addresses table is really a generic address storage structure. It shouldn't care about date ranges during which someone resided there. It's better to be left to specific implementations of an address such as Client/Staff addresses.
Hope this helps a little.

Person name structure in separate database table

I am wondering when and when not to pull a data structure into a separate database table when it appears in several tables.
I have pulled the 12 attribute address structure into a separate table because I have a couple of different entities containing a single address in this format.
But how about my 3 attribute person name structure (given, middle, surname)?
Should this be put into its own table referenced with a foreign key for all the entities containing a name... e.g. the company table has a contact person name, the citizen table has a person name etc.
Are these best left as attributes in the main tables or should they be extracted?
I would usually keep the address on the Person table, unless there was an unusual need for absolutely uniform addresses on each entity, or if an entity could have an arbitrary number of addresses, or if addresses need to be shared between entities, or if it was a large enterprise product where I know I have to invest in infrastructure all over the place or I will end up gutting everything down the road.
Having your addresses in a seperate table is interesting because it's flexible, but in the context of a small project lacking a special need like the ones mentioned above, it's probably a slight waste. Always be aware of the balance between complexity and flexibility. Flexibility is important, but be discriminating... It's easy to invest way too much there!
In concrete terms, the times that I experimented with (for instance) one-to-one relationships for things like addresses, I ended up refactoring them back into the table because it introduced a bunch of headaches including more complex queries, dealing with situations where the address does not exist, etc. More entities also increases your cognitive load -- it makes the project harder to think about. In my case, it was an unecessary cost because there was no concrete need and, in truth, not even a gain in flexibility.
So, based on my experiences, I would "try" to keep the addresses in the same table, and I would definitely keep the names on them - again, unless there was a special need.
So to paraphrase Einstein, make it as simple as possible and no simpler. But in the short term, experiment. It's the best way to learn these lessons.
It's about not repeating information, so you don't want to store the same information in two places when one will do.
Another useful rule of thumb is one entity per table. If you find that one table contains, say, "person" AND "order" then you probably should split those into two tables.
And (putting myself at risk of repeating information...) you might find it helpful to review some database design basics, there are plenty of related questions here on stackoverflow.
Start with these...
What is normalisation?
What is important to keep in mind when designing a database
How many fields is 'too many'?
More tables or more columns?
Creating a person entity across your data model will give you this present and future advantages -
The same person occurring as a contact, or individual in different contexts. Saves redundancy.
Info can be maintained and kept current with far-less effort.
Easier to search for a person and identify them - i.e. is it the same John Smith?
You can expand the information - i.e. maintain addresses for this person far more easily.
Programming will be more consistent and debugging will be easier as well.
Moves you closer to a 'self-documenting' system.
As a counterpoint to the other (entirely valid) replies: within your application's current structure, how likely will it be for a given individual (not just name, the actual "person" -- multiple people could be "John Smith") to appear in more than one table? The less likely this is to happen, the less likely you are to get benefits from normalization.
Another way to think of it is entities. Outside of labels (names), is their any overlap between "customer" entity and an "employee" entity?
Extract them. Your aim should be to have no repeating data in your database.
Read about Normalization
It really depends on the problem you are trying to solve. In general it is probably a good idea to have some sort of 'person' table which holds details of people. However, there are occasions where that is potentially a very bad idea.
One example would be if you are holding details of prescriptions written out to people by a doctor. In some countries it is a legal requirment that the prescription details are held with the name in which they were prescribed NOT the name the person is going under currently. For instance a woman might be prescribed a drug as miss X, but then she gets married and becomes Mrs Y. If you had a person table that was linked to the prescriptions table you would now have the wrong details and would possibly face legal consequences. In that case you would need to probably copy the relevant details of the person into the prescription table, even though this would be duplicating data.
So again - it depends on the problem you are trying to solve. Don't just blindly follow what people consider to be best practices. Understand your data and any issues surrounding it, then try to follow best practices that fit.
Depends on what you're using the database for.
If you want fast queries on your tables you should de-normalize your tables. Having to run multiple JOIN's will take longer and make your queries more complex.
On the other hand if your intention is to have a flexible storage database which is not meant to be hit with a ton of fast-response queries, then normalizing the tables by splitting them out into multiple xref'ed tables will provide more flexibility in your design and reduce the need for submitting duplicated data.
Since de-normalization is "optimization", I would suggest you normalize the tables first, index them properly and see if you're getting any bottlenecks on your queries. If so, flatten the affected tables where needed.
You should really consider your whole database structure and do a ER diagram (entity relationship diagram) first. OF COURSE there should be another table called "Person" where the concept of a person is stored...

Resources