Person name structure in separate database table - database

I am wondering when and when not to pull a data structure into a separate database table when it appears in several tables.
I have pulled the 12 attribute address structure into a separate table because I have a couple of different entities containing a single address in this format.
But how about my 3 attribute person name structure (given, middle, surname)?
Should this be put into its own table referenced with a foreign key for all the entities containing a name... e.g. the company table has a contact person name, the citizen table has a person name etc.
Are these best left as attributes in the main tables or should they be extracted?

I would usually keep the address on the Person table, unless there was an unusual need for absolutely uniform addresses on each entity, or if an entity could have an arbitrary number of addresses, or if addresses need to be shared between entities, or if it was a large enterprise product where I know I have to invest in infrastructure all over the place or I will end up gutting everything down the road.
Having your addresses in a seperate table is interesting because it's flexible, but in the context of a small project lacking a special need like the ones mentioned above, it's probably a slight waste. Always be aware of the balance between complexity and flexibility. Flexibility is important, but be discriminating... It's easy to invest way too much there!
In concrete terms, the times that I experimented with (for instance) one-to-one relationships for things like addresses, I ended up refactoring them back into the table because it introduced a bunch of headaches including more complex queries, dealing with situations where the address does not exist, etc. More entities also increases your cognitive load -- it makes the project harder to think about. In my case, it was an unecessary cost because there was no concrete need and, in truth, not even a gain in flexibility.
So, based on my experiences, I would "try" to keep the addresses in the same table, and I would definitely keep the names on them - again, unless there was a special need.
So to paraphrase Einstein, make it as simple as possible and no simpler. But in the short term, experiment. It's the best way to learn these lessons.

It's about not repeating information, so you don't want to store the same information in two places when one will do.
Another useful rule of thumb is one entity per table. If you find that one table contains, say, "person" AND "order" then you probably should split those into two tables.
And (putting myself at risk of repeating information...) you might find it helpful to review some database design basics, there are plenty of related questions here on stackoverflow.
Start with these...
What is normalisation?
What is important to keep in mind when designing a database
How many fields is 'too many'?
More tables or more columns?

Creating a person entity across your data model will give you this present and future advantages -
The same person occurring as a contact, or individual in different contexts. Saves redundancy.
Info can be maintained and kept current with far-less effort.
Easier to search for a person and identify them - i.e. is it the same John Smith?
You can expand the information - i.e. maintain addresses for this person far more easily.
Programming will be more consistent and debugging will be easier as well.
Moves you closer to a 'self-documenting' system.

As a counterpoint to the other (entirely valid) replies: within your application's current structure, how likely will it be for a given individual (not just name, the actual "person" -- multiple people could be "John Smith") to appear in more than one table? The less likely this is to happen, the less likely you are to get benefits from normalization.
Another way to think of it is entities. Outside of labels (names), is their any overlap between "customer" entity and an "employee" entity?

Extract them. Your aim should be to have no repeating data in your database.
Read about Normalization

It really depends on the problem you are trying to solve. In general it is probably a good idea to have some sort of 'person' table which holds details of people. However, there are occasions where that is potentially a very bad idea.
One example would be if you are holding details of prescriptions written out to people by a doctor. In some countries it is a legal requirment that the prescription details are held with the name in which they were prescribed NOT the name the person is going under currently. For instance a woman might be prescribed a drug as miss X, but then she gets married and becomes Mrs Y. If you had a person table that was linked to the prescriptions table you would now have the wrong details and would possibly face legal consequences. In that case you would need to probably copy the relevant details of the person into the prescription table, even though this would be duplicating data.
So again - it depends on the problem you are trying to solve. Don't just blindly follow what people consider to be best practices. Understand your data and any issues surrounding it, then try to follow best practices that fit.

Depends on what you're using the database for.
If you want fast queries on your tables you should de-normalize your tables. Having to run multiple JOIN's will take longer and make your queries more complex.
On the other hand if your intention is to have a flexible storage database which is not meant to be hit with a ton of fast-response queries, then normalizing the tables by splitting them out into multiple xref'ed tables will provide more flexibility in your design and reduce the need for submitting duplicated data.
Since de-normalization is "optimization", I would suggest you normalize the tables first, index them properly and see if you're getting any bottlenecks on your queries. If so, flatten the affected tables where needed.

You should really consider your whole database structure and do a ER diagram (entity relationship diagram) first. OF COURSE there should be another table called "Person" where the concept of a person is stored...

Related

Database design, multiple M-M tables or just one?

Today I was designing a database for a potential personal project of mine. Since I couldn't decide what would be a better option I asked my teacher Databases, unfortunately he couldn't tell me which of the two options is better than the other and why.
I designed the database for a dummy data generator. Since I want to generate multilangual data I thought of these tables. (But its a simplification of the tables).
(first and last)names: id, name
streets: id, name
languages: id, name
Each names.name and streets.name originates from a language, sometimes a name can have multiple origins (ex: Nick is both a Dutch as an English name).
Each language has multiple names and streets.
These two rules result in a Many-to-Many relationship. At the moment I've got only two tables, but I know I will get between 10 and 20 of these kind of tables.
The regular way one would do this is just make 10 to 20 Many-to-Many relationship tables.
Another idea I came up with was just one Many-to-Many table with a third column which specifies which table the id relates to.
At the moment I've got the design on my other PC so I will update it with my ideas visualized after dinner (2 hours or so).
Which idea is better and why?
To make the project idea a bit clearer:
It is always a hassle to create good and enough realistic looking working data for projects. This application will generate this data for you and return the needed SQL so you only have to run the queries.
The user comes to the site to get the data. He states his tablename, his columnnames and then he can link the columnnames to types of data, think of:
* Firstname
* Lastname
* Email adress (which will be randomly generated from the name of the person)
* Adress details (street, housenumber, zipcode, place, country)
* A lot more
Then, after linking columns with the types the user can set the number of rows he wants to make. The application will then choose a country at random and generate realistic looking data according to the country they live in.
That's actually an excellent question. This sort of thing leads to a genuine problem in database design and there is a real tradeoff. I don't know what rdbms you are using but....
Basically you have four choices, all of them with serious downsides:
1. One M-M table with check constraints that only one fkey can be filled in besides language and one column per potential table. Ick....
2. One M-M table per relationship. This makes things quite hard to manage over time especially if you need to change something from an int to a bigint at some point.
3. One M-M table with a polymorphic relationship. You lose a lot of referential integrity checks when you do this and to make it safe, have fun coding (and testing!) triggers.
4. Look carefully at the advanced features in your rdbms for a solution. For example in postgresql this can be solved with table inheritance. The downside is that you lose portability and end up in advanced territory.
Unfortunately there is no single definite answer. You need to consider the tradeoffs carefully and decide what makes sense for your project. If I was just working with one RDBMS, I would do the last one. But if not, I would probably do one table per relationship and focus on tooling to manage the problems that come up. But the former preference is about my level of knowledge and confidence, and the latter is a bit more of a personal opinion.
So I hope this helps you look at the tradeoffs and select what is right for you.

Performance in database design

I have to implement a testing platform. My database needs the following tables: Students, Teachers, Admins, Personnel and others. I would like to know if it's more efficient to have the FirstName and LastName in each of these tables, or to have another table, Persons, and each of the other table to be linked to this one with PersonID.
Personally, I like it this way, although trickier to implement, because I think it's cleaner, especially if you look at it from the object-oriented point of view. Would this add an unnecessary overhead to the database?
Don't know if it helps to mention I would like to use SQL Server and ADO.NET Entity Framework.
As you've explicitly mentioned OO and that you're using EntityFramework, perhaps its worth approaching the problem instead from how the framework is intended to work - rather than just building a database structure and then trying to model it?
Entity Framework Code First Inheritance : Table Per Hierarchy and Table Per Type is a nice introduction to the various strategies that you could pick from.
As for the note on adding unnecessary overhead to the database - I wouldn't worry about it just yet. EF is generally about getting a product built more rapidly and as it has to cope with a more general case, doesn't always produce the most efficient SQL. If the performance is a problem after your application is built, working and correct you can revisit and fix up the most inefficient stuff then.
If there is a person overlap between the mentioned tables, then yes, you should separate them out into a Persons table.
If you are only tracking what role each Person has (i.e. Student vs. Teacher etc) then you might consider just having the following three tables: Persons, Roles, and a bridge table PersonRoles.
On the other hand, if each role has it's own unique fields, then you should carry on as you are and leave each of these tables separate with a foreign key of PersonID.
If the attributes (i.e. First Name, Last Name, Gender etc) of these entities (i.e. Students, Teachers, Admins and Personnel) are exactly the same then you could just make a single table for all the entities with PersonType or Role attribute added to distinguish each person's role. However, if the entities has a lot of different attributes then it would be better that you create separate tables otherwise you will have normalization problem.
Yes that is a very bad way of structuring a DB. The DB structure should be designed based on the Normalizations.
Please check the normalization forms.
U should avoid the duplicate data as much as possible, else the queries will become slower.
And the main problem is when u r trying to get data that is associated with more than one or two tables.

Onetomany with parent database design

Below is a database design which represents my problem(it is not my actual database design). For each city I need to know which restaurants, bars and hotels are available. I think the two designs speak for itself, but:
First design: create one-to-many relations between city and restaurants, bars and hotels.
Second design: only create an one-to-many relation between city and place.
Which design would be best practice? The second design has less relations, but would I be able to get all the restaurants, bars and hotels for a city and there own data (property_x/y/z)?
Update: this question is going wrong, maybe my fault for not being clear.
the restaurant/bar/hotel classes are subclasses of "place" (in both
designs).
the restaurant/bar/hotel classes must have the parent "place"
the restaurant/bar/hotel classes have there own specific data (property_X/Y/X)
Good design first
Your data, and the readability/understandability of your SQL and ERD, are the most important factors to consider. For the purpose of readability:
Put city_id into place. Why: Places are in cities. A hotel is not a place that just happens to be in a city by virtue of being a hotel.
Other design points to consider are how this structure will be extended in the future. Let's compare adding a new subtype:
In design one, you need to add a new table, relationship to 'place' and a relationship to city
In design two, you simply add a new table and relationship to 'place'.
I'd again go with the second design.
Performance second
Now, I'm guessing, but the reason for putting city_id in the subtype is probably that you anticipate that it's more efficient or faster in some specific use cases and this may be a very good reason to ignore readability/understandability. However, until you measure performance on the actual hardware you'll deploy on, you don't know:
Which design is faster
Whether the difference in performance would actually degrade the overall system
Whether other optimization approaches (tuning SQL or database parameters) is actually a better way to handle it.
I would argue that design one is an attempt to physically model the database on an ERD, which is a bad practice.
Premature optimization is the root of a lot of evil in SW Engineering.
Subtype approaches
There are two solutions to implementing subtypes on an ERD:
A common-properties table, and one table per subtype, (this is your second model)
A single table with additional columns for subtype properties.
In the single-table approach, you would have:
A subtype column, TYPE INT NOT NULL. This specifies whether the row is a restaurant, bar or hotel
Extra columns property_X, property_Y and property_Z on place.
Here is a quick table of pros and cons:
Disadvantages of a single-table approach:
The extension columns (X, Y, Z) cannot be NOT NULL on a single table approach. You can implement row-level constraints, but you lose the simplicity and visibility of a simple NOT NULL
The single table is very wide and sparse, especially as you add additional subtypes. You may hit the max. number of columns on some databases. This can make this design quite wasteful.
To query a list of a specific subtype, you have to filter using a WHERE TYPE = ? clause, whereas the table-per-subtype is a much more natural `FROM HOTEL INNER JOIN PLACE ON HOTEL.PLACE_ID = PLACE.ID"
IMHO, mapping into classes in an object-oriented languages is harder and less obvious. Consider avoiding if this DB is going to be mapped by Hibernate, Entity Beans or similar
Advantages of a single-table approach:
By consolidating into a single table, there are no joins, so queries and CRUD operations are more efficient (but is this small difference going to cause problems?)
Queries for different types are parameterized (WHERE TYPE = ?) and therefore more controllable in code rather than in the SQL itself (FROM PLACE INNER JOIN HOTEL ON PLACE.ID = HOTEL.PLACE_ID).
There is no best design, you have to pick based on the type of SQL and CRUD operations you are doing most frequently, and possibly on performance (but see above for a general warning).
Advice
All things being equal, I would advise the default option is your second design. But, if you have an overriding concern such as those I listed above, do choose another implementation. But don't optimize prematurely.
Both of them and none of them at all.
If I need to choose one, I would keep the second one, because of the number of foreign keys and indexes needed to be created after. But, a better approach would be: create a table with all kinds of places (bars, restaurants, and so on) and assign to each row a column with a value of the type of the place (apply a COMPRESS clause with the types expected at the column). It would improve both performance and readability of the structure, plus being more easier to maintain. Hope this helps. :-)
you do not show alternate columns in any of the sub-place tables. i think you should not split type data into table names like 'bar','restaurant', etc - these should be types inside the place table.
i think further you should have an address table - one column of which is city. then each place has an address and you can easily group by city when needed. (or state or zip code or country etc)
I think the best option is the second one. In the first design, there is a possibility of data errors as one place can be assigned to a particular restaurant (or any other type) in one city (e.g. A) and at the same time can be assigned to another restaurant in a different city (e.g. B). In the second design, a place is always bound to a particular city.
First:
Both designs can get you all the appropriate data.
Second:
If all extending classes are going to implement the location (which sound obvious for your implementation) then it would be a better practice to include it as part of the parent object. This would suggest option 2.
Thingy:
The thing is that even-tough you can find out the type of each particular PLACE, it is easier to just know that a type (CHILD) is always a place (PARENT). You can think of that while you visualize the result-set of option 2. With that in mind, I recommend the first approach.
NOTE:
First one doesn't have more relations, it just splits them.
If bar, restaurant and hotel have different sets of attributes, then they are different entities and should be represented by 3 different tables. But why do you need the place table? My advice it to ditch it and have 3 tables for your 3 entities and that's that.
In code, collecting common attributes into a parent class is more organised and efficient than repeating them in each child class - of course. But as spathirana comments above, database design is not like OOP. Sure, you'll save on the repetition of column names by sticking common attributes of places into a "place" table. But it will also add complication:
- you have to join on that table whenever you want to reference a bar, restaurant or hotel
- you have to insert into two tables whenever you want to add a new bar, restaurant or hotel
- you have to update two tables when ... etc.
Having 3 tables without the place table is ALSO, PROBABLY, the most performance-optimal design. But that's not where I'm coming from. I'm thinking of clean, simple database design where a single entity means a single row in a single table. There are no "is-a" relationships in a relational DB. Foreign key relationships are "has-a". OK, there are exceptions I'm sure, but your case is not exceptional.

Database Normalization Vocabulary

There is lot or material on database normalization available on Steve's Class and the Web. However, I still seem to lack on very definite reasons on explaining normalization.
For example, for a simple design such as a table Item with a Type field, it makes sense to have the Type as a separate table. The reason I forwarded for that was if in future any need arose to add properties to the Type, it would be much easier with a separate table already existing.
Are there more reasons which can be shown to be obvious?
Check these out too:
An Introduction to Database Normalization
A Simple Guide to Five Normal Forms
in Relational Database Theory
This article says it better than I can:
There are two goals of the normalization process: eliminating redundant data (for example, storing the same data in more than one table) and ensuring data dependencies make sense (only storing related data in a table). Both of these are worthy goals as they reduce the amount of space a database consumes and ensure that data is logically stored.
Normalization is the process of organizing data in a database. This includes creating tables and establishing relationships between those tables according to rules designed both to protect the data and to make the database more flexible by eliminating redundancy and inconsistent dependency.
Redundant data wastes disk space and creates maintenance problems. If data that exists in more than one place must be changed, the data must be changed in exactly the same way in all locations. A customer address change is much easier to implement if that data is stored only in the Customers table and nowhere else in the database.
What is an "inconsistent dependency"? While it is intuitive for a user to look in the Customers table for the address of a particular customer, it may not make sense to look there for the salary of the employee who calls on that customer. The employee's salary is related to, or dependent on, the employee and thus should be moved to the Employees table. Inconsistent dependencies can make data difficult to access because the path to find the data may be missing or broken.
following links can be useful:
http://support.microsoft.com/kb/283878
http://neerajtripathi.wordpress.com/2010/01/12/normalization-of-data-base/
Edgar F. Codd, the inventor of the relational model, introduced the concept of normalization. In his own words:
To free the collection of relations from undesirable insertion, update and deletion dependencies;
To reduce the need for restructuring the collection of relations as new types of data are introduced, and thus increase the life span of application programs;
To make the relational model more informative to users;
To make the collection of relations neutral to the query statistics, where these statistics are liable to change as time goes by.
— E.F. Codd, "Further Normalization of the Data Base Relational Model"
Taken word-for-word from Wikipedia:Database normalization

Is it a bad idea to make a generic link table?

Imagine a meta database with a high degree of normalization. It would blow up this input field if I would attempt to describe it here. But imagine, every relationship through the entire database, through all tables, go through one single table called link. It has got these fields: master_class_id, master_attr_id, master_obj_id, class_id2, obj_id2. This table can easily represent all kinds of relationships: 1:1, 1:n, m:n, self:self.
I see the problem that this table is going to get HUUUUGE. Is that bad practice?
That is wrong on two accounts:
It'll be a tremendous bottleneck for all your queries and it'll kill any chance of throughput.
It reeks of bad design: you should be able to describe things more concisely and closer to reality. If this is really the best way to store the data you can consider partitioning or even another paradigm instead of the relational
In a word, yes, this is a bad idea
Without going into too many details, I would offer the following:
for a meta database, the link table should be split by (high level) entity : that is, you should have a separate link table for each entity
another link table is required for the between-entities links
Normally the high-level entities are fairly easy to identify, like Customer.
It is usually bad practice but not because the table is huge. The problem is that you are mixing unrelated data in one table.
The reason to keep the links in separate tables, is because you won't need to use them together.
It is a common mistake that is also done with data itself: You should not mix two sets of data in one table only because the fields are similar if the data itself is unrelated.
Relational databases don't actually fit for this model.
It's possible to implement it but it will be quite slow. The main drawback is that you won't be able to index the links efficiently.
However, this design can be useful in two cases:
This only stores the metadata: declared relationships between the entities. The actual data are stored in the plain relational tables, so this links are only used to show the structure but not in the actual queries.
This stores some structures which are complex but contain few data, so that the ease of development overweights the performance drawbacks.
This design can be seen in several ORMs (one of which I even developed).
I don't see the purpose of this type of table anyway. If you have table A that is one-to-many to table B then A is going to still have a PK and B will still have a PK. A would normally contain a FK to B.
So in the Master_Table you will have to store A PK, B FK which is just a duplicate of what is already there. The only thing you will 'lose' is the FK in table A but you just migrated it into a giant table that is hard to deal with by the database, the dba, and anyone coding using the db.
Those table appear in Access most frequently and show up on the DailyWTF because they are insanely hard to read and understand.
Oh! And a main problem is that to make the table ubiquitous you will have to make generic columns which will probably end up destroying data integrity.

Resources