Performance in database design - sql-server

I have to implement a testing platform. My database needs the following tables: Students, Teachers, Admins, Personnel and others. I would like to know if it's more efficient to have the FirstName and LastName in each of these tables, or to have another table, Persons, and each of the other table to be linked to this one with PersonID.
Personally, I like it this way, although trickier to implement, because I think it's cleaner, especially if you look at it from the object-oriented point of view. Would this add an unnecessary overhead to the database?
Don't know if it helps to mention I would like to use SQL Server and ADO.NET Entity Framework.

As you've explicitly mentioned OO and that you're using EntityFramework, perhaps its worth approaching the problem instead from how the framework is intended to work - rather than just building a database structure and then trying to model it?
Entity Framework Code First Inheritance : Table Per Hierarchy and Table Per Type is a nice introduction to the various strategies that you could pick from.
As for the note on adding unnecessary overhead to the database - I wouldn't worry about it just yet. EF is generally about getting a product built more rapidly and as it has to cope with a more general case, doesn't always produce the most efficient SQL. If the performance is a problem after your application is built, working and correct you can revisit and fix up the most inefficient stuff then.

If there is a person overlap between the mentioned tables, then yes, you should separate them out into a Persons table.
If you are only tracking what role each Person has (i.e. Student vs. Teacher etc) then you might consider just having the following three tables: Persons, Roles, and a bridge table PersonRoles.
On the other hand, if each role has it's own unique fields, then you should carry on as you are and leave each of these tables separate with a foreign key of PersonID.

If the attributes (i.e. First Name, Last Name, Gender etc) of these entities (i.e. Students, Teachers, Admins and Personnel) are exactly the same then you could just make a single table for all the entities with PersonType or Role attribute added to distinguish each person's role. However, if the entities has a lot of different attributes then it would be better that you create separate tables otherwise you will have normalization problem.

Yes that is a very bad way of structuring a DB. The DB structure should be designed based on the Normalizations.
Please check the normalization forms.
U should avoid the duplicate data as much as possible, else the queries will become slower.
And the main problem is when u r trying to get data that is associated with more than one or two tables.

Related

Database design, multiple M-M tables or just one?

Today I was designing a database for a potential personal project of mine. Since I couldn't decide what would be a better option I asked my teacher Databases, unfortunately he couldn't tell me which of the two options is better than the other and why.
I designed the database for a dummy data generator. Since I want to generate multilangual data I thought of these tables. (But its a simplification of the tables).
(first and last)names: id, name
streets: id, name
languages: id, name
Each names.name and streets.name originates from a language, sometimes a name can have multiple origins (ex: Nick is both a Dutch as an English name).
Each language has multiple names and streets.
These two rules result in a Many-to-Many relationship. At the moment I've got only two tables, but I know I will get between 10 and 20 of these kind of tables.
The regular way one would do this is just make 10 to 20 Many-to-Many relationship tables.
Another idea I came up with was just one Many-to-Many table with a third column which specifies which table the id relates to.
At the moment I've got the design on my other PC so I will update it with my ideas visualized after dinner (2 hours or so).
Which idea is better and why?
To make the project idea a bit clearer:
It is always a hassle to create good and enough realistic looking working data for projects. This application will generate this data for you and return the needed SQL so you only have to run the queries.
The user comes to the site to get the data. He states his tablename, his columnnames and then he can link the columnnames to types of data, think of:
* Firstname
* Lastname
* Email adress (which will be randomly generated from the name of the person)
* Adress details (street, housenumber, zipcode, place, country)
* A lot more
Then, after linking columns with the types the user can set the number of rows he wants to make. The application will then choose a country at random and generate realistic looking data according to the country they live in.
That's actually an excellent question. This sort of thing leads to a genuine problem in database design and there is a real tradeoff. I don't know what rdbms you are using but....
Basically you have four choices, all of them with serious downsides:
1. One M-M table with check constraints that only one fkey can be filled in besides language and one column per potential table. Ick....
2. One M-M table per relationship. This makes things quite hard to manage over time especially if you need to change something from an int to a bigint at some point.
3. One M-M table with a polymorphic relationship. You lose a lot of referential integrity checks when you do this and to make it safe, have fun coding (and testing!) triggers.
4. Look carefully at the advanced features in your rdbms for a solution. For example in postgresql this can be solved with table inheritance. The downside is that you lose portability and end up in advanced territory.
Unfortunately there is no single definite answer. You need to consider the tradeoffs carefully and decide what makes sense for your project. If I was just working with one RDBMS, I would do the last one. But if not, I would probably do one table per relationship and focus on tooling to manage the problems that come up. But the former preference is about my level of knowledge and confidence, and the latter is a bit more of a personal opinion.
So I hope this helps you look at the tradeoffs and select what is right for you.

How to support variable number of one-to-many relationships in database

I am trying to design a Person database. The requirement is that a Person can have one or more varying number of children, cars, jobs, and homes.
So, currently, the way I have designed this is:
Person {
CharField name
DateField dob
CharField city
...
# Some standard base person data
}
Since I want to support variable number of associations, I create separate tables with one-to-many relationships. For example, I have
Home {
ForeignKey Person
CharField home_address
...
}
Job {
ForeignKey Person
CharField company_nme
CharField office_address
...
}
And so on for other fields.
This works fine because I can have as many or as few entries per person.
The downside is that for each Person, I do lookup on 5-6 tables. I am going to need more fields, so the lookups will increase.
Is there a paradigm to efficiently design this kind of scenario?
If it is of interest, I use Django with PostGreSql.
Edit:
The server is mostly making REST API responses off the database. The browser client needs the entire data for one Person at one go (to reduce multiple requests over network). So I will have to do the multiple joins together.
Actually, for my Person table, I really do not need any relational-stuff. Other tables in my DB are heavily relational. The reason I am thinking of this now is because I suspect that the lot of joins will result in slower performance, and changing the design later will be difficult.
I also came across JSONField for PostGreSql and I was wondering whether I should use those to save the "hanging-off" data so that the REST calls do not result in a multitude of JOINS. Since this is design level, I am thinking of the issue now because I am not sure changing this going ahead will be feasible.
Thanks a lot for your inputs.
Your design is correct. The number of tables is a reflection of the complexity (or not) of the application.
The "paradigm to efficiently design this kind of scenario" is the relational model and you are designing in terms of tables because you are working within that paradigm.
Your notions about "the downside" and "lookups" and "efficiency" presume implementation aspects without justification. The DBMS takes your declarations and updates and answers your queries and hides how. Implementation issues do arise, but far from the level of experience and knowledge suggested by your question.
Just make a staightforward design.

Database Is-a relationship

My problem relates to DB schema developing and is as follows.
I am developing a purchasing module, in which I want to use for purchasing items and SERVICES.
Following is my EER diagram, (note that service has very few specialized attributes – max 2)
My problem is to keep products and services in two tables or just in one table?
One table option –
Reduces complexity as I will only need to specify item id which refers to item table which will have an “item_type” field to identify whether it’s a product or a service
Two table option –
Will have to refer separate product or service in everywhere I want to refer to them and will have to keep “item_type” field in every table which refers to either product or service?
Currently planning to use option 1, but want to know expert opinion on this matter. Highly appreciate your time and advice. Thanks.
I'd certainly go to the "two tables" option. You see, you have to distinguish Products and Services, so you may either use switch(item_type) { ... } in your program or entirely distinct code paths for Product and for Service. And if a need for updating the DB schema arises, switch is harder to maintain.
The second reason is NULLs. I'd advise avoid them as much as you can — they create more problems than they solve. With two tables you can declare all fields non-NULL and forget about NULL-processing. With one table option, you have to manually write code to ensure that if item_type=product, then Product-specific fields are not NULL, and Service-specific ones are, and that if item_type=service, then Service-specific fields are not NULL, and Product-specific ones are. That's not quite pleasant work, and the DBMS can't do it for you (there is no NOT NULL IF another_field = value column constraint in SQL or anything like this).
Go with two tables. It's easier to support. I once saw a DB where everything, every single piece of data went in just two tables — there were pages and pages of code to make sure that necessary fields are not NULL.
If I were to implement I would have gone for the Two table option, It's kinda like the first rule of normalization of the schema. To remove multi-valued attributes. Using item_type is not recommended. Once you create separate tables you dont need to use the item_type you can just use the foreign key relationship.
Consider reading this article :
http://en.wikipedia.org/wiki/Database_normalization
It should help.

Person name structure in separate database table

I am wondering when and when not to pull a data structure into a separate database table when it appears in several tables.
I have pulled the 12 attribute address structure into a separate table because I have a couple of different entities containing a single address in this format.
But how about my 3 attribute person name structure (given, middle, surname)?
Should this be put into its own table referenced with a foreign key for all the entities containing a name... e.g. the company table has a contact person name, the citizen table has a person name etc.
Are these best left as attributes in the main tables or should they be extracted?
I would usually keep the address on the Person table, unless there was an unusual need for absolutely uniform addresses on each entity, or if an entity could have an arbitrary number of addresses, or if addresses need to be shared between entities, or if it was a large enterprise product where I know I have to invest in infrastructure all over the place or I will end up gutting everything down the road.
Having your addresses in a seperate table is interesting because it's flexible, but in the context of a small project lacking a special need like the ones mentioned above, it's probably a slight waste. Always be aware of the balance between complexity and flexibility. Flexibility is important, but be discriminating... It's easy to invest way too much there!
In concrete terms, the times that I experimented with (for instance) one-to-one relationships for things like addresses, I ended up refactoring them back into the table because it introduced a bunch of headaches including more complex queries, dealing with situations where the address does not exist, etc. More entities also increases your cognitive load -- it makes the project harder to think about. In my case, it was an unecessary cost because there was no concrete need and, in truth, not even a gain in flexibility.
So, based on my experiences, I would "try" to keep the addresses in the same table, and I would definitely keep the names on them - again, unless there was a special need.
So to paraphrase Einstein, make it as simple as possible and no simpler. But in the short term, experiment. It's the best way to learn these lessons.
It's about not repeating information, so you don't want to store the same information in two places when one will do.
Another useful rule of thumb is one entity per table. If you find that one table contains, say, "person" AND "order" then you probably should split those into two tables.
And (putting myself at risk of repeating information...) you might find it helpful to review some database design basics, there are plenty of related questions here on stackoverflow.
Start with these...
What is normalisation?
What is important to keep in mind when designing a database
How many fields is 'too many'?
More tables or more columns?
Creating a person entity across your data model will give you this present and future advantages -
The same person occurring as a contact, or individual in different contexts. Saves redundancy.
Info can be maintained and kept current with far-less effort.
Easier to search for a person and identify them - i.e. is it the same John Smith?
You can expand the information - i.e. maintain addresses for this person far more easily.
Programming will be more consistent and debugging will be easier as well.
Moves you closer to a 'self-documenting' system.
As a counterpoint to the other (entirely valid) replies: within your application's current structure, how likely will it be for a given individual (not just name, the actual "person" -- multiple people could be "John Smith") to appear in more than one table? The less likely this is to happen, the less likely you are to get benefits from normalization.
Another way to think of it is entities. Outside of labels (names), is their any overlap between "customer" entity and an "employee" entity?
Extract them. Your aim should be to have no repeating data in your database.
Read about Normalization
It really depends on the problem you are trying to solve. In general it is probably a good idea to have some sort of 'person' table which holds details of people. However, there are occasions where that is potentially a very bad idea.
One example would be if you are holding details of prescriptions written out to people by a doctor. In some countries it is a legal requirment that the prescription details are held with the name in which they were prescribed NOT the name the person is going under currently. For instance a woman might be prescribed a drug as miss X, but then she gets married and becomes Mrs Y. If you had a person table that was linked to the prescriptions table you would now have the wrong details and would possibly face legal consequences. In that case you would need to probably copy the relevant details of the person into the prescription table, even though this would be duplicating data.
So again - it depends on the problem you are trying to solve. Don't just blindly follow what people consider to be best practices. Understand your data and any issues surrounding it, then try to follow best practices that fit.
Depends on what you're using the database for.
If you want fast queries on your tables you should de-normalize your tables. Having to run multiple JOIN's will take longer and make your queries more complex.
On the other hand if your intention is to have a flexible storage database which is not meant to be hit with a ton of fast-response queries, then normalizing the tables by splitting them out into multiple xref'ed tables will provide more flexibility in your design and reduce the need for submitting duplicated data.
Since de-normalization is "optimization", I would suggest you normalize the tables first, index them properly and see if you're getting any bottlenecks on your queries. If so, flatten the affected tables where needed.
You should really consider your whole database structure and do a ER diagram (entity relationship diagram) first. OF COURSE there should be another table called "Person" where the concept of a person is stored...

Separating user table from people table in a relational database

I've done many web apps where the first thing you do is make a user table with usernames, passwords, names, e-mails and all of the other usual flotsam. My current project presents a situation where non-users records need to function similarly to users, but do not need to the ability to be a first order user.
Is it reasonable to create a second table, people_tb, that is the main relational table and data store, and only use the users_tb for authentication? Does separating user_tb from people_tb present any problems? If this is commonly done, what are some strategies and solutions as well as drawbacks?
This is certainly a good idea, as you are normalizing the database. I have done a similar design in an app that I am writing, where I have an employee table and a user table. Users may a from an external company or an employee, so I have separate tables because an employee is always a user, but a user may not be an employee.
The issues that you'll run into is that whenever you use the user table, you'll nearly always want the person table to get the name or other common attributes you would want to show up.
From a coding standpoint, if you're using straight SQL, it will take a little more effort to mentally parse the select statement. It may be a little more complicated if you're using an ORM library. I don't have enough experience with those.
In my application, I'm writing it in Ruby on Rails, so I'm constantly doing things like employee.user.name, where if I kept them together, it would be just employee.name or user.name.
From a performance standpoint, you are hitting two tables instead of one, but given proper indexes, it should be negligible. If you had an index that contained the primary key and the person name, for instance, the database would hit the user table, then the index for the person table (with a nearly direct hit), so the performance would be nearly the same as having one table.
You could also create a view in the database to keep both tables joined together to give you additional performance enhancements. I know in the later versions of Oracle you can even put an index on a view if needed to increase performance.
I routinely do that because for me the concept of "user" (username, password, create date, last login date) is different from "person" (name, address, phone, email). One of the drawbacks that you may find is that your queries will often require more joins to get the info you're looking for. If all you have is a login name, you'll need to join the "people" table to get the first and last name for example. If you base everything around the user id primary key, this is mitigated a bit, but still pops up.
If user_tb has auth info, I would very much keep it separate from people_tb. I would however keep a relationship between the two, and most of users' info would be stored in people_tb except all of the info needed for auth (which i guess will not be used for much else) Its a nice tradeoff between design and efficiency i think.
That is definitely what we do as we have millions of people records and only thousands of users. We also separate address, phones and emails into relational tables as many people have more than one of each of these things. Critial is to not rely on name as the identifier as name is not unique. Make sure the tables are joined through some type of surrogate key (an integer or a GUID is preferable) not name.
I always try to avoid as much data repetition as possible. If not all people need to login, you can have a generic people table with the information that applies to both people and users (eg. firstname, lastname, etc).
Then for people that login, you can have a users table that has a 1~1 relationship with people. This table can store the username and password.
I'd say go for the normalized design (two tables) and only denormalize (go down to one user/person table) if it will really make your life easier down the line. If however practically all people are also users it may be simpler to denormalize up front. Its up to you; I have used the normalized approach without problems.
Very reasonable.
As an example, take a look at the aspnet_* services tables here.
Their built in schema has a aspnet_Users and aspnet_Membership with the later table having more extended information about a given user (hashed passwords, etc) but the aspnet_User.UserID is used in the other portions of the schema for referential integrity etc.
Bottom line, it's very common, and good design, to have attributes in a separate table if they are different entities, as in your case.

Resources