Relational database design: standard row values in one table vs. separate tables - database

Note: I've seen a few related question about similar issues; however, none of them would fully answer my question.
I have exam data for schools. There are around 500 schools, and around 12 subject exams in my dataset (each school has data for each exam). Each exam has 6 attributes (columns). After the initial data is loaded to the database, no modifications are expected. With respect to SELECT queries, I imagine that separate exam data is used as often as queries over a number of exams. However, the database would be used by a website visualizing the data, thus those SELECT queries might have to be run rather often. With that in mind, I can think of three ways of organizing that data, with each way producing (apparently) BCNF tables.
First scema:
school
exam1_attr1
exam1_attr2
...
exam12_attr6
This schema feels wrong, though I do not have strong arguments against it. As I said, my data would not change, thus having exams carved into attribute names is not that much of an issue. However, such a setup would pose some aggregation difficulties over the entire dataset (i.e., resulting queries would possibly be unnecessarily complicated).
Second schema:
school
examID
attr1
attr2
...
attr6
While this schema looks attractive, I find it hard to convince myself that it is a good idea to represent exams as values rather than columns or separate tables. That is, the set of exams is known, finite and final, and each exam has exact same properties - sounds like a primary candidate for a separate table. On the other hand, under such an arrangement, both aggregation and single-exam queries are very clean and straight-forward.
Third schema would be identical for 12 separate exam tables:
school
attr1
attr2
...
attr6
Conceptually, I would feel that this schema represents my data best: each exam is logically separated into its own table. However, any queries requiring aggregate data over all exams would then include 12 tables, and that makes me feel rather uneasy.
Thus, my question: which database design would be best in my case? While I am looking for an answer, I am also very interested in reasons for choosing one schema over the other. Specifically, I wonder:
how efficiency of running queries changes with each database design,
how important in real life is the ease of writing queries (given that the data would be primarily used by a website - I would seldom write queries over the data after the website has been finished),
which design is better if potential future changes to the data of the website are taken into account,
whether your answer would be different if the number of schools was not 500, but 50,000.
In short, I am interested in any opinions that would help me understand why one design is better than the other. Any database design theories are welcome as well. Thanks!

In an operational relational database, the speed of changes is more important than speed of selects. In a data warehouse, the speed of selects is more important than the speed of changes.
You have a data warehouse.
Operational relational databases are normalized.
Data warehouses use some variation of a star schema.
Your second schema is a good schema for the reason you stated. Both aggregation and single-exam queries are very clean and straight-forward. However, you should put the school information in a separate school table, and reference the school table ID (primary key field, auto-increment integer) as a foreign key in the exam table. This allows you to scale from 500 to 50,000 schools more easily.

Related

Database design, multiple M-M tables or just one?

Today I was designing a database for a potential personal project of mine. Since I couldn't decide what would be a better option I asked my teacher Databases, unfortunately he couldn't tell me which of the two options is better than the other and why.
I designed the database for a dummy data generator. Since I want to generate multilangual data I thought of these tables. (But its a simplification of the tables).
(first and last)names: id, name
streets: id, name
languages: id, name
Each names.name and streets.name originates from a language, sometimes a name can have multiple origins (ex: Nick is both a Dutch as an English name).
Each language has multiple names and streets.
These two rules result in a Many-to-Many relationship. At the moment I've got only two tables, but I know I will get between 10 and 20 of these kind of tables.
The regular way one would do this is just make 10 to 20 Many-to-Many relationship tables.
Another idea I came up with was just one Many-to-Many table with a third column which specifies which table the id relates to.
At the moment I've got the design on my other PC so I will update it with my ideas visualized after dinner (2 hours or so).
Which idea is better and why?
To make the project idea a bit clearer:
It is always a hassle to create good and enough realistic looking working data for projects. This application will generate this data for you and return the needed SQL so you only have to run the queries.
The user comes to the site to get the data. He states his tablename, his columnnames and then he can link the columnnames to types of data, think of:
* Firstname
* Lastname
* Email adress (which will be randomly generated from the name of the person)
* Adress details (street, housenumber, zipcode, place, country)
* A lot more
Then, after linking columns with the types the user can set the number of rows he wants to make. The application will then choose a country at random and generate realistic looking data according to the country they live in.
That's actually an excellent question. This sort of thing leads to a genuine problem in database design and there is a real tradeoff. I don't know what rdbms you are using but....
Basically you have four choices, all of them with serious downsides:
1. One M-M table with check constraints that only one fkey can be filled in besides language and one column per potential table. Ick....
2. One M-M table per relationship. This makes things quite hard to manage over time especially if you need to change something from an int to a bigint at some point.
3. One M-M table with a polymorphic relationship. You lose a lot of referential integrity checks when you do this and to make it safe, have fun coding (and testing!) triggers.
4. Look carefully at the advanced features in your rdbms for a solution. For example in postgresql this can be solved with table inheritance. The downside is that you lose portability and end up in advanced territory.
Unfortunately there is no single definite answer. You need to consider the tradeoffs carefully and decide what makes sense for your project. If I was just working with one RDBMS, I would do the last one. But if not, I would probably do one table per relationship and focus on tooling to manage the problems that come up. But the former preference is about my level of knowledge and confidence, and the latter is a bit more of a personal opinion.
So I hope this helps you look at the tradeoffs and select what is right for you.

Are there situations where it makes sense to store redundant data in a database, to increase speed?

In the following situation:
user has many bags
bag has many items
users
-id
bags
-id
-user_id
items
-id
-bag_id
There are 2 ways to get access to a user's items.
1) An instance method can be added to a user, which iterates over each of the user's bags, and collects the bag's items into an array to return. In Ruby on Rails, something like:
#in user.rb
def items
items = []
bags.includes(:items).each { |bag| items += bag.items }
end
2) Add a user_id attribute directly to the items table, and add an additional relationship, so that user has many items. Then just do user.items.
The 2nd method would be faster, but involves storing redundant data. Are there situations where it makes sense to implement it?
Yes, there are certain circumstances where it does make sense to introduce some controlled redundancy for the sake of performance. Generally, this should only be done when a database is unable to meet its performance requirements. This is called "denormalisation" and what you have to consider is that it:
Can make implementation more complex
Often sacrifices flexibility 
May speed up retrievals but slow down updates (as you now have to update more than one location)
So it's something to consider in cases where performance is unsatisfactory and a relation has a low update rate and high query rate.
There are also some denormalised database designs, such as the star schema, which are used in database warehousing.
Yes. In particular, reporting databases, data marts and data warehouses often use design principles that knowingly depart from some of the normalization rules. The result is a database that has some redundancy in it, but is not only faster to query, but easier as well.
Ease of query is particularly important when there is an analytic GUI between the database and the database user. These analytic tools are quite a bit easier to master if certain design principles are followed in database design. Normalization isn't particularly helpful in this regard.
Unnormalized design need not mean undisciplined design. In particular, it's worth boning up on star schema and snowflake schema designs if you plan on building a reporting database, a data mart, or a data warehouse. The process by which a star or snowflake schema is kept up to date, sometimes called ETL (extract-transform-load), has to be carefully written so as to prevent controlled redundancy from resulting in self contradictory data.
In transaction oriented databases, normalized is generally better, although many experts don't try to push it beyond Boyce-Codd normal form.
Unless there's something you haven't told us, a table like this
create table bagged_items (
user_id integer not null,
bag_id integer not null,
item_id integer not null,
primary key (user_id, bag_id, item_id)
);
is in at least 5NF. It's all key. There's not a bit of redundant data there.
What you've done isn't normalization; normalization is based on identifying certain kinds of dependencies, and reducing their effects by projection. And what you've done isn't denormalization, either; denormalization is an undoing of normalization.
You've simply split a primary key into pieces. I don't pretend to know what principle you followed in order to justify that. It looks a little like "no table may have more than one foreign key" normal form. (But, of course, there's no such thing.)
For combining records from two SQL tables, databases implement efficient JOIN methods that can be used from Ruby on Rails. For almost all applications this is fast enough. That being said, for certain high performances stores, you might want to store redundant data as you suggested, but this comes at the cost of having to keep the data in sync on writes.

Table Design - Wide Table vs. Columns as Properties

I'm part of a team architecting an Operational Data Store (ODS) database, using SQL Server 2012, that will be used by some of our analysts to do predictive modeling. The ODS will contain manufacturing production data for a single product we make.
We will have hundreds of tables in the ODS. However, we will have a single core table that will contain critical information (lifecycle info) about each item manufactured (tens of millions each year). Our product is manufactured in a manufacturing plant and spends roughly 2.5 hours moving through various processes along a production line. We want to store various, individual, pieces of manufacturing and post manufacturing information in this core table. An example piece of data might be the time the product entered a particular oven.
We have a decision to make on how to architect this table. We can create a wide table (many columns) or a narrow table where most columns are rows (as property values). I have never designed and worked with a table structure that is very narrow and columns are treated as rows in the table.
I'd like some feedback on the pros and cons of a wide table vs. a narrow table. The following might be useful in helping with this discussion:
Number of products produced each year: Several million (each of these product instances will be a row in the core table)
Will this table be queried often: Yes, very often. It will be the parent to many child tables.
Potential number of columns (or row properties): 75 to 150+
If more information would be useful, I'd be glad to provide it.
Wide tables, static properties
You are tracking a single product through a well-defined manufacturing process. This data model sounds very static, and would lend itself to a wide table with many columns that are consistently populated with data.
Narrow tables, dynamic properties
If you had many, many products with lots of variation in the manufacturing process, it would be better suited for a narrow table, where you could easily add new properties for tracking.
Difficult to query a narrow table
However, even simple querying of a narrow table can extremely difficult. For example, what if you needed to sort the data by a certain property when that property is shuffled amongst 100+ other property rows? How would you get all the rows together to form a single "record" and then sort the record groups within your result set?
Flat tables simpler to query
Depending on how you need to view and analyze the data, you may find yourself constantly using pivot or crosstab queries. If that's the case, then why not flatten out the storage table to begin with?
Or do both
Another option is to do both: Store the data narrowly, and use a transformation process to flatten it out for ease of reporting. That way you can quickly begin tracking new properties (just by adding rows), and then you can work on getting your reporting tables and transformation process updated to utilize the new data.
How wide is too wide? Well, there can be several problems with wide tables.
One problem is that wide tables tend to deviate from the rules for normalizing data. This in turn can result in tricky update problems where you have to be careful to prevent the database from entering a self contradictory state. There's no particular answer to how wide it too wide here. Just apply the normalization rules, and you'll end up decomposing the table.
However, some databases are not built with normalization as the guiding principle. In particular, consider fact tables in star schemas. There are times when some of the coulmns are determined by some subset of the FK's, and this can violate 3NF or even 2NF. Keeping fact tables skinny is still important in star schemas, but it's for a different reason, namely speed. Sometimes, a fact table can be made skinnier by pushing data out to one of the dimension tables. Sometimes, you can decompose a star into two or more related stars.
Your case sounds like the second reason given above, even though your design probably isn't a star schema. Still, star schema design principles might help you improve your design.

What is data normalization? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What exactly does database normalization do?
Can someone please clarify data normalization? What are the different levels? When should I "de-normalize"? Can I over normalize? I have a table with millions of records, and I believe I over-normalized it, but I'm not sure.
If you have million columns you probably under-normalized it.
What normalizing means is that
every non-key attribute "must provide
a fact about the key, the whole key,
and nothing but the key."
If you have a column that depends on anything but the key, you should normalize your table.
see here.
Added to reply to comment:
If you have ProductID | ProductType | ProductTypeID, where ProdcutTypeID depends only on ProductType, you should make a new table for that:
ProductID | ProductTypeID and on the other table: ProductTypeID | ProductTypeName .
So to answer your question, pertaining to Product isn't accurate enough, in my example at the first case, I was pertaining to the Product as well. All columns should pertain only to ProductID (you may say you only describe product, but not describing anything else, even if it's related to product - that's accurate).
Number of rows, generally speaking isn't relevent.
Normalization is about reducing data duplication in a relational database. The most popular level is third normal form (it's the one described by "the key, the whole key, and nothing but the key"), but there are a lot of different levels, see the Wikipedia entry for a list of the main ones. (In practice people seem to think they're doing well to achieve third normal form.) Denormalizing means accepting more data duplication, typically in exchange for better performance.
As others said Database normalization is about reduction of data duplication and more generic data models (that can easily answer to queries unexpected at design time). Normalisation of a database is allow a formal enough process. When you are experimented you mostly follow data analysis methods and get a normalized database at the end.
Normalizing database is usually a good idea, but there is a catch. In many case it involve creation of new tables and JOIN relationships between tables. JOIN is known to have a (very) high performance cost at runtime, henceforth for big volumes of data you may want to denormalize.
Another cost may also be the need to write more complex requests to access to the data you need and that can be a problem for SQL beginners. The best idea is probably to stick with normalization anyway (Third Normal Form is usually enough, as there is several levels of normalization as others said) and to become more skilled with SQL.

Database schema design

I'm quite new to database design and have some questions about best practices and would really like to learn.
I am designing a database schema, I have a good idea of the requirements and now its a matter of getting it into black and white.
In this pseudo-database-layout, I have a table of customers, table of orders and table of products.
TBL_PRODUCTS:
ID
Description
Details
TBL_CUSTOMER:
ID
Name
Address
TBL_ORDER:
ID
TBL_CUSTOMER.ID
prod1
prod2
prod3
etc
Each 'order' has only one customer, but can have any number of 'products'.
The problem is, in my case, the products for a given order can be any amount (hundreds for a single order) on top of that, each product for an order needs more than just a 'quantity' but can have values that span pages of text for a specific product for a specific order.
My question is, how can I store that information?
Assuming I can't store a variable length array as single field value, the other option is to have a string that is delimited somehow and split by code in the application.
An order could have say 100 products, each product having either only a small int, or 5000 characters or free text (or anything in between), unique only to that order.
On top of that, each order must have it's own audit trail as many things can happen to it throughout it's lifetime.
An audit trail would contain the usual information - user, time/date, action and can be any length.
Would I store an audit trail for a specific order in it's own table (as they could become quite lengthy) created as the order is created?
Are there any places where I could learn more about techniques for database design?
The most common way would be to store the order items in another table.
TBL_ORDER:
ID
TBL_CUSTOMER.ID
TBL_ORDER_ITEM:
ID
TBL_ORDER.ID
TBL_PRODUCTS.ID
Quantity
UniqueDetails
The same can apply to your Order audit trail. It can be a new table such as
TBL_ORDER_AUDIT:
ID
TBL_ORDER.ID
AuditDetails
First of all, Google Third Normal Form. In most cases, your tables should be 3NF, but there are cases where this is not the case because of performance or ease of use, and only experiance can really teach you that.
What you have is not normalized. You need a "Join table" to implement the many to many relationship.
TBL_ORDER:
ID
TBL_CUSTOMER.ID
TBL_ORDER_PRODUCT_JOIN:
ID
TBL_ORDER.ID
TBL_Product.ID
Quantity
TBL_ORDER_AUDIT:
ID
TBL_ORDER.ID
Audit_Details
The basic conventional name for the ID column in the Orders table (plural, because ORDER is a keyword in SQL) is "Order Number", with the exact spelling varying (OrderNum, OrderNumber, Order_Num, OrderNo, ...).
The TBL_ prefix is superfluous; it is doubly superfluous since it doesn't always mean table, as for example in the TBL_CUSTOMER.ID column name used in the TBL_ORDER table. Also, it is a bad idea, in general, to try using a "." in the middle of a column name; you would have to always treat that name as a delimited identifier, enclosing it in either double quotes (standard SQL and most DBMS) or square brackets (MS SQL Server; not sure about Sybase).
Joe Celko has a lot to say about things like column naming. I don't agree with all he says, but it is readily searchable. See also Fabian Pascal 'Practical Issues in Database Management'.
The other answers have suggested that you need an 'Order Items' table - they're right; you do. The answers have also talked about storing the quantity in there. Don't forget that you'll need more than just the quantity. For example, you'll need the price prevailing at the time of the order. In many systems, you might also need to deal with discounts, taxes, and other details. And if it is a complex item (like an airplane), there may be only one 'item' on the order, but there will be an enormous number of subordinate details to be recorded.
While not a reference on how to design database schemas, I often use the schema library at DatabaseAnswers.org. It is a good jumping off location if you want to have something that is already roughed in. They aren't perfect and will most likely need to be modified to fit your needs, but there are more than 500 of them in there.
Learn Entity-Relationship (ER) modeling for database requirements analysis.
Learn relational database design and some relational data modeling for the overall logical design of tables. Data normalization is an important part of this piece, but by no means all there is to learn. Relational database design is pretty much DBMS independent within the main stream DBMS products.
Learn physical database design. Learn index design as the first stage of designing for performance. Some index design is DBMS independent, but physical design becomes increasingly dependent on special features of your DBMS as you get more detailed. This can require a book that's specifically tailored to the DBMS you intend to use.
You don't have to do all the above learning before you ever design and build your first database. But what you don't know WILL hurt you. Like any other skill, the more you do it, the better you'll get. And learning what other people already know is a lot cheaper than learning by trial and error.
Take a look at Agile Web Development with Rails, it's got an excellent section on ActiveRecord (an implementation of the same-named design pattern in Rails) and does a really good job of explaining these types of relationships, even if you never use Rails. Here's a good online tutorial as well.

Resources