Star Schema design from 3NF - database

I'm a newbie to data warehousing and I've been reading articles and watching videos on the principles but I'm a bit confused as to how I would take the design below and convert it into a star schema.
In all the examples I've seen the fact table references the dim tables, so I'm assuming the questionId and responseId would be part of the fact table? Any advice would be much appreciated.

I can't see the image at the moment (blocked by my firewall # the office). but I'll try to give you some ideas.
The general idea is to organize your measurable 'facts' into what are called fact tables. There are 3 main types of facts, but that is a topic for a different day (but I'd be happy to go into this if needed). Each of these facts are what you'd see in the center of typical 'star schema'. The other attributes within the fact tables are typically FK references to the dimension tables.
Regarding dimensions, these are groups of attributes that share commonality (the most notable being a calendar dimension). This is important because when you're doing analysis across multiple facts the dimensions are what you use to connect them.
If you consider this simple example: A product is ordered and then shipped. We could have 2 transaction facts (one that contains the qty ordered - measure, type of product ordered - dimension, and transaction date - dimension). We'd also have a transaction fact for the product shipping ( qty shipped - measure, product type - dimension, and ship date - dimension). This simple schema could be used to answer questions like 'how many products by product type last quarter were ordered but not shipped'.
Hopefully this helps you get started.

Usually a fact table is used to aggregate measures - which are always numeric. Examples would be: sales dollars, distances, weights, number of items sold.
The type of data you drew here doesn't have any cut and dry "measure" so you need to decide what you want to measure. Is the number of answers per question? Is it how many responses per sample?
This is often called an Event Fact table (if you want to search for other examples). And you need some sort of reporting requirements before you can turn it into a star schema. So it isn't an easy answer...

It's so easy :) Responses is fact, all other is dimensions. And your schema is now star designed, because you can directly connect fact with all dimensions. Example, when you need to redesign its structure where addresses stored in separate table and related with sample. You must add address table id into responses table for get star schema.

Related

How to move from Excel to designing a Data Warehouse Model

I just started in Data Warehouse modeling and I need help for the modeling of a problem.
Let me tell you the facts: I work on flight data (aeronautical data),
so I have two Excel (fact) files, linked together, one file 'order' and the other 'services'.
the 'order' file sets out a summary of each flight (orderId, departure date, arrival date, City of departure, City of arrival, total amount collected, etc.)
the 'services' file lists the services provided by flight (orderId, service name, quantity, amount / qty, etc.)
with a 1-n relationship (order-services) each order has n services
I already see some dimensions (Time, Location, etc ...). However, I would like to know how I could design my Data Warehouse, knowing that I have two fact files linked together by orderId.
I thought about it, and the star and snowflake schema do not work in my case (since I have two fact tables) and the galaxy schema requires to have dimensions in common, but I block it, is that I put the order table as a dimension and not as a fact table or I should rather put the services table as a dimension, but these are fact tables. I get a little confused.
How can I design my model?
First of all realize that in a star schema it is not a problem to have more fact tables that are connected - see the discussion here.
So the first draw will simple follow your two fact tables with the native provided dimensions.
Order is in one context a fact table, in other context a dimensional table for the service table.
Dependent on your expected queries you could find useful to denormalize some dimensions of the order table in the service table. So the service will have defined the departure date, arrival date etc. dimensions.
This will be done at the load time in the ETL job.
I will be somehow careful to denormalize the measures from order to service - which will basically eliminate the whole order table.
There will be no problem with the measure total amount collected if this is a redundant sum of the service amounts - you may safely get rid of it.
But you will need for sure the number of flights or number of people transported - those measure are better defined in the order fact table; you can not simple replicate them in the N rows for each service.
A workaround is possible, if you define a main service for each order and those measures are defined only in this row - in other rows the value is NULL. This could lead to unexpected results if queried naively, e.g. for number of flights per service.
So basically I'd start with the two fact tables and denormalize some dimensions to the services if this would help to optimize the queries.
I would start with one fact table of Services. This fact would include all of the dimensions you might associate with the Order including a degenerated dimension of OrderId.
Once this fact is built out and some information products are consuming it, return to the Order and re-evaluate it to see if there are any reporting needs which are not being served, or questions which are difficult to answer with the Services fact.
Joining two facts together is always a bad idea. Performance is terrible. You are always better off bring the dimensions from, in your case, Order to Services. Don't forget to include the context of the dimension in the column name and a corresponding role-playing dimension view for this context. E.G. OrderArrivalCity, OrderDepartureDate, OrderDepartureTime.
You can also get yourself a copy of Ralph Kimball's The Data Warehouse Toolkit

I'm unable to normalize my Product table as I have 4 different product types

So because I have 4 different product types (books, magazines, gifts, food) I can't just put all products in one "products" table without having a bunch of null values. So I decided to break each product up into their own tables but I know this is just wrong (https://c1.staticflickr.com/1/742/23126857873_438655b10f_b.jpg).
I also tried creating an EAV model for this (https://c2.staticflickr.com/6/5734/23479108770_8ae693053a_b.jpg), but I got stuck as I'm not sure how to link the publishers and authors tables.
I know this question has been asked a lot but I don't understand ANY of the answer's I've seen. I think this is because I'm a very visual learner and this makes it hard to understand what's being talked about when not a lot of information is given.
Your model is on the right track, except that the product name should be sufficient you don't need Gift name, book name etc. What you put in those tables is the information that is specific to the type of product that the other products don't need. The Product table contains all the common fields. I would use productid in the child tables rather than renaming it giftID, magazineID etc. It is easier to remember what things are celled when you are consistent in nameing them.
Now to be practical, you put as much as you can into the product table especially if you are going to do calculations. I prefer the child tables in this specific case to have what is mostly display information. So product contains the product name, the cost, the type of product, the units the product is sold in etc. The stuff that generally is needed to calculate the cost of an order or to have a report of what was ordered. There may be one or two fields that can contain nulls, but it simplifies the calculation type queries so much it might be worth it.
The meat of the descriptive details though would go in the child table for the type of product. These would usually only be referenced when displaying the product in the shopping area and only one at a time, so you can use the product type to let you only join to the one child table you need for display. So while the order cares about the product number and name and cost calculations, it probably doesn't need to go line by line describing the book ISBN number or the megapixels in a camera. But the description page of the product does need those things.
This approach is not purely relational, although it mostly is, but it does group the information by the meanings of the data and how they will be used which will make the database easier to understand and query. I am a big fan of relational tables because database just work better when they hit at least the third normal form but sometimes you can go too far for practicality, so the meaning of the data and the way you are grouping to use the data (and not just for the user interface, but for later reporting as well) is almost always one of my considerations in design.
Breaking each product type into its own table is fine - let the child tables use the same id as the parent Product table, and create views for the child tables that join with Product
Your case is a classic case of types and subtypes. This is often called class/subclass in object modeling and generalization/specialization in ER modeling. It's a well understood pattern. There are known techniques for dealing with this pattern.
Visit the following tabs, and read the description under the info tab (presented as "learn more"). Also look over the questions grouped under these tags.
single-table-inheritance class-table-inheritance shared-primary-key
If you want to rean in more depth use these buzzwords to search for articles on the web.
You've already discovered and discarded single table inheritance on your own. Other answers have pointed you at shared primary key. Class table inheritance involves a single table for generalized data as well as the four specialized tables. Shared primary key is generally used in conjunction with class table inheritance.

Relational database design: standard row values in one table vs. separate tables

Note: I've seen a few related question about similar issues; however, none of them would fully answer my question.
I have exam data for schools. There are around 500 schools, and around 12 subject exams in my dataset (each school has data for each exam). Each exam has 6 attributes (columns). After the initial data is loaded to the database, no modifications are expected. With respect to SELECT queries, I imagine that separate exam data is used as often as queries over a number of exams. However, the database would be used by a website visualizing the data, thus those SELECT queries might have to be run rather often. With that in mind, I can think of three ways of organizing that data, with each way producing (apparently) BCNF tables.
First scema:
school
exam1_attr1
exam1_attr2
...
exam12_attr6
This schema feels wrong, though I do not have strong arguments against it. As I said, my data would not change, thus having exams carved into attribute names is not that much of an issue. However, such a setup would pose some aggregation difficulties over the entire dataset (i.e., resulting queries would possibly be unnecessarily complicated).
Second schema:
school
examID
attr1
attr2
...
attr6
While this schema looks attractive, I find it hard to convince myself that it is a good idea to represent exams as values rather than columns or separate tables. That is, the set of exams is known, finite and final, and each exam has exact same properties - sounds like a primary candidate for a separate table. On the other hand, under such an arrangement, both aggregation and single-exam queries are very clean and straight-forward.
Third schema would be identical for 12 separate exam tables:
school
attr1
attr2
...
attr6
Conceptually, I would feel that this schema represents my data best: each exam is logically separated into its own table. However, any queries requiring aggregate data over all exams would then include 12 tables, and that makes me feel rather uneasy.
Thus, my question: which database design would be best in my case? While I am looking for an answer, I am also very interested in reasons for choosing one schema over the other. Specifically, I wonder:
how efficiency of running queries changes with each database design,
how important in real life is the ease of writing queries (given that the data would be primarily used by a website - I would seldom write queries over the data after the website has been finished),
which design is better if potential future changes to the data of the website are taken into account,
whether your answer would be different if the number of schools was not 500, but 50,000.
In short, I am interested in any opinions that would help me understand why one design is better than the other. Any database design theories are welcome as well. Thanks!
In an operational relational database, the speed of changes is more important than speed of selects. In a data warehouse, the speed of selects is more important than the speed of changes.
You have a data warehouse.
Operational relational databases are normalized.
Data warehouses use some variation of a star schema.
Your second schema is a good schema for the reason you stated. Both aggregation and single-exam queries are very clean and straight-forward. However, you should put the school information in a separate school table, and reference the school table ID (primary key field, auto-increment integer) as a foreign key in the exam table. This allows you to scale from 500 to 50,000 schools more easily.

What is data normalization? [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
What exactly does database normalization do?
Can someone please clarify data normalization? What are the different levels? When should I "de-normalize"? Can I over normalize? I have a table with millions of records, and I believe I over-normalized it, but I'm not sure.
If you have million columns you probably under-normalized it.
What normalizing means is that
every non-key attribute "must provide
a fact about the key, the whole key,
and nothing but the key."
If you have a column that depends on anything but the key, you should normalize your table.
see here.
Added to reply to comment:
If you have ProductID | ProductType | ProductTypeID, where ProdcutTypeID depends only on ProductType, you should make a new table for that:
ProductID | ProductTypeID and on the other table: ProductTypeID | ProductTypeName .
So to answer your question, pertaining to Product isn't accurate enough, in my example at the first case, I was pertaining to the Product as well. All columns should pertain only to ProductID (you may say you only describe product, but not describing anything else, even if it's related to product - that's accurate).
Number of rows, generally speaking isn't relevent.
Normalization is about reducing data duplication in a relational database. The most popular level is third normal form (it's the one described by "the key, the whole key, and nothing but the key"), but there are a lot of different levels, see the Wikipedia entry for a list of the main ones. (In practice people seem to think they're doing well to achieve third normal form.) Denormalizing means accepting more data duplication, typically in exchange for better performance.
As others said Database normalization is about reduction of data duplication and more generic data models (that can easily answer to queries unexpected at design time). Normalisation of a database is allow a formal enough process. When you are experimented you mostly follow data analysis methods and get a normalized database at the end.
Normalizing database is usually a good idea, but there is a catch. In many case it involve creation of new tables and JOIN relationships between tables. JOIN is known to have a (very) high performance cost at runtime, henceforth for big volumes of data you may want to denormalize.
Another cost may also be the need to write more complex requests to access to the data you need and that can be a problem for SQL beginners. The best idea is probably to stick with normalization anyway (Third Normal Form is usually enough, as there is several levels of normalization as others said) and to become more skilled with SQL.

Database schema design

I'm quite new to database design and have some questions about best practices and would really like to learn.
I am designing a database schema, I have a good idea of the requirements and now its a matter of getting it into black and white.
In this pseudo-database-layout, I have a table of customers, table of orders and table of products.
TBL_PRODUCTS:
ID
Description
Details
TBL_CUSTOMER:
ID
Name
Address
TBL_ORDER:
ID
TBL_CUSTOMER.ID
prod1
prod2
prod3
etc
Each 'order' has only one customer, but can have any number of 'products'.
The problem is, in my case, the products for a given order can be any amount (hundreds for a single order) on top of that, each product for an order needs more than just a 'quantity' but can have values that span pages of text for a specific product for a specific order.
My question is, how can I store that information?
Assuming I can't store a variable length array as single field value, the other option is to have a string that is delimited somehow and split by code in the application.
An order could have say 100 products, each product having either only a small int, or 5000 characters or free text (or anything in between), unique only to that order.
On top of that, each order must have it's own audit trail as many things can happen to it throughout it's lifetime.
An audit trail would contain the usual information - user, time/date, action and can be any length.
Would I store an audit trail for a specific order in it's own table (as they could become quite lengthy) created as the order is created?
Are there any places where I could learn more about techniques for database design?
The most common way would be to store the order items in another table.
TBL_ORDER:
ID
TBL_CUSTOMER.ID
TBL_ORDER_ITEM:
ID
TBL_ORDER.ID
TBL_PRODUCTS.ID
Quantity
UniqueDetails
The same can apply to your Order audit trail. It can be a new table such as
TBL_ORDER_AUDIT:
ID
TBL_ORDER.ID
AuditDetails
First of all, Google Third Normal Form. In most cases, your tables should be 3NF, but there are cases where this is not the case because of performance or ease of use, and only experiance can really teach you that.
What you have is not normalized. You need a "Join table" to implement the many to many relationship.
TBL_ORDER:
ID
TBL_CUSTOMER.ID
TBL_ORDER_PRODUCT_JOIN:
ID
TBL_ORDER.ID
TBL_Product.ID
Quantity
TBL_ORDER_AUDIT:
ID
TBL_ORDER.ID
Audit_Details
The basic conventional name for the ID column in the Orders table (plural, because ORDER is a keyword in SQL) is "Order Number", with the exact spelling varying (OrderNum, OrderNumber, Order_Num, OrderNo, ...).
The TBL_ prefix is superfluous; it is doubly superfluous since it doesn't always mean table, as for example in the TBL_CUSTOMER.ID column name used in the TBL_ORDER table. Also, it is a bad idea, in general, to try using a "." in the middle of a column name; you would have to always treat that name as a delimited identifier, enclosing it in either double quotes (standard SQL and most DBMS) or square brackets (MS SQL Server; not sure about Sybase).
Joe Celko has a lot to say about things like column naming. I don't agree with all he says, but it is readily searchable. See also Fabian Pascal 'Practical Issues in Database Management'.
The other answers have suggested that you need an 'Order Items' table - they're right; you do. The answers have also talked about storing the quantity in there. Don't forget that you'll need more than just the quantity. For example, you'll need the price prevailing at the time of the order. In many systems, you might also need to deal with discounts, taxes, and other details. And if it is a complex item (like an airplane), there may be only one 'item' on the order, but there will be an enormous number of subordinate details to be recorded.
While not a reference on how to design database schemas, I often use the schema library at DatabaseAnswers.org. It is a good jumping off location if you want to have something that is already roughed in. They aren't perfect and will most likely need to be modified to fit your needs, but there are more than 500 of them in there.
Learn Entity-Relationship (ER) modeling for database requirements analysis.
Learn relational database design and some relational data modeling for the overall logical design of tables. Data normalization is an important part of this piece, but by no means all there is to learn. Relational database design is pretty much DBMS independent within the main stream DBMS products.
Learn physical database design. Learn index design as the first stage of designing for performance. Some index design is DBMS independent, but physical design becomes increasingly dependent on special features of your DBMS as you get more detailed. This can require a book that's specifically tailored to the DBMS you intend to use.
You don't have to do all the above learning before you ever design and build your first database. But what you don't know WILL hurt you. Like any other skill, the more you do it, the better you'll get. And learning what other people already know is a lot cheaper than learning by trial and error.
Take a look at Agile Web Development with Rails, it's got an excellent section on ActiveRecord (an implementation of the same-named design pattern in Rails) and does a really good job of explaining these types of relationships, even if you never use Rails. Here's a good online tutorial as well.

Resources