How are fact tables formed in relation to the dimension tables? - database

I am trying to understand how fact tables are form in relation to the dimension tables.
E.g. Sale Fact Table
For there is a query for Sale of product by year/month/week/day, do I create a dimension for each type of period: Dim_Year, Dim_Month, Dim_Week and Dim_Day, each with their own respective keys?
Or is it possible to just use one dimension for all periods: Dim_Date and only have one date key?
Another area I am confused about is that why do some fact tables not contain their own ID? E.g. Sale fact table does not have SaleID included in the fact table.
Sale Fact Table Textbook Example

DATES
Your date dimension needs to correspond to the grain of your fact table. So if you had daily sales you would have a Dim_Day, weekly sales you would have a Dim_Week, etc.
You would normally have multiple date dimensions (at different grains) in your data warehouse as you would have facts at different date grains.
Each date dimension would hold hold attributes applicable to levels higher up in the date hierarchy. So a Dim_Day might hold day, week, month, year attributes; Dim_Month might hold month, quarter and year attributes, etc.
PRIMARY KEYS
Primary keys are rarely (never?) a technical requirement when creating tables in a database i.e. you can create a table without defining a PK. So you need to consider why we normally (at least in OLTP DBs) include PKs. Common reasons include:
To easily identify an individual record
To ensure that duplicate records (those with the same PK value) are
not created
So there are good reasons for creating PKs, however there are cost overheads e.g. the PK needs to be checked every time a new record is inserted into the table.
In a dimensional model where you are performing bulk inserts/updates, having PKs would cause a significant performance hit. Additionally, the insert logic/checks should always be implemented in your ETL processes so there is no need to include these types of checks/constraints in the DB itself.
Fact tables do have a primary key but it is often implicit rather than explicit - so a group of the FKs in the fact table uniquely identify each record. This compound PK may be documented but is is never enabled/implemented.
Occasionally a fact table will have an explicit, single column, PK. This is normally used when the fact table needs to be updated and its implicit PK involves a large number of columns. There is normally logic required to identify the record to be updated using its FKs but this returns the PK; then the update statement just has a clause like this:
WHERE table_pk = 12345678
rather than having to include all the columns in the implicit PK:
WHERE table_sk1 = 1234
AND table_sk2 = 5678
AND table_sk3 = 9876
....
Hope this helps?

Related

Where is the limit to what would be considered data duplication in a database?

What is the line that you should draw when normalising data, in terms of data duplication? i.e would you say that 2 employees who share the same birthday or have the same timestamp for a shift is data duplication? and therefore should be placed into another data table?
Birth date has full and non-transitive dependency to a person which means that it should be stored within the same table where you keep your employees and it would comply with third normal form (3NF).
Work shifts are not an attribute of an employee which means that they are a different entity and stay in relation with employee entity.
There is no a particular 'limit' when following the normalisation to data, since the main restriction that is given for every relational database table is to have an unique parimary key. Hence, if all other columns contain the same data, but the primary key is still different, it is a different row of a table.
The actual restrictions can come in two form. One is either the programming or systhematic approach, where the restriction on what kind of data is inputed is given from a program which interacts with the database or already defined script handed down physically for the admin of the database.
Other, more database-orriented approach would be to create primary keys composed of multiple columns. That way a row is unique only if for both columns the data is unique. It should be noted that a primary key is not necessary the same as an unique key, which should be different for every instance.
You have misunderstood what normalization does.
Two attributes having the same value (i.e. two employees having the same birthday) is not redundancy.
Rather having the same attribute in the two tables (i.e. two tables having birthday column, therefore repeating every employee's birthday information) is.
Normalization is a quality decision and denormalization is a performance decision. For my school projects, my teachers recommended me to normalize at least till 3NF. So that may be a good guideline.

What are the points considered when a database table is created?

I know it is a big and general question. Let me describe what I am looking for.
In big projects, we have some entities with many properties. (Many is over 100 properties for just a specific entity.) These properties have one to one relation. By the time goes, these tables with many columns are really big problems for maintenance and further development.
As you think, these 90 columns is created in a time with many projects. Not a single project. Therefore, requirements affect the table design in a wide time duration.
i.e. : There is a table to store information of payments between banks in global.
Some columns are foreign keys of others.(Customer, TransferType etc.)
Some columns are parameters of current payment. (IsActive, IsLoaded, IsOurCustomer etc.)
Some columns are fields of payment. (Information Bank, Receiver Bank etc.)
and so on.
These fields are always counting and now we have about 90 columns with one to one relation.
What are the concerns to divide a table to smaller tables. I know normalization rules and I am not interested it. (Already duplicated columns are normalized)
I try to find some patterns or some rules to divide a table which has one to one relation among columns.
If all of the columns are only dependent on the primary table key and are not repeating (phone1, phone2) they should be part of the same table. If you split a table you will have to do joins when you need all the columns of the table. If many of the values are null you may investigate the use of sparse columns (which don't take up any room if they have a null value).

Should I normalize a database with a column for each day of the week?

Designing an oracle database for an ordering system. Each row will be a schedule that stores can be assigned that designates if/when they will order from a specific vendor for each day of the week.
It will be keyed by vendor id and a unique schedule id. Started out with those columns, and then a column for each day of the week like TIME_SUN, TIME_MON, TIME_TUE... to contain the order time for each day.
I'm normally inclined to try and normalize data and have another table referencing the schedule id, with a column like DAY_OF_WEEK and ORDER_TIME, so potentially 7 rows for the same data.
Is it really necessary for me to do this, or is it just over complicating what can be handled as a simple single row?
Normalization is the best way. Reasons:
The table will act as a master table
The table can be used for reference in future needs
It will be costly to normalize later
If there are huge number of rows with repeating more column values then database size growth is unwanted
Using master table will limit redundant data only to the foreign key
Normalization would be advisable. In future if you are required to store two or more order times for the same day then just adding rows in your vendor_day_order table will be required. In case you go with the first approach you will be required to make modifications to your table structure.

Difference between Fact table and Dimension table?

When reading a book for business objects, I came across the term- fact table and dimension table.
I am trying to understand what is the different between Dimension table and Fact table?
I read couple of articles on the internet but I was not able to understand clearly..
Any simple example will help me to understand better?
In Data Warehouse Modeling, a star schema and a snowflake schema consists of Fact and Dimension tables.
Fact Table:
It contains all the primary keys of the dimension and associated
facts or measures(is a property on which calculations can be made) like quantity sold, amount sold and average sales.
Dimension Tables:
Dimension tables provides descriptive information for all the measurements recorded in fact table.
Dimensions are relatively very small as comparison of fact table.
Commonly used dimensions are people, products, place and time.
image source
This appears to be a very simple answer on how to differentiate between fact and dimension tables!
It may help to think of dimensions as things or objects. A thing such
as a product can exist without ever being involved in a business
event. A dimension is your noun. It is something that can exist
independent of a business event, such as a sale. Products, employees,
equipment, are all things that exist. A dimension either does
something, or has something done to it.
Employees sell, customers buy. Employees and customers are examples of
dimensions, they do.
Products are sold, they are also dimensions as they have something
done to them.
Facts, are the verb. An entry in a fact table marks a discrete event
that happens to something from the dimension table. A product sale
would be recorded in a fact table. The event of the sale would be
noted by what product was sold, which employee sold it, and which
customer bought it. Product, Employee, and Customer are all dimensions
that describe the event, the sale.
In addition fact tables also typically have some kind of quantitative
data. The quantity sold, the price per item, total price, and so on.
Source:
http://arcanecode.com/2007/07/23/dimensions-versus-facts-in-data-warehousing/
This is to answer the part:
I was trying to understand whether dimension tables can be fact table
as well or not?
The short answer (INMO) is No.That is because the 2 types of tables are created for different reasons. However, from a database design perspective, a dimension table could have a parent table as the case with the fact table which always has a dimension table (or more) as a parent. Also, fact tables may be aggregated, whereas Dimension tables are not aggregated. Another reason is that fact tables are not supposed to be updated in place whereas Dimension tables could be updated in place in some cases.
More details:
Fact and dimension tables appear in a what is commonly known as a Star Schema. A primary purpose of star schema is to simplify a complex normalized set of tables and consolidate data (possibly from different systems) into one database structure that can be queried in a very efficient way.
On its simplest form, it contains a fact table (Example: StoreSales) and a one or more dimension tables. Each Dimension entry has 0,1 or more fact tables associated with it (Example of dimension tables: Geography, Item, Supplier, Customer, Time, etc.). It would be valid also for the dimension to have a parent, in which case the model is of type "Snow Flake". However, designers attempt to avoid this kind of design since it causes more joins that slow performance. In the example of StoreSales, The Geography dimension could be composed of the columns (GeoID, ContenentName, CountryName, StateProvName, CityName, StartDate, EndDate)
In a Snow Flakes model, you could have 2 normalized tables for Geo information, namely: Content Table, Country Table.
You can find plenty of examples on Star Schema. Also, check this out to see an alternative view on the star schema model Inmon vs. Kimball. Kimbal has a good forum you may also want to check out here: Kimball Forum.
Edit: To answer comment about examples for 4NF:
Example for a fact table violating 4NF:
Sales Fact (ID, BranchID, SalesPersonID, ItemID, Amount, TimeID)
Example for a fact table not violating 4NF:
AggregatedSales (BranchID, TotalAmount)
Here the relation is in 4NF
The last example is rather uncommon.
Super simple explanation:
Fact table: a data table that maps lookup IDs together. Is usually one of the main tables central to your application.
Dimension table: a lookup table used to store values (such as city names or states) that are repeated frequently in the fact table.
Dimension table
Dimension table is a table which contain attributes of measurements stored in fact tables. This table consists of hierarchies, categories and logic that can be used to traverse in nodes.
Fact table contains the measurement of business processes, and it contains foreign keys for the dimension tables.
Example – If the business process is manufacturing of bricks
Average number of bricks produced by one person/machine – measure of the business process
a Fact = an action: a sale, a transaction, an access
a Dimension = an object: a seller, a customer, a date, a price
Then...
Facts references dimensions for: when, where, what, who, how
The real interesting thing is deciding whether an attribute should be a dimension or a fact. For example, the price of each item in an order, or, the maximum amount of a insurance recorded in a contract. There are no generally correct way to approach these, only ones that make sense in the context.
PS: If I were to create those jargons I would prefer Log table and Object table.
In the simplest form, I think a dimension table is something like a 'Master' table - that keeps a list of all 'items', so to say.
A fact table is a transaction table which describes all the transactions. In addition, aggregated (grouped) data like total sales by sales person, total sales by branch - such kinds of tables also might exist as independent fact tables.
From my point of view,
Dimension table : Master Data
Fact table : Transactional Data
The fact table mainly consists of business facts and foreign keys that refer to primary keys in the dimension tables. A dimension table consists mainly of descriptive attributes that are textual fields.
A dimension table contains a surrogate key, natural key, and a set of attributes. On the contrary, a fact table contains a foreign key, measurements, and degenerated dimensions.
Dimension tables provide descriptive or contextual information for the measurement of a fact table. On the other hand, fact tables provide the measurements of an enterprise.
When comparing the size of the two tables, a fact table is bigger than a dimensional table. In a comparison table, more dimensions are presented than the fact tables. In a fact table, less numbers of facts are observed.
The dimension table has to be loaded first. While loading the fact tables, one should have to look at the dimension table. This is because the fact table has measures, facts, and foreign keys that are the primary keys in the dimension table.
Read more: Dimension Table and Fact Table | Difference Between | Dimension Table vs Fact Table http://www.differencebetween.net/technology/hardware-technology/dimension-table-and-fact-table/#ixzz3SBp8kPzo
For Relation database users, Dimension is equivalent to Master Table.
Fact is equivalent to Transaction table.
Dimension table : It is nothing but we can maintains information about the characterized date called as Dimension table.
Example : Time Dimension , Product Dimension.
Fact Table : It is nothing but we can maintains information about the metrics or precalculation data.
Example : Sales Fact, Order Fact.
Star schema : one fact table link with dimension table form as a Start Schema.
enter image description here

Database: Should tables always be normalized and have primary keys?

I have a database storing customer enquiries about products.
The enquiry reference (text), product number (int) and revision number (int) together uniquely identifies a single discussion between sales and customer.
As a result, there are many tables each for a specific detail about a single enquiry, uqniuely idenified by enq, pdt and rev values combined.
The CREATE TABLE does not use any AUTO INCREMENT UNIQUE PRIMARY KEY for any field.
My question is, is this database design acceptable?
Should tables always be normalized?
Thanks for advise.
There's no need to use AUTOINCREMENT, but every table should have a PRIMARY KEY of some kind. A primary key can be a combination of several fields that together identify the record uniquely.
Based on what you've told us, yes, the design is acceptable, provided you explicitly declare the combination of the enquiry reference (text), product number (int) and revision number (int) as a primary key that together uniquely identifies a single discussion.
People sometimes denormalize a database for performance reasons. If select queries are far more frequent than inserts and updates, and the select query of interest is slow to return because of the number of tables it has to join, then consider denormalizing.
If you supply a specific query that is running slow for you, you'll get lots of specific advice.
Having a PRIMARY KEY (or a UNIQUE constraint) will, first, ensure that these values are really unique, and, second, will greatly improve the searches for a given enquiry.
A PRIMARY KEY implies creating an index over (enq, pdt, rev), and this query:
SELECT *
FROM enquiries
WHERE enq = 'enquiry'
AND pdt = 'product'
AND rev = 'revision'
will complete in a single index seek.
Without the index, this query will require scanning the whole table, and there is no guarantee that you won't end up with the duplicates.
Unless for very, very, very special conditions (like heavily inserted log tables), you should always have a PRIMARY KEY on your tables.
Personally, I ALWAYS always have some sort of primary key on all tables, even if it is an auto-incrment number used for nothing else
As to normalization, I think one should strive for normalized tables, but in reality there are many good reasons when a table design is good, but not normalized. This is where the 'theory' of DB design meets the reality - but it is good to know what normalization is, strive for it, and have good reasons when you are deviating from the rules (as opposed to just being ignorant of the rules or worse ignoring good design rules).
These are two questions.
(1) It is not required to have an auto increment key always. It is practical though, since you can use it for easy manipulation of your data. Also having no duplicates is not a must.
(2) Normalization is a must when you do homework for school, but if things get tough you can break it in order to make your life easier if you do not endanger your data integrity.
I am splitting from the herd on this one. Do NOT make your enquiry reference (text), product number (int) and revision number (int) the primary key. You indicated the enquiry reference was a text type and did you mean it would be 25 or 50 or 500 characters wide? If the primary key is made from those fields it will be too wide in my view as it will be appended to every index created for that table increasing the size of every index row by the size of the three fields and any table which needs to use a foreign key back to this table will also need the three fields.
Make the three fields a unique index. Place an auto-increment value as the primary key and make it the clustered index. The tables which will link back to this master table will have a small footprint in memory to link the data from table one to table two.
As far as normalized goes it does not matter, normalized or not, if your data is only a few thousand rows, or even 50,000 or 500,000. When the data starts getting bigger than the available RAM cache then it is an issue.
Design a view to present the data to the application to fulfill the business rule. Design stored procedures to accept data to store. Design the table stucture to meet the response time in the SLA. If you have to normalize or denormalize or patrtition or index or get a bigger server to meet the SLA the app will never know because you are always supplying the data via the view which meets the business rule.
There is nothing in normalization theory that deals with whether a table should have a simple or compound primary key. Believe it or not, the concept of "primary key" is not a component of the relational model of data.
Having said that, tables should nearly always be defined with a primary key. The primary key need not be a single column, and it need not be filled in by an autoincrement. In your case, it could be the three columns that taken together uniquely identify an enquiry.
If a table has no declared primary key, it could end up with duplicate rows. A table with duplicate rows represents a bag of tuples, not a set of tuples. Once you are dealing with bags instead of sets, the results predicted by the relational model need not apply. That is why preventing duplicate rows is so important.

Resources