Fact table in dimensional modeling has primary key duplicates

Fact table in dimensional modeling has primary key duplicates - database

I have a fact table at the grain of sales order. The issue is that people want to see the invoice number but each sales order can have multiple invoice numbers. Which causes my table to have duplicates.
What I did is to create a new table with sales order and invoice numbers with a one to many relationship.
Is this the right approach? I sense having a dimension with more rows than your fact table is an indication of a problem.

Related

How are fact tables formed in relation to the dimension tables?

I am trying to understand how fact tables are form in relation to the dimension tables.
E.g. Sale Fact Table
For there is a query for Sale of product by year/month/week/day, do I create a dimension for each type of period: Dim_Year, Dim_Month, Dim_Week and Dim_Day, each with their own respective keys?
Or is it possible to just use one dimension for all periods: Dim_Date and only have one date key?
Another area I am confused about is that why do some fact tables not contain their own ID? E.g. Sale fact table does not have SaleID included in the fact table.
Sale Fact Table Textbook Example

DATES
Your date dimension needs to correspond to the grain of your fact table. So if you had daily sales you would have a Dim_Day, weekly sales you would have a Dim_Week, etc.
You would normally have multiple date dimensions (at different grains) in your data warehouse as you would have facts at different date grains.
Each date dimension would hold hold attributes applicable to levels higher up in the date hierarchy. So a Dim_Day might hold day, week, month, year attributes; Dim_Month might hold month, quarter and year attributes, etc.
PRIMARY KEYS
Primary keys are rarely (never?) a technical requirement when creating tables in a database i.e. you can create a table without defining a PK. So you need to consider why we normally (at least in OLTP DBs) include PKs. Common reasons include:
To easily identify an individual record
To ensure that duplicate records (those with the same PK value) are
not created
So there are good reasons for creating PKs, however there are cost overheads e.g. the PK needs to be checked every time a new record is inserted into the table.
In a dimensional model where you are performing bulk inserts/updates, having PKs would cause a significant performance hit. Additionally, the insert logic/checks should always be implemented in your ETL processes so there is no need to include these types of checks/constraints in the DB itself.
Fact tables do have a primary key but it is often implicit rather than explicit - so a group of the FKs in the fact table uniquely identify each record. This compound PK may be documented but is is never enabled/implemented.
Occasionally a fact table will have an explicit, single column, PK. This is normally used when the fact table needs to be updated and its implicit PK involves a large number of columns. There is normally logic required to identify the record to be updated using its FKs but this returns the PK; then the update statement just has a clause like this:
WHERE table_pk = 12345678
rather than having to include all the columns in the implicit PK:
WHERE table_sk1 = 1234
AND table_sk2 = 5678
AND table_sk3 = 9876
....
Hope this helps?

What kinds of indexes would you create to speed up queries to this table?

I had a question that I could really use someone's help with.
So suppose I have the following huge table with about one million rows:
ORDER (Order#, OrderDate, Customer#, OrderAmount, Product#, DiscountAmount, OrderStatus, OrderFullfillmentDate)
In this table, Order# is PK, and Customer# is a FK to the Customer Table and Product is a FK to the Product table. What kinds of indexes could I create to speed up queries to this table?
Thanks.

Depends what you need to do with this table.
1. Apply index on all fields
2. Pay attention on query because query are prepare relative to where close and you can ask in a query, that is not optimised, to load hole table in memory even the final result contain a few rows.
3. Create many tables with less fields (cols) instead few tables with many cols
I can help you if you can give me more detail and example how you extract data from this table. I am curios where is the unique Order_id and how you query a specific order number.
There are many methods to optimize tables, queries and quick output the results.

MS access tables relationships trouble

I have a database with three tables: Employees, courses and instructors.
I am having toubles making the relationships because some of the employees can be instructors. So how should I link my tables so that I can add some of the employees to be instructors. Also what would be the the primary and foreign keys that I should use?
Thanks

If I understand the problem right, some but not all of your instructors are also employees. There are two ways to go about it:
No data duplication: No Instructors or Employees tables, just a Person table with Yes/No fields for IsInstructor and IsEmployee.
Data duplication (if current scheme is fixed or the remaining info for instructors and employees is very different): Add an EmployeeID field to the Instructors table, leaving it Null if the instructor is not also an employee.
In all cases the primary key is an auto-increment number for each table (PersonID, EmployeeID, InstructorID, CourseID) and that is the only field used in the various relationships.

Difference between Fact table and Dimension table?

When reading a book for business objects, I came across the term- fact table and dimension table.
I am trying to understand what is the different between Dimension table and Fact table?
I read couple of articles on the internet but I was not able to understand clearly..
Any simple example will help me to understand better?

In Data Warehouse Modeling, a star schema and a snowflake schema consists of Fact and Dimension tables.
Fact Table:
It contains all the primary keys of the dimension and associated
facts or measures(is a property on which calculations can be made) like quantity sold, amount sold and average sales.
Dimension Tables:
Dimension tables provides descriptive information for all the measurements recorded in fact table.
Dimensions are relatively very small as comparison of fact table.
Commonly used dimensions are people, products, place and time.
image source

This appears to be a very simple answer on how to differentiate between fact and dimension tables!
It may help to think of dimensions as things or objects. A thing such
as a product can exist without ever being involved in a business
event. A dimension is your noun. It is something that can exist
independent of a business event, such as a sale. Products, employees,
equipment, are all things that exist. A dimension either does
something, or has something done to it.
Employees sell, customers buy. Employees and customers are examples of
dimensions, they do.
Products are sold, they are also dimensions as they have something
done to them.
Facts, are the verb. An entry in a fact table marks a discrete event
that happens to something from the dimension table. A product sale
would be recorded in a fact table. The event of the sale would be
noted by what product was sold, which employee sold it, and which
customer bought it. Product, Employee, and Customer are all dimensions
that describe the event, the sale.
In addition fact tables also typically have some kind of quantitative
data. The quantity sold, the price per item, total price, and so on.
Source:
http://arcanecode.com/2007/07/23/dimensions-versus-facts-in-data-warehousing/

This is to answer the part:
I was trying to understand whether dimension tables can be fact table
as well or not?
The short answer (INMO) is No.That is because the 2 types of tables are created for different reasons. However, from a database design perspective, a dimension table could have a parent table as the case with the fact table which always has a dimension table (or more) as a parent. Also, fact tables may be aggregated, whereas Dimension tables are not aggregated. Another reason is that fact tables are not supposed to be updated in place whereas Dimension tables could be updated in place in some cases.
More details:
Fact and dimension tables appear in a what is commonly known as a Star Schema. A primary purpose of star schema is to simplify a complex normalized set of tables and consolidate data (possibly from different systems) into one database structure that can be queried in a very efficient way.
On its simplest form, it contains a fact table (Example: StoreSales) and a one or more dimension tables. Each Dimension entry has 0,1 or more fact tables associated with it (Example of dimension tables: Geography, Item, Supplier, Customer, Time, etc.). It would be valid also for the dimension to have a parent, in which case the model is of type "Snow Flake". However, designers attempt to avoid this kind of design since it causes more joins that slow performance. In the example of StoreSales, The Geography dimension could be composed of the columns (GeoID, ContenentName, CountryName, StateProvName, CityName, StartDate, EndDate)
In a Snow Flakes model, you could have 2 normalized tables for Geo information, namely: Content Table, Country Table.
You can find plenty of examples on Star Schema. Also, check this out to see an alternative view on the star schema model Inmon vs. Kimball. Kimbal has a good forum you may also want to check out here: Kimball Forum.
Edit: To answer comment about examples for 4NF:
Example for a fact table violating 4NF:
Sales Fact (ID, BranchID, SalesPersonID, ItemID, Amount, TimeID)
Example for a fact table not violating 4NF:
AggregatedSales (BranchID, TotalAmount)
Here the relation is in 4NF
The last example is rather uncommon.

Super simple explanation:
Fact table: a data table that maps lookup IDs together. Is usually one of the main tables central to your application.
Dimension table: a lookup table used to store values (such as city names or states) that are repeated frequently in the fact table.

Dimension table
Dimension table is a table which contain attributes of measurements stored in fact tables. This table consists of hierarchies, categories and logic that can be used to traverse in nodes.
Fact table contains the measurement of business processes, and it contains foreign keys for the dimension tables.
Example – If the business process is manufacturing of bricks
Average number of bricks produced by one person/machine – measure of the business process

a Fact = an action: a sale, a transaction, an access
a Dimension = an object: a seller, a customer, a date, a price
Then...
Facts references dimensions for: when, where, what, who, how
The real interesting thing is deciding whether an attribute should be a dimension or a fact. For example, the price of each item in an order, or, the maximum amount of a insurance recorded in a contract. There are no generally correct way to approach these, only ones that make sense in the context.
PS: If I were to create those jargons I would prefer Log table and Object table.

In the simplest form, I think a dimension table is something like a 'Master' table - that keeps a list of all 'items', so to say.
A fact table is a transaction table which describes all the transactions. In addition, aggregated (grouped) data like total sales by sales person, total sales by branch - such kinds of tables also might exist as independent fact tables.

From my point of view,
Dimension table : Master Data
Fact table : Transactional Data

The fact table mainly consists of business facts and foreign keys that refer to primary keys in the dimension tables. A dimension table consists mainly of descriptive attributes that are textual fields.
A dimension table contains a surrogate key, natural key, and a set of attributes. On the contrary, a fact table contains a foreign key, measurements, and degenerated dimensions.
Dimension tables provide descriptive or contextual information for the measurement of a fact table. On the other hand, fact tables provide the measurements of an enterprise.
When comparing the size of the two tables, a fact table is bigger than a dimensional table. In a comparison table, more dimensions are presented than the fact tables. In a fact table, less numbers of facts are observed.
The dimension table has to be loaded first. While loading the fact tables, one should have to look at the dimension table. This is because the fact table has measures, facts, and foreign keys that are the primary keys in the dimension table.
Read more: Dimension Table and Fact Table | Difference Between | Dimension Table vs Fact Table http://www.differencebetween.net/technology/hardware-technology/dimension-table-and-fact-table/#ixzz3SBp8kPzo

For Relation database users, Dimension is equivalent to Master Table.
Fact is equivalent to Transaction table.

Dimension table : It is nothing but we can maintains information about the characterized date called as Dimension table.
Example : Time Dimension , Product Dimension.
Fact Table : It is nothing but we can maintains information about the metrics or precalculation data.
Example : Sales Fact, Order Fact.
Star schema : one fact table link with dimension table form as a Start Schema.
enter image description here

Airline database prototype

I am designing an airline database (the outline of one anyway) for an assignment and seem to be running around in circles.
Three tables are concerned:
Customer Booking_Reference Flight
cust_id(pk) reference_id(pk) Flight_id(pk)
cust_id(fk)
A booking reference can have many flights.
A flight will have many booking references.
I am trying to break up the many to many relationship. Is it possible to have a relational table with the flight_id as the attributes (columns) and the booking_reference as the rows (data)? If so there can be no primary key, which is a no-go as I understand.
Alternatively I could make the booking_reference/flight relational table with 2 attributes and a compound primary key of booking_reference/flight, which would result in both entities being duplicated but the primary key being unique (half of it anyway). Is this acceptable design practice?
I was going to just list a max number of 8 flights as columns in the booking reference table (with NULL for the entries where there is less than 8 flights) and give customers with more than 8 flights a new reference_id, but this seems to be more ridiculous as i learn more about databases, resulting in more reference ids and more NULL data.
Any ideas on which route to take?

Rather than having eight (or any arbitrary number of) columns, create what's sometimes called a join table, with three columns:
Table: references_flights
id (Primary key)
reference_id (fk)
flight_id (fk)
You should then be able to query data across them with the right JOINs, but I'll leave that for someone with more database expertise.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight