Database table normalization (Many relations) - database

I have the following tables :
Items { item_id, item_name, item_price,......}
Divisions { division_id, division_name, division_address,.......}
Warehouses { warehouse_id, warehouse_name,........}
WarehousePartitions { partition_id, partition_name,........}
WarehouseRacks { rack_id, rack_name,.......}
Now, in order to track an item's location, I have the following table (a relation).
itemLocation { item_id, division_id, warehouse_id, partition_id, rack_id, floor_number}
It accurately tracks an item's location, but in order to get an items location info, I have to join five tables which can cause performance issues.
Also, the table doesn't have any Primary Key if we do not take the entire fields. Will this cause any issues ? and Is there a better way to accomplish this ?
Thanks.

Think in terms of relationships, since you're putting information in a relational database.
Here are my relationship guesses. Feel free to correct them.
A Division has one or more Warehouses.
A Warehouse has one or more Warehouse partitions.
A Warehouse partition has one or more Warehouse Racks.
A Warehouse rack has one or more items.
.
An item is located in a Warehouse rack.
A Warehouse rack is located in a Warehouse partition.
A Warehouse partition is located in a Warehouse.
A Warehouse is located in a Division.
I hope this helps with your database design.
Edited to add: I'll lay out the indexes for the tables. You should be able to create the rest of the columns.
Division
--------
Division ID
...
Warehouse
---------
Warehouse ID
Division ID
...
Warehouse Partition
-------------------
Warehouse Partition ID
Warehouse ID
...
Warehouse Rack
--------------
Warehouse Rack ID
Warehouse Partition ID
...
Item
----
Item ID
Warehouse Rack ID
Item Type ID
...
Item Type
---------
Item Type ID
Item name
Item Price
Each table has a primary ID blind key, probably an auto incrementing integer or an auto incrementing long.
All of the tables except Division have a foreign key that points back to the parent table.
A row in the Item table represents one item. One item can only be in one Warehouse Rack.
Modern relational databases should have no performance problems joining five tables. I've seen 30 table joins.
Build your system, and solve the actual performance problems that come up, rather than spending any time worrying about hypothetical performance problems.

As Gilbert Le Blanc writes, you probably don't need to join to five tables - you may only need to join to "WarehouseRacks".
However, you write that you need to "keep track of" - this suggests that there's a time aspect involved.
That gives you the following schema:
Items { item_id, item_name, item_price,......}
Divisions { division_id, division_name, division_address,.......}
Warehouses { warehouse_id, division_id, warehouse_name,........}
WarehousePartitions { partition_id, warehouse_id partition_name,........}
WarehouseRacks { rack_id, partition_id, rack_name,.......}
ItemLocation (rack_id, item_id, entry_time, quantity, floor_number)
In ItemLocation, all 3 columns are part of a composite primary key - you're effectively saying "there can only be one instance of an item in a given place at any one time".
You still have to join to five tables to retrieve an item ID (at least if you want the addresses and names and such). Assuming you have modern hardware and database software, this should be fine - uUnless you're working with huge amounts of data, a 5-way join on a foreign/primary key relationship is unlikely to cause performance issues. Given the quantities you mention in the comment, and the fact you'll be running this on MySQL, I don't think you need to worry about the number of joins.
The benefit of this model is that you simply cannot insert invalid data into the item location table - you can't say that the item is in a rack which doesn't exist in the partition, or a warehouse that doesn't exist in the division; if a warehouse changes division, you don't have to update all the item_location records.
I've created a SQLFiddle to show how it might work.
The "item_location" table is the biggest concern in this - you have to choose whether to store a snapshot (which is what this design does), or a transaction table. With "snapshot" views, your code always updates the "quantity" column, effectively saying "as of entry_time, there are x items in this floor in this rack".
The "transaction" model allows you to insert multiple records - typically positive quantities when adding items, and negative quantities when removing them. The items in that location at any point in time are the SUM of those quantities up to the desired time.

Related

At what point does becoming normalized vs. star help performance?

Let's say I have an ordering system which has a table size of around 50,000 rows and grows by about 100 rows a day. Also, say once an order is placed, I need to store metrics about that order for the next 30 days and report on those metrics on a daily basis (i.e. on day 2, this order had X activations and Y deactivations).
1 table called products, which holds the details of the product listing
1 table called orders, which holds the order data and product id
1 table called metrics, which holds a date field, and order id, and metrics associated.
If I modeled this in a star schema format, I'd design like this:
FactOrders table, which has 30 days * X orders rows and stores all metadata around the orders, product id, and metrics (each row represents the metrics of a product on a particular day).
DimProducts table, which stores the product metadata
Does my performance gain from a huge FactOrders table only needing one join to get all relevant information outweigh the fact that I increased by table size by 30x and have an incredible amount of repeated data, vs. the truly normalized model that has one extra join but much smaller tables? Or am I designing this incorrectly for a star schema format?
Do not denormalize something this small to get rid of joins. Index properly instead. Joins are not bad, joins are good. Databases are designed to use them.
Denormalizing is risky for data integrity and may not even be faster due to the much wider size of the tables. IN tables this tiny, it is very unlikely that denormalizing would help.

Alternative/Faster way to model this data

Our SaaS app helps eCommerce stores create more effective promotions. In order to do this, we need to retain quite a bit of data - promotional data plus all the products these promotions are associated with across each customer group that these promotions are applied to. I've normalized this data but I feel one of these tables, promotion_products, might grow too fast and slow down our most critical queries.
Here's how this table gets populated so quickly: If a store has 25,000 products and runs 1 promotion per week across 10 customer groups, then the promotion_products table would have 250,000 new entries/week (1 million/mo). Since this is a SaaS product, we have many customers creating the same amount of data or more.
How can I improve this schema so that the data in the promotion_products can be queried quickly?
products
- product_id
- customer_id
- name
promotions
- promotion_id
- customer_id
- promotion (i.e. 20% off etc)
promotion_products (need fast read access)
- product_id
- promotion_id
- customer_id
- group_id (each offer is associated w/ one or many customer groups)
It looks like you've denormalized it a bit by adding customer_id to the promotion_products table, that should help, especially if it's indexed. You could also consider clustering the promotion_products table by customer and promotion_id.
I'm wondering why group_id and customer_id are in promotion_products as it seems to be conflicting in that a group_id seems like it'd relate to multiple customers.
Make sure to index all the fields that you intend on querying by.

How can I create a hierarchy in SSAS?

I have the table order with following fields:
ID
Serial
Visitor
Branch
Company
Assume there are relations between Visitor, Branch and Company in the database. But every visitor can be in more Branch. How can I create a hierarchy between these three fields for my order table.
How can I do that?
You would need to create a denormalised dimension table, with the distinct result of the denormalisation process of the table order. In this case, you would have many rows for the same visitor. One for each branch.
In your fact table, the activity record which would have BranchKey in the primary key, would reference this dimension. This obviously would be together with the VisitorKey...
Then in SSAS you would need to build the hierarchy, and set the relationships between the keys... When displaying this data in a client, such as excel, you would drag the hierarchy in the rows, and when expanding, data from your fact would fit in according to the visitors branch...
With regards to dimensions, it's important to set relationships between the attributes, as this will give you a massive performance gain when processing the dimension, and the cube. Take a look at this article for help regarding that matter http://www.bidn.com/blogs/DevinKnight/ssis/1099/ssas-defining-attribute-relationships-in-2005-and-2008. In this case it's the same approach also for '12.

Building data warehouse dimension

I'm trying to create a data warehouse from which we will create all business reports. Already learned quite a lot about this and I have a general idea how to build the data warehouse. However, I came across a problem when I started to wonder how I could combine into a single data store information about products and sales from two separate OLTP databases.
ETL process looks like this:
1 Transfer product data from the first OLTP database table stgProducts
2 Merg product data from table to table stgProducts dimProducts - if the product is changed records are updated when there are new products that are added to new records.
3 Transfer product data from another database OLTP table stgProducts
4 Merg product data from table to table stgProducts dimProducts - if the product is changed records are updated when there are new products that are added to new records.
Similarly, the transfer is realized on sales data.
If I have one table with products how do I connect to the sales data from two different databases?
Speaking of the two databases, I mean two different ERP systems. One manages the online sales, the other handles other sales. SKU of the product is the same but the product ID is different for each system.
Assuming that you're using a star schema (not always the best approach, BTW), your product dimension table should have a key that is unique to the DW. Thus, SKU 186 might have a DW-specific key of 1 and SKU 294 might have a DW-specific key of 2. Your fact table, which holds 'transaction records' (sales records?), will have a compound key composed of multiple foreign key columns (eg. product_key, date_key, location_key, etc.).
This foreign key to the product table in this case is to the DW-specific product key, NOT to the source system SKU.
The ETL to populate your fact table should 'translate' the source system product key into the DW-specific product key as you bring the data in.
NOTE: This is the general approach to populating fact tables. There can be variations based on specific requirements.
Expanding Bens answer a bit with the caveate the there is no right answer the data warehouse - it is an expansive, mungeable area of IT with lots of schools of thought. Here is one possible direction you can pursue.
Assumptions:
1) You have 2 separate source databases both with 2 tables: Product and Sales
2) The separate source database are isloated and may have conflicting primary key data.
3) You want to version[1] both the product and the sales tables. This is an important assumption as mostly fact tables are not updated and the sales table sounds like it belongs as a nice static fact table. Your question in unclear on if you are expecting changes to sales so I will assume you will
4) A sales record can only ever be of 1 product (this is unlikely, but your question only mentions the 2 tables so ill work from there, a 1-many relation will involve more tweaking around the bridging table)
Warehouse Design:
You will need 3 tables with the following columns:
PRODUCT_DIM
PRODUCT_SK (Surrogate primary key data warehouse database generated)
SOURCE_SYSTEM_ID (DW specific indicator as to which source OLTP database the record was sourced from - can be string if you like)
PRODUCT_NK (PK of product from source system, used for SCD operations)
DATE_FROM (Record active from)
DATE_TO (Record active to (null for current))
PRODUCT_NAME (Product name for source table)
Other Columns (Any other product columns you may need)
SALES DIM
- SALES_SK (Surrogate primary key Data warehouse database generated)
- SOURCE_SYSTEM_ID (DW specific indicator as to which source OLTP database the record was sourced from - can be string if you like)
- SALES_NK (PK of sales record from source system, used for SCD operations)
- DATE_FROM (Record active from)
- DATE_TO (Record active to (null for current))
- SALE_AMOUNT (Product name for source table)
- Other Columns (Any other sales columns you may need)
PRODUCT_SALES_BRIDGE
- PRODUCT_SK (composite primary key)
- SALES_SK (composite primary key)
- DATE_FROM (Record active from)
- DATE_TO (Record active to (null for current))
The main things to note is the identifiers in the SALES and PRODUCT dim tables.
There is a Natural Key column for storing each records Primary Key value as exactly whats in the source system.
Because you have stated you have multiple source systems, the additional SOURCE_SYSTEM_ID column is required so you can match records from your multiple source systems to their equivalent record in your warehouse. Otherwise you might have a product of EGGS with an ID of 13 in your first source system and a product called MILK with an ID also of 13 in your second system. Without the additional SOURCE_SYSTEM_ID you will be forever cutting records for PRODUCT_DIM natural key 13. This will look somthing like this in your warehouse:
PRODUCT_SK SOURCE_SYSTEM_ID PRODUCT_NK .. PRODUCT_NAME
..
14 1 13 .. EGGS
15 2 13 .. MILK
..
The bridge table exists to prevent cutting of new SALES or PRODUCT records each time their related record changes. Consider a sale of 10$ with of Red Eggs. The next data, the Red Eggs product is renamed to "Super Red Eggs". This will result in a new PRODUCT record for Red Eggs in the warehouse. If the SALES table included a direct link to PRODUCT_SK, a new SALES record would to be cut solely because there was a new product SK for our Red Eggs. The bridging table moves the Referential Integrity Induced cutting of a new record from the DIMENSION/FACT table into the bridge table. This also has the added benfit of making new comers to the datawarehouse very aware that they are operating in a different mindset to the traditional RDBMS.
The 2 Natural Key columns should assist you to solve your original question, the bridge table is just personal preference and added for completeness - If you have a DW design already that works for you, stick with it.
[1] Ill use version to refer to mean what ever slowly changing dimension methodology you choose. Most people cheap out and just Type2 their entire tables 'just in case'

Athletics Ranking Database - Number of Tables

I'm fairly new to this so you may have to bear with me. I'm developing a database for a website with athletics rankings on them and I was curious as to how many tables would be the most efficient way of achieving this.
I currently have 2 tables, a table called 'athletes' which holds the details of all my runners (potentially around 600 people/records) which contains the following fields:
mid (member id - primary key)
firstname
lastname
gender
birthday
nationality
And a second table, 'results', which holds all of their performances and has the following fields:
mid
eid (event id - primary key)
eventdate
eventcategory (road, track, field etc)
eventdescription (100m, 200m, 400m etc)
hours
minutes
seconds
distance
points
location
The second table has around 2000 records in it already and potentially this will quadruple over time, mainly because there are around 30 track events, 10 field, 10 road, cross country, relays, multi-events etc and if there are 600 athletes in my first table, that equates to a large amount of records in my second table.
So what I was wondering is would it be cleaner/more efficient to have multiple tables to separate track, field, cross country etc?
I want to use the database to order peoples results based on their performance. If you would like to understand better what I am trying to emulate, take a look at this website http://thepowerof10.info
Changing the schema won't change the number of results. Even if you split the venue into a separate table, you'll still have one result per participant at each event.
The potential benefit of having a separate venue table would be better normalization. A runner can have many results, and a given venue can have many results on a given date. You won't have to repeat the venue information in every result record.
You'll want to pay attention to indexes. Every table must have a primary key. Add additional indexes for columns you use in WHERE clauses when you select.
Here's a discussion about normalization and what it can mean for you.
PS - Thousands of records won't be an issue. Large databases are on the order of giga- or tera-bytes.
My thought --
Don't break your events table into separate tables for each type (track, field, etc.). You'll have a much easier time querying the data back out if it's all there in the same table.
Otherwise, your two tables look fine -- it's a good start.

Resources