Our SaaS app helps eCommerce stores create more effective promotions. In order to do this, we need to retain quite a bit of data - promotional data plus all the products these promotions are associated with across each customer group that these promotions are applied to. I've normalized this data but I feel one of these tables, promotion_products, might grow too fast and slow down our most critical queries.
Here's how this table gets populated so quickly: If a store has 25,000 products and runs 1 promotion per week across 10 customer groups, then the promotion_products table would have 250,000 new entries/week (1 million/mo). Since this is a SaaS product, we have many customers creating the same amount of data or more.
How can I improve this schema so that the data in the promotion_products can be queried quickly?
products
- product_id
- customer_id
- name
promotions
- promotion_id
- customer_id
- promotion (i.e. 20% off etc)
promotion_products (need fast read access)
- product_id
- promotion_id
- customer_id
- group_id (each offer is associated w/ one or many customer groups)
It looks like you've denormalized it a bit by adding customer_id to the promotion_products table, that should help, especially if it's indexed. You could also consider clustering the promotion_products table by customer and promotion_id.
I'm wondering why group_id and customer_id are in promotion_products as it seems to be conflicting in that a group_id seems like it'd relate to multiple customers.
Make sure to index all the fields that you intend on querying by.
Related
First of all I have to mention that I am modernising our ERP system that is build in-house. It handles everything from purchasing of parts, sale orders, quotes and inventory to invoicing and statistical data. The system is web based and heavily dependent on ORM. EloquentORM will be used in the redesign.
My main question is about the data model of certain entities that are very similar. Currently three of most widely interconnected entities in the app are: Orders, Products and Invoices.
1. Orders
In current DB design I have one big orders table in which there is a order_type attribute to distinct between different order types: Purchase orders, Sale orders, Quotes and Service orders. About 80% of fields are common to each order type and there are some specific fields for each order types. Currently at ~15k records.
2. Products
Similarly I have one big products table with an attribute product_type to distinct between different product types: Finished products, Services, Assemblies and Parts. Again there is a fair % of fields that are common throughout all product types and some that are specific to different product type Currently at ~7k records.
3. Invoices
Again one table invoices with invoice_type attribute to distinct between 4 invoice types: Issued invoices (for things we sell), Received invoices (for things we buy), Credit notes and Avans Invoices. More or less all invoice types use the same fields. Currently at ~15k records.
I am now trying to decide which is the optimal way for this kind of DB model. I see three options:
1. Single Table Inheritance
Leave as is, everything in the same table. It feels kind of awkward to always filter records like where order_type = 'Sale order' to display right orders in the right place in GUI... Also when doing sale and purchase analytics I need to include the same where condition to fetch right orders. Seems wrong!
2. Class Table Inheritance
Have a master tables orders, products and invoices with common set of fields between each of entity types and then one-to-one child relation tables for every different type of each entity: sales_orders, purchase_orders, quote_orders, finished_products, reseller_products, part_products, assembly_products, received_invoices and issued_invoices with FK in each of the child tables to master table... This seems like a good idea but handling that with ORM brings in a little more complexity...
In this method I have a questions which FK should be used around. For example each invoice can belong to one order. Received invoice will go with Purchase order and issued invoice will go with Sale order. Should the master orders table's PK be used as a FK in the master invoices table to relate these entities, or should the child sale_orders PK be used in the child issued_invoices?
3. Concrete Table Inheritance
Having completely separated tables for every type of each entity. This would avoid me having parent->child relationship between master table but would result in a lot of similar attributes in each table...
What would be the best approach? I am aiming at ease of use in EloquentORM and also speed and scalability for the future.
Let's say I have an ordering system which has a table size of around 50,000 rows and grows by about 100 rows a day. Also, say once an order is placed, I need to store metrics about that order for the next 30 days and report on those metrics on a daily basis (i.e. on day 2, this order had X activations and Y deactivations).
1 table called products, which holds the details of the product listing
1 table called orders, which holds the order data and product id
1 table called metrics, which holds a date field, and order id, and metrics associated.
If I modeled this in a star schema format, I'd design like this:
FactOrders table, which has 30 days * X orders rows and stores all metadata around the orders, product id, and metrics (each row represents the metrics of a product on a particular day).
DimProducts table, which stores the product metadata
Does my performance gain from a huge FactOrders table only needing one join to get all relevant information outweigh the fact that I increased by table size by 30x and have an incredible amount of repeated data, vs. the truly normalized model that has one extra join but much smaller tables? Or am I designing this incorrectly for a star schema format?
Do not denormalize something this small to get rid of joins. Index properly instead. Joins are not bad, joins are good. Databases are designed to use them.
Denormalizing is risky for data integrity and may not even be faster due to the much wider size of the tables. IN tables this tiny, it is very unlikely that denormalizing would help.
My new employer runs a standard ecommerce website with about 14k products. They create "microsites" that a person can log into and see custom pricing and limited products (hide some/show some products) and custom categories.
The current system is just has a huge relational database. So there is a products table, a sites table and then sites_products has the following columns
site_id
product_id
product_price
If the microsite is suppose to show a product it simply stores it in this table. This table is currently # 2 million rows and growing. The custom categories is a similar relational table setup but the numbers are much lower so I am not as worried about that.
I would appreciate any help/ideas you could provide to decrease this table size. I am confident that in the next couple years it will be at 20 million at this rate.
-Justin
I'm trying to create a data warehouse from which we will create all business reports. Already learned quite a lot about this and I have a general idea how to build the data warehouse. However, I came across a problem when I started to wonder how I could combine into a single data store information about products and sales from two separate OLTP databases.
ETL process looks like this:
1 Transfer product data from the first OLTP database table stgProducts
2 Merg product data from table to table stgProducts dimProducts - if the product is changed records are updated when there are new products that are added to new records.
3 Transfer product data from another database OLTP table stgProducts
4 Merg product data from table to table stgProducts dimProducts - if the product is changed records are updated when there are new products that are added to new records.
Similarly, the transfer is realized on sales data.
If I have one table with products how do I connect to the sales data from two different databases?
Speaking of the two databases, I mean two different ERP systems. One manages the online sales, the other handles other sales. SKU of the product is the same but the product ID is different for each system.
Assuming that you're using a star schema (not always the best approach, BTW), your product dimension table should have a key that is unique to the DW. Thus, SKU 186 might have a DW-specific key of 1 and SKU 294 might have a DW-specific key of 2. Your fact table, which holds 'transaction records' (sales records?), will have a compound key composed of multiple foreign key columns (eg. product_key, date_key, location_key, etc.).
This foreign key to the product table in this case is to the DW-specific product key, NOT to the source system SKU.
The ETL to populate your fact table should 'translate' the source system product key into the DW-specific product key as you bring the data in.
NOTE: This is the general approach to populating fact tables. There can be variations based on specific requirements.
Expanding Bens answer a bit with the caveate the there is no right answer the data warehouse - it is an expansive, mungeable area of IT with lots of schools of thought. Here is one possible direction you can pursue.
Assumptions:
1) You have 2 separate source databases both with 2 tables: Product and Sales
2) The separate source database are isloated and may have conflicting primary key data.
3) You want to version[1] both the product and the sales tables. This is an important assumption as mostly fact tables are not updated and the sales table sounds like it belongs as a nice static fact table. Your question in unclear on if you are expecting changes to sales so I will assume you will
4) A sales record can only ever be of 1 product (this is unlikely, but your question only mentions the 2 tables so ill work from there, a 1-many relation will involve more tweaking around the bridging table)
Warehouse Design:
You will need 3 tables with the following columns:
PRODUCT_DIM
PRODUCT_SK (Surrogate primary key data warehouse database generated)
SOURCE_SYSTEM_ID (DW specific indicator as to which source OLTP database the record was sourced from - can be string if you like)
PRODUCT_NK (PK of product from source system, used for SCD operations)
DATE_FROM (Record active from)
DATE_TO (Record active to (null for current))
PRODUCT_NAME (Product name for source table)
Other Columns (Any other product columns you may need)
SALES DIM
- SALES_SK (Surrogate primary key Data warehouse database generated)
- SOURCE_SYSTEM_ID (DW specific indicator as to which source OLTP database the record was sourced from - can be string if you like)
- SALES_NK (PK of sales record from source system, used for SCD operations)
- DATE_FROM (Record active from)
- DATE_TO (Record active to (null for current))
- SALE_AMOUNT (Product name for source table)
- Other Columns (Any other sales columns you may need)
PRODUCT_SALES_BRIDGE
- PRODUCT_SK (composite primary key)
- SALES_SK (composite primary key)
- DATE_FROM (Record active from)
- DATE_TO (Record active to (null for current))
The main things to note is the identifiers in the SALES and PRODUCT dim tables.
There is a Natural Key column for storing each records Primary Key value as exactly whats in the source system.
Because you have stated you have multiple source systems, the additional SOURCE_SYSTEM_ID column is required so you can match records from your multiple source systems to their equivalent record in your warehouse. Otherwise you might have a product of EGGS with an ID of 13 in your first source system and a product called MILK with an ID also of 13 in your second system. Without the additional SOURCE_SYSTEM_ID you will be forever cutting records for PRODUCT_DIM natural key 13. This will look somthing like this in your warehouse:
PRODUCT_SK SOURCE_SYSTEM_ID PRODUCT_NK .. PRODUCT_NAME
..
14 1 13 .. EGGS
15 2 13 .. MILK
..
The bridge table exists to prevent cutting of new SALES or PRODUCT records each time their related record changes. Consider a sale of 10$ with of Red Eggs. The next data, the Red Eggs product is renamed to "Super Red Eggs". This will result in a new PRODUCT record for Red Eggs in the warehouse. If the SALES table included a direct link to PRODUCT_SK, a new SALES record would to be cut solely because there was a new product SK for our Red Eggs. The bridging table moves the Referential Integrity Induced cutting of a new record from the DIMENSION/FACT table into the bridge table. This also has the added benfit of making new comers to the datawarehouse very aware that they are operating in a different mindset to the traditional RDBMS.
The 2 Natural Key columns should assist you to solve your original question, the bridge table is just personal preference and added for completeness - If you have a DW design already that works for you, stick with it.
[1] Ill use version to refer to mean what ever slowly changing dimension methodology you choose. Most people cheap out and just Type2 their entire tables 'just in case'
I have the following tables :
Items { item_id, item_name, item_price,......}
Divisions { division_id, division_name, division_address,.......}
Warehouses { warehouse_id, warehouse_name,........}
WarehousePartitions { partition_id, partition_name,........}
WarehouseRacks { rack_id, rack_name,.......}
Now, in order to track an item's location, I have the following table (a relation).
itemLocation { item_id, division_id, warehouse_id, partition_id, rack_id, floor_number}
It accurately tracks an item's location, but in order to get an items location info, I have to join five tables which can cause performance issues.
Also, the table doesn't have any Primary Key if we do not take the entire fields. Will this cause any issues ? and Is there a better way to accomplish this ?
Thanks.
Think in terms of relationships, since you're putting information in a relational database.
Here are my relationship guesses. Feel free to correct them.
A Division has one or more Warehouses.
A Warehouse has one or more Warehouse partitions.
A Warehouse partition has one or more Warehouse Racks.
A Warehouse rack has one or more items.
.
An item is located in a Warehouse rack.
A Warehouse rack is located in a Warehouse partition.
A Warehouse partition is located in a Warehouse.
A Warehouse is located in a Division.
I hope this helps with your database design.
Edited to add: I'll lay out the indexes for the tables. You should be able to create the rest of the columns.
Division
--------
Division ID
...
Warehouse
---------
Warehouse ID
Division ID
...
Warehouse Partition
-------------------
Warehouse Partition ID
Warehouse ID
...
Warehouse Rack
--------------
Warehouse Rack ID
Warehouse Partition ID
...
Item
----
Item ID
Warehouse Rack ID
Item Type ID
...
Item Type
---------
Item Type ID
Item name
Item Price
Each table has a primary ID blind key, probably an auto incrementing integer or an auto incrementing long.
All of the tables except Division have a foreign key that points back to the parent table.
A row in the Item table represents one item. One item can only be in one Warehouse Rack.
Modern relational databases should have no performance problems joining five tables. I've seen 30 table joins.
Build your system, and solve the actual performance problems that come up, rather than spending any time worrying about hypothetical performance problems.
As Gilbert Le Blanc writes, you probably don't need to join to five tables - you may only need to join to "WarehouseRacks".
However, you write that you need to "keep track of" - this suggests that there's a time aspect involved.
That gives you the following schema:
Items { item_id, item_name, item_price,......}
Divisions { division_id, division_name, division_address,.......}
Warehouses { warehouse_id, division_id, warehouse_name,........}
WarehousePartitions { partition_id, warehouse_id partition_name,........}
WarehouseRacks { rack_id, partition_id, rack_name,.......}
ItemLocation (rack_id, item_id, entry_time, quantity, floor_number)
In ItemLocation, all 3 columns are part of a composite primary key - you're effectively saying "there can only be one instance of an item in a given place at any one time".
You still have to join to five tables to retrieve an item ID (at least if you want the addresses and names and such). Assuming you have modern hardware and database software, this should be fine - uUnless you're working with huge amounts of data, a 5-way join on a foreign/primary key relationship is unlikely to cause performance issues. Given the quantities you mention in the comment, and the fact you'll be running this on MySQL, I don't think you need to worry about the number of joins.
The benefit of this model is that you simply cannot insert invalid data into the item location table - you can't say that the item is in a rack which doesn't exist in the partition, or a warehouse that doesn't exist in the division; if a warehouse changes division, you don't have to update all the item_location records.
I've created a SQLFiddle to show how it might work.
The "item_location" table is the biggest concern in this - you have to choose whether to store a snapshot (which is what this design does), or a transaction table. With "snapshot" views, your code always updates the "quantity" column, effectively saying "as of entry_time, there are x items in this floor in this rack".
The "transaction" model allows you to insert multiple records - typically positive quantities when adding items, and negative quantities when removing them. The items in that location at any point in time are the SUM of those quantities up to the desired time.