Let's say I have an ordering system which has a table size of around 50,000 rows and grows by about 100 rows a day. Also, say once an order is placed, I need to store metrics about that order for the next 30 days and report on those metrics on a daily basis (i.e. on day 2, this order had X activations and Y deactivations).
1 table called products, which holds the details of the product listing
1 table called orders, which holds the order data and product id
1 table called metrics, which holds a date field, and order id, and metrics associated.
If I modeled this in a star schema format, I'd design like this:
FactOrders table, which has 30 days * X orders rows and stores all metadata around the orders, product id, and metrics (each row represents the metrics of a product on a particular day).
DimProducts table, which stores the product metadata
Does my performance gain from a huge FactOrders table only needing one join to get all relevant information outweigh the fact that I increased by table size by 30x and have an incredible amount of repeated data, vs. the truly normalized model that has one extra join but much smaller tables? Or am I designing this incorrectly for a star schema format?
Do not denormalize something this small to get rid of joins. Index properly instead. Joins are not bad, joins are good. Databases are designed to use them.
Denormalizing is risky for data integrity and may not even be faster due to the much wider size of the tables. IN tables this tiny, it is very unlikely that denormalizing would help.
Related
I have a fairly standard database structure for storing invoices and their line items. We have an AccountingDocuments table for header details, and AccountingDocumentItems child table. The child table has a foreign key of AccountingDocumentId with associated non-clustered index.
Most of our Sales reports need to select almost all the columns in the child table, based on a passed date range and branch. So the basic query would be something like:
SELECT
adi.TotalInclusive,
adi.Description, etc.
FROM
AccountingDocumentItems adi
LEFT JOIN AccountingDocuments ad on ad.AccountingDocumentId=adi.AccountingDocumentId
WHERE
ad.DocumentDate > '1 Jan 2022' and ad.DocumentDate < '5 May 2022'
AND
ad.SiteId='x'
This results in an index seek on the AccountingDocuments table, but due to the select requirements on AccountingDocumentItems we're seeing a full index scan here. I'm not certain how to index the child table given the filtering is taking place on the parent table.
At present there are only around 2m rows in the line items table, and the query performs quite well. My concern is how this will work once the table is significantly larger and what strategies there are to ensure good performance on these reports going forward.
I'm aware the answer may be to simply keep adding hardware, but want to investigate alternative strategies to this.
I'm wondering what are best practices around recording stocks and flows. Do you store only flows, and calculate stocks? Or do you store both?
It seems like the important thing to persist are the flows. (For example, in a bank database, it would be debits and credits to an account), and the stocks (funds remaining) can be calculated from these. But if there are lots of bank accounts, and I want a table of multiple bank accounts with funds remaining, then I would have to recalculate this amount for each one. This seems quite slow.
On the other hand, I thought one of the main goals of databases is to not have duplicated data.
Is there a general practice around storing stocks? Should this be a calculated field, or rather inserted by program logic?
In database design, we have Derived Data:
A table can have derived columns, which are columns for which values are computed, based on the values of other table columns. If all columns are derived, it is said to be a derived table.
For example:
Student Age,
Account Balance,
Number of likes or up-votes and likes of posts and comments (like stackoverflow).
In this case we have 2 options with pros and cons:
Delete the derived data and calculate them
pros: we do not have any Redundancy in our database design.
cons: we should calculate the Aggregation data (Count, Sum, Avg,...) in most queries
Use derived data instead of calculate them
pros: we have all Aggregation data ready and do not need to calculate them
cons: we have a little Redundancy.
cons: we should update derived data when the original data changes.
Therefor we have a trade-off between choosing option 1 or 2. We should calculate their costs in our application and choose one of them.
First: Redundancy
I my idea the redundancy is not so important case in this trade-off. Because there is no so many duplicate data, we only use an extra field (like Integer of Big Integer)
Second:
I think we should calculate the costs between these options:
in Deleting derived data
cost of Performance of retrieving Aggregation data
Using Aggregation columns
cost of Updating Aggregation columns
So, how can we calculate them in our application? There are some evaluation parameters that directly related to the cost:
the number of records in original table (and secondary table).
the number of inserts in original tables in a specified period of time.
the number of updates (update and delete)in original table in a specified period of time.
the number of selects from original data (or secondary table), containing Aggregation data, in a specified period of time
so many other parameters.
Finally: to get very formal approaches, I recommend to read DAX Patterns.
I have the following tables :
Items { item_id, item_name, item_price,......}
Divisions { division_id, division_name, division_address,.......}
Warehouses { warehouse_id, warehouse_name,........}
WarehousePartitions { partition_id, partition_name,........}
WarehouseRacks { rack_id, rack_name,.......}
Now, in order to track an item's location, I have the following table (a relation).
itemLocation { item_id, division_id, warehouse_id, partition_id, rack_id, floor_number}
It accurately tracks an item's location, but in order to get an items location info, I have to join five tables which can cause performance issues.
Also, the table doesn't have any Primary Key if we do not take the entire fields. Will this cause any issues ? and Is there a better way to accomplish this ?
Thanks.
Think in terms of relationships, since you're putting information in a relational database.
Here are my relationship guesses. Feel free to correct them.
A Division has one or more Warehouses.
A Warehouse has one or more Warehouse partitions.
A Warehouse partition has one or more Warehouse Racks.
A Warehouse rack has one or more items.
.
An item is located in a Warehouse rack.
A Warehouse rack is located in a Warehouse partition.
A Warehouse partition is located in a Warehouse.
A Warehouse is located in a Division.
I hope this helps with your database design.
Edited to add: I'll lay out the indexes for the tables. You should be able to create the rest of the columns.
Division
--------
Division ID
...
Warehouse
---------
Warehouse ID
Division ID
...
Warehouse Partition
-------------------
Warehouse Partition ID
Warehouse ID
...
Warehouse Rack
--------------
Warehouse Rack ID
Warehouse Partition ID
...
Item
----
Item ID
Warehouse Rack ID
Item Type ID
...
Item Type
---------
Item Type ID
Item name
Item Price
Each table has a primary ID blind key, probably an auto incrementing integer or an auto incrementing long.
All of the tables except Division have a foreign key that points back to the parent table.
A row in the Item table represents one item. One item can only be in one Warehouse Rack.
Modern relational databases should have no performance problems joining five tables. I've seen 30 table joins.
Build your system, and solve the actual performance problems that come up, rather than spending any time worrying about hypothetical performance problems.
As Gilbert Le Blanc writes, you probably don't need to join to five tables - you may only need to join to "WarehouseRacks".
However, you write that you need to "keep track of" - this suggests that there's a time aspect involved.
That gives you the following schema:
Items { item_id, item_name, item_price,......}
Divisions { division_id, division_name, division_address,.......}
Warehouses { warehouse_id, division_id, warehouse_name,........}
WarehousePartitions { partition_id, warehouse_id partition_name,........}
WarehouseRacks { rack_id, partition_id, rack_name,.......}
ItemLocation (rack_id, item_id, entry_time, quantity, floor_number)
In ItemLocation, all 3 columns are part of a composite primary key - you're effectively saying "there can only be one instance of an item in a given place at any one time".
You still have to join to five tables to retrieve an item ID (at least if you want the addresses and names and such). Assuming you have modern hardware and database software, this should be fine - uUnless you're working with huge amounts of data, a 5-way join on a foreign/primary key relationship is unlikely to cause performance issues. Given the quantities you mention in the comment, and the fact you'll be running this on MySQL, I don't think you need to worry about the number of joins.
The benefit of this model is that you simply cannot insert invalid data into the item location table - you can't say that the item is in a rack which doesn't exist in the partition, or a warehouse that doesn't exist in the division; if a warehouse changes division, you don't have to update all the item_location records.
I've created a SQLFiddle to show how it might work.
The "item_location" table is the biggest concern in this - you have to choose whether to store a snapshot (which is what this design does), or a transaction table. With "snapshot" views, your code always updates the "quantity" column, effectively saying "as of entry_time, there are x items in this floor in this rack".
The "transaction" model allows you to insert multiple records - typically positive quantities when adding items, and negative quantities when removing them. The items in that location at any point in time are the SUM of those quantities up to the desired time.
I have like about 10 tables where are records with date ranges and some value belongin to the date range.
Each table has some meaning.
For example
rates
start_date DATE
end_date DATE
price DOUBLE
availability
start_date DATE
end_date DATE
availability INT
and then table dates
day DATE
where are dates for each day for 2 years ahead.
Final result is joining these 10 tables to dates table.
The query takes a bit longer, because there are some other joins and subqueries.
I have been thinking about creating one bigger table containing all the 10 tables data for each day, but final table would have about 1.5M - 2M records.
From testing it seems to be quicker (0.2s instead of about 1s) to search in this table instead of joining tables and searching in the joined result.
Is there any real reason why it should be bad idea to have a table with that many records?
The final table would look like
day DATE
price DOUBLE
availability INT
Thank you for your comments.
This is a complicated question. The answer depends heavily on usage patterns. Presumably, most of the values do not change every day. So, you could be vastly increasing the size of the database.
On the other hand, something like availability may change every day, so you already have a large table in your database.
If your usage patterns focused on one table at a time, I'd be tempted to say "leave well-enough alone". That is, don't make a change if it ain't broke. If your usage involved multiple updates to one type of record, I'd be inclined to leave them in separate tables (so locking for one type of value does not block queries on other types).
However, your usage suggests that you are combining the tables. If so, I think putting them in one row per day per item makes sense. If you are getting successive days at one time, you may find that having separate days in the underlying table greatly simplifies your queries. And, if your queries are focused on particular time frames, your proposed structure will keep the relevant data in the cache, giving room for better performance.
I appreciate what Bohemian says. However, you are already going to the lowest level of granularity and seeing that it works for you. I think you should go ahead with the reorganization.
I went down this road once and regretted it.
The fact that you have a projection of millions of rows tells me that dates from one table don't line up with dates from another table, leading to creating extra boundaries for some attributes because being in one table all attributes must share the same boundaries.
The problem I encountered was that the business changed and suddenly I had a lot more combinations to deal with and the number of rows blew right out, slowing queries significantly. The other problem was keeping the data up to date - my "super" table was calculated from the separate tables when ever they changed.
I found that keeping them separate and moving the logic into the app layer worked for me.
The data I was dealing with was almost exactly the same as yours except I had only 3
tables: I had availability, pricing and margin. The fact was that the 3 were unrelated, so date ranges never aligned, leasing to lots of artificial rows in the big table.
I'm fairly new to this so you may have to bear with me. I'm developing a database for a website with athletics rankings on them and I was curious as to how many tables would be the most efficient way of achieving this.
I currently have 2 tables, a table called 'athletes' which holds the details of all my runners (potentially around 600 people/records) which contains the following fields:
mid (member id - primary key)
firstname
lastname
gender
birthday
nationality
And a second table, 'results', which holds all of their performances and has the following fields:
mid
eid (event id - primary key)
eventdate
eventcategory (road, track, field etc)
eventdescription (100m, 200m, 400m etc)
hours
minutes
seconds
distance
points
location
The second table has around 2000 records in it already and potentially this will quadruple over time, mainly because there are around 30 track events, 10 field, 10 road, cross country, relays, multi-events etc and if there are 600 athletes in my first table, that equates to a large amount of records in my second table.
So what I was wondering is would it be cleaner/more efficient to have multiple tables to separate track, field, cross country etc?
I want to use the database to order peoples results based on their performance. If you would like to understand better what I am trying to emulate, take a look at this website http://thepowerof10.info
Changing the schema won't change the number of results. Even if you split the venue into a separate table, you'll still have one result per participant at each event.
The potential benefit of having a separate venue table would be better normalization. A runner can have many results, and a given venue can have many results on a given date. You won't have to repeat the venue information in every result record.
You'll want to pay attention to indexes. Every table must have a primary key. Add additional indexes for columns you use in WHERE clauses when you select.
Here's a discussion about normalization and what it can mean for you.
PS - Thousands of records won't be an issue. Large databases are on the order of giga- or tera-bytes.
My thought --
Don't break your events table into separate tables for each type (track, field, etc.). You'll have a much easier time querying the data back out if it's all there in the same table.
Otherwise, your two tables look fine -- it's a good start.