Database table structure for price list - database

I have like about 10 tables where are records with date ranges and some value belongin to the date range.
Each table has some meaning.
For example
rates
start_date DATE
end_date DATE
price DOUBLE
availability
start_date DATE
end_date DATE
availability INT
and then table dates
day DATE
where are dates for each day for 2 years ahead.
Final result is joining these 10 tables to dates table.
The query takes a bit longer, because there are some other joins and subqueries.
I have been thinking about creating one bigger table containing all the 10 tables data for each day, but final table would have about 1.5M - 2M records.
From testing it seems to be quicker (0.2s instead of about 1s) to search in this table instead of joining tables and searching in the joined result.
Is there any real reason why it should be bad idea to have a table with that many records?
The final table would look like
day DATE
price DOUBLE
availability INT
Thank you for your comments.

This is a complicated question. The answer depends heavily on usage patterns. Presumably, most of the values do not change every day. So, you could be vastly increasing the size of the database.
On the other hand, something like availability may change every day, so you already have a large table in your database.
If your usage patterns focused on one table at a time, I'd be tempted to say "leave well-enough alone". That is, don't make a change if it ain't broke. If your usage involved multiple updates to one type of record, I'd be inclined to leave them in separate tables (so locking for one type of value does not block queries on other types).
However, your usage suggests that you are combining the tables. If so, I think putting them in one row per day per item makes sense. If you are getting successive days at one time, you may find that having separate days in the underlying table greatly simplifies your queries. And, if your queries are focused on particular time frames, your proposed structure will keep the relevant data in the cache, giving room for better performance.
I appreciate what Bohemian says. However, you are already going to the lowest level of granularity and seeing that it works for you. I think you should go ahead with the reorganization.

I went down this road once and regretted it.
The fact that you have a projection of millions of rows tells me that dates from one table don't line up with dates from another table, leading to creating extra boundaries for some attributes because being in one table all attributes must share the same boundaries.
The problem I encountered was that the business changed and suddenly I had a lot more combinations to deal with and the number of rows blew right out, slowing queries significantly. The other problem was keeping the data up to date - my "super" table was calculated from the separate tables when ever they changed.
I found that keeping them separate and moving the logic into the app layer worked for me.
The data I was dealing with was almost exactly the same as yours except I had only 3
tables: I had availability, pricing and margin. The fact was that the 3 were unrelated, so date ranges never aligned, leasing to lots of artificial rows in the big table.

Related

Why is datekey in fact tables always INT?

I'm looking at the datekey column from the fact tables in AdventureWorksDW and they're all of type int.
Is there a reason for this and not of type date?
I understand that creating a clustered index composed of an INT would optimize query speed. But let's say I want to get data from this past week. I can subtract 6 from date 20170704 and I'll get 20170698 which is not a valid date. So I have to cast everything to date, subtract, and then cast as int.
Right now I have a foreign key constraint to make sure that something besides 'YYYYMMDD' isn't inserted. It wouldn't be necessary with a Date type. Just now, I wanted to get some data between 6/28 and 7/4. I can't just subtract six from `20170703'; I have to cast from int to date.
It seems like a lot of hassle and not many benefits.
Thanks.
Yes, you could be using a Date data type and have that as your primary key in the Fact and the dimension and you're going to save yourself a byte in the process.
And then you're going to have to deal with a sale that is recorded and we didn't know the date. What then? In a "normal" dimensional model, you define Unknown surrogate values so that people know there is data and it might be useful but it's incomplete. A common convention is to make it zero or in the negative realm. Easy to do with integers.
Dates are a little weird in that we typically use smart keys - yyyymmdd. From a debugging perspective, it's easy to quickly identify what the date is without having to look up against your dimension.
You can't make an invalid date. Soooo what then? Everyone "knows" that 1899-12-31 is the "fake" date (or whatever tickles your fancy) and that's all well and good until someone fat fingers a date and magically hit your sentinel date and now you've got valid unknowns mixed with merely bad data entry.
If you're doing date calculations against an smart key, you're doing it wrong. You need to go to your data dimension to properly resolve the value and use methods that are aware of date logic because it's ugly and nasty beyond just simple things like month lengths and leap year calculations.
Actually that fact table has a relationship to a table DimDate, and if you join that table you would get many more options for point in time search, then if you would`ve got by adding and removing days/months.
Say you need list of all orders on second Saturday of May? Or all orders on last week of december?
Also some business regulate their fiscal year different. Some start in June, some start in January..
In summary, DimDate is there to provide you with flexibility when you need to do complicated date searches without doing any calculations, and using a simple index seek on DimDate
It's a good question, but the answer depends on what kind of datawarehouse you're aiming for. SSAS, for instance, covers tabular and multi-dimensional.
In multi-dimensional, you would never be querying the fact table itself through SQL, so the problem you note with e.g. subtracting 6 days from 20170704 would actually never arise. Because in MD SSAS you'd use MDX on the dimension itself to implement date logic (as suggested in #S4V1N's answer above). Calendar.Date.PrevMember(6). And for more complicated stuff, you can build all kinds of date hierarchies and get into MDX ParallelPeriod and FirstChild and that kind of thing.
For a datawarehouse that you're intending to use with SQL, your question has more urgency. I think that in that case #S4V1N's answer still applies: restrict your date logic to the dimension side
because that's where it's already implemented (possibly with pre-built calendar and fiscal hierarchies).
Because your logic will operate on an order of magnitude less rows.
I'm perfectly happy to have fact tables keyed on an INT-style date: but that's because I use MD SSAS. It could be that AdventureWorksDW was originally built with MD SSAS in mind (where whether the key used in fact tables is amenable to SQL is irrelevant), even though MS's emphasis seems to have switched to Tabular SSAS recently. Or the use of INTs for date keys could have been a "developer-nudging" design decision, meant to discourage date operations on the fact tables themselves, as opposed to on the Date dimension.
The thread is pretty old, but my two cents.
At one of the clients I worked at, the design chosen was an int column. The reason given (by someone before I joined) was that there were imports from different sources - some that included time information and some that only provided the date information (both strings, to begin with).
By having an int key, we could then retain the date/datetime information in a datetime column in the Fact table, while at the same time, have a second column with just the date portion (Data type: date/datetime) and use this to join to Dim table. This way the (a) aggregations/measures would be less involved (b) we wouldn't prematurely discard time information, which may be of value at some point and (c) at that point, if required the Date dimension could be refactored to include time OR a new DateTime dimension could be created.
That said, this was the accepted trade-off there, but might not be a universal recommendation.
Now a very old thread,
For non-date columns a sequential integer key is considered best practice, because it is fast, and reasonably small. A natural key which encapsulates business logic could change overtime and also may need some method of identifying which version of that dimension it is for a slowly changing dimension.
[https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/dimension-surrogate-key/][1]
Ideally for consistency a date dimension should also have a sequential integer key, so why is it different? After all the theory of debugging could be also applied to other (non-date) dimensions. From The Data Warehouse Toolkit, 3rd Edition, Kimball & Ross, page 49 (Calendar Date Dimension) is this comment
To facilitate partitioning, the primary key of a date dimension can be
more meaningful, such as an integer representing YYYYMMDD, instead of
a sequentially-assigned surrogate key.
Although I think this means partitioning of a fact table. I argue that the datekey is an integer to allow for consistency with other dimensions but not a sequential key to allow for easier table partitioning.

PostgreSql correct number of field

I would like your opinion. I have a table with 120 VARCHAR fields where I will have to hire about 1,000 records per month for at least 10 years, with a total number of 240,000 records.
I could divide the fields into multiple tables but I'd rather keep it that way. Do you think I will have problems in the future?
Thank you
Well, if the data of the columns is following a certain logic, keep it flat. Which means that I would let it that way. Otherwise separate it into multiple tables. I depents on your data.
I worked once worked with medical data where one table contained over 100 columns, but all these columns where needed to get a diagnostic result. I don't remember, what exactly it was, because I worked with that data set some years ago. But in that case it would make it more complicated, if the columns would be separated into multiple columns. Logically the data of each column served a certain purpose so it was easier to have them all in the same place (the table).
If you put the columns all together just to be lazy, so that you have to call the table once, I would recommend to separate the columns into different tables to make it more comfortable to work with, and to make the database schema more understandable.

Database design for a yearly updated database (once a year)

I have a large database which will only be updated once a year. Every year of data will use the same schema (the data will not be adding any new variables). There's a 'main' table where most of the customer information lives. To keep track of what happens from year to year, is it better design to put a field in the main customer table that says what year it is, or have a 'year' table that relates to the main customer table?
I recommend having a year field in the customer table, that way it is all together. You could even use a timestamp to automatically input the date of user sign up.
To really answer, we'd need to see your schema, but it is almost never the right choice to make a new table for a new year. You probably want to relate years to customers.
Usually you would split off your archive data because you are doing OLTP stuff on your current data, because you want to mostly work on current data, and sometimes look at old stuff. But you have very few updates it seems. I guess the main driver is your queries, and what they 'usually' do, and what performance you need to get out of them. Its probably easier for you to have everything in one table - with a year column. But if most of your queries are for the current year, and they are tight on performance you may want to look at splitting the current data out - either using physical tables, or partitioning of the table (depending on the DB some can do this for you, whilst still being a single table)

Database Optimization - Store each day in a different column to reduce rows

I'm writing an application that stores different types of records by user and day. These records are divided in categories.
When designing the database, we create a table User and then for each record type we create a table RecordType and a table Record.
Example:
To store data related to user events we have the following tables:
Event EventType
----- ---------
UserId Id
EventTypeId Name
Value
Day
Our boss pointed out (with some reason) that we're gonna store a lot of rows ( Users * Days ) and suggested an idea that seems a little crazy to me: Create a table with a column for each day of the year, like so:
EventTypeId | UserId | Year | 1 | 2 | 3 | 4 | ... | 365 | 366
This way we only have 1 row per user per year, but we're gonna get pretty big rows.
Since most ORMs (we're going with rails3 for this project) use select * to get the database records, aren't we optimizing something to "deoptimize" another?
What's the community thoughs about this?
This is a violation of First Normal Form. It's an example of repeating groups across columns.
Example of why this is bad: Write a query to find which day a given event occurred. You'll need a WHERE clause with 366 terms, separated by OR. This is tedious to write, and impossible to index.
Relational databases are designed to work well even if you have a lot of rows. Say you have 10000 users, and on average every user generates 10 events every day. After 10 years, you will have 10000*366*10*10 rows, or 366,000,000 rows. That's a fairly large database, but not uncommon.
If you design your indexes carefully to match the queries you run against this data, you should be able to have good performance for a long time. You should also have a strategy for partitioning or archiving old data.
That's breaks the DataBase normal forms principles
http://databases.about.com/od/specificproducts/a/normalization.htm
if it's applicable why don't you replace Day columns with a DateTime column in your event table with a default value (GetDate() we are talking about SQL)
then you could group by Date ...
I wouldn't do it. As long as you take the time to index the table appropriately, the database server should work well with tables that have lots of rows. If it's significantly slowing down your database performance, I'd start by making sure your queries aren't forcing a lot of full table scans.
Some other potential problems I see:
It probably will hurt ORM performance.
It's going to create maintainability problems on down the road. You probably don't want to be working with objects that have 366 fields for every day of the year, so there's probably going to have to be a lot of boilerplate glue code to keep track of.
Any query that wants to search against a range of dates is going to be an unholy mess.
It could be even more wasteful of space. These rows are big, and the number of rows you have to create for each customer is going to be the sum of the maximum number of times each different kind of event happened in a single day. Unless the rate at which all of these events happens is very constant and regular, those rows are likely to be mostly empty.
If anything, I'd suggest sharding the table based on some other column instead if you really do need to get the table size down. Perhaps by UserId or year?

Athletics Ranking Database - Number of Tables

I'm fairly new to this so you may have to bear with me. I'm developing a database for a website with athletics rankings on them and I was curious as to how many tables would be the most efficient way of achieving this.
I currently have 2 tables, a table called 'athletes' which holds the details of all my runners (potentially around 600 people/records) which contains the following fields:
mid (member id - primary key)
firstname
lastname
gender
birthday
nationality
And a second table, 'results', which holds all of their performances and has the following fields:
mid
eid (event id - primary key)
eventdate
eventcategory (road, track, field etc)
eventdescription (100m, 200m, 400m etc)
hours
minutes
seconds
distance
points
location
The second table has around 2000 records in it already and potentially this will quadruple over time, mainly because there are around 30 track events, 10 field, 10 road, cross country, relays, multi-events etc and if there are 600 athletes in my first table, that equates to a large amount of records in my second table.
So what I was wondering is would it be cleaner/more efficient to have multiple tables to separate track, field, cross country etc?
I want to use the database to order peoples results based on their performance. If you would like to understand better what I am trying to emulate, take a look at this website http://thepowerof10.info
Changing the schema won't change the number of results. Even if you split the venue into a separate table, you'll still have one result per participant at each event.
The potential benefit of having a separate venue table would be better normalization. A runner can have many results, and a given venue can have many results on a given date. You won't have to repeat the venue information in every result record.
You'll want to pay attention to indexes. Every table must have a primary key. Add additional indexes for columns you use in WHERE clauses when you select.
Here's a discussion about normalization and what it can mean for you.
PS - Thousands of records won't be an issue. Large databases are on the order of giga- or tera-bytes.
My thought --
Don't break your events table into separate tables for each type (track, field, etc.). You'll have a much easier time querying the data back out if it's all there in the same table.
Otherwise, your two tables look fine -- it's a good start.

Resources