SQL Schema - merge data from two tables - sql-server

I'm trying to create a schema that will allow me to define times, when a supplier website is non operational (planned not unplanned).
I've gone for non-operational as opposed to operational because many suppliers work 24/7, so non-operting times represent the least number of rows.
For example, a supplier might not work:
On a Sunday
On a recognised holiday date - '1/1/2015'
On a Saturday after 5pm
I'm not overly confident with SQL Server, but have come up with a schema that 'does the job'. However, as we all know, there are good ways, not so good ways, and bad ways, that all work in a fashion, so would appreciate comments and advice on what I have to date.
One of the key features is to use data from WorkingDays and Holidays together to represent a WorkingPeriod entity.
I would appreciate coments no matter how small.
Holiday
Contains all recognised holidays - Easter Monday, Good Friday etc.
HolidayDate
Contains dates of holidays. For instance, this year Easter Monday is 6th Apr 2015.
WorkingDay
Sunday through to Monday, mapped to Asp.Net day of week enums.
WorkingPeriodType
A lookup table containing 2 rows - Holiday, or Day of Week
WorkingPeriod
Merges the Holiday table and the WorkingDay table to represent a single WorkingPeriod entity that can be used in the SupplierNonWorkingTimes table.
SupplierNonWorkingTimes
Contains the ID representing the WorkingDay/Holiday and the times of non- operation.

This is a very subjective question, as you've already observed there's no right and wrong, just different ways. I'm a database guy but I don't know your specific circumstances, so this is just some observations - you'll have to judge for yourself whether any of them are appropriate to you.
I like my naming to be crystal clear, it saves all the
misunderstanding by other people later on. If [WorkingDay] holds the
7 days of the week I would call it [WeekDay]. If you intend
[Holiday] to hold whole-day holidays I would call it [HolidayDay].
The main table [SupplierNonWorkingTime] is about 'non-working' so I
would call the [WorkingPeriod] table [NonWorkingPeriod]. The term
'period' always refers to a whole day, so I would replace 'period'
with 'day' (let's ignore start/stop time for now).
My first impression was that your design is over-normalised. The
[WorkingPeriodType] table has 2 rows that will never change,
[WorkingDay] has 7. For these very low numbers I sometimes prefer a
char(1) with a check constraint. Normalisation is generally good,
but lots of JOINs for trivial queries is not so good. You could
eliminate [WorkingPeriodType] and [WorkingDay] but you've mentioned
.Net enums in your question so if you've got some sort of ORM in
your .Net code this level of normalisation might be right for you.
I'd add a Year field to the [HolidayDate] table, then the PK
becomes a better HolidayID+Year - unless you know somewhere that has
lots of Christmas' :)
I'd add an IsAllDay field to the [SupplierNonWorkingTime] table,
otherwise you have to use 'magic values' to represent 'all day' and
magic values are bad. There should be a check constraint to enforce
start/stop times can only be entered if IsAllDay = false.
Like I said, just my thoughts, hope it's helpful.

Related

DynamoDB table design for business hours

I am trying to create a business hours application using DynamoDB.
I saw lots of examples and schema designs for different databases but just can't find the right table design for DynamoDB.
Here are my requirements:
every business should have default working hours (Monday 08:00 - 14:00, 16:00 - 20:00) and special events (26/11/2020 shop is closed / opened between 10:00 - 14:00 due to Thanksgiving)
every day can have multiple work durations (08:00 - 14:00, 16:00 - 20:00)
Those are the operations I need to allow:
Create / edit working hours for each business (including special events)
Check whether a business (or a list of businesses) are open right now - by providing a list of business ids.
Get business (or list of businesses) working hours between 2 dates (for example 23/04/2020 - 25/04/2020) by providing a list of business ids and date range for each id
What I've tried:
Defined a table where business id is the partition key (HASH) and special dates / day of week is the sort key (RANGE).
The problem with this approach is that I cannot query by multiple business hours unless I use the scan api which is not recommended due to expensive operations.
Please advice what kind of table design I should use for this application.
You probably need to first construct your overarching logic outside of DynamoDB, do decide if a business is working or not, and only use quarries in Dynamo for a subset of that logic.
Lets say though we use DynamoDB for querying in regards to normal working hours, and not include logic like holidays and special cases, you can use that to filter after you access Dynamo. You can't construct one query in Dynamo to answer all your questions that is more like what you can do in SQL.
So lets say we have a Table/Subset of values which relate to the normal working day. So you have something like this:
Partition Key (PK): business, Range Key (RK): dayOfWeek, and attributes, opens & closes.
We can then create 2 GSIs:
PK dayOfWeek RK opens
PK dayOfWeek RK closes
Now we can do two queries if a store is open between 3-4pm on Monday:
PK == MONDAY & opens < 2 pm
PK == MONDAY & closes > 4 pm
And collect only the values which appear in both queries.
Obviously though, having a PK of day, is probably not a great idea, as you will only have 7 partitions. So what do you do? Well you probably have more criteria in your query than simply day, for example, the type of store, the city the store is located it, etc. That would mean then you would have a PK of something like: city-category-dayOfWeek.
Similarly on the sorting side, you might want higher rated stores to be the first option, so you might have something like: {rating}-{open} & {rating}-{closes}.
You will just have to get creative, firstly layout all the queries you have before you design your tables. I really like this video on table design, it's terrific.

Why is datekey in fact tables always INT?

I'm looking at the datekey column from the fact tables in AdventureWorksDW and they're all of type int.
Is there a reason for this and not of type date?
I understand that creating a clustered index composed of an INT would optimize query speed. But let's say I want to get data from this past week. I can subtract 6 from date 20170704 and I'll get 20170698 which is not a valid date. So I have to cast everything to date, subtract, and then cast as int.
Right now I have a foreign key constraint to make sure that something besides 'YYYYMMDD' isn't inserted. It wouldn't be necessary with a Date type. Just now, I wanted to get some data between 6/28 and 7/4. I can't just subtract six from `20170703'; I have to cast from int to date.
It seems like a lot of hassle and not many benefits.
Thanks.
Yes, you could be using a Date data type and have that as your primary key in the Fact and the dimension and you're going to save yourself a byte in the process.
And then you're going to have to deal with a sale that is recorded and we didn't know the date. What then? In a "normal" dimensional model, you define Unknown surrogate values so that people know there is data and it might be useful but it's incomplete. A common convention is to make it zero or in the negative realm. Easy to do with integers.
Dates are a little weird in that we typically use smart keys - yyyymmdd. From a debugging perspective, it's easy to quickly identify what the date is without having to look up against your dimension.
You can't make an invalid date. Soooo what then? Everyone "knows" that 1899-12-31 is the "fake" date (or whatever tickles your fancy) and that's all well and good until someone fat fingers a date and magically hit your sentinel date and now you've got valid unknowns mixed with merely bad data entry.
If you're doing date calculations against an smart key, you're doing it wrong. You need to go to your data dimension to properly resolve the value and use methods that are aware of date logic because it's ugly and nasty beyond just simple things like month lengths and leap year calculations.
Actually that fact table has a relationship to a table DimDate, and if you join that table you would get many more options for point in time search, then if you would`ve got by adding and removing days/months.
Say you need list of all orders on second Saturday of May? Or all orders on last week of december?
Also some business regulate their fiscal year different. Some start in June, some start in January..
In summary, DimDate is there to provide you with flexibility when you need to do complicated date searches without doing any calculations, and using a simple index seek on DimDate
It's a good question, but the answer depends on what kind of datawarehouse you're aiming for. SSAS, for instance, covers tabular and multi-dimensional.
In multi-dimensional, you would never be querying the fact table itself through SQL, so the problem you note with e.g. subtracting 6 days from 20170704 would actually never arise. Because in MD SSAS you'd use MDX on the dimension itself to implement date logic (as suggested in #S4V1N's answer above). Calendar.Date.PrevMember(6). And for more complicated stuff, you can build all kinds of date hierarchies and get into MDX ParallelPeriod and FirstChild and that kind of thing.
For a datawarehouse that you're intending to use with SQL, your question has more urgency. I think that in that case #S4V1N's answer still applies: restrict your date logic to the dimension side
because that's where it's already implemented (possibly with pre-built calendar and fiscal hierarchies).
Because your logic will operate on an order of magnitude less rows.
I'm perfectly happy to have fact tables keyed on an INT-style date: but that's because I use MD SSAS. It could be that AdventureWorksDW was originally built with MD SSAS in mind (where whether the key used in fact tables is amenable to SQL is irrelevant), even though MS's emphasis seems to have switched to Tabular SSAS recently. Or the use of INTs for date keys could have been a "developer-nudging" design decision, meant to discourage date operations on the fact tables themselves, as opposed to on the Date dimension.
The thread is pretty old, but my two cents.
At one of the clients I worked at, the design chosen was an int column. The reason given (by someone before I joined) was that there were imports from different sources - some that included time information and some that only provided the date information (both strings, to begin with).
By having an int key, we could then retain the date/datetime information in a datetime column in the Fact table, while at the same time, have a second column with just the date portion (Data type: date/datetime) and use this to join to Dim table. This way the (a) aggregations/measures would be less involved (b) we wouldn't prematurely discard time information, which may be of value at some point and (c) at that point, if required the Date dimension could be refactored to include time OR a new DateTime dimension could be created.
That said, this was the accepted trade-off there, but might not be a universal recommendation.
Now a very old thread,
For non-date columns a sequential integer key is considered best practice, because it is fast, and reasonably small. A natural key which encapsulates business logic could change overtime and also may need some method of identifying which version of that dimension it is for a slowly changing dimension.
[https://www.kimballgroup.com/data-warehouse-business-intelligence-resources/kimball-techniques/dimensional-modeling-techniques/dimension-surrogate-key/][1]
Ideally for consistency a date dimension should also have a sequential integer key, so why is it different? After all the theory of debugging could be also applied to other (non-date) dimensions. From The Data Warehouse Toolkit, 3rd Edition, Kimball & Ross, page 49 (Calendar Date Dimension) is this comment
To facilitate partitioning, the primary key of a date dimension can be
more meaningful, such as an integer representing YYYYMMDD, instead of
a sequentially-assigned surrogate key.
Although I think this means partitioning of a fact table. I argue that the datekey is an integer to allow for consistency with other dimensions but not a sequential key to allow for easier table partitioning.

Can a join table be combined with another table?

The first image below is my database schema for a project that will use psql, ruby and active record.
While writing my schema, things got a bit complex. My "special_days" table ended up becoming a join table for "days_of_week" and "organizations". I'm assuming that this is not best practice and will end up causing me trouble.
In the second schema below, I made a separate join table for "days of week" and "organizations". My special_days table still needs to be associated with a day_of_week and an organization, so I think I have to keep the joining information in the special_days table. Is there a better way to do this? It seems that my second attempt is too repetitive.
These are my relationships:
days of week & organizations | many to many
city & organizations | one to many
organization & special days | one to many
day of week & special days | one to many
Some more information about the requirements might be useful. Are you required to do something like define which days are holidays, which days are paydays, etc. (e.g. many organizations would define Sat and Sun as non-work days and many U.S. organization would define July 4 as a non-work day)? I'm not sure what days_of_the week represents. Is this a table with 7 records in it (M,T,W,Th,F,S,Sun)? If I'm guessing correctly at the requirements it might be better to do something like have a table called special_day that has a date column and a recurrence column (e.g. weekly, monthly, yearly, etc.). You could then have a organization_special_day table that is a many-to-many join table on organization and special_day.

Database table structure for price list

I have like about 10 tables where are records with date ranges and some value belongin to the date range.
Each table has some meaning.
For example
rates
start_date DATE
end_date DATE
price DOUBLE
availability
start_date DATE
end_date DATE
availability INT
and then table dates
day DATE
where are dates for each day for 2 years ahead.
Final result is joining these 10 tables to dates table.
The query takes a bit longer, because there are some other joins and subqueries.
I have been thinking about creating one bigger table containing all the 10 tables data for each day, but final table would have about 1.5M - 2M records.
From testing it seems to be quicker (0.2s instead of about 1s) to search in this table instead of joining tables and searching in the joined result.
Is there any real reason why it should be bad idea to have a table with that many records?
The final table would look like
day DATE
price DOUBLE
availability INT
Thank you for your comments.
This is a complicated question. The answer depends heavily on usage patterns. Presumably, most of the values do not change every day. So, you could be vastly increasing the size of the database.
On the other hand, something like availability may change every day, so you already have a large table in your database.
If your usage patterns focused on one table at a time, I'd be tempted to say "leave well-enough alone". That is, don't make a change if it ain't broke. If your usage involved multiple updates to one type of record, I'd be inclined to leave them in separate tables (so locking for one type of value does not block queries on other types).
However, your usage suggests that you are combining the tables. If so, I think putting them in one row per day per item makes sense. If you are getting successive days at one time, you may find that having separate days in the underlying table greatly simplifies your queries. And, if your queries are focused on particular time frames, your proposed structure will keep the relevant data in the cache, giving room for better performance.
I appreciate what Bohemian says. However, you are already going to the lowest level of granularity and seeing that it works for you. I think you should go ahead with the reorganization.
I went down this road once and regretted it.
The fact that you have a projection of millions of rows tells me that dates from one table don't line up with dates from another table, leading to creating extra boundaries for some attributes because being in one table all attributes must share the same boundaries.
The problem I encountered was that the business changed and suddenly I had a lot more combinations to deal with and the number of rows blew right out, slowing queries significantly. The other problem was keeping the data up to date - my "super" table was calculated from the separate tables when ever they changed.
I found that keeping them separate and moving the logic into the app layer worked for me.
The data I was dealing with was almost exactly the same as yours except I had only 3
tables: I had availability, pricing and margin. The fact was that the 3 were unrelated, so date ranges never aligned, leasing to lots of artificial rows in the big table.

Best practice database schema for property booking calendar

I am working on a multiple properties booking system and making me headache about the best practice schema design. Assume the site hosts for example 5000 properties where each of it is maintained by one user. Each property has a booking calendar. My current implementation is a two-table-system with one table for the available dates and the other for the unavailable dates, with a granularity of 1 day each.
property_dates_available (property_id, date);
property_dates_booked (property_id, date);
However, i feel unsure if this is a good solution. In another question i read about a single table solution with both states represented. But i wonder if it is a good idea to mix them up. Also, should the booking calendar be mapped for a full year with all its 365 days per year into the database table or was it better to map only the days a property is available for booking? I think of the dramatically increasing number of rows every year. Also i think of searching the database lately for available properties and am not sure if looking through 5000 * 365 rows might be a bad idea compared to i.e. only 5000 * av. 100 rows.
What would you generally recommend? Is this aspect ignorable? How to best practice implement this?
I don't see why you need a separate table for available dates. If you have a table for booked dates (property_id, date), then you can easily query this table to find out which properties are available for a given date
select properties.property_name
from properties where not exists
(select 1 from property_dates_booked
where properties.property_id = property_dates_booked
and property_dates_booked.date = :date)
:date being a parameter to the query
Only enter actual bookings into the property_dates_booked table (it would be easier to rename the table 'bookings'). If a property is not available for certain dates because of maintenance, then enter a booking for those dates where the customer is 'special' (maybe the 'customer' has a negative id).

Resources