How to design a schema for periods of dates with exceptions? - database

The site is about special discount events. Each event contains a period of time (dates to be more precise) that it is valid. However there will often be a constrain that the deal is not valid in say Saturdays and Sundays (or even a specific day).
Currently my rough design would be to have two tables:
Event table store EventID, start and end date of the duration and all other things.
EventInvalidDate table stores EventID, and specific dates which the deals are not valid. This requires the application code to calculate invalid dates upfront.
Does anyone know of a better pattern to fit this requirement, or possible pitfall for my design? This requirement is like a subset of a general calender model, because it does not require infinite repeating events in the future (i.e. each event has a definite duration).
UPDATE
My co-worker suggested to have a periods table with start and end dates. If the period is between 1/Jan and 7/Jan, with 3/Jan being an exception, the table would record: 1/Jan~2/Jan, 4/Jan~7/Jan.
Does anyone know if this is better the same as the answer's approach, in terms of SQL performance. Thanks

Specifying which dates are not included might keep the number of database rows down, but it makes calculations, queries and reports more difficult.
I'd turn it upside down.
Have a master Event table that lists the first and last date of the event.
Also have a detail EventDates table that gets populated with all the dates where the event is available.
Taking this approach makes things easier to use, especially when writing queries and reports.
Update
Having a row per date allows you to do exact joins on dates to other tables, and allows you to aggregate per day for reporting purposes.
select ...
from sales
inner join eventDates
on sales.saleDate = eventDates.date
If your eventDates table uses start and end dates, the joins become harder to write:
select ...
from sales
inner join eventDates
on sales.saleDate >= eventDates.start and sales.SaleDate < eventDates.finish
Exact match joins are definately done by index, if available, in every RDBMS I've checked; range matches, as in the second example, I'm not sure. They're probably Ok from a performance perspective, unless you end up with a metric ton of data.

Related

Database Design: How do I handle tracking goals vs. actuals over time?

This isn't exactly a programming question, as I don't have an issue writing the code, but a database design question. I need to create an app that tracks sales goals vs. actual sales over time. The thing is, that a persons goal can change (let's say daily at most).
Also, a location can have multiple agents with different goals that need to be added together for the location.
I've considered basically running a timed task to save daily goals per agent into a field. It seems that over the years that will be a lot of data, but it would let me simply query a date range and add all the daily goals up to get an goal for that date range.
Otherwise, I guess I could simply write changes (i.e. March 2nd - 15 sales / week, April 12th, 16 sales per week) which would be less data, but much more programming work to figure out goals based on a time query.
I'm assuming there is probably a best practice for this - anyone?
Put a date range on your goals. The start of the range is when you set that goal. The end of the range starts off as max-collating date (often 9999-12-31, depending on your database).
Treat this as "until forever" or "until further notice".
When you want to know what goals were in effect as of a particular date, you would have something like this in your WHERE clause:
...
WHERE effective_date <= #AsOfDate
AND expiry_date > #AsOfDate
...
When you change a goal, you need two operations, first you update the existing record (if it exists) and set the expiry_date to the new as-of date. Then you insert a new record with an effective_date of the new as-of date and an expiry_date of forever (e.g. '9999-12-31')
This give you the following benefits:
Minimum number of rows
No scheduled processes to take daily snapshots
Easy retrieval of effective records as of a point in time
Ready-made audit log of changes

Hit Counter: Separate Date + Time Fields vs one DateTime2 field

I am planning out a hit counter, and I plan to make many report queries to show number of hits total in a day, the past week, the past month, etc, as well as one that would feed a chart that shows what time of day was most popular, within a specific date range, for a specific page.
With this in mind, would it be beneficial to store the DATE in a separate field from the TIME that the hit occurred, then add indexes? I would be using a where clause with a range (greater than x and less than y) for some of these queries. I do expect to have queries that ask about both the Date and the Time, such as "within the past 6 months, show me number of hit grouped per hour of the day."
Am I over complicating it? should I just use a single DateTime2(0) field or is there some advantage to using two fields for this?
I think you are bordering premature optimization with this approach.
Use Datetime. In due time (i.e. after your application has reached Production and you have a better idea of the actual requirements and how it performs) you can for example introduce views to aggregate your data in a way that proves more useful for any reporting/querying you have to perform frequently.
In the most extreme case you can even refactor your schema and migrate everything from Datetime to two distinct fields, but I doubt this will prove necessary.

Deriving and saving the historical values into a separate table, or calculate the historical values from the existing data only when they're needed?

tl;dr general question about handling database data and design:
Is it ever acceptable/are there any downsides to derive data from other data at some point in time, and then store that derived data into a separate table in order to keep a history of values at that certain time, OR, should you never store data that is derived from other data, and instead derive the required data from the existing data only when you need it?
My specific scenario:
We have a database where we record peoples' vacation days and vacation day statuses. We track how many days they have left, how many days they've taken, and things like that.
One design requirement has changed and now asks that I be able to show how many days a person had left on December 31st of any given year. So I need to be able to say, "Bob had 14 days left on December 31st, 2010".
We could do this two ways I see:
A SQL Server Agent job that, on December 31st, captures the days remaining for everyone at that time, and inserts them into a table like "YearEndHistories", which would have your EmployeeID, Year, and DaysRemaining at that time.
We don't keep a YearEndHistories table, but instead if we want to find out the amount of days possessed at a certain time, we loop through all vacations added and subtracted that exist UP TO that specific time.
I like the feeling of certainty that comes with #1 --- the recorded values would be reviewed by administration, and there would be no arguing or possibility about that number changing. With #2, I like the efficiency --- one less table to maintain, and there's no derived data present in the actual tables. But I have a weird fear about some unseen bug slipping by and peoples' historical value calculation start getting screwed up or something. In 2020 I don't want to deal with, "I ended 2012 with 9.5 days, not 9.0! Where did my half day go?!"
One thing we have decided on is that it will not be possible to modify values in previous years. That means it will never be possible to go back to the previous calendar year and add a vacation day or anything like that. The value at the end of the year is THE value, regardless of whether or not there was a mistake in the past. If a mistake is discovered, it will be balanced out by rewarding or subtracting vacation time in the current year.
Yes, it is acceptable, especially if the calculation is complex or frequently called, or doesn't change very often (eg: A high score table in a game - it's viewed very often, but the content only changes on the increasingly rare occasions when a player does very well).
As a general rule, I would normalise the data as far as possible, then add in derived fields or tables where necessary for performance reasons.
In your situation, the calculation seems relatively simple - a sum of employee vacation days granted - days taken, but that's up to you.
As an aside, I would encourage you to get out of thinking about "loops" when data is concerned - try to think about the data as a whole, as a set. Something like
SELECT StaffID, sum(Vacation)
from
(
SELECT StaffID, Sum(VacationAllocated) as Vacation
from Allocations
where AllocationDate<=convert(datetime,'2010-12-31' ,120)
group by StaffID
union
SELECT StaffID, -Count(distinct HolidayDate)
from HolidayTaken
where HolidayDate<=convert(datetime,'2010-12-31' ,120)
group by StaffID
) totals
group by StaffID
Derived data seems to me like a transitive dependency, which is avoided in normalisation.
That's the general rule.
In your case I would go for #1, which gives you a better "auditability", without performance penalty.

Date / Time reference table needed for Analytic?

Is it better to keep Days of month, Months, Year, Day of week and week of year as separate reference tables or in a common Answer table? Goal is allow user content searches and action analytic to be filtered by all the various date-time values (There will be custom reporting for users based on their shared content). I am trying to ensure data accuracy by using IDs, and also report out on numbers of shares, etc by time and date for system reporting by comparing various user groups. If we keep in separate tables, what about time? A table with each hour, minute and second also needed?
Most databases support some sort of TIMESTAMP data type plus assciated DAY(), MONTH(), DAYOFWEEK() functions.
The only valid reason for separate DAY or HOUR columns in a separate table is if you have procomputed totals and averages for each timeslot.
Even then its only worth it if you expect a lot of filtering based on these values, as the cost of building these tables is high, and, for most queries the standard SQL "GROUP BY ... HAVING .. " will perform well enough.
It sounds like you may be interested in a "STAR SCHEMA" wikipedia a common method in data warehosing to speed up queries -- but be warned designing and building a Star Schem is not a trivial exercise.

SSAS cube design, semi-additive measures, and running totals

I have what is to me a bit of a tricky design issue in my SSAS cube. The question is related to general accounting practices, I have a fact table containing financial transactions (i.e. a ledger) and each of those transactions is tagged with a transaction date and a period. The period does NOT related directly to a day, or a series of days. Users may close a period in the middle of a day if that is when they have finished their months work.
I need to be able to report on Accounts Receivable (AR) by both date and period. I am not using Enterprise Edition of SSAS so the time intelligence semi-additive options are not availabe to me, and even if they were they would only allow one time dimension to use non-standard aggregation and I believe in this case I need two that allow this.
Accounts Receivable is a running total, it should be the sum of the latest ledger item selected and everything that came before it. I know how do do this calculation in MDX for a single time dimension, but how can I allow this to work with two time dimensions, transaction date, and period close? Is period close even considered a "time" dimension in this case? It does have a temporal aspect to it, and I do want the sums from all periods up to the current.
I am stumped on how to related the two time dimensions to a single fact table and use different aggregation for each. Maybe the best solution here is to have two periodic snapshot tables (instead of trying to aggregate this info from the FactLedger table), one aggregated by transaction date and one by period which is the solution I am currently leaning towards but I would love a second opinion.
You can most certainly have more than one time dimension in a cube, and in this case I would actually just create one common time dimension and have it role play as two, transaction date and period close. To role play a dimension, just add it to the cube again in the Dimension Usage tab of the cube designer and rename it. Set up your references appropriately to key off of the two different fact columns.
Or maybe I'm not understanding the issue correctly. This sounds pretty straight-forward.
You can create your own time-table with periods and you can alter your fact_table's datetime format to match your time-table. Then 1 dimension would be enough.

Resources