I have a cube that shows order volume history, but showing the month to date numbers isn't that useful (since they're way out of whack with other full months). I want to include that data in some kind of display, but in a way that makes sense.
What's the best way to go about showing the last 30 days (or last month) over a previous period? If there's some way I can do that with my time dimension, great, but I figured I might need a calculation of some kind.
For example, since today is 7/12, I'd want to show the data for 6/13-7/12 as the most recent period, and then compare it to 5/13-6/12, and so on - that way, it would be easy to see where the current month is trending, but with values that are inline size-wise with previous periods. Also, I'd figure that would make an easier KPI value, since I can compare movement from last month's values using the rolling one-month period for comparison.
Related
In StackOverflow, you can earn a few badges which are based on streaks, and from those you can imagine some other use cases of tracking streaks:
Visit the site every day for 30 days.
Vote on 30 posts per day, for 7 days in a row.
Etc..
How would you keep track of and implement this in any kind of database? (Doesn't really matter which one, SQL, NoSQL, etc..)
The first hard question/problem is, do you do it in cron/background jobs/tasks, or on every change to the database (every page view or vote, for example)?
The second hard problem is, depending on the previous solution, how do you somewhat efficiently query the state of the database to figure out if the conditions are met, and how do you not count previous "object actions" in your new calculation?
So let's focus on the votes one, since that is a little more complicated. Say we are tracking "UTC days".
If we go with the "check on every database change" solution (to part 1), then on next vote, after save to DB, query all votes since past UTC date for that user. If vote count >= 30, create a new "streak" record and do count as 1. Then on next vote (same day), we somehow need to know we already counted it, hmm, so need to modify the approach. So maybe we track "last_vote_id" for the current day, so we have (pseudocode):
table vote_streaks {
date last_date;
int last_vote_id;
int last_date_count;
int total_streak_count;
}
Then on the next vote / same day, we check when the last_date we tracked the vote (today), and the last_date_count (say we had 30 votes at this point), and the total streak is 1, we know that we should ignore the 31st vote for that day. Then on day 2, last_date doesn't match today, so we reset last_date_count to 0 before we set it to 1. I think something like that might work? I can't quite tell.
If we went with the approach to do a cron/background job, then we would query all votes (limit 30) for each user, sometime after the start of the UTC day. If we can build a streak out of that (that is more straightforward to do), then we can get the streak. But what if some problem occurs in the middle of the job and it cuts out, and has to restart? I can't imagine how to solve this yet, but it seems way more complicated than the other real-time approach.
How is this generally solved? Again, don't need to get into the weeds of the actual SQL/NoSQL tables (unless you'd like, pseudocode is fine). Or instead of a general solution, if you know of a specific solution (like how StackOverflow implements it), that would work too.
The smallest data you need to keep track of here is going to be the date of the most recent action by the user and the current streak count.
When the user done the streaked action, you need to check the date of the most recent action. If the most recent action happens within less than a day but now is already a different day, you increment the streak count; if it's further than two days, you reset the streak count to zero; otherwise you update the most recent action timestamp
When checking for streaks, you need to check for the most recent action timestamp as well. If the timestamp was last updated within the last day, the streak count is valid, otherwise the real streak count is really zero.
Alternatively, you can just simply do a document database and do the full streak calculations based on transaction logs. This will get expensive for long streak users, though it will be the simplest to implement. Depending on how often you expect people to have long streaks, this might or might not be acceptable.
Let's say we are storing data for 1000s of devices that collect a single type of data every 10s. Each device can be located in a different timezone. The ability to query quickly to visualize the data is important. We can ask the system questions such as the following:
1. For a specific device, I want the last 7 days of data grouped by day totals for my local timezone.
2. For a specific device, I want the last year's data grouped by month totals for my local timezone.
Storing all the data in UTC seems like the cleanest approach, however it becomes tricky when asking for local groupings of the data. For example, a day grouping for each timezone has different offsets. So if we were to store in say day, month, year "buckets" they would all be grouped relative to UTC which would not be useful for asking questions for timezones other than UTC itself.
If we were to group the data in minute and hour "buckets" (ignoring timezones that are off by less than an hour, e.g. IST +5:30) we could use the hour "buckets" to construct the answers to the above questions. For question 2, there would be 12 groupings of up to 744 hour "buckets" for each grouping.
Does the approach with minute and hour (ignoring timezones that are off by less than an hour, e.g. IST +5:30) "buckets" seem like a decent design? Has anyone designed something similar with a different suggestion?
Yes, it's a reasonable design to create buckets by offset, and this occurs often in data warehousing (for example).
Though bucketing by 1 hour increments means ignoring many real places. As you pointed out, India is one location that uses a :30 offset. If you want to cover every modern time zone in the world, you actually need to bucked by 15 minute segments, as there are several that are either :30 or :45 offset.
Of course, if you find it acceptable to have a margin of error, then you can use whatever granularity you can tolerate. In theory, you could go larger than an hour - you'd just have a larger margin of error.
If you want to consider a different approach, you can store the value in a date-time-offset form, using the local time of the device. Most databases will convert to UTC when indexing such a value, so you may also need a computed column that extracts and indexes just the local time portion. Then you can group by the day in local time without having to necessarily be aware of how that ties back to UTC. The downside of this approach is that the data is fixed to its original time zone. You can't easily regroup to infer a different time zone. Though if these are actual devices in the real world, that is usually not a concern.
I am planning out a hit counter, and I plan to make many report queries to show number of hits total in a day, the past week, the past month, etc, as well as one that would feed a chart that shows what time of day was most popular, within a specific date range, for a specific page.
With this in mind, would it be beneficial to store the DATE in a separate field from the TIME that the hit occurred, then add indexes? I would be using a where clause with a range (greater than x and less than y) for some of these queries. I do expect to have queries that ask about both the Date and the Time, such as "within the past 6 months, show me number of hit grouped per hour of the day."
Am I over complicating it? should I just use a single DateTime2(0) field or is there some advantage to using two fields for this?
I think you are bordering premature optimization with this approach.
Use Datetime. In due time (i.e. after your application has reached Production and you have a better idea of the actual requirements and how it performs) you can for example introduce views to aggregate your data in a way that proves more useful for any reporting/querying you have to perform frequently.
In the most extreme case you can even refactor your schema and migrate everything from Datetime to two distinct fields, but I doubt this will prove necessary.
There is a similar topic before : Daylight saving time and time zone best practices
What I am trying to ask is related, but different.
What is the suggested practice for Date handling?
I am looking for a more 'logical' date. For example, business date of our application, or the date of birth for certain people.
Normally I store it as Date (in Oracle), with 0:0:0 time. This is going to work fine IF all component of my application is in the same timezone. Coz that date in DB means 0:0:0 of the DB's timezone, if I am presenting my data to user of another timezone, it will easily have problem because, for example, Date of 2012-12-25 0:0:0 London time is in fact 2012-12-24 16:0:0 Hong Kong Time.
I have thought of two way to solve, but both of them have its deficiencies.
First, we are storing it as a String. The drawback is obvious: I need to do a lot of conversations in our app or query, and I lost a lot of date arithmetic
Second way is to store it as Date, but with a pre-defined timezone (e.g. UTC). When application is displaying the date, it has to display as UTC timezone. However I will need a lot of timezone manipulation in my application code.
What is the suggested way of handling Date? Or do most people simply use one of the above 3 (including the assume-to-be-same-timezone one) approaches?
A date is a way of identifying a day, and a day is relative to the local time zone, that is, the sun. A day is a 24 hour period (although because of leap seconds and other sidereal corrections, that 24 hours is only a very close approximation). So the date of December 5 in London names a different 24 hour period from the date December 5 in New York. One of the consequences of this is that if you want to do arithmetic on dates between different time zones, you can only do so to an accuracy of +/- 1. As a data structure, this is a conventional date (say, year and day offset) and a time zone identified by an hour offset from UTC (beware, there are some 1/2 hour offsets out there).
It should be clear, then that converting dates in one time zone to dates in another is not possible, because they represent different intervals. (Mostly. There can be exception for adjacent time zones, one on daylight savings time and one not.) Date arithmetic between different time zones can't be done ordinarily either, for the same reason. Sometimes there's not enough data captured to get the perfect answer.
But the full answer to your concern behind the question you asked depends on what the dates mean. If they are, for example, legal dates for things like deadlines, then those dates are conventionally taken with respect to the location of an office building, say, where the deadline is clocked. In that case the day boundary follows a single time zone, and there would be no sense in storing it redundantly. Times would be converted to dates in that time zone when they are stored.
Using UTC everywhere makes things easy and consistent. Keeping dates (as points in time) saved as UTC in DB, making math on them in UTC, doing explicit conversions to local time only in view layer, converting user input dates to UTC will give quite stable base for any action or computation you need with them. You don't really need to show the dates to the user in UTC - actually showing them in local time and hinting that you may show them UTC gives more useful information.
If you need to keep only the dates (like the birthday, what you've mentioned in comment), explicitly cut such information away (like conversion from DateTime to Date on DB level, or any code-level possibility).
This is sample of good normalization - the same thing you're doing while using UTF over codepages or keeping to the same units doing physical computations.
Using that approach, your code for differing dates will be much simpler. In case of showing the date & conversions between UTC and locals, many frameworks (or even languages itself) give you tools to deal with locals, while working with UTC, etc.
I once designed and lead the build of a real-time transaction system that handled customers in multiple timezones, but were all invoiced on time periods of one timezone.
The solution that worked well for me was to store both the UTC time and the local time on each record. This came after using the system for a few years and realising there were really two separate uses for date columns, so the data was stored that way.
Although it used up a few more bytes on disk (big deal - disk is cheap), it made things so simple when querying; "casual" queries eg the help desk searching for a customer transaction used the local time column and "formal" queries eg accounting department invoice batch runs used the UTC column.
It also dealt with issues of the local time "reliving" an hour of transactions every time daylight saving went back one hour, which can make using just the local time a real pain if you're a 24 hour business.
tl;dr general question about handling database data and design:
Is it ever acceptable/are there any downsides to derive data from other data at some point in time, and then store that derived data into a separate table in order to keep a history of values at that certain time, OR, should you never store data that is derived from other data, and instead derive the required data from the existing data only when you need it?
My specific scenario:
We have a database where we record peoples' vacation days and vacation day statuses. We track how many days they have left, how many days they've taken, and things like that.
One design requirement has changed and now asks that I be able to show how many days a person had left on December 31st of any given year. So I need to be able to say, "Bob had 14 days left on December 31st, 2010".
We could do this two ways I see:
A SQL Server Agent job that, on December 31st, captures the days remaining for everyone at that time, and inserts them into a table like "YearEndHistories", which would have your EmployeeID, Year, and DaysRemaining at that time.
We don't keep a YearEndHistories table, but instead if we want to find out the amount of days possessed at a certain time, we loop through all vacations added and subtracted that exist UP TO that specific time.
I like the feeling of certainty that comes with #1 --- the recorded values would be reviewed by administration, and there would be no arguing or possibility about that number changing. With #2, I like the efficiency --- one less table to maintain, and there's no derived data present in the actual tables. But I have a weird fear about some unseen bug slipping by and peoples' historical value calculation start getting screwed up or something. In 2020 I don't want to deal with, "I ended 2012 with 9.5 days, not 9.0! Where did my half day go?!"
One thing we have decided on is that it will not be possible to modify values in previous years. That means it will never be possible to go back to the previous calendar year and add a vacation day or anything like that. The value at the end of the year is THE value, regardless of whether or not there was a mistake in the past. If a mistake is discovered, it will be balanced out by rewarding or subtracting vacation time in the current year.
Yes, it is acceptable, especially if the calculation is complex or frequently called, or doesn't change very often (eg: A high score table in a game - it's viewed very often, but the content only changes on the increasingly rare occasions when a player does very well).
As a general rule, I would normalise the data as far as possible, then add in derived fields or tables where necessary for performance reasons.
In your situation, the calculation seems relatively simple - a sum of employee vacation days granted - days taken, but that's up to you.
As an aside, I would encourage you to get out of thinking about "loops" when data is concerned - try to think about the data as a whole, as a set. Something like
SELECT StaffID, sum(Vacation)
from
(
SELECT StaffID, Sum(VacationAllocated) as Vacation
from Allocations
where AllocationDate<=convert(datetime,'2010-12-31' ,120)
group by StaffID
union
SELECT StaffID, -Count(distinct HolidayDate)
from HolidayTaken
where HolidayDate<=convert(datetime,'2010-12-31' ,120)
group by StaffID
) totals
group by StaffID
Derived data seems to me like a transitive dependency, which is avoided in normalisation.
That's the general rule.
In your case I would go for #1, which gives you a better "auditability", without performance penalty.