Hit Counter: Separate Date + Time Fields vs one DateTime2 field

Hit Counter: Separate Date + Time Fields vs one DateTime2 field - sql-server

I am planning out a hit counter, and I plan to make many report queries to show number of hits total in a day, the past week, the past month, etc, as well as one that would feed a chart that shows what time of day was most popular, within a specific date range, for a specific page.
With this in mind, would it be beneficial to store the DATE in a separate field from the TIME that the hit occurred, then add indexes? I would be using a where clause with a range (greater than x and less than y) for some of these queries. I do expect to have queries that ask about both the Date and the Time, such as "within the past 6 months, show me number of hit grouped per hour of the day."
Am I over complicating it? should I just use a single DateTime2(0) field or is there some advantage to using two fields for this?

I think you are bordering premature optimization with this approach.
Use Datetime. In due time (i.e. after your application has reached Production and you have a better idea of the actual requirements and how it performs) you can for example introduce views to aggregate your data in a way that proves more useful for any reporting/querying you have to perform frequently.
In the most extreme case you can even refactor your schema and migrate everything from Datetime to two distinct fields, but I doubt this will prove necessary.

Related

Deriving and saving the historical values into a separate table, or calculate the historical values from the existing data only when they're needed?

tl;dr general question about handling database data and design:
Is it ever acceptable/are there any downsides to derive data from other data at some point in time, and then store that derived data into a separate table in order to keep a history of values at that certain time, OR, should you never store data that is derived from other data, and instead derive the required data from the existing data only when you need it?
My specific scenario:
We have a database where we record peoples' vacation days and vacation day statuses. We track how many days they have left, how many days they've taken, and things like that.
One design requirement has changed and now asks that I be able to show how many days a person had left on December 31st of any given year. So I need to be able to say, "Bob had 14 days left on December 31st, 2010".
We could do this two ways I see:
A SQL Server Agent job that, on December 31st, captures the days remaining for everyone at that time, and inserts them into a table like "YearEndHistories", which would have your EmployeeID, Year, and DaysRemaining at that time.
We don't keep a YearEndHistories table, but instead if we want to find out the amount of days possessed at a certain time, we loop through all vacations added and subtracted that exist UP TO that specific time.
I like the feeling of certainty that comes with #1 --- the recorded values would be reviewed by administration, and there would be no arguing or possibility about that number changing. With #2, I like the efficiency --- one less table to maintain, and there's no derived data present in the actual tables. But I have a weird fear about some unseen bug slipping by and peoples' historical value calculation start getting screwed up or something. In 2020 I don't want to deal with, "I ended 2012 with 9.5 days, not 9.0! Where did my half day go?!"
One thing we have decided on is that it will not be possible to modify values in previous years. That means it will never be possible to go back to the previous calendar year and add a vacation day or anything like that. The value at the end of the year is THE value, regardless of whether or not there was a mistake in the past. If a mistake is discovered, it will be balanced out by rewarding or subtracting vacation time in the current year.

Yes, it is acceptable, especially if the calculation is complex or frequently called, or doesn't change very often (eg: A high score table in a game - it's viewed very often, but the content only changes on the increasingly rare occasions when a player does very well).
As a general rule, I would normalise the data as far as possible, then add in derived fields or tables where necessary for performance reasons.
In your situation, the calculation seems relatively simple - a sum of employee vacation days granted - days taken, but that's up to you.
As an aside, I would encourage you to get out of thinking about "loops" when data is concerned - try to think about the data as a whole, as a set. Something like
SELECT StaffID, sum(Vacation)
from
(
SELECT StaffID, Sum(VacationAllocated) as Vacation
from Allocations
where AllocationDate<=convert(datetime,'2010-12-31' ,120)
group by StaffID
union
SELECT StaffID, -Count(distinct HolidayDate)
from HolidayTaken
where HolidayDate<=convert(datetime,'2010-12-31' ,120)
group by StaffID
) totals
group by StaffID

Derived data seems to me like a transitive dependency, which is avoided in normalisation.
That's the general rule.
In your case I would go for #1, which gives you a better "auditability", without performance penalty.

Show data over rolling months instead of calendar months in SSAS

I have a cube that shows order volume history, but showing the month to date numbers isn't that useful (since they're way out of whack with other full months). I want to include that data in some kind of display, but in a way that makes sense.
What's the best way to go about showing the last 30 days (or last month) over a previous period? If there's some way I can do that with my time dimension, great, but I figured I might need a calculation of some kind.
For example, since today is 7/12, I'd want to show the data for 6/13-7/12 as the most recent period, and then compare it to 5/13-6/12, and so on - that way, it would be easy to see where the current month is trending, but with values that are inline size-wise with previous periods. Also, I'd figure that would make an easier KPI value, since I can compare movement from last month's values using the rolling one-month period for comparison.

MySQL Database Table Structure Efficiency Advice

We are designing a MySQL table to track the number of followers on a daily basis for 10,000s of Twitter accounts. We've been struggling to figure out the most efficient way to store this data. The two options we are consider are:
1) OPTION 1 - Table with rows: Twitter ID, Month, Day1, Day2, Day3, etc. where each day would contain the number of followers for that account for each day of the specified month
2) OPTION 2 - Table with rows: Twitter ID, Day, Followers
Option 1 would result in about 30x less rows than Option 2. What I'm not sure from a performance perspective is if it's preferable to have less columns or less rows.
In terms of the queries we will be using, we just want to be able to query the data to get the number of followers for a specific Twitter account for arbitrary time ranges.
I would appreciate suggestions on which approach is better and why. Also, if there is a much better option than the ones I present please feel free to suggest it.
Thanks in advance for your help!

Option 2, no question.
Imagine trying to write a query using each option. Let's give the best case for option 1: We know we want the total for all 31 days of the month. THen with option 1 the query is:
select twitterid, day1+day2+day3+day4+day5+day6+day7+day8+day9+day10
+day11+day12+day13+day14+day15+day16+day17+day18+day19+day20
+day21+day22+day23+day24+day15+day26+day27+day28+day29+day30
+day31 as total
from table1
where month='2010-12';
select twitterid, sum(day) as total
from table2
where date between '2010-12-01' and '2010-12-31'
group by twitterid;
The second looks way easier to me. If you don't think so, tell me if you immediately noticed the error in the option 1 version, and if you're confident that no programmer would ever make such an error.
Now imagine that the requirements change just slightly, and someone wants the total for one week. With the second version, that's easy: give a date range that describes that week. This could easily be done when building a query on the fly: JUst ask for start date and add 6 days to it for the end date. But with the first version, what are you going to do? You'd have to figure out which days of the month fall in that week and change the list of fields retrieved. The week might span two calendar months. This would be a giant pain.
As to performance: Sure, more rows take more time to retrieve. But longer rows also take more time to retrieve. Lesson 1 on database design: Don't throw out normalization to do a micro-optimization when you don't even have a good reason to believe there's a problem. Build a normalized database first. Then if it turns out that there are performance problems, tune it afterwards. Odds are that you can buy a faster hard drive for a whole lot less than the cost of one day of programmer's time taken finding a mistake in an unnecessarily complex query.

Offcourse it depends on what queries you are going to do - but unless every query requires the 31 days of that month, for your operational data, Use Option 2.
It's better from a logical perspective (say later on you don't want queries per "30 calender days", but "last X days")
It's better for writes, too (only
update 1 row with 2 fields instead of
overwriting all fields).
You can always optimize later (partitioning comes to mind)
Your data-warehouse can still be optimized for long-term aggregate statistics.

Use Option 2. Option 1 would be a nightmare for queries.
MySQL has good support for doing date ranges in queries, so it is easiest to just have row per day.

I would say option 2, but you would probably want to add a field for a primary key to speed up queries. And if that primary key is an integer value, even better.

Option 2 definitely (with a two-column unique key/constraint on Twitter ID and Day).
Option 1 will just be regrettable.

Date / Time reference table needed for Analytic?

Is it better to keep Days of month, Months, Year, Day of week and week of year as separate reference tables or in a common Answer table? Goal is allow user content searches and action analytic to be filtered by all the various date-time values (There will be custom reporting for users based on their shared content). I am trying to ensure data accuracy by using IDs, and also report out on numbers of shares, etc by time and date for system reporting by comparing various user groups. If we keep in separate tables, what about time? A table with each hour, minute and second also needed?

Most databases support some sort of TIMESTAMP data type plus assciated DAY(), MONTH(), DAYOFWEEK() functions.
The only valid reason for separate DAY or HOUR columns in a separate table is if you have procomputed totals and averages for each timeslot.
Even then its only worth it if you expect a lot of filtering based on these values, as the cost of building these tables is high, and, for most queries the standard SQL "GROUP BY ... HAVING .. " will perform well enough.
It sounds like you may be interested in a "STAR SCHEMA" wikipedia a common method in data warehosing to speed up queries -- but be warned designing and building a Star Schem is not a trivial exercise.

SSAS cube design, semi-additive measures, and running totals

I have what is to me a bit of a tricky design issue in my SSAS cube. The question is related to general accounting practices, I have a fact table containing financial transactions (i.e. a ledger) and each of those transactions is tagged with a transaction date and a period. The period does NOT related directly to a day, or a series of days. Users may close a period in the middle of a day if that is when they have finished their months work.
I need to be able to report on Accounts Receivable (AR) by both date and period. I am not using Enterprise Edition of SSAS so the time intelligence semi-additive options are not availabe to me, and even if they were they would only allow one time dimension to use non-standard aggregation and I believe in this case I need two that allow this.
Accounts Receivable is a running total, it should be the sum of the latest ledger item selected and everything that came before it. I know how do do this calculation in MDX for a single time dimension, but how can I allow this to work with two time dimensions, transaction date, and period close? Is period close even considered a "time" dimension in this case? It does have a temporal aspect to it, and I do want the sums from all periods up to the current.
I am stumped on how to related the two time dimensions to a single fact table and use different aggregation for each. Maybe the best solution here is to have two periodic snapshot tables (instead of trying to aggregate this info from the FactLedger table), one aggregated by transaction date and one by period which is the solution I am currently leaning towards but I would love a second opinion.

You can most certainly have more than one time dimension in a cube, and in this case I would actually just create one common time dimension and have it role play as two, transaction date and period close. To role play a dimension, just add it to the cube again in the Dimension Usage tab of the cube designer and rename it. Set up your references appropriately to key off of the two different fact columns.
Or maybe I'm not understanding the issue correctly. This sounds pretty straight-forward.

You can create your own time-table with periods and you can alter your fact_table's datetime format to match your time-table. Then 1 dimension would be enough.