Date / Time reference table needed for Analytic? - database

Is it better to keep Days of month, Months, Year, Day of week and week of year as separate reference tables or in a common Answer table? Goal is allow user content searches and action analytic to be filtered by all the various date-time values (There will be custom reporting for users based on their shared content). I am trying to ensure data accuracy by using IDs, and also report out on numbers of shares, etc by time and date for system reporting by comparing various user groups. If we keep in separate tables, what about time? A table with each hour, minute and second also needed?

Most databases support some sort of TIMESTAMP data type plus assciated DAY(), MONTH(), DAYOFWEEK() functions.
The only valid reason for separate DAY or HOUR columns in a separate table is if you have procomputed totals and averages for each timeslot.
Even then its only worth it if you expect a lot of filtering based on these values, as the cost of building these tables is high, and, for most queries the standard SQL "GROUP BY ... HAVING .. " will perform well enough.
It sounds like you may be interested in a "STAR SCHEMA" wikipedia a common method in data warehosing to speed up queries -- but be warned designing and building a Star Schem is not a trivial exercise.

Related

Hit Counter: Separate Date + Time Fields vs one DateTime2 field

I am planning out a hit counter, and I plan to make many report queries to show number of hits total in a day, the past week, the past month, etc, as well as one that would feed a chart that shows what time of day was most popular, within a specific date range, for a specific page.
With this in mind, would it be beneficial to store the DATE in a separate field from the TIME that the hit occurred, then add indexes? I would be using a where clause with a range (greater than x and less than y) for some of these queries. I do expect to have queries that ask about both the Date and the Time, such as "within the past 6 months, show me number of hit grouped per hour of the day."
Am I over complicating it? should I just use a single DateTime2(0) field or is there some advantage to using two fields for this?
I think you are bordering premature optimization with this approach.
Use Datetime. In due time (i.e. after your application has reached Production and you have a better idea of the actual requirements and how it performs) you can for example introduce views to aggregate your data in a way that proves more useful for any reporting/querying you have to perform frequently.
In the most extreme case you can even refactor your schema and migrate everything from Datetime to two distinct fields, but I doubt this will prove necessary.

Difference between sql query aggregation and aggegration and querying an OLAP cube

I have a query with respect to the advantages of building a OLAP cube vs aggregating data in database table for querying ,data of say 6 months and then archiving the sql table later for analytics purpose.
Which one is better, table or OLAP cube? and why since I can aggregate and keep data in my tables also and query the aggregated data as and when needed.
Short version: Like many development decisions, it depends.
Long version: I wouldn't say that one is "better" than the other - it's just that the two have separate uses and one or the other might be the better solution depending on what the requirements are.
If you have a few specific reports which require specific aggregations, then it might be simpler and easier for everyone involved to just aggregate that information in a table or a view, and point your reports at that.
As an example, if you know your users only want reports at a monthly level for a particular set of parameters - maybe your sales department want the monthly value of each salesperson's sales, for example - then your best bet might be to aggregate this up and pop it into a report where they can select the month and the salesperson, and get the number that they want.
The benefits of this might be that it's quick to develop and provide to your users, there's not too much time spent testing as only a few figures need checking, etc. Your users also don't need to spend time being trained/learning to use a cube - reports are generally pretty easy for people to pick up and use.
But if your users want to be able to carry out much more open-ended analysis on their own terms then it's not much use if you need to go away and develop a report every time they have a new requirement. Your database might start getting very full of similar-but-different tables full of aggregated amounts. You could run into issues where one report ends up not agreeing with another for some reason - you might find you're dealing with the same data quality issues over and over again in each report.
In this case, it might make more sense to develop a cube over the top of data held at the lowest grain which your users want to analyse. In this way, they can essentially self-serve, rather than getting back in touch with you every time they need a new set of aggregated data. They can slice and dice through the data using multiple different "parameters" (dimensions in the OLAP world), rather than being limited by the nature of the reports.
Aggregated data still sometimes plays a role even when you have a cube in place, though. Sometimes performance gains can be found by aggregating data up to certain levels and holding it in a physical table, and getting your OLAP tool to use the physically aggregated data at that level instead of using its own aggregations - but this is an optimization step which would need careful consideration to see whether it's beneficial in terms of performance, whether the space vs. performance payoff is worthwhile, etc. I wouldn't worry about this aspect if you're just starting to look at OLAP, but wanted to note it for the sake of completeness.
To add to Jo's great answer, consider the grain of the facts that need to be aggregated and compared. If you have daily sales by product, but budgets by month and product category, you're going to need an aggregate fact table based on sales in order to compare budgets. That would be further represented as two cubes in your OLAP database - Sales cube, and Budget cube.
If there are very regular use cases which involve specific aggregated data, and this aggregated data would take a while to return from sql database tables then a cube might help.
If there are lots of potential ways in which your db table data needs to be sliced and diced at an aggregated level then there is definitely a good argument to start playing around with olap cubes.
In terms of sums of data olap is a great aggregation tool. I'm not convinced that it is the best tool for distinct counts though, so if your requirements includes lots of distinct counts then maybe look elsewhere. Do you have the option of Tabular/PowerPivot/DAX ?

Database design: storing many large reports for frequent historical analysis

I'm a long time programmer who has little experience with DBMSs or designing databases.
I know there are similar posts regarding this, but am feeling quite discombobulated tonight.
I'm working on a project which will require that I store large reports, multiple times per day, and have not dealt with storage or tables of this magnitude. Allow me to frame my problem in a generic way:
The process:
A script collects roughly 300 rows of information, set A, 2-3 times per day.
The structure of these rows never change. The rows contain two columns, both integers.
The script also collects roughly 100 rows of information, set B, at the same time. The
structure of these rows does not change either. The rows contain eight columns, all strings.
I need to store all of this data. Set A will be used frequently, and daily for analytics. Set B will be used frequently on the day that it is collected and then sparingly in the future for historical analytics. I could theoretically store each row with a timestamp for later query.
If stored linearly, both sets of data in their own table, using a DBMS, the data will reach ~300k rows per year. Having little experience with DBMSs, this sounds high for two tables to manage.
I feel as though throwing this information into a database with each pass of the script will lead to slow read times and general responsiveness. For example, generating an Access database and tossing this information into two tables seems like too easy of a solution.
I suppose my question is: how many rows is too many rows for a table in terms of performance? I know that it would be in very poor taste to create tables for each day or month.
Of course this only melts into my next, but similar, issue, audit logs...
300 rows about 50 times a day for 6 months is not a big blocker for any DB. Which DB are you gonna use? Most will handle this load very easily. There are a couple of techniques for handling data fragmentation if the data rows exceed more than a few 100 millions per table. But with effective indexing and cleaning you can achieve the performance you desire. I myself deal with heavy data tables with more than 200 million rows every week.
Make sure you have indexes in place as per the queries you would issue to fetch that data. Whats ever you have in the where clause should have an appropriate index in db for it.
If you row counts per table exceed many millions you should look at partitioning of tables DBs store data in filesystems as files actually so partitioning would help in making smaller groups of data files based on some predicates e.g: date or some unique column type. You would see it as a single table but on the file system the DB would store the data in different file groups.
Then you can also try table sharding. Which actually is what you mentioned....different tables based on some predicate like date.
Hope this helps.
You are over thinking this. 300k rows is not significant. Just about any relational database or NoSQL database will not have any problems.
Your design sounds fine, however, I highly advise that you utilize the facility of the database to add a primary key for each row, using whatever facility is available to you. Typically this involves using AUTO_INCREMENT or a Sequence, depending on the database. If you used a nosql like Mongo, it will add an id for you. Relational theory depends on having a primary key, and it's often helpful to have one for diagnostics.
So your basic design would be:
Table A tableA_id | A | B | CreatedOn
Table B tableB_id | columns… | CreatedOn
The CreatedOn will facilitate date range queries that limit data for summarization purposes and allow you to GROUP BY on date boundaries (Days, Weeks, Months, Years).
Make sure you have an index on CreatedOn, if you will be doing this type of grouping.
Also, use the smallest data types you can for any of the columns. For example, if the range of the integers falls below a particular limit, or is non-negative, you can usually choose a datatype that will reduce the amount of storage required.

How to design a schema for periods of dates with exceptions?

The site is about special discount events. Each event contains a period of time (dates to be more precise) that it is valid. However there will often be a constrain that the deal is not valid in say Saturdays and Sundays (or even a specific day).
Currently my rough design would be to have two tables:
Event table store EventID, start and end date of the duration and all other things.
EventInvalidDate table stores EventID, and specific dates which the deals are not valid. This requires the application code to calculate invalid dates upfront.
Does anyone know of a better pattern to fit this requirement, or possible pitfall for my design? This requirement is like a subset of a general calender model, because it does not require infinite repeating events in the future (i.e. each event has a definite duration).
UPDATE
My co-worker suggested to have a periods table with start and end dates. If the period is between 1/Jan and 7/Jan, with 3/Jan being an exception, the table would record: 1/Jan~2/Jan, 4/Jan~7/Jan.
Does anyone know if this is better the same as the answer's approach, in terms of SQL performance. Thanks
Specifying which dates are not included might keep the number of database rows down, but it makes calculations, queries and reports more difficult.
I'd turn it upside down.
Have a master Event table that lists the first and last date of the event.
Also have a detail EventDates table that gets populated with all the dates where the event is available.
Taking this approach makes things easier to use, especially when writing queries and reports.
Update
Having a row per date allows you to do exact joins on dates to other tables, and allows you to aggregate per day for reporting purposes.
select ...
from sales
inner join eventDates
on sales.saleDate = eventDates.date
If your eventDates table uses start and end dates, the joins become harder to write:
select ...
from sales
inner join eventDates
on sales.saleDate >= eventDates.start and sales.SaleDate < eventDates.finish
Exact match joins are definately done by index, if available, in every RDBMS I've checked; range matches, as in the second example, I'm not sure. They're probably Ok from a performance perspective, unless you end up with a metric ton of data.

SSAS cube design, semi-additive measures, and running totals

I have what is to me a bit of a tricky design issue in my SSAS cube. The question is related to general accounting practices, I have a fact table containing financial transactions (i.e. a ledger) and each of those transactions is tagged with a transaction date and a period. The period does NOT related directly to a day, or a series of days. Users may close a period in the middle of a day if that is when they have finished their months work.
I need to be able to report on Accounts Receivable (AR) by both date and period. I am not using Enterprise Edition of SSAS so the time intelligence semi-additive options are not availabe to me, and even if they were they would only allow one time dimension to use non-standard aggregation and I believe in this case I need two that allow this.
Accounts Receivable is a running total, it should be the sum of the latest ledger item selected and everything that came before it. I know how do do this calculation in MDX for a single time dimension, but how can I allow this to work with two time dimensions, transaction date, and period close? Is period close even considered a "time" dimension in this case? It does have a temporal aspect to it, and I do want the sums from all periods up to the current.
I am stumped on how to related the two time dimensions to a single fact table and use different aggregation for each. Maybe the best solution here is to have two periodic snapshot tables (instead of trying to aggregate this info from the FactLedger table), one aggregated by transaction date and one by period which is the solution I am currently leaning towards but I would love a second opinion.
You can most certainly have more than one time dimension in a cube, and in this case I would actually just create one common time dimension and have it role play as two, transaction date and period close. To role play a dimension, just add it to the cube again in the Dimension Usage tab of the cube designer and rename it. Set up your references appropriately to key off of the two different fact columns.
Or maybe I'm not understanding the issue correctly. This sounds pretty straight-forward.
You can create your own time-table with periods and you can alter your fact_table's datetime format to match your time-table. Then 1 dimension would be enough.

Resources