I have what is to me a bit of a tricky design issue in my SSAS cube. The question is related to general accounting practices, I have a fact table containing financial transactions (i.e. a ledger) and each of those transactions is tagged with a transaction date and a period. The period does NOT related directly to a day, or a series of days. Users may close a period in the middle of a day if that is when they have finished their months work.
I need to be able to report on Accounts Receivable (AR) by both date and period. I am not using Enterprise Edition of SSAS so the time intelligence semi-additive options are not availabe to me, and even if they were they would only allow one time dimension to use non-standard aggregation and I believe in this case I need two that allow this.
Accounts Receivable is a running total, it should be the sum of the latest ledger item selected and everything that came before it. I know how do do this calculation in MDX for a single time dimension, but how can I allow this to work with two time dimensions, transaction date, and period close? Is period close even considered a "time" dimension in this case? It does have a temporal aspect to it, and I do want the sums from all periods up to the current.
I am stumped on how to related the two time dimensions to a single fact table and use different aggregation for each. Maybe the best solution here is to have two periodic snapshot tables (instead of trying to aggregate this info from the FactLedger table), one aggregated by transaction date and one by period which is the solution I am currently leaning towards but I would love a second opinion.
You can most certainly have more than one time dimension in a cube, and in this case I would actually just create one common time dimension and have it role play as two, transaction date and period close. To role play a dimension, just add it to the cube again in the Dimension Usage tab of the cube designer and rename it. Set up your references appropriately to key off of the two different fact columns.
Or maybe I'm not understanding the issue correctly. This sounds pretty straight-forward.
You can create your own time-table with periods and you can alter your fact_table's datetime format to match your time-table. Then 1 dimension would be enough.
Related
I have a query with respect to the advantages of building a OLAP cube vs aggregating data in database table for querying ,data of say 6 months and then archiving the sql table later for analytics purpose.
Which one is better, table or OLAP cube? and why since I can aggregate and keep data in my tables also and query the aggregated data as and when needed.
Short version: Like many development decisions, it depends.
Long version: I wouldn't say that one is "better" than the other - it's just that the two have separate uses and one or the other might be the better solution depending on what the requirements are.
If you have a few specific reports which require specific aggregations, then it might be simpler and easier for everyone involved to just aggregate that information in a table or a view, and point your reports at that.
As an example, if you know your users only want reports at a monthly level for a particular set of parameters - maybe your sales department want the monthly value of each salesperson's sales, for example - then your best bet might be to aggregate this up and pop it into a report where they can select the month and the salesperson, and get the number that they want.
The benefits of this might be that it's quick to develop and provide to your users, there's not too much time spent testing as only a few figures need checking, etc. Your users also don't need to spend time being trained/learning to use a cube - reports are generally pretty easy for people to pick up and use.
But if your users want to be able to carry out much more open-ended analysis on their own terms then it's not much use if you need to go away and develop a report every time they have a new requirement. Your database might start getting very full of similar-but-different tables full of aggregated amounts. You could run into issues where one report ends up not agreeing with another for some reason - you might find you're dealing with the same data quality issues over and over again in each report.
In this case, it might make more sense to develop a cube over the top of data held at the lowest grain which your users want to analyse. In this way, they can essentially self-serve, rather than getting back in touch with you every time they need a new set of aggregated data. They can slice and dice through the data using multiple different "parameters" (dimensions in the OLAP world), rather than being limited by the nature of the reports.
Aggregated data still sometimes plays a role even when you have a cube in place, though. Sometimes performance gains can be found by aggregating data up to certain levels and holding it in a physical table, and getting your OLAP tool to use the physically aggregated data at that level instead of using its own aggregations - but this is an optimization step which would need careful consideration to see whether it's beneficial in terms of performance, whether the space vs. performance payoff is worthwhile, etc. I wouldn't worry about this aspect if you're just starting to look at OLAP, but wanted to note it for the sake of completeness.
To add to Jo's great answer, consider the grain of the facts that need to be aggregated and compared. If you have daily sales by product, but budgets by month and product category, you're going to need an aggregate fact table based on sales in order to compare budgets. That would be further represented as two cubes in your OLAP database - Sales cube, and Budget cube.
If there are very regular use cases which involve specific aggregated data, and this aggregated data would take a while to return from sql database tables then a cube might help.
If there are lots of potential ways in which your db table data needs to be sliced and diced at an aggregated level then there is definitely a good argument to start playing around with olap cubes.
In terms of sums of data olap is a great aggregation tool. I'm not convinced that it is the best tool for distinct counts though, so if your requirements includes lots of distinct counts then maybe look elsewhere. Do you have the option of Tabular/PowerPivot/DAX ?
I have a database that increases every month. The schema remains the same, so I think I use one of these two methods:
Use only one table, new data will be appended to this table, and will be identified by a date column. The increasing data every month is about 20,000 rows, but in long term, I think this should be problem to search and analyze this data
create dynamically one table per month, the table name will indicate which data it contains (for example, Usage-20101125), this will force us to use dynamic SQL, but in long term, it seems fine.
I must confess that I have no experiences about designing this kind of database. Which one should I use in real world?
Thank you so much
20 000 rows per month is not a lot. Go with your first option. You didn't mention which database you'll be using, but SQL Server, Oracle, Sybase and PostgreSQL, to name just a few, can handle millions of rows comfortably.
You will need to investigate a proper maintenance plan, including indexing and statistics, but that will come with lots of reading and experience.
Look into partitioning your table.
That way you can physically store the data on different disks for performance while logically it would be one table so your database stays well designed.
Is it better to keep Days of month, Months, Year, Day of week and week of year as separate reference tables or in a common Answer table? Goal is allow user content searches and action analytic to be filtered by all the various date-time values (There will be custom reporting for users based on their shared content). I am trying to ensure data accuracy by using IDs, and also report out on numbers of shares, etc by time and date for system reporting by comparing various user groups. If we keep in separate tables, what about time? A table with each hour, minute and second also needed?
Most databases support some sort of TIMESTAMP data type plus assciated DAY(), MONTH(), DAYOFWEEK() functions.
The only valid reason for separate DAY or HOUR columns in a separate table is if you have procomputed totals and averages for each timeslot.
Even then its only worth it if you expect a lot of filtering based on these values, as the cost of building these tables is high, and, for most queries the standard SQL "GROUP BY ... HAVING .. " will perform well enough.
It sounds like you may be interested in a "STAR SCHEMA" wikipedia a common method in data warehosing to speed up queries -- but be warned designing and building a Star Schem is not a trivial exercise.
I'm still learning the ropes of OLAP, cubes, and SSAS, but I'm hitting a performance barrier and I'm not sure I understand what is happening.
So I have a simple cube, which defines two simple dimensions (type and area), a third Time dimension hierarchy (goes Year->Quarter->Month->Day->Hour->10-Minute), and one measure (sum on a field called Count). The database tracks events: when they occur, what type are, where they occurred. The fact table is a precalculated summary of events for each 10 minute interval.
So I set up my cube and I use the browser to view all my attributes at once: total counts per area per type over time, with drill down from Year down to the 10 Minute Interval. Reports are similar in performance to the browse.
For the most part, it's snappy enough. But as I get deeper into the drill-tree, it takes longer to view each level. Finally at the minute level it seems to take 20 minutes or so before it displays the mere 6 records. But then I realized that I could view the other minute-level drilldowns with no waiting, so it seems like the cube is calculating the entire table at that point, which is why it takes so long.
I don't understand. I would expect that going to Quarters or Years would take longest, since it has to aggregate all the data up. Going to the lowest metric, filtered down heavily to around 180 cells (6 intervals, 10 types, 3 areas), seems like it should be fastest. Why is the cube processing the entire dataset instead of just the visible sub-set? Why is the highest level of aggregation so fast and the lowest level so slow?
Most importantly, is there anything I can do by configuration or design to improve it?
Some additional details that I just thought of which may matter: This is SSAS 2005, running on SQL Server 2005, using Visual Studio 2005 for BI design. The Cube is set (as by default) to full MOLAP, but is not partitioned. The fact table has 1,838,304 rows, so this isn't a crazy enterprise database, but it's no simple test db either. There's no partitioning and all the SQL stuff runs on one server, which I access remotely from my work station.
When you are looking at the minute level - are you talking about all events from 12:00 to 12:10 regardless of day?
I would think if you need that to go faster (because obviously it would be scanning everything), you will need to make the two parts of your "time" dimension orthogonal - make a date dimension and a time dimension.
If you are getting 1/1/1900 12:00 to 1/1/1900 12:10, I'm not sure what it could be then...
Did you verify the aggregations of your cube to ensure they were correct? Any easy way to tell is that if you get the same amount of records no matter what drill-tree you go down.
Assuming this is not the case, what Cade suggests about making a Date dimension AND a Time dimension would be the most obvious approach but it is one bigger no-no's in SSAS. See this article for more information: http://www.sqlservercentral.com/articles/T-SQL/70167/
Hope this helps.
I would also check to ensure that you are running the latest sp for sql server 2005
The RTM version had some SSAS perf issues.
also check to ensure that you have correctly define attribute relationships on you time dimension and other dims as well.
Not having these relationships defined will the SSAS storage engine to scan more data then necessary
more info: http://ms-olap.blogspot.com/2008/10/attribute-relationship-example.html
as stated above, splitting out the date and time will significantly decrease the cardinality of your date dimension which should increase performance and allow a better analytic experience.
In a data warehouse, I want to have a fact table which tracks certain metrics of a university application (average score on a standardized test, for example) and also the status of applications during different times of the year. For simpliciy, let's say a given application progresses through 3 states:
New
Being Assesssed
Assessed
and these states change over time.
I believe I want to use a slowly changing dimension here, but I can't figure out how to get to work properly.
Can someone give me an example of a fact table and dimension table which tracks two applications as they progress through these states?
I'm using SQL Server Analysis Services 2005.
The goal is to be able to do year on year analysis for the number of applications in each state.
It sounds like a classic example of where you would use an accumulating snapshot type fact table more than slowly changing dimensions. Accumulating snapshots are the standard way of modeling business processes that have a defined lifecycle when you want to be able to analyze your progress of applications through the pipeline.
Google "accumulating snapshot" fact tables and you will find many good articles on their usage but here is one you may find helpful. http://blog.oaktonsoftware.com/2007/03/accumulating-snapshot-use-accumulating.html
Your question mentioned standardized test score and assessment status. Those would be two of your dimensions, along with the omnipresent time, of course. Ralph Kimball has a nice example of a good time dimension. If your test score dimension is SAT it'd have 2400-700 = 1700 rows, because you get 700 points for signing your name and there are three sections with perfect scores of 800 each. Your assessment dimension could be three rows, as you described.
So you'd have one record in your fact table for every time a score or assessment changed, with a key to the time dimension to tell you when the change occurred.
We've got a couple of articles on slowly changing dimensions over on SQLServerPedia:
http://sqlserverpedia.com/wiki/SSIS_-_Slowly_Changing_Dimension_Wizard
http://sqlserverpedia.com/wiki/Data_Warehousing_-_Slowly_Changing_Dimensions
Those may help bring you up to speed.