In a data warehouse, I want to have a fact table which tracks certain metrics of a university application (average score on a standardized test, for example) and also the status of applications during different times of the year. For simpliciy, let's say a given application progresses through 3 states:
New
Being Assesssed
Assessed
and these states change over time.
I believe I want to use a slowly changing dimension here, but I can't figure out how to get to work properly.
Can someone give me an example of a fact table and dimension table which tracks two applications as they progress through these states?
I'm using SQL Server Analysis Services 2005.
The goal is to be able to do year on year analysis for the number of applications in each state.
It sounds like a classic example of where you would use an accumulating snapshot type fact table more than slowly changing dimensions. Accumulating snapshots are the standard way of modeling business processes that have a defined lifecycle when you want to be able to analyze your progress of applications through the pipeline.
Google "accumulating snapshot" fact tables and you will find many good articles on their usage but here is one you may find helpful. http://blog.oaktonsoftware.com/2007/03/accumulating-snapshot-use-accumulating.html
Your question mentioned standardized test score and assessment status. Those would be two of your dimensions, along with the omnipresent time, of course. Ralph Kimball has a nice example of a good time dimension. If your test score dimension is SAT it'd have 2400-700 = 1700 rows, because you get 700 points for signing your name and there are three sections with perfect scores of 800 each. Your assessment dimension could be three rows, as you described.
So you'd have one record in your fact table for every time a score or assessment changed, with a key to the time dimension to tell you when the change occurred.
We've got a couple of articles on slowly changing dimensions over on SQLServerPedia:
http://sqlserverpedia.com/wiki/SSIS_-_Slowly_Changing_Dimension_Wizard
http://sqlserverpedia.com/wiki/Data_Warehousing_-_Slowly_Changing_Dimensions
Those may help bring you up to speed.
Related
I have a query with respect to the advantages of building a OLAP cube vs aggregating data in database table for querying ,data of say 6 months and then archiving the sql table later for analytics purpose.
Which one is better, table or OLAP cube? and why since I can aggregate and keep data in my tables also and query the aggregated data as and when needed.
Short version: Like many development decisions, it depends.
Long version: I wouldn't say that one is "better" than the other - it's just that the two have separate uses and one or the other might be the better solution depending on what the requirements are.
If you have a few specific reports which require specific aggregations, then it might be simpler and easier for everyone involved to just aggregate that information in a table or a view, and point your reports at that.
As an example, if you know your users only want reports at a monthly level for a particular set of parameters - maybe your sales department want the monthly value of each salesperson's sales, for example - then your best bet might be to aggregate this up and pop it into a report where they can select the month and the salesperson, and get the number that they want.
The benefits of this might be that it's quick to develop and provide to your users, there's not too much time spent testing as only a few figures need checking, etc. Your users also don't need to spend time being trained/learning to use a cube - reports are generally pretty easy for people to pick up and use.
But if your users want to be able to carry out much more open-ended analysis on their own terms then it's not much use if you need to go away and develop a report every time they have a new requirement. Your database might start getting very full of similar-but-different tables full of aggregated amounts. You could run into issues where one report ends up not agreeing with another for some reason - you might find you're dealing with the same data quality issues over and over again in each report.
In this case, it might make more sense to develop a cube over the top of data held at the lowest grain which your users want to analyse. In this way, they can essentially self-serve, rather than getting back in touch with you every time they need a new set of aggregated data. They can slice and dice through the data using multiple different "parameters" (dimensions in the OLAP world), rather than being limited by the nature of the reports.
Aggregated data still sometimes plays a role even when you have a cube in place, though. Sometimes performance gains can be found by aggregating data up to certain levels and holding it in a physical table, and getting your OLAP tool to use the physically aggregated data at that level instead of using its own aggregations - but this is an optimization step which would need careful consideration to see whether it's beneficial in terms of performance, whether the space vs. performance payoff is worthwhile, etc. I wouldn't worry about this aspect if you're just starting to look at OLAP, but wanted to note it for the sake of completeness.
To add to Jo's great answer, consider the grain of the facts that need to be aggregated and compared. If you have daily sales by product, but budgets by month and product category, you're going to need an aggregate fact table based on sales in order to compare budgets. That would be further represented as two cubes in your OLAP database - Sales cube, and Budget cube.
If there are very regular use cases which involve specific aggregated data, and this aggregated data would take a while to return from sql database tables then a cube might help.
If there are lots of potential ways in which your db table data needs to be sliced and diced at an aggregated level then there is definitely a good argument to start playing around with olap cubes.
In terms of sums of data olap is a great aggregation tool. I'm not convinced that it is the best tool for distinct counts though, so if your requirements includes lots of distinct counts then maybe look elsewhere. Do you have the option of Tabular/PowerPivot/DAX ?
I think the question in the title speaks it all and is general.
I can give a concrete example as well:
I have tagged articles and want to find similar articles with the tags associated with them.
The score function will look at two articles and count the number of tags in common.
Since the score is not stored anywhere, I'll have to calculate the score everytime I need to find similar articles given an article.
But this is too expensive.
What is the common work-around to this kind of problem in general?
Is there a better approach for my specific tag problem? (e.g. solr's moreLikeThis)
edit
I'm using postgres, if that matters.
I'm looking for a general solution that people used successfully, such as you should batch calculate the score and save it somewhere and etc...
The answer will vary wildly by database product and version. For example, in some database products, it may be the case that a view or an indexed view might be faster than the more common solution...
Typically the way to handle a situation like this is by precalculating the result. You can do that in a handful of ways:
a. You can use something like triggers (added in the SQL 99 standard) that update the counts as rows are added, updated or removed from the source table. In this solution, you are making a (presumably) small sacrifice on inserts, updates and deletes of the source table in order to make significant gains in retrieving the information.
b. You can use a data warehouse where you accept some level of latency of live data to reported data. That means you accept that the data queried from the data warehouse will be stale by some accepted number of minutes, hours, days, or weeks. The data warehouse works by periodically querying the live OLTP (Online Transaction Processing) data and updates the OLAP (Online Analytical Processing) database which contains the precalculated results. You then run your reports off the OLAP data or a combination of OLTP and OLAP data. A formal database warehouse isn't required to achieve the equivalent results. You could write a procedure which is executed on a timer that updates a table periodically with updated results.
I'm still learning the ropes of OLAP, cubes, and SSAS, but I'm hitting a performance barrier and I'm not sure I understand what is happening.
So I have a simple cube, which defines two simple dimensions (type and area), a third Time dimension hierarchy (goes Year->Quarter->Month->Day->Hour->10-Minute), and one measure (sum on a field called Count). The database tracks events: when they occur, what type are, where they occurred. The fact table is a precalculated summary of events for each 10 minute interval.
So I set up my cube and I use the browser to view all my attributes at once: total counts per area per type over time, with drill down from Year down to the 10 Minute Interval. Reports are similar in performance to the browse.
For the most part, it's snappy enough. But as I get deeper into the drill-tree, it takes longer to view each level. Finally at the minute level it seems to take 20 minutes or so before it displays the mere 6 records. But then I realized that I could view the other minute-level drilldowns with no waiting, so it seems like the cube is calculating the entire table at that point, which is why it takes so long.
I don't understand. I would expect that going to Quarters or Years would take longest, since it has to aggregate all the data up. Going to the lowest metric, filtered down heavily to around 180 cells (6 intervals, 10 types, 3 areas), seems like it should be fastest. Why is the cube processing the entire dataset instead of just the visible sub-set? Why is the highest level of aggregation so fast and the lowest level so slow?
Most importantly, is there anything I can do by configuration or design to improve it?
Some additional details that I just thought of which may matter: This is SSAS 2005, running on SQL Server 2005, using Visual Studio 2005 for BI design. The Cube is set (as by default) to full MOLAP, but is not partitioned. The fact table has 1,838,304 rows, so this isn't a crazy enterprise database, but it's no simple test db either. There's no partitioning and all the SQL stuff runs on one server, which I access remotely from my work station.
When you are looking at the minute level - are you talking about all events from 12:00 to 12:10 regardless of day?
I would think if you need that to go faster (because obviously it would be scanning everything), you will need to make the two parts of your "time" dimension orthogonal - make a date dimension and a time dimension.
If you are getting 1/1/1900 12:00 to 1/1/1900 12:10, I'm not sure what it could be then...
Did you verify the aggregations of your cube to ensure they were correct? Any easy way to tell is that if you get the same amount of records no matter what drill-tree you go down.
Assuming this is not the case, what Cade suggests about making a Date dimension AND a Time dimension would be the most obvious approach but it is one bigger no-no's in SSAS. See this article for more information: http://www.sqlservercentral.com/articles/T-SQL/70167/
Hope this helps.
I would also check to ensure that you are running the latest sp for sql server 2005
The RTM version had some SSAS perf issues.
also check to ensure that you have correctly define attribute relationships on you time dimension and other dims as well.
Not having these relationships defined will the SSAS storage engine to scan more data then necessary
more info: http://ms-olap.blogspot.com/2008/10/attribute-relationship-example.html
as stated above, splitting out the date and time will significantly decrease the cardinality of your date dimension which should increase performance and allow a better analytic experience.
We're inherting a project at work from another office that has closed down. The production database is around 150GB and we're shying away from copying this to 4 dev machines to work from. Are there any scripts, utilities or suggestions on how we can go about capturing a small subset of this data, say 5%, to work with in development - while maintaining integrity of the relationships, key tables, etc?
I guess what I mean by that last part is that if I had an orders table of 500 rows and took a random sampling of 25 rows, I would need to make sure that the 5% of products I took from the products table included any prodcuts need to satisfy those orders..... exceeding 5% if necessary.
I hope I explained that well enough. Anyone have any thoughts?
I suppose the first step would be to map out what the dependencies / relationships between tables are, and how you find all the dependencies of a given row in a given table.
Once you've done then then you could just take a random sampling of one of your high level tables (e.g. "Customers") and recursively fetch any dependent rows from dependent tables.
Rinse and repeat for any tables that didn't appear in the "dependency heirachy" for the first table that you chose, until you have a sampling from all tables.
There certainly isn't going to be a generic script to do this, but I'd say that time spent mapping out the dependencies in the database in this way is time well spent understanding the structure of the database.
Tbh I'd probably do the reverse instead - empty the database and add records to relevant tables as you find it necessary. There isn't really any need for developers to always run against a representative sampling of data, and really you should make sure that you regularly test against the full sampling of data anyay, just in case the 95% of the database thats left behind contains the rows that cause problems.
At the risk of sounding like a pimp for third party products, have you thought about using a product like Hyperbac's? It allows you to restore the database onto your dev machine, but in a compressed - but performant - manner.
It's Hyperbac Online that is probably most relevant:
http://www.hyperbac.com/online/overview.asp
I have what is to me a bit of a tricky design issue in my SSAS cube. The question is related to general accounting practices, I have a fact table containing financial transactions (i.e. a ledger) and each of those transactions is tagged with a transaction date and a period. The period does NOT related directly to a day, or a series of days. Users may close a period in the middle of a day if that is when they have finished their months work.
I need to be able to report on Accounts Receivable (AR) by both date and period. I am not using Enterprise Edition of SSAS so the time intelligence semi-additive options are not availabe to me, and even if they were they would only allow one time dimension to use non-standard aggregation and I believe in this case I need two that allow this.
Accounts Receivable is a running total, it should be the sum of the latest ledger item selected and everything that came before it. I know how do do this calculation in MDX for a single time dimension, but how can I allow this to work with two time dimensions, transaction date, and period close? Is period close even considered a "time" dimension in this case? It does have a temporal aspect to it, and I do want the sums from all periods up to the current.
I am stumped on how to related the two time dimensions to a single fact table and use different aggregation for each. Maybe the best solution here is to have two periodic snapshot tables (instead of trying to aggregate this info from the FactLedger table), one aggregated by transaction date and one by period which is the solution I am currently leaning towards but I would love a second opinion.
You can most certainly have more than one time dimension in a cube, and in this case I would actually just create one common time dimension and have it role play as two, transaction date and period close. To role play a dimension, just add it to the cube again in the Dimension Usage tab of the cube designer and rename it. Set up your references appropriately to key off of the two different fact columns.
Or maybe I'm not understanding the issue correctly. This sounds pretty straight-forward.
You can create your own time-table with periods and you can alter your fact_table's datetime format to match your time-table. Then 1 dimension would be enough.