my etl process collects and transforms data into a DB base with facts and dimensions. so why should I build cubes out of this? Is there more than just a the speed benefit of queries and the pre aggregiation of values?
Thank you for helping
Facts and dimensions are tables that can be built in almost any relational database. To improve performance you can build aggregate fact tables in the same relational database. These tables are generic in that, with very little effort, you could move these tables from Oracle to SQL Server, as an example.
At the risk of over simplifying, a cube is a type of aggregate fact table but is built in a multi-dimensional database and is, normally, specific to that flavour of database. So if you build a cube in SSAS you couldn't move it to Hyperion Essbase.
For a simple query, such as sum transaction amount by date, cubes would not give you much/any benefit over facts. For complex queries, the performance tends to be significantly better than with facts.
Cubes normally support their own query language (e.g. SSAS and DAX) that allow much more complex queries than can normally be written in SQL (without a lot of effort)
So whether you should build cubes depends on a lot of factors, such as:
Are you running a lot of complex queries that would perform better in a cube?
Would the improved performance be worth the effort/cost involved?
Is there a cost/benefit case for deploying a MOLAP (cube) database as well as your dimensional database? Cubes are normally populated from facts/dimensions so they are an addition to, rather than a replacement for, facts/dimensions
Related
I want to process my data into Qlikview but i am confused about to process the data through Cube or directly from SQL.
Can anyone tell me which gives better performance from cube and SQL?
Note: I have millions of data into the database.
Generally as the volume of data grows, the advantages of SSAS tend to become more apparent than those from using SQL Server as the source. How will the data be used? When it comes to large scale aggregations SSAS becomes very beneficial. SSAS will also force a structured layout, as the relationships are predefined in the cube as opposed to joins. Some additional features that SSAS brings are hierarchical analysis (hierarchies) as well as ease of use with tools such as Excel and SSRS, although it sounds like you're only looking to use Qlikview for this. However, your best option would be to do a baseline for both SSAS and SQL Server in your environment with queries that best represent what would be run when this is implemented, and assess the results from there.
From BI tool perspective it doesn't matter as you can connect to both source (SQL is more common but it depends on your expertise). Regarding performance the best strategy is to have separate extract layer and store data incrementally as qvd (for example every night previous day) so performance is not as important with incremental reload as even for big data sets it should be quick.
If your original source of data is SQL in my opinion it doesn't make sense to replicate data in 3 places (SQL, cube and QlikView) better connect directly to source save it incrementally raw data as qvd and then have transformer which will model that data.
I am trying to do a case study on migrating our data from Sybase to Teradata. We would than try to implement the same in our organization. Based on my study what I know is Teradata is excellent when it comes down to allowing maximum number of concurrent queries, more number of active sessions per system, maximum number of transactions including ETL jobs etc. But are there any other added advantages we can get (where Sybase lacks) by moving to Teradata? Also what all key factors or constraints I should consider while performing such migration?
Actually, Michael, Sybase IQ can be compared to TD as an OLAP solution.
To OP: Cost comparison would be a huge advantage to Sybase. Also, you can maintain Syb yourself, whilst for TD you're likely to contract TD staff to maintain the hardware.
But no matter the size, you can bring Teradata to its knees just as easy with quite a few concurrent queries or poorly written ETLs. Teradata performs very well though on heavy aggregation of huge dimensions, if skewed well across controllers.
I am accessing OLAP SSAS Cubes on a 2005 SQL Server using Excel 2007 pivot tables and finding that refreshing some of the tables is taking >10 minutes. My coworkers seem to think it is a sad reality, but I am wondering if there are alternatives I should be looking into.
Some thoughts I have had:
Obviously if I could upgrade the server hardware I would, but I am merely an analyst with no such powers, so I don't think hardware improvements are a great option. The same is true of moving to a newer SQL server, which I imagine would also speed up the process.
Would updating to a newer version of excel speed up the process?
I came across this: http://olappivottableextend.codeplex.com/, which gives me access to the MDX, which is apparently comically inefficient (Sounds like the macro recorder for VBA to me), so would changing the MDX around (I know a bit of it and the queries it gives for the pivot tables don't seem that complicated) be an option?
Would running MDX outside of excel be an option? I can write the queries, but I imagine it would not be as simple as the pivot table is.
It just seems like OLAP Cubes are a great solution in a lot of ways and these are some massive pivot tables processing quite a bit of information, but if there is a reasonable way to speed up the whole process I would love to know more about it.
Thanks for your thoughts SO.
There are many ways to access SSAS cubes, but it depends on what you are trying to achieve.
Excel tends to be used by business because
Its already installed
It is a familiar business tool
Easy to use
Requires no developer intervention
Other alternatives to Excel to access the cube include
SQL Server Analysis Services (management studio) via cube browser or mdx directly
SQL Server Reporting Services
Bespoke development (such as c#) utilising AdomdConnection
SQL Server (management studio) via OpenQuery
If you have been using Excel to access the cube so far, you will probably decide that none of the other tools quite cover your needs and you will end up sticking with it.
Assuming that Excel is the right tool for you, you should then move on to why is it slow. The list of possibilities (not including hardware / software) is long, but here are some;
It could be that it is external contention (to your project) on network / database / disk resource. The colume of data may be accumulating over time.
The cube may not be paritioned.
The questions you ask of it may be getting more complex.
The cube aggregations may not be utilised for your needs.
Cube partitioning may be missing
Cube structure may be inefficient as its supporting many-to-many relationships
User / query volume may have increased
To try to address the problem I would
Assess the data that you require within the cube (and maybe limit the cube to a rolling x month window)
Log your queries and apply Usage Based Optimisation
Monitor cube usage via SQL Server Profiler
Review the structure of your cube design
Attempt similar queries with other tools (both across the network and local to the cube) to establish where the issue lies
These two sites may help you if you establish Excel is the week point Excel, Cube Formulas, Analysis Services, Performance, Network Latency, and Connection Strings OR Excel, Cube Formulas, Analysis Services, Performance, Network Latency, and Connection Strings (which is on page 57 of SQLCAT's Guide to BI and Analytics)
Assume that the table in the source data is is clean and in a state where they can be used directly.
I am trying to understand whether building views based off the RAW table is better than creating cubes. To make the VIEWS dynamic, we can have .NET application which would take paramteres for the view and execute a View with Parameters and get the data for Reporting and analysis.
If I want to view the Sales of a Product for United states in the Month of Februaray. So, I can create a view joining Product, Customer get the sales for a particular day in the month of February.
Instead of forming a Star Schema with Product, Date, Customer dimension. I am really trying to understand what is the standarad a company should go with.
I have folks telling Cubes are only good for analysis not good for reporting . Whatever information we want we can get it by creating a DYNAMIC Views
Any advise or ideas on this ?
Thanks!!
As the name suggest SSAS (SQL Server Analysis Services) is indeed built for analysis. The reason for this is the highly normalized table structure (e.g., the star schema) which allows for super efficient indexing combined with the pre-processing of aggregated values.
Views are a great way to take data that already exists within your OLTP (as compared to OLAP) database and transform it in a manner that better fits your querying needs. This works in the same manner as "get" stored procedures.
Now for my opinion:
If you have a small amount of data (relative to the power of your server, as well as many other factors) and you're not performing intense aggregations of the data, consider using stored procedures to access your database. You can specify the parameter in .NET like any other function, making this method super easy.
If you have a lot of data (like, over 100 million rows), consider creating a cube. This will allow your queries to fly. There's a lot more work that goes into these, but the speed payoff is huge.
End note:
If the data in your reports is pretty similar to the data you already have in your database (including JOINing the tables) and you have under half a billion rows, just use a stored proc, and look into using SSRS (or not). If you have a ton of data that needs to be aggregated and transformed, look into SSAS OLAP cubes.
From my limited experience with Microsoft's Analysis Services, I would agree with Norla. If the execution time of the view is reasonable, that would be the way to go. Cubes can certainly be reported against, as SQL Reporting Services accommodates them fairly well, but the development process can often be much more involved when using a cube as your data source.
Building views can be an alternative for small datasets. You could consider going that route BUT:
1) once the reports are taking a lot of time to load
2) It slows the transactional systems
Then you'll have to consider cubes.
This question is related to another:
Will having multiple filegroups help speed up my database?
The software we're developing is an analytical tool that uses MS SQL Server 2005 to store relational data. Initial analysis can be slow (since we're processing millions or billions of rows of data), but there are performance requirements on recalling previous analyses quickly, so we "save" results of each analysis.
Our current approach is to save analysis results in a series of "run-specific" tables, and the analysis is complex enough that we might end up with as many as 100 tables per analysis. Usually these tables use up a couple hundred MB per analysis (which is small compared to our hundreds of GB, or sometimes multiple TB, of source data). But overall, disk space is not a problem for us. Each set of tables is specific to one analysis, and in many cases this provides us enormous performance improvements over referring back to the source data.
The approach starts to break down once we accumulate enough saved analysis results -- before we added more robust archive/cleanup capability, our testing database climbed to several million tables. But it's not a stretch for us to have more than 100,000 tables, even in production. Microsoft places a pretty enormous theoretical limit on the size of sysobjects (~2 billion), but once our database grows beyond 100,000 or so, simple queries like CREATE TABLE and DROP TABLE can slow down dramatically.
We have some room to debate our approach, but I think that might be tough to do without more context, so instead I want to ask the question more generally: if we're forced to create so many tables, what's the best approach for managing them? Multiple filegroups? Multiple schemas/owners? Multiple databases?
Another note: I'm not thrilled about the idea of "simply throwing hardware at the problem" (i.e. adding RAM, CPU power, disk speed). But we won't rule it out either, especially if (for example) someone can tell us definitively what effect adding RAM or using multiple filegroups will have on managing a large system catalog.
Without first seeing the entire system, my first recommendation would be to save the historical runs in combined tables with a RunID as part of the key - a dimensional model may also be relevant here. This table can be partitioned for improvement, which will also allow you to spread the table into other filegroups.
Another possibility it to put each run in its own database and then detach them, only attaching them as needed (and in read-only form)
CREATE TABLE and DROP TABLE are probably performing poorly because the master or model databases are not optimized for this kind of behavior.
I also recommend talking to Microsoft about your choice of database design.
Are the tables all different structures? If they are the same structure you might get away with a single partitioned table.
If they are different structures, but just subsets of the same set of dimension columns, you could still store them in partitions in the same table with nulls in the non-applicable columns.
If this is analytic (derivative pricing computations perhaps?) you could dump the results of a computation run to flat files and reuse your computations by loading from the flat files.
This seems to be a very interesting problem/application that you are working with. I would love to work on something like this. :)
You have a very large problem surface area, and that makes it hard to start helping. There are several solution parameters that are not evident in your post. For example, how long do you plan to keep the run analysis tables? There's a LOT other questions that need to be asked.
You are going to need a combination of serious data warehousing, and data/table partitioning. Depending on how much data you want to keep and archive you may need to start de-normalizing and flattening the tables.
This would be pretty good case where contacting Microsoft directly can be mutually beneficial. Microsoft gets a good case to show other customers, and you get help directly from the vendor.
We ended up splitting our database into multiple databases. So the main database contains a "databases" table that refers to one or more "run" databases, each of which contains distinct sets of analysis results. Then the main "run" table contains a database ID, and the code that retrieves a saved result includes the relevant database prefix on all queries.
This approach allows the system catalog of each database to be more reasonable, it provides better separation between the core/permanent tables and the dynamic/run tables, and it also makes backups and archiving more manageable. It also allows us to split our data across multiple physical disks, although using multiple filegroups would have done that too. Overall, it's working well for us now given our current requirements, and based on expected growth we think it will scale well for us too.
We've also noticed that SQL 2008 tends to handle large system catalogs better than SQL 2000 and SQL 2005 did. (We hadn't upgraded to 2008 when I posted this question.)