SSAS - should I use views or underlying tables? - sql-server

I have a set of views set up in SQL Server which output exactly the results that I would like to include in a SQL Server Analysis Services cube, including the calculation of a number of dimensions (such as Age using DATEDIFF, business quarter using DATENAME etc.). What I would like to know is whether it makes sense to use these views as the data source for a cube, or whether I should use the underlying tables to reproduce the logic in SSAS. What are the implications of going either route?
My concerns are:
the datasets are massive, but we need quick access to the results, so I would like to have as much of the calculations that are done in the views persisted within the SSAS data warehouse
Again, because the datasets are massive I want the recalculation of any cubes to be a fast as possible

Many experts actually recommend using views in your data source view in SSAS. John Welch (Pragmatic Works, Microsoft MVP, Analysis Services Maestro) spoke on how he preferred using views in the DSV this year at SQL Rally Dallas. The reason being is that it creates a layer between the cube and the physical table.
Calculating columns in the view will take a little extra time and resources during cube processing. If processing time is ok, leave the computations in the view. If it's an issue, you can always add a persisted computed column directly to the fact table so that the calculation is done during the insert / update of the fact table. The disadvantage of this is that you'll have to physically store the columns in the fact table. The advantage is that they don't have to be computed every time the cube gets processed. These are the tradeoffs that you'll need to weigh to decide which way to go.
Just make sure you tune the source queries to be as efficient as possible. Views are fine for DSVs.

Views, always! The only advantage of using tables on the DSV is that it ill map your keys automatically :) which saves you 5 minutes of development time haha.
Also, by "use the underlying tables to reproduce the logic in SSAS" you mean creating calculated columns on your SSAS DSV? It is an option too, but I rather add the calculations to the views because, in case I have to uptade them, is MUCH easier (and less subject to failure) to re-deploy a view than to redeploy a full cube.

Related

Database table creation for large data

I am making a client management application in which I am storing the data of employee , admin and company. In the future the database will have hundreds of companies registered. I am thinking to go for the best approach to database design.
I can think of 2 approaches:
Making all tables of app separately for each company
Storing all data in app database
Can you suggest the best way to do that?
Please note that all 3 tables are linked on the basis of ids and there will be hundreds of companies and each company will have many admin and each admin will have hundreds of employee . What would be the best approach to do with security and query performance
With the partial information you provided, it look like 3 normalized tables is what you need, plus the auxiliar data like lookups and other stuff.
But when you design a database you would need to consider many more point like, security, visibility, client access methods, etc
For example if you want to ensure isolation, and don't allow users to have any visibility to other's data, you could create dynamically a schema per company, create user and access rights for each schema dynamically. Then you'll need support these stuff in the DAL, which in fact will be quite fat.
Another approach for the DAl could be exposing views that always return subsets for one company.
A big reason reason that I would suggest going for the normalized approach is that maintenance will be much easier this way.
From a SQL point of view I don't see any performance advantage having many tables or just 3, efficiency of the indexes, and smart DAL will make the difference.
The performance of the query doesn't much depends on the size of table but it depends more on the indexes you have on that table. so you need to put clustered and non clustered indexes as per your requirement and i can guarantee that up to 10 GB of data you will not face any problem
This is a classic problem shared my most web business services: for discussions of the factors involved, Google "multi-tenant architecture."
You almost certainly want to put all companies into a common set of tables: each data table should reference the company key, and all queries should join on that key, among their other criteria. This allows the best overall performance, and saves you the potential maintenance nightmare of duplicating views, stored procedures and so on hundreds of times, or of having to apply the same structural changes to hundreds of tables should you wish to add a field or a table.
To help assure that you don't inadvertently intermingle data from different customers, it might be useful to do all data access through a validated set of stored procedures (all of which take the company ID as a parameter).
Hundreds of parallel databases will not scale very well: the DB server will constantly be pushing tables and indexes out of memory to accommodate the next query, resulting in disk thrashing and poor performance, as well. There is only pain down that path.
depending on the use-cases of your application there is no "best" way.
Please explain the operations your application will provide so we can get further insight into your problem.
The data to be stored seemed to be structured so a relational database at a first glance would work out well, but stick to the point i marked above.
You have not said how this data links at all or if there are even any links between them. However, at a guess, you need 3 tables.
EmployeeTable
AdminTable
CompanyTable
Each with the required properties in there, without additional information I'm not able to provide any more guidance.

Parallel bulk loading using partition switching of indexed table in SQL Server 2008

This is a follow up to a previous question of mine after definitely deciding on partition switching as the best way to quickly get data into a heavily indexed fact type table that needs to remain available to readers.
While it seems to be the best way, it is not quite good enough to really satisfy the requirement to allow several (< 5) users to bulk insert at the same time, have the new data indexed and to appear in the indexed views (not necessarily real indexed views, just selects that rely on indices).
The idea of partitioning was that each partition and the index subtree rooted at the partition could, in parallel, be locked as read-only, copied into a working table, new data inserted/updated and the indexes rebuilt then switched back into the main table so readers aren't affected.
The problem is the single working table. Each parallel bulk insert needs its own copy, with the same constraints as the main table to allow switching.
So far I've hit several walls trying to get around this bottleneck:
I tried partitioning the working table using the same partition
function. This doesn't work because you can't disable the indexes on
a partition basis to insert into one while rebuilding the index on
another.
Creating a temporary table as the working table. This
doesn't work because, while you can use the same index names, you
can't easily dynamically create the constraints and can't switch
that in anyway.
Have a fixed set of named working tables? How can I select one and work with it under an alias so I have just one stored proc?
Dynamic SQL? I've tried very hard to avoid going that route. It's complicated as it is.
Big challenge but has anyone got any ideas before I accept the bottleneck? Would Sql 2012 help? How do proper data warehouses cope with this?
How do proper data warehouses cope with this? Compromise and set realistic goals for the EDW. The data warehouse can't be everything to everyone. Make sure that what you're implementing is the best solution for the business (not just the techies/analysts). Are your goals realistic if you cannot find solutions from experienced peers and experts?
Associate a cost with all of the hoops you jump through. Does the data really need to be up to the minute? What if I told you that we needed to spend another $200,000 on storage because we're constantly duplicating partitions and rebuilding indexes and the current solution can't keep up with the IOPS demand? At some point, they're going to figure out that it's not free. While you don't need to just say no, you do need to be realistic and up-front about the cost associated. Additionally, your storage admin will thank you.
As for 2012, there is a new columnstore index which can reduce or replace all of the current nonclustereds you're using to cover all you're analysts search requests. It's highly compressed, covers a very wide variety of search arguments, and utilizes the new Batch execution mode. It performs best on low selectivity queries like the ones frequently performed on fact tables. The one catch is that you can't directly do updates. You'll have to switch the partition out to a staging table, drop the columnstore on the staging table, update the staging table, add the columnstore back, then switch the partition back into the fact table. It sounds like alot, but could be significantly faster and require less IO than maintaining all of those nonclustereds.
My question has always been "Is it really a fact table if it is constantly changing?". This is not OLTP is it? Try offsetting transactions or at least push all updates to a scheduled off-peak time. Updating fact tables is becoming a thing of the past. All of the big boys are moving toward the "Update frowned upon" column oriented architecture for data warehousing. PowerPivot and the Analysis Services Tabular Model are built on the columnstore technology.
Finally, Review Kimballs' DW Toolkit books. He has several that lay out best practices and cover edge-case scenarios. What I learned from them was that Data Warehouse Development is not just Database Development on steroids. It also involves politics and focusing resources on what's best for the business.

Business Intelligence - analyzing events rather than aggregates? What's the right approach

I currently analyze our customer data and trends by a number of SQL queries; and the testing of a hypothesis can be time-expensive.
For instance, we have a table of our customer info and a table of our customer service calls, indexed by customer. I'd like to find out if a particular cohort of customers had more CS issues than another; and if there is any correlation between customer service calls and increased cancel rates.
I was looking into MS's BI studio, as we're running MSSQL 2008 already; but most of what I've read focuses on carefully constructed MDX cubes that aggregate numerical data; so in the above model, I'd have to build a cube of facts (number of CS calls and types) and then use the customer data as dimensions. Fair enough, but in the time it'd take me to do that, I could just write the query manually in TSQL.
My DB is small enough that the speed gains from a separate datamart aren't necessary -- what I'm looking for is a flexible way of looking at my data, by creating a Customer 'Object' and tying all sorts of data, actions and numerical values to them. And I'd rather have the data extracted from my existing tables rather than having to ETL to a separate table.
Ideally at some point, I'd be able to use Data Mining tools for predictive analysis, but right now I'm going after low hanging fruit -- do customers from this ad campaign cancel more than the other one; etc.
Am I barking up the wrong tree with SQL Analysis Services/MDX cubes? Or does what I'm talking about not exist easily to begin with? Any advice, directions to products, or insight greatly appreciated.
It depends on who you want to do the analysis. If you are the one who is going to do the analysis, you know SQL, and you understand the structure of your data, then there's no real benefit to doing extra work to simply change the structure of the data. You want to use BI tools when you want to make that data available to others who don't know SQL, and don't necessarily know the relationships between different tables of data that are out there. You're in essence adding an abstraction layer to hide all this complexity from them, but still allow them to do the analysis. Of course the side effect of the abstraction is that you end up adding some limitations, but the trade-off is that the information is available to more people.
Don't waste your time with SSAS/cubes. Your dataset is small and the scope of your problem is narrow...so there's no need for you to build a cube. Instead, you should give the Excel Data Mining addin a test-run. It's pretty powerful and works well with small datasets. It is the low-hanging fruit I believe you are looking for. Plus, users feel comfortable using Excel.
SSAS is not necessary for creating data mining structures/models is only necessary if you want to automate the process.
Building a cube first only helps when you have a very large dataset. Because of its speed, it will allow the data mining algorithms to run faster. Even if you use SSAS to build a data minining strucutre/model(s), you still don't need a cube...you can build the structure/model(s) off of relational tables.
If you database tables are designed correctly

How to improve ESRI/ArcGIS database performance while maintaining normalization?

I work with databases containing spatial data. Most of these databases are in a proprietary format created by ESRI for use with their ArcGIS software. We store our data in a normalized data model within these geodatabases.
We have found that the performance of this database is quite slow when dealing with relationships (i.e. relating several thousand records to several thousand records can take several minutes).
Is there any way to improve performance without completely flattening/denormalizing the database or is this strictly limited by the database platform we are using?
There is only one way: measure. Try to obtain a query plan, and try to read it. Try to isolate a query from the logfile, edit it to an executable (non-parameterised) form, and submit it manually (in psql). Try to tune it, and see where it hurts.
Geometry joins can be costly in term of CPU, if many (big) polygons have to be joined, and if their bounding boxes have a big chance to overlap. In the extreme case, you'll have to do a preselection on other criteria (eg zipcode, if available) or maintain cache tables of matching records.
EDIT:
BTW: do you have statistics and autovacuum? IIRC, ESRI is still tied to postgres-8.3-something, where these were not run by default.
UPDATE 2014-12-11
ESRI does not interfere with non-gis stuff. It is perfectly Ok to add PK/FK relations or additional indexes to your schema. The DBMS will pick them up if appropiate. And ESRI will ignore them. (ESRI only uses its own meta-catalogs, ignoring the system catalogs)
When I had to deal with spatial data, I tended to precalulate the values and store them. Yes that makes for a big table but it is much faster to query when you only do the complex calculation once on data entry. Data entry does take longer though. I was in a situation where all my spacial data came from a monthly load, so precalculating wasn't too bad.

What should I have in mind when building OLAP solution from scratch?

I'm working for a company running a software product based on a MS SQL database server, and through the years I have developed 20-30 quite advanced reports in PHP, taking data directly from the database. This has been very successful, and people are happy with it.
But it has some drawbacks:
For new changes, it can be quite development intensive
The user can't experiment much with the data - it is locked to a hard-coded view
It can be slow for big reports
I am considering gradually going to a OLAP-based approach, which can be queried from Excel or some web-based service. But I would like to do this in a way that introduces the least amount of new complexity in the IT environment - the least amount of different services, synchronization jobs etc!
I have some questions in this regard:
1) Workflow-related:
What is a good development route from "black box SQL server" to "OLAP ready to use"?
Which servers and services should be set up, and which scripts should be written?
Which are the hardest/most critical/most time-intensive parts?
2) ETL:
I suppose it is best to have separate servers for their Data Warehouse and Production SQL?
How are these kept in sync (push/pull)? Using which technologies/languages?
For me SSIS looks overly complicated, and the graphical workflow doesn't appeal much to me -- I would rather like a text based script that does the job. Is this feasible?
Or is it advantagous to use the graphical client with only one source and one destination?
3) Development:
How much of this (data integration, analysis services) can be efficiently maintained from a CLI-tool?
Can the setup be transferred back and forth between production and development easily?
I'm happy with any answer that covers just some of this - and even though it is a MS environment, I'm also interested to hear about advantages in other technologies.
I only have experience with Microsoft OLAP, so here are my two cents regarding what I know:
If you are implementing cubes, then separate the production SQL Server from the source for the cubes. Cubes require a lot of SELECT DISTINCT column_name FROM source.table. You don't want cube processing to block your mission critical production system.
Although you can implement OLAP cubes with standard relation tables, you will quickly find that unless your data is a ledger-style system you will probably need to fully reprocess your fact and dimension tables and this will require requerying the source database over and over again. That's a large argument for building a separate data warehouse that uses ledger-style transactions for the fact tables. For instance, if a customer orders something and then cancels it, your source system may track this as a status change. In your fact table, you probably need to show this as a row for ordering that has a positive quantity and revenue stream and a row for cancelling that has a negative quantity and revenue stream.
OLAP may be overkill for your environment. The main issue you appeared to raise was that your reports are static and users want access to the data directly. You could build a data model and give users Report Builder access in SSRS, or report writing access in some other BI suite like Cognos, Business Objects, etc. I don't generally recommend this approach since it is way beyond what most users should have to know to get data, but in a small shop this may be sufficient and it is easy to implement. Let's face it -- users generally just want to get the data into Excel to manipulate it further. So if you don't want to give them a web front-end and you just want them to get to the data from Excel, you could give them direct database access to a copy of the production data. The downside of this approach is users don't generally understand SQL or database relationships. OLAP helps you avoid forcing users to learn SQL or relationships, but is isn't easy to implement on your end. If you only have a couple of power users who need this kind of access, it could be easy enough to teach the few power users how to do basic queries in Excel against the database and they will be happy to get this tomorrow. OLAP won't be ready by tomorrow.
If you only have a few kinds of source data systems, you could get away with building a super-dynamic static report. For instance, I have a report that was written in C# that basically allows users to select as many columns as they want from a list of 30 columns and filter the data on a few date range fields and field filter lists. This simple report covers about 40% of all ad hoc report requests from end-users since it covers all the basic, core customer metrics and fields. We recently moved this report to SSRS and that allowed us to up the number of fields to about 100 and improved the overall user experience. Regardless of the reporting platform, it is possible to give users some dynamic flexibility even in the confines of a static reporting system.
If you only have a couple of databases, you can probably backup and restore the databases as your ETL. However, if you want to do anything beyond that, then you might as well bite the bullet and use SSIS (or some other ETL tool). Once you get into ETL for data warehousing, you are going to use a graphic-oriented design tool. Coding works well for applications, but ETL is more about workflows and that's why the tools tend to converge on a graphical UI. You can work around this and try to code a data warehouse from a text editor, but in the end you are going to lose out on a lot. See this post for more details on the differences between loading data from code and loading data from SSIS.
FEEDBACK ON HOW TO USE CUBES WITH A RELATIONAL DATA STORE
It is possible to implement a cube over a relational data store, but there are some major problems with using this approach. The main reason it is technically feasible has to do with how you configure your DSV. The DSV is essentially a logical layer between the physical database and the cube/dimension definitions. Instead of importing the relational tables into the DSV, you could define Named Queries or create views in the database that flatten the data.
The advantage of this approach are as follows:
It is relatively easy to implement since you don't have to build an entire ETL subsystem to get started with OLAP.
This approach works well for prototyping how you want to build a more long-term solution. You can prototype it in 1-2 days and show some of the benefits of OLAP today.
Some very, very large tables don't have to be completely duplicated just to support an OLAP cube. I have several multi-billion row tables that are almost completely standardized fact tables. The only columns they don't have are date keys and they also contain some NULL values on fields that shouldn't have nulls at all. Instead of duplicating these very massive tables, you can create the surrogate date keys and set values for the nulls in the view or named query. If you aren't going to see a huge performance boon for duplicating the table, then this may be a candidate for leaving in a more raw format in the database itself.
The disadvantages of this approach are as follows:
If you haven't built a true Kimball method data warehouse, then you probably aren't tracking transactions in a ledger-style. Kimball method fact tables (at least as I understand them) always change values by adding and subtracting rows. If someone cancels part of an order, you can't update the value in the cube for the single transaction. Instead, you have to balance out the transaction with a negative value. If you have to update the transaction, then you will have to fully reprocess the partition of the cube to replace the value which can be a very expensive operation. Unless your source system is a ledger-style transaction system, you will probably have to build a ledger-style copy in your ETL subsystem.
If you don't build a Kimball method data warehouse, then you are probably using unobscured and possibly non-integer primary keys in your database. This directly impacts query performance inside the cube. It also sets you up for having a theoretically inflexible data warehouse. For instance, if you have an product ordering system that uses an integer key and you start using a second product ordering system either as a replacement for the legacy system or in tandem with the legacy system, you may struggle to combine the data together merely through the DSV since each system has different data points, metrics, workflows, data types, etc. Worse, if they have the same data types for the order id and the order id values overlap between systems, then you must declare a surrogate key that you can use across both systems. This can be difficult, but not impossible, to implement without using a flattened data warehouse.
You may have to build the system twice if you start with the relational data store and then move to flattened database. Frankly, I think the amount of duplicated work is trivial. Most of what you learned building the cube off a relational data store will translate to setting up the new OLAP cube. The main problem, though, is that you will probably create a new cube altogether and then any users of the old cube will have to migrate to the new cube. Any reports built in SSRS or Excel will probably break at that point and need to be rewritten from the ground up. So the main cost of rebuilding the cube is really on rebuilding dependent objects -- not on rebuilding the cube itself.
Let me know if you want me to expand on any of the above points. good luck.
You're basically asking the million dollar question of "How do I build a DWH". This is not really a question that can decisively be answered.
Nevertheless, here is a kickstart:
If you are looking for a minimum viable product, be aware that you are in a data environment, and not a pure software one. In data-heavy environments, it is much harder to incrementally build a product, because the amount of effort to introduce changes in the system is much greater. Think about it as if every change you make in a piece of software has to be somehow backwards-compatible with anything you've ever done. Now you understand the hell Microsoft are in :-).
Also, data systems involve many third-party tools such as DBs, ETL tools and reporting platforms. The choices you make should be viable for the expected development of your system, else you might have to completely replace these tools down the road.
While you can start with a DB cloning that will be based on simple copy SQLs and then aggregating it or pushing it into an OLAP, I would recommend getting your hands dirty with a real ETL tool from the start. This is especially true if you foresee the need to grow. 9 out of 10 times, the need will grow.
MS-SQL is a good choice for a DB if you don't mind the cost. The natural ETL tool would be SSIS, and it's a solid tool as well.
Even if your first transformations are merely "take this table and dump it in there", you still gain a lot in terms of process management (has the job run? What happens if it fails? etc) and debugging. Also, it is easier to organically grow as requirements and/or special cases have to be dealt with.

Resources