Help figuring out approaches to (near) real time multi dimensional data querying - sql-server

I have a system that involves numerous related tables. Think of a standard category/product/order/customer/orderitem scenario. Some tables are self referencing (like Categories). None of the tables are particularly large (around 100k rows with an estimated scale to around 1 million rows). There are a lot of dimensions to this data I need to consider, but must be queried in a near real time way. I also don't know which dimensions a particular user is interested in- it can be one or many criteria across numerous tables. Things can range from
Give me everything with a category of Jackets
Give me everything with a category of Jackets->Parkas having a red color purchased in the last month in New York
Give me everything which wasn't purchased in New York which costs over $100.
Currently we have a very long SP which uses a "cascading data" approach- we go table by table, filtering everything into a temp table using whatever criteria was specified for that table. For the next table, we join the current temp table to whatever table we're using and apply a new filter set into a new temp table. It works, but manageability and performance is slow. I need something better.
I need a new approach to this problem. It's clearly a need for OLAP, possibly using a star schema. Does this work in real time? Can it be configured to work in real time? Should I use indexed views to create a set of denormalized tables? Should I offload this outside of the database completely?
FYI We're using Sql Server.

As you say, this is perfect for OLAP.
With Sql Server 2005 and 2008 you can set up an almost real time solution. You should:
Create a denormalized star schema
Build an OLAP cube using that schema
Enable proactive caching to update the cube when the underlying data source changes.
It's not a trivial job, and you need the Enterprise version of Sql Server to use proactive caching. You also need some front-end tool (maybe excel would do) to consume the cube.

It would probably be better to build a dynamic query in your code with all the joins you need, customized to each individual request. (properly parameterized for security of course).
You would use much of the same cascading logic you have now but you move it to to the code instead of the database. Then you only submit the exact query you need.
The performance would beat using all of the temp tables and you might get some caching benefit after a few queries were run.

Your dilemma sounds to me like "Is it better to achieve the same result by performing complex processing every time I need it, or should I do it once only for each new piece of data?".

Related

DB Aggregation and BI Best Practices

We have a fair amount of aggregating queries in our db for use on making (sometimes real-time) business decisions. Unfortunately the pages that present these aggregates are some of the most frequently called, and the SPs are passed parameters by the page. The queries themselves have been tuned, but Unfortunately each SP is generation a handful of aggregate fields.
We're working on some performance tuning, and one of the tasks is to re-work how/where these aggregations are done.
The thoughts we have are to possibly create SPs that do some of these and store them in a table. Then the page could run a more simple select query on the table still using a parameter to limit it to the correct data-set. It wouldn't be as "real time", but could be frequently enough.
The other suggested solution was to perform the aggregation queries in our DWH and (through) SSIS pass the data back to a table in the Prod db. There is significantly less traffic on our DWH db, so it could easily handle the heavy lifting.
What are the thoughts on ways to streamline querying and presenting what would typically be considered "reporting" data in a Prod environment? The current SPs are called probably a couple thousand times a day. Is pushing DWH data back to a Prod db against BI best practices? Is this something better done in a CLR Proc (not that familiar with CLR)?
To make use of pre-aggregated calculations and still answer all questions in NRT, this is what you could do:
Calculate an aggregate every minute/hour/day and store this in the database as an aggregated value. Then, if a query is done on the database you use these aggregates. For the time that isn't in the aggregate, you use the raw data that was inserted after the final timestamp of the last aggregate. It requires a bit more coding, but this is the ultimate solution.
I don't see any problem with pushing aggregate reporting data back to a prod DB from a data warehouse for delivery, but as a commenter mentioned, you would inherently introduce a delay which might not be acceptable for your 'real-time' decisions.
Another option, as long as the nature of your aggregation is relatively straightforward, is indexed views. Essentially you create a view of the source table with the level of aggregation required for reporting, and then put an index on it. The index means that SQL server actually precalculates the view and stores it on disk like a real table, so that when you do a reporting query, it doesn't actually need to read the full detail of the source table and aggregate it.
Another neat trick that SQL can do with indexed views is 'aggregate awareness', which means that even if you query the source table, if the planner realises it can get the answer from the indexed view more quickly, then it will. You wouldn't even need to update your existing procs for them to gain a benefit.
It probably sounds too good to be true, and in some ways it is, since there are are couple of massive downsides:
There are a large number of limitations on the SQL syntax you can use in the view, for example while you can use sum and count_big, you can't use max/min (which seems odd to be honest). You can do inner but not outer joins. You can't use window functions. The list goes on...
Also, there will obviously be a performance cost when writing to the tables the view references, as the view has to be updated. So you would want to test out performance in your environment with a typical workload before you turn something like this on in prod.

Architecture for MVC Application for running web reports on data across 2 database servers

I have an ASP.NET MVC 4 application that uses a database in US or Canada, depending on which website you are on.
This program lets you filter job data on various filters and the criteria gets translated to a SQL query with a good number of table joins. The data is filtered then grouped/aggregated.
However, now I have a new requirement: query and do some grouping and aggregation (avg salary) on data in both the Canada Server and US server.
Right now, the lookup tables are duplicated on both database servers.
Here's the approach I was thinking:
Run the query on the US server, run the query again on the Canada server and then merge the data in memory.
Here is one use case: rank the companies by average salary.
In terms of the logic, I am just filtering and querying a job table and grouping the results by company and average salary.
Would are some other ways to do this? I was thinking of populating a reporting view table with a nightly job and running the queries against that reporting table.
To be honest, the queries themselves are not that fast to begin with; running, the query again against the Canada database seems like it would make the site much slower.
Any ideas?
Quite a number of variables here. If you don't have too much data then doing the queries on each DB and merging is fine so long as you get the database to do as much of the work as it is able to (i.e. the grouping, averaging etc.).
Other options include linking your databases and doing a single query but there are a few downsides to this including
Having to link databases
Security associated with a linked database
A single query will require both databases to be online, whereas you can most likely work around that with two queries
Scheduled, prebuilt tables have some advantages & disadvantages but probably not really relevant to the root problem of you having 2 databases where perhaps you should have one (maybe, maybe not).
If the query is quite slow and called many times, the a single snapshot once could save you some resources provided the data "as at" the time of the snapshot is relevant and useful to your business need.
A hybrid is to create an "Indexed View" which can let the DB create a running average for you. That should be fast to query and relatively unobtrusive to keep up to date.
Hope some of that helps.

Need for OLAP cubes if we can Build views based directly off the RAW table

Assume that the table in the source data is is clean and in a state where they can be used directly.
I am trying to understand whether building views based off the RAW table is better than creating cubes. To make the VIEWS dynamic, we can have .NET application which would take paramteres for the view and execute a View with Parameters and get the data for Reporting and analysis.
If I want to view the Sales of a Product for United states in the Month of Februaray. So, I can create a view joining Product, Customer get the sales for a particular day in the month of February.
Instead of forming a Star Schema with Product, Date, Customer dimension. I am really trying to understand what is the standarad a company should go with.
I have folks telling Cubes are only good for analysis not good for reporting . Whatever information we want we can get it by creating a DYNAMIC Views
Any advise or ideas on this ?
Thanks!!
As the name suggest SSAS (SQL Server Analysis Services) is indeed built for analysis. The reason for this is the highly normalized table structure (e.g., the star schema) which allows for super efficient indexing combined with the pre-processing of aggregated values.
Views are a great way to take data that already exists within your OLTP (as compared to OLAP) database and transform it in a manner that better fits your querying needs. This works in the same manner as "get" stored procedures.
Now for my opinion:
If you have a small amount of data (relative to the power of your server, as well as many other factors) and you're not performing intense aggregations of the data, consider using stored procedures to access your database. You can specify the parameter in .NET like any other function, making this method super easy.
If you have a lot of data (like, over 100 million rows), consider creating a cube. This will allow your queries to fly. There's a lot more work that goes into these, but the speed payoff is huge.
End note:
If the data in your reports is pretty similar to the data you already have in your database (including JOINing the tables) and you have under half a billion rows, just use a stored proc, and look into using SSRS (or not). If you have a ton of data that needs to be aggregated and transformed, look into SSAS OLAP cubes.
From my limited experience with Microsoft's Analysis Services, I would agree with Norla. If the execution time of the view is reasonable, that would be the way to go. Cubes can certainly be reported against, as SQL Reporting Services accommodates them fairly well, but the development process can often be much more involved when using a cube as your data source.
Building views can be an alternative for small datasets. You could consider going that route BUT:
1) once the reports are taking a lot of time to load
2) It slows the transactional systems
Then you'll have to consider cubes.

Database design: one huge table or separate tables?

Currently I am designing a database for use in our company. We are using SQL Server 2008. The database will hold data gathered from several customers. The goal of the database is to acquire aggregate benchmark numbers over several customers.
Recently, I have become worried with the fact that one table in particular will be getting very big. Each customer has approximately 20.000.000 rows of data, and there will soon be 30 customers in the database (if not more). A lot of queries will be done on this table. I am already noticing performance issues and users being temporarily locked out.
My question, will we be able to handle this table in the future, or is it better to split this table up into smaller tables for each customer?
Update: It has now been about half a year since we first created the tables. Following the advices below, I created a handful of huge tables. Since then, I have been experimenting with indexes and decided on a clustered index on the first two columns (Hospital code and Department code) on which we would have partitioned the table had we had Enterprise Edition. This setup worked fine until recently, as Galwegian predicted, performance issues are springing up. Rebuilding an index takes ages, users lock each other out, queries frequently take longer than they should, and for most queries it pays off to first copy the relevant part of the data into a temp table, create indices on the temp table and run the query. This is not how it should be. Therefore, we are considering to buy Enterprise Edition for use of partitioned tables. If the purchase cannot go through I plan to use a workaround to accomplish partitioning in Standard Edition.
Start out with one large table, and then apply 2008's table partitioning capabilities where appropriate, if performance becomes an issue.
Datawarehouses are supposed to be big (the clue is in the name). Twenty million rows is about medium by warehousing standards, although six hundred million can be considered large.
The thing to bear in mind is that such large tables have a different physics, like black holes. So tuning them takes a different set of techniques. The other thing is, users of a datawarehouse must understand that they are dealing with huge amounts of data, and so they must not expect sub-second response (or indeed sub-minute) for every query.
Partitioning can be useful, especially if you have clear demarcations such as, as in your case, CUSTOMER. You have to be aware that partitioning can degrade the performance of queries which cut across the grain of the partitioning key. So it is not a silver bullet.
Splitting tables for performance reasons is called sharding. Also, a database schema can be more or less normalized. A normalized schema has separate tables with relations between them, and data is not duplicated.
I am assuming you have your database properly normalized. It shouldn't be a problem to deal with the data volume you refer to on a single table in SQL Server; what I think you need to do is review your indexes.
Since you've tagged your question as 'datawarehouse' as well I assume you know some things about the subject. Depending on your goals you could go for a star-schema (a multidemensional model with a fact and dimensiontables). Store all fastchanging data in 1 table (per subject) and the slowchaning data in another dimension/'snowflake' tables.
An other option is the DataVault method by Dan Lindstedt. Which is a bit more complex but provides you with full flexibility.
http://danlinstedt.com/category/datavault/
In a properly designed database, that is not a huge anmout of records and SQl server should handle with ease.
A partioned single table is usually the best way to go. Trying to maintain separate indivudal customer tables is very costly in termas of time and effort and far more probne to errors.
Also examine you current queries if you are experiencing performance issues. If you don't have proper indexing (did you for instance index the foreign key fields?) queries will be slow, if you don't have sargeable queries they will be slow if you used correlated subqueries or cursors, they will be slow. Are you returning more data than is striclty needed? If you have select * anywhere in your production code, get rid of it and only return the fields you need. If you used views that call views that call views or if you used EAV table, you willhave performance iisues at this level. If you allowed a framework to autogenerate SQl code, you may well have badly perforimng queries. Remember Profiler is your friend. Of course you could also have a hardware issue, you need a pretty good sized dedicated server for that number of records. It won't work to run this on your web server or a small box.
I suggest you need to hire a professional dba with performance tuning experience. It is quite complex stuff. Databases desigend by application programmers often are bad performers when they get a real number of users and records. Database MUST be designed with data integrity, performance and security in mind. If you didn't do that the changes of having them are slim indeed.
Partioning is definately something to look into. I had a database that had 2 tables sharded. Each table contained around 30-35million records. I have since merged this into one large table and assigned some good indexes. So far, I've not had to partition this table as it's working a treat, but I'm keep partitioning in mind. One thing that I have noticed, compared to when the data was sharded, and that's the data import. It is now slower, but I can live with that as the Import tool can be re-written ;o)
One table and use table partitioning.
I think the advice to use NOLOCK is unjustified based on the information given. NOLOCK means you will get inaccurate and unreliable results from your queries (dirty and phantom reads). Before using NOLOCK you need to be sure that's not going to be a problem for your customers.
Is this a single flat table (no particular model)? Typically in data warehouses, you either have a normalized data model (third normal form at least - usually in an entity-relationship-model) or you have dimensional data (Kimball method or variations - usually fact tables with associated dimension tables in a set of stars).
In both cases, indexes play a large part, and partitioning can also play a part in getting queries to perform (but partitioning is not usually about performance but about maintenance being able to add and drop partitions quickly) over very large data sets - but it really depends on the order of aggregation and the types of queries.
One table, then worry about performance. That is, assuming you are collecting the exact same information for each customer. That way, if you have to add/remove/modify a column, you are only doing it in one place.
If you're on MS SQL server and you want to keep the single table, table partitioning could be one solution.
Keep one table - 20M rows isn't huge, and customers aren't exactly the kind of table that you can easily 'archive off', and the aggrevation of searching multiple tables to find a customer isn't worth the effort (SQL is likely to be much more efficient at BTree searching than your own invention is)
You will need to look into the performance and locking issues however - this will prevent your db from scaling.
You can also create supplemental tables that hold already calculated details on historical information if there are common queries.

What's the best way to manage a large number of tables in MS SQL Server?

This question is related to another:
Will having multiple filegroups help speed up my database?
The software we're developing is an analytical tool that uses MS SQL Server 2005 to store relational data. Initial analysis can be slow (since we're processing millions or billions of rows of data), but there are performance requirements on recalling previous analyses quickly, so we "save" results of each analysis.
Our current approach is to save analysis results in a series of "run-specific" tables, and the analysis is complex enough that we might end up with as many as 100 tables per analysis. Usually these tables use up a couple hundred MB per analysis (which is small compared to our hundreds of GB, or sometimes multiple TB, of source data). But overall, disk space is not a problem for us. Each set of tables is specific to one analysis, and in many cases this provides us enormous performance improvements over referring back to the source data.
The approach starts to break down once we accumulate enough saved analysis results -- before we added more robust archive/cleanup capability, our testing database climbed to several million tables. But it's not a stretch for us to have more than 100,000 tables, even in production. Microsoft places a pretty enormous theoretical limit on the size of sysobjects (~2 billion), but once our database grows beyond 100,000 or so, simple queries like CREATE TABLE and DROP TABLE can slow down dramatically.
We have some room to debate our approach, but I think that might be tough to do without more context, so instead I want to ask the question more generally: if we're forced to create so many tables, what's the best approach for managing them? Multiple filegroups? Multiple schemas/owners? Multiple databases?
Another note: I'm not thrilled about the idea of "simply throwing hardware at the problem" (i.e. adding RAM, CPU power, disk speed). But we won't rule it out either, especially if (for example) someone can tell us definitively what effect adding RAM or using multiple filegroups will have on managing a large system catalog.
Without first seeing the entire system, my first recommendation would be to save the historical runs in combined tables with a RunID as part of the key - a dimensional model may also be relevant here. This table can be partitioned for improvement, which will also allow you to spread the table into other filegroups.
Another possibility it to put each run in its own database and then detach them, only attaching them as needed (and in read-only form)
CREATE TABLE and DROP TABLE are probably performing poorly because the master or model databases are not optimized for this kind of behavior.
I also recommend talking to Microsoft about your choice of database design.
Are the tables all different structures? If they are the same structure you might get away with a single partitioned table.
If they are different structures, but just subsets of the same set of dimension columns, you could still store them in partitions in the same table with nulls in the non-applicable columns.
If this is analytic (derivative pricing computations perhaps?) you could dump the results of a computation run to flat files and reuse your computations by loading from the flat files.
This seems to be a very interesting problem/application that you are working with. I would love to work on something like this. :)
You have a very large problem surface area, and that makes it hard to start helping. There are several solution parameters that are not evident in your post. For example, how long do you plan to keep the run analysis tables? There's a LOT other questions that need to be asked.
You are going to need a combination of serious data warehousing, and data/table partitioning. Depending on how much data you want to keep and archive you may need to start de-normalizing and flattening the tables.
This would be pretty good case where contacting Microsoft directly can be mutually beneficial. Microsoft gets a good case to show other customers, and you get help directly from the vendor.
We ended up splitting our database into multiple databases. So the main database contains a "databases" table that refers to one or more "run" databases, each of which contains distinct sets of analysis results. Then the main "run" table contains a database ID, and the code that retrieves a saved result includes the relevant database prefix on all queries.
This approach allows the system catalog of each database to be more reasonable, it provides better separation between the core/permanent tables and the dynamic/run tables, and it also makes backups and archiving more manageable. It also allows us to split our data across multiple physical disks, although using multiple filegroups would have done that too. Overall, it's working well for us now given our current requirements, and based on expected growth we think it will scale well for us too.
We've also noticed that SQL 2008 tends to handle large system catalogs better than SQL 2000 and SQL 2005 did. (We hadn't upgraded to 2008 when I posted this question.)

Resources