Indexed Views in OLTPs? - sql-server

I'm familiar with SQL Server Indexed Views (or Oracle Materialized Views), we use them in our OLAP applications. They have the really cool feature of being able to usurp an execution plan and remap it to the indexed view w/out having to change existing code.
IE. Let's say I had a SPROC that was a really expensive join.
SELECT [SOME COLUMNS]
FROM Table1 INNER JOIN Table2 [DETAILS]
INNER JOIN Table3 [BUNCH MORE JOINS]
...
If I authored an indexed view that held a similar result set then the Query Optimizer will very likely send the SPROC to my indexed view as opposed to the base tables and I get a big performance increase.
Now say I wanted to use indexed views in an OLTP!? I mean most OLTPs (like this site) are relatively read heavy, if they have expensive joins then we could speed them up a ton AND potentially reduce locking contention (http://www.codinghorror.com/blog/archives/001166.html). Even better is you wouldn't have to change any code, just author the indexed view.
But this also means the database gets bigger since we need to keep a copy of these data in the indexed view...
Has anyone ever used indexed views to solve contention or speed issues in an OLTP? How come I've never seen this in use?

Materialized views can be useful for reporting against OLTP, especially is large numbers of rows are aggregated to get the results. The space requirements are completely dependent on how much data you are saving. Think of it as a cache.
The tricky balance is between how recent the data needs to be for the reports, and how much of a hit you can take on OLTP performance. If somewhat stale data is OK, you may be able to schedule the updates to the views during a time when system activity is low.
The one time I could not, and need very current data, I ended up using some custom development. Each update to the base table fired a trigger which wrote a record to a transaction table. The view looked at a cached aggregate, plus the delta stored in the transaction table. As system resources allowed, the transactions were applied to the aggregate table as delta transactions. This allowed me up to the second data, good performance on reporting (the only aggregation happening was recent transactions) and fairly little load on the database (only doubling the size of every write, not re-calculating a huge aggregate every time).
Unfortunately, it was complex to maintain, and did not use simple built in tools. If you can wait on your reporting data, it is often best to use the built in materialized views and defer the refresh.

We use materialized views to speed up things where I work. Most often for reports against the OLTP system. Many of our reports run from a data warehouse, but since we refresh the warehouse overnight, up to the moment data has to come from the OLTP tables.

Related

SQL Server In-Memory table use case with huge data

I have SQL Server table with 160+ million records having continuous CRUD operations from UI, batch jobs etc. basically from multiple sources
Currently I have partitioned the table on a column to have better performance on the table.
I came across In-Memory tables which can be used in case of tables with frequent updates and also if updates happening from multiple sources it won't put a lock instead it will maintain row versioning, so concurrent updates is better using this approach.
So what are my options in this case ?
Partition the table or Create In-Memory table
As I have read SQL server is not supporting In-Memory table when table is partitioned.
What is the better option in this case In-Memory table or partitioned table.
It depends.
In-memory tables look great on theory, but you really need to spend time learning the details in order to make the right implementation. You may find some details disturbing. For example:
there are no parallel inserts in in-memory tables which make creation of rows slower compare to parallel insert in traditional table stored in SSD
not all index operations supported by dis-based indexes are available in in-memory table indexes
not all data types are supported
there are both unsupported features and T-SQL constructs
you may need more RAM then you think
If you are ready to pay the price for using Hekaton, you may start with reading its white-paper.
The partitioning itself comes with benefits but there is no guarantee it will heal your system. Only particular queries and case-scenarios can benefit from it. For example, if 99% of your workload is touching the data in one partition you may see no optimization at all. On the other hand, if your reports are based on historical data and your inserts/updates/deletes touch another partition it will be better.
Both of the technologies are good, but need to be examine in details and applied carefully. Often, folks believe that using some new tech will solve their problems, when the problems can be solved just applying some basic concepts.
For example, you said that you are performing CRUD over 160+ millions records. Ask yourself:
is my table normalized - when data is stored in normalized way you gain two things - first, you will perform CRUD only on part of the data, the engine may read only the data that is needed for particular query (without the need to support an index)
are my T-SQL statements write well - row by agonizing row, calling stored procedures in loops or not processing the data in batches are common sources of slow queries
which are the blocking and deadlocked queries - for example, there is a possibility one long running query to block all your inserts - identify these types of issues first and try to resolve them with data pre-calculation (indexed view) or creating covering indexes (which can be filtered with include columns, too)
are readers and writers being blocked - you can try different isolation levels to solve this type of issues - RCSI is the Azure default isolation level. You may need to add more RAM to your RAMDISK used by your TempDB, but since your are looking at Hekaton, this will be easier to test (and rollback) compare to it(or partitioning)

DB Aggregation and BI Best Practices

We have a fair amount of aggregating queries in our db for use on making (sometimes real-time) business decisions. Unfortunately the pages that present these aggregates are some of the most frequently called, and the SPs are passed parameters by the page. The queries themselves have been tuned, but Unfortunately each SP is generation a handful of aggregate fields.
We're working on some performance tuning, and one of the tasks is to re-work how/where these aggregations are done.
The thoughts we have are to possibly create SPs that do some of these and store them in a table. Then the page could run a more simple select query on the table still using a parameter to limit it to the correct data-set. It wouldn't be as "real time", but could be frequently enough.
The other suggested solution was to perform the aggregation queries in our DWH and (through) SSIS pass the data back to a table in the Prod db. There is significantly less traffic on our DWH db, so it could easily handle the heavy lifting.
What are the thoughts on ways to streamline querying and presenting what would typically be considered "reporting" data in a Prod environment? The current SPs are called probably a couple thousand times a day. Is pushing DWH data back to a Prod db against BI best practices? Is this something better done in a CLR Proc (not that familiar with CLR)?
To make use of pre-aggregated calculations and still answer all questions in NRT, this is what you could do:
Calculate an aggregate every minute/hour/day and store this in the database as an aggregated value. Then, if a query is done on the database you use these aggregates. For the time that isn't in the aggregate, you use the raw data that was inserted after the final timestamp of the last aggregate. It requires a bit more coding, but this is the ultimate solution.
I don't see any problem with pushing aggregate reporting data back to a prod DB from a data warehouse for delivery, but as a commenter mentioned, you would inherently introduce a delay which might not be acceptable for your 'real-time' decisions.
Another option, as long as the nature of your aggregation is relatively straightforward, is indexed views. Essentially you create a view of the source table with the level of aggregation required for reporting, and then put an index on it. The index means that SQL server actually precalculates the view and stores it on disk like a real table, so that when you do a reporting query, it doesn't actually need to read the full detail of the source table and aggregate it.
Another neat trick that SQL can do with indexed views is 'aggregate awareness', which means that even if you query the source table, if the planner realises it can get the answer from the indexed view more quickly, then it will. You wouldn't even need to update your existing procs for them to gain a benefit.
It probably sounds too good to be true, and in some ways it is, since there are are couple of massive downsides:
There are a large number of limitations on the SQL syntax you can use in the view, for example while you can use sum and count_big, you can't use max/min (which seems odd to be honest). You can do inner but not outer joins. You can't use window functions. The list goes on...
Also, there will obviously be a performance cost when writing to the tables the view references, as the view has to be updated. So you would want to test out performance in your environment with a typical workload before you turn something like this on in prod.

Transactional table and Reporting table within same database

I did read posts about transactional and reporting database.
We have in single table which is used for reporting(historical) purpose and transactional
eg :order with fields
orderid, ordername, orderdesc, datereceived, dateupdated, confirmOrder
Is it a good idea to split this table into neworder and orderhistrory
The new ordertable records the current days transaction (select,insert and update activity every ms for the orders received on that day .Later we merge this table with order history
Is this a recommended approach.
Do you think this is would minimize the load and processing time on the Database?
PostgreSQL supports basic table partitioning which allows splitting what is logically one large table into smaller physical pieces. More info provided here.
To answer your second question: No. Moving data from one place to another is an extra load that you otherwise wouldn't have if you used the transational table for reporting. But there are some other questions you need to ask before you make this decision.
How often are these reports run?
If you are running these reports once an hour, it may make sense to keep them in the same table. However, if this report takes a while to run, you'll need to take care not to tie up resources for the other clients using it as a transactional table.
How up-to-date do these reports need to be?
If the reports are run less than daily or weekly, it may not be critical to have up to the minute data in the reports.
And this is where the reporting table comes in. The approaches I've seen typically involve having a "data warehouse," whether that be implemented as a single table or an entire database. This warehouse is filled on a schedule with the data from the transactional table, which subsequently triggers the generation of a report. This seems to be the approach you are suggesting, and is a completely valid one. Ultimately, the one question you need to answer is when you want your server to handle the load. If this can be done on a schedule during non-peak hours, I'd say go for it. If it needs to be run at any given time, than you may want to keep the single-table approach.
Of course there is nothing saying you can't do both. I've seen a few systems that have small on-demand reports run on transactional tables, scheduled warehousing of historical data, and then long-running reports against that historical data. It's really just a matter of how real-time you want the data to be.

Database design: one huge table or separate tables?

Currently I am designing a database for use in our company. We are using SQL Server 2008. The database will hold data gathered from several customers. The goal of the database is to acquire aggregate benchmark numbers over several customers.
Recently, I have become worried with the fact that one table in particular will be getting very big. Each customer has approximately 20.000.000 rows of data, and there will soon be 30 customers in the database (if not more). A lot of queries will be done on this table. I am already noticing performance issues and users being temporarily locked out.
My question, will we be able to handle this table in the future, or is it better to split this table up into smaller tables for each customer?
Update: It has now been about half a year since we first created the tables. Following the advices below, I created a handful of huge tables. Since then, I have been experimenting with indexes and decided on a clustered index on the first two columns (Hospital code and Department code) on which we would have partitioned the table had we had Enterprise Edition. This setup worked fine until recently, as Galwegian predicted, performance issues are springing up. Rebuilding an index takes ages, users lock each other out, queries frequently take longer than they should, and for most queries it pays off to first copy the relevant part of the data into a temp table, create indices on the temp table and run the query. This is not how it should be. Therefore, we are considering to buy Enterprise Edition for use of partitioned tables. If the purchase cannot go through I plan to use a workaround to accomplish partitioning in Standard Edition.
Start out with one large table, and then apply 2008's table partitioning capabilities where appropriate, if performance becomes an issue.
Datawarehouses are supposed to be big (the clue is in the name). Twenty million rows is about medium by warehousing standards, although six hundred million can be considered large.
The thing to bear in mind is that such large tables have a different physics, like black holes. So tuning them takes a different set of techniques. The other thing is, users of a datawarehouse must understand that they are dealing with huge amounts of data, and so they must not expect sub-second response (or indeed sub-minute) for every query.
Partitioning can be useful, especially if you have clear demarcations such as, as in your case, CUSTOMER. You have to be aware that partitioning can degrade the performance of queries which cut across the grain of the partitioning key. So it is not a silver bullet.
Splitting tables for performance reasons is called sharding. Also, a database schema can be more or less normalized. A normalized schema has separate tables with relations between them, and data is not duplicated.
I am assuming you have your database properly normalized. It shouldn't be a problem to deal with the data volume you refer to on a single table in SQL Server; what I think you need to do is review your indexes.
Since you've tagged your question as 'datawarehouse' as well I assume you know some things about the subject. Depending on your goals you could go for a star-schema (a multidemensional model with a fact and dimensiontables). Store all fastchanging data in 1 table (per subject) and the slowchaning data in another dimension/'snowflake' tables.
An other option is the DataVault method by Dan Lindstedt. Which is a bit more complex but provides you with full flexibility.
http://danlinstedt.com/category/datavault/
In a properly designed database, that is not a huge anmout of records and SQl server should handle with ease.
A partioned single table is usually the best way to go. Trying to maintain separate indivudal customer tables is very costly in termas of time and effort and far more probne to errors.
Also examine you current queries if you are experiencing performance issues. If you don't have proper indexing (did you for instance index the foreign key fields?) queries will be slow, if you don't have sargeable queries they will be slow if you used correlated subqueries or cursors, they will be slow. Are you returning more data than is striclty needed? If you have select * anywhere in your production code, get rid of it and only return the fields you need. If you used views that call views that call views or if you used EAV table, you willhave performance iisues at this level. If you allowed a framework to autogenerate SQl code, you may well have badly perforimng queries. Remember Profiler is your friend. Of course you could also have a hardware issue, you need a pretty good sized dedicated server for that number of records. It won't work to run this on your web server or a small box.
I suggest you need to hire a professional dba with performance tuning experience. It is quite complex stuff. Databases desigend by application programmers often are bad performers when they get a real number of users and records. Database MUST be designed with data integrity, performance and security in mind. If you didn't do that the changes of having them are slim indeed.
Partioning is definately something to look into. I had a database that had 2 tables sharded. Each table contained around 30-35million records. I have since merged this into one large table and assigned some good indexes. So far, I've not had to partition this table as it's working a treat, but I'm keep partitioning in mind. One thing that I have noticed, compared to when the data was sharded, and that's the data import. It is now slower, but I can live with that as the Import tool can be re-written ;o)
One table and use table partitioning.
I think the advice to use NOLOCK is unjustified based on the information given. NOLOCK means you will get inaccurate and unreliable results from your queries (dirty and phantom reads). Before using NOLOCK you need to be sure that's not going to be a problem for your customers.
Is this a single flat table (no particular model)? Typically in data warehouses, you either have a normalized data model (third normal form at least - usually in an entity-relationship-model) or you have dimensional data (Kimball method or variations - usually fact tables with associated dimension tables in a set of stars).
In both cases, indexes play a large part, and partitioning can also play a part in getting queries to perform (but partitioning is not usually about performance but about maintenance being able to add and drop partitions quickly) over very large data sets - but it really depends on the order of aggregation and the types of queries.
One table, then worry about performance. That is, assuming you are collecting the exact same information for each customer. That way, if you have to add/remove/modify a column, you are only doing it in one place.
If you're on MS SQL server and you want to keep the single table, table partitioning could be one solution.
Keep one table - 20M rows isn't huge, and customers aren't exactly the kind of table that you can easily 'archive off', and the aggrevation of searching multiple tables to find a customer isn't worth the effort (SQL is likely to be much more efficient at BTree searching than your own invention is)
You will need to look into the performance and locking issues however - this will prevent your db from scaling.
You can also create supplemental tables that hold already calculated details on historical information if there are common queries.

Why use NoSQL over Materialized Views?

There has been a lot of talk recently about NoSQL.
The #1 reason why I hear people use NoSQL is because they start to de-normalize their DBMS data so much so, to increase performance, that they end up with just one table with all of their data within that single table.
With Materialized Views however, you can keep your data normalized, yet have it stored as a single table view for the same reasons why you'd use NoSQL.
As such, why would someone use NoSQL over Materialized Views?
One reason is that materialized views will perform poorly in an OLTP situation where there is a heavy amount of INSERTs vs. SELECTs.
Everytime data is inserted the materialized views indexes must be updated, which not only slows down inserts but selects as well. The primary reason for using NoSQL is performance. By being basically a hash-key store, you get insanely fast reads/writes, at the cost of less control over constraints, which typically must be done at the application layer.
So, while materialized views may help reads, they do nothing to speed up writes.
NoSQL is not about getting better performance out of your SQL database. It is about considering options other than the default SQL storage when there is no particular reason for the data to be in SQL at all.
If you have an established SQL Database with a well designed schema and your only new requirement is improved performance, adding indexes and views is definitely the right approach.
If you need to save a user profile object that you know will only ever need to be accessed by its key, SQL may not be the best option - you gain nothing from a system with all sorts of query functionality you won't use, but being able to leave out the ORM layer while improving the performance of the queries you will be using is quite valuable.
Another reason is the dynamic nature of NoSQL. Each view you create will need created before-hand and a "guess" as to how an application might use it.
With NoSQL you can change as the data changes; dynamically varying your data to suit the application.

Resources