Reliability of Indexed views - sql-server

I just found out that a report I quickly threw together years ago has been the sole means of collecting millions of dollars, and there isn't anything being done to check if it is correct.
For performance reasons, the report makes heavy use of indexed views. This concerns me, since while I have used indexed views a lot, I tend not to use them for anything this critical.
Is it possible that the indexed views can fail to update or otherwise return information different from the data in the tables? How real of a risk is this? Is there a good SQL script I can run periodically to check for errors?

There is zero risk for inconsistency according to the docs.
In practice, you have to deal with product bugs. They are not a realistic concern.
Indexed view maintenance is based on the exact same mechanism that indexes are based on: They are updated as part of the DML query plan. I guess you wouldn't expect indexes to become corrupt so you should trust indexes views with about the same strength.

Related

Is a strongly normalized relational database not efficient?

I was reading this question https://meta.stackexchange.com/questions/26398/stackoverflow-database-design-join-issues and I got the following question: using a very normalized db is not efficient?
How should be found the right compromise?
I'm not sure if this question better fits here or on programmer. Here there are some similar but if I should move, just ask me.
Whether it will speed it up or slow it down depends strongly on the nature of the data, the size of the tables, the type of querying, the indexing. I have seen it go both ways although, more often than not in my experience, normalization to the third normal form speeds things up. Relational databases are built to be normalized and designed so that those things are expected.
One thing the denormalization advocates often forget is that speed is critical to transactions (possibly more critical due to blocking potential) and that denormalization often slows down updates. You can't measure performance just on select statements. Denormalized database tables are often wider and wider tables can often cause slowdowns too.
Denormalized databases are a major problem to keep the data integrity in and a change of a company name in a normalized database might result in one record needing to be updated and in a denormalized one might result in 100,000,000 records needing to be updated. That is why denormalization is generally only preferred for databases (like data warehouses) where the data is loaded through an ETL process but the database itself is frequently queried for complex reporting scenarios. Transactional databases that have a lot of user updates and deletions and inserts are often much faster if they are normalized to the third normal forma at least. Now you can go crazy with normalization too, don't get me wrong. I shouldn't have to join to 10 tables to get a simple address especially if I get them often. Data that is often used together often belongs together especially if the items are unlikely to change a million records if a change is made. For instance in the address, it would require a large update if Chicago changed it's name to New Chicago, but those types of massive address changes are pretty rare in my part of the world. On the other hand, company name changes are frequent and could cause massive disruption if they needed to be made to millions of denormalized records.
If you are not designing a data warehouse, then normalize your data. Never denormalize unless you are a database specialist with at least 5 years experience in large systems. You can harm things tremendously if you don't know what you are doing. If things are slow denormalization is one of the last performance improvements to try. Generally, the problem is fixed by writing better queries that are sargable and which do not use poorly performing techniques like correlated subqueries or by getting the correct indexing applied.
Normalization optimizes storage requirements and data consistency. As a tradeoff, it can make queries more complex and slow.
How should be found the right compromise?
Unfortunately, that cannot be answered with generality.
It all depends on your application and its requirements.
If your queries run too slow, and indexing or caching or query rewriting or database parameter tuning don't cut it, denormalization may be appropriate for you.
(OTOH, if your queries run just fine, or can be made to run just fine, there is probably no need to go there).
It depends. Every time I've worked to normalize a database, it has radically sped up. But, the performance problems with the non-normalized DBs were that they needed many indices, most of which were not used for any particular query, having too many columns, forced DISTINCT constraints on queries that wouldn't have been needed with a normalized DB, and inefficient table searching.
If common queries need to perform many joins on large tables for the simplest of lookups, or hit many tables for writes to update what the user/application sees as an atomic update of a single entity, then as traffic grows, so will that burden, at a rate higher than with lower/no normalization. Typically what happens is that everything runs OK until either the database and application are put on different production servers, while they were on the same dev server, or when the data gets big enough to start hitting the disks all the time.
DBMS products couple logical layout and physical storage, so while it may be as likely to increase speed as decrease it, normalization of base tables will in some way affect performance of the system.
Usually, the right compromise is views, with an SQL DBMS. If you are using any variation of design by contract, views are likely the correct design decision even without any concerns for normalization or performance, so that the application gets a model fitting its needs. Scalability concerns, like for major websites, create problems that don't have quick and easy solutions, at this point in time.
Additionally to Thilo's post:
normalizing on SAP HANA is wrong due to the fact the db normalize the data itself. If you do it anyway you will slow down the database.

What are the first issues to check while optimizing an existing database?

What are the top issues and in which order of importance to look into while optimizing (performance tuning, troubleshooting) an existing (but unknown to you) database?
Which actions/measures in your previous optimizations gave the most effect (with possibly the minimum of work) ?
I'd like to partition this question into following categories (in order of interest to me):
one needs to show the performance boost (improvements) in the shortest time. i.e. most cost-effective methods/actions;
non-intrusive or least-troublesome most effective methods (without changing existing schemas, etc.)
intrusive methods
Update:
Suppose I have a copy of a database on dev machine without access to production environment to observe stats, most used queries, performance counters, etc. in real use.
This is development-related but not DBA-related question.
Update2:
Suppose the database was developed by others and was given to me for optimization (review) before it was delivered to production.
It is quite usual to have outsourced development detached from end-users.
Besides, there is a database design paradigm that a database, in contrast to application data storage, should be a value in itself independently on specific applications that use it or on context of its use.
Update3: Thanks to all answerers! You all pushed me to open subquestion
How do you stress load dev database (server) locally?
Create a performance Baseline (non-intrusive, use performance counters)
Identify the most expensive queries (non-intrusive, use SQL Profiler)
Identify the most frequently run queries (non-intrusive, use SQL Profiler)
Identify any overly complex queries, or those using slowly performing constructs or patterns. (non-intrusive to identify, use SQL Profiler and/or code inspections; possibly intrusive if changed, may require substantial re-testing)
Assess your hardware
Identify Indexes that would benefit the measured workload (non-intrusive, use SQL Profiler)
Measure and compare to your baseline.
If you have very large databases, or extreme operating conditions (such as 24/7 or ultra high query loads), look at the high end features offered by your RDBMS, such as table/index partitioning.
This may be of interest: How Can I Log and Find the Most Expensive Queries?
If the database is unknown to you, and you're under pressure, then you may not have time for Mitch's checklist which is good best practice to monitor server health.
You also need access to production to gather real info from assorted queries you can run. Without this, you're doomed. The server load pattern is important: you can't reproduce many issue yourself on a development server because you won't use the system like an end user.
Also, focus on "biggest bang for the buck". An expensive query running once daily at 3am can be ignored. A not-so-expensive one running every second is well worth optimising. However, you may not know this without knowing server load pattern.
So, basic steps..
Assuming you're firefighting:
server logs
SQL Server logs
sys.sysprocesses eg ASYNC_NETWORK_IO waits
Slow response:
profiler, with a duration filter. What runs often and is lengthy
most expensive query, weighted for how often used
open transaction with plan
weighted missing index
Things you should have:
Backups
Tested restore of aforementioned backups
Regular index and statistic maintenance
Regular DBCC and integrity checks
Edit: After your update
Static analysis is best practices only: you can't optimise for usage. This is all you can do. This is marc_s' answer.
You can guess what the most common query may be, but you can't guess how much data will be written or how badly a query scales with more data
In many shops developers provide some support, either directly or as *3rd line"
If you've been given a DB for review by another team that you hand over to another team to deploy: that's odd.
If you're not interested in the runtime behavior of the database, e.g. what are the most frequently executed queries and those that consume the most time, you can only do a "static" analysis of the database structure itself. That has a lot less value, really, since you can only check for a number of key indicators of bad design - but you cannot really tell much about the "dynamics" of the system being used.
Things I would check for in a database that I get as a .bak file - without the ability to collect live and actual runtime performance statistics - would be:
normalization - is the table structure normalized to third normal form? (at least most of the time - there might be some exceptions)
do all tables have a primary key? ("if it doesn't have a primary key, it's not a table", after all)
For SQL Server: do all the tables have a good clustering index? A unique, narrow, static, and preferably ever-increasing clustered key - ideally an INT IDENTITY, and most definitely not a large compound index of many fields, no GUID's and no large VARCHAR fields (see Kimberly Tripp's excellent blog posts on the topics for details)
are there any check and default constraints on the database tables?
are all the foreign key fields backed up by a non-clustered index to speed up JOIN queries?
are there any other, obvious "deadly sins" in the database, e.g. overly complicated views, or really badly designed tables etc.
But again: without actual runtime statistics, you're quite limited in what you can do from a "static analysis" point of view. The real optimization can only really happen when you have a workload from a regular day of operation, to see what queries are used frequently and put the most stress on your database --> use Mitch's checklist to check those points.
The most important thing to do is collect up-to-date statistics. Performance of a database depends on:
the schema;
the data in the database; and
the queries being executed.
Looking at any of those in isolation is far less useful than the whole.
Once you have collected the statistics, then you start identifying operations that are sub-par.
For what it's worth, the vast majority of performance problems we've fixed have been by either adding indexes, adding extra columns and triggers to move the cost of calculations away from the select to the insert/update, and tactfully informing the users that their queries are, shall we say, less than optimal :-)
They're usually pleased that we can just give them an equivalent query that runs much faster.

Optimize database for web usage (lots more reading than writing)

I am trying to layout the tables for use in new public-facing website. Seeing how there will lots more reading than writing data (guessing >85% reading) I would like to optimize the database for reading.
Whenever we list members we are planning on showing summary information about the members. Something akin to the reputation points and badges that stackoverflow uses. Instead of doing a subquery to find the information each time we do a search, I wanted to have a "calculated" field in the member table.
Whenever an action is initiated that would affect this field, say the member gets more points, we simply update this field by running a query to calculate the new values.
Obviously, there would be the need to keep this field up to date, but even if the field gets out of sync, we can always rerun the query to update this field.
My question: Is this an appropriate approach to optimizing the database? Or are the subqueries fast enough where the performance would not suffer.
There are two parts:
Caching
Tuned Query
Indexed Views (AKA Materialized views)
Tuned table
The best solution requires querying the database as little as possible, which would require caching. But you still need a query to fill that cache, and the cache needs to be refreshed when it is stale...
Indexed views are the next consideration. Because they are indexed, querying against is faster than an ordinary view (which is equivalent to a subquery). Nonclustered indexes can be applied to indexed views as well. The problem is that indexed views (materialized views in general) are very constrained to what they support - they can't have non-deterministic functions (IE: GETDATE()), extremely limited aggregate support, etc.
If what you need can't be handled by an indexed view, a table where the data is dumped & refreshed via a SQL Server Job is the next alternative. Like the indexed view, indexes would be applied to make fetching data faster. But data change means cleaning up the indexes to ensure the query is running as best it can, and this maintenance can take time.
The least expensive database query is the one that you don't have to run against the database at all.
In the scenario you describe, using a high-performance cache technology (example: memcached) to store query results in your application can be a lot better strategy than trying to trick out the database to be highly scalable.
The First Rule of Program Optimization: Don't do it.
The Second Rule of Program Optimization (for experts only!): Don't do it yet.
Michael A. Jackson
If you are just designing the tables, I'd say, it's definitely premature to optimize.
You might want to redesign your database a few days later, you might find out that things work pretty fast without any clever hacks, you might find out they work slow, but in a different way than you expected. In either case you would waste your time, if you start optimizing now.
The approach you describe is generally fine; you could get some pre-computed values, either using triggers/SPs to preserve data consistency, or running a job to update these values time-to-time.
All databases are more than 85% read only! Usually high nineties too.
Tune it when you need to and not before.

Do database views affect query performance?

Are database views only a means to simplify the access of data or does it provide performance benefits when accessing the views as opposed to just running the query which the view is based on? I suspect views are functionally equivalent to just the adding the stored view query to each query on the view data, is this correct or are there other details and/or optimizations happening?
I have always considered Views to be like a read-only Stored Procedures. You give the database as much information as you can in advance so it can pre-compile as best it can.
You can index views as well allowing you access to an optimised view of the data you are after for the type of query you are running.
Although a certain query running inside a view and the same query running outside of the view should perform equivalently, things get much more complicated quickly when you need to join two views together. You can easily end up bringing tables that you don't need into the query, or bringing tables in redundantly. The database's optimizer may have more trouble creating a good query execution plan. So while views can be very good in terms of allowing more fine grained security and the like, they are not necessarily good for modularity.
It depends on the RDBMS, but usually there isn't optimization going on, and it's just a convenient way to simplify queries. Some database systems use "materialized views" however, which do use a caching mechanism.
Usually a view is just a way to create a common shorthand for defining result sets that you need frequently.
However, there is a downside. The temptation is to add in every column you think you might need somewhere sometime when you might like to use the view. So YAGNI is violated. Not only columns, but sometimes additional outer joins get tacked on "just in case". So covering indexes might not cover any more, and the query plan may increase in complexity (and drop in efficiency).
YAGNI is a critical concept in SQL design.
Generally speaking, views should perform equivalently to a query written directly on the underlying tables.
But: there may be edge cases, and it would behoove you to test your code. All modern RDBMS systems have tools that will let you see the queryplans, and monitor execution. Don't take my (or anybody else's) word for it, when you can have the definitive data at your fingertips.
I know this is an old thread. Discussion is good, but I do want to throw in one more thought. Performance also depends on what you are using to pull data with. For example, if you are front-ending with something like Microsoft Access you can definately gain performance for some complex queries by using a view. This is because Access does not always pull from the SQL server as we would like -- in some cases it would pull entire tables across then try to process locally from there! Not so if you use a view.
Yes, in all modern RDBMS's (MSSQL after 2005? etc) view's query plans are cached removing the overhead of planning the query and speeding up performance over the same SQL performed in-line. Previously to this (and it applies to parameterized SQL/Prepared Statements as well) people correctly thought stored procedures performed better.
Many still hang onto this today making it a modern DB myth. Ever since Views/PS's got the cached query planning of SPs they've been pretty much even.

Query vs. View

I want to know what is the difference between a query and a view in terms of performance. And if a view is costly, what else besides a query could I do to improve performance?
I can't speak for all databases, but in SQL Server you cannot index views unless you have an Enterprise version. An unindexed view can be significantly poorer in terms of performance than a query especially if you are writing a query against it to add some where conditions. Indexed views generally can perform fairly well. An indexed view can also be against multiple fields which are in differnt tables and that may imporve performance over the ad hoc query. (It may not too, in performance tuning, you must always test against your particular circumstances.)
One point against views is that they do not allow for run-time selection of where criteria. So often you end up with both a view and a query.
Views can be more easily maintained (Just add that new table in a join and everything accessing financial reports has it available) but they are much more difficult to performance tune. This is in part because they tend to be over generalized and thus are slower than their counterparts which only return the minimum necessary. And yes as Jonathan said, you can far too easily get into joining together views for a report into a mess which joins to the same large tables many more times than need be and is very slow.
Two places where views shine though is:
Making sure that complex relationships are always correctly described. This is one reason why report writers tend to favor them.
Limiting access to a subset of records
There are also limitations on the type of queries that can be done for a view vice an ad hoc query or a stored proc. For instance you can't use an if statement (or other procedural type code such as looping) or as noted above you cannot provide run-time values for the where criteria.
One place where views are often significantly slower is when they call other views. The underlying views need to be fully realized in some databases and thus you might need to callup 4,459,203 records to see the 10 you are ultimately interested in. Start to layer this more than once and it can get very slow, very fast; views that call views are simply a poor practice.
Views and ad-hoc queries, in the simple case, are nearly identical in terms of performance. So much so that when you program with a view, you should think of it as though the text of the view definition were being cut and pasted into your parent query.
HLGEM points out in his answer that certain editions of SQL Server allows you to "index" views -- in this case, behind the scenes SQL Server maintains the same structures that underlie a table, making an indexed view and a table very similar in terms of performance.
In SQL Server, though you can generally nest views fairly liberally without running into performance problems, it can make things more difficult to understand and debug.
In SQL Server I believe that the performance difference between views and queries is negligible. What I would recommend doing to improve performance is to create another table that holds the results of the view. You could perhaps create a staging table where new data is held and then a stored procedure can be run at some interval that populates the working table with the new information. A trigger might be good for this purpose. Depending on the requirements of your application this design may or may not be suitable. If you are working with near real-time data, this approach will lead to concurrency issues...
One other thing to look into, is to make absolutely sure that the base tables you are using to construct your view are indexed correctly, and that the query itself is optimized. Finally, I believe it is possible in SQL Server enterprise to create indexed views although I have not used them before.
If they do exactly the same thing a view might be slightly faster on first execution as the database server will have a precompiled execution plan for it. Depends on your server though.
Empasis on might and slightly...
Views promote code reuse and can abstract away database complexity to give a more coherent 'business' model of data. However they are not nearly as tunable. You may find yourself in a position where you need to provide join hints or other low level optimisations and many DBA's that i have worked with do not like them being applied to views as they may then be reused across many queries, the opinion being that these types of hints should be employed as sparingly as possible. I like using views myself.
A view is barely more expensive to the computer than writing out the query longhand. A view can save the programmer/user a lot of time writing the same query out time after time, and getting it wrong, and so on. The view may also be the only way to access the data if views are also used to enforce authorization (access control) on the underlying tables.
If the query does not perform well, you need to review how the query is formed, and whether the tables all have the appropriate indexes on them. If your system needs accurate statistics for the optimizer to perform well, have you updated those statistics sufficiently recently?
Once upon a long time ago, I came across a system where a query generator had created one query that listed seventeen tables in a single FROM clause, including several LEFT OUTER JOIN of a table with itself. And, in fact, closer scrutiny revealed that several of the 'tables' were in fact multi-table views, and some of these also involved self outer joins, and were themselves involved in self outer joins of the view. To say "ghastly" is an understatement. There was a lot of cleanup possible to improve the performance of that query - eliminating unnecessary outer joins, self joins, and so on. (It actually pre-dated the explicit join notation of SQL-92 - I said a long time ago - so the outer join syntax was DBMS-specific.)
If you mean network performance then working from a local cache (as with ADO.Net DataSets) would reduce network traffic- but could cause problems with locking. Just a thought.
A view is still a query, it just abstracts certain parts of it so that your queries can be simplified (if they do similar things) and to maximize reuse.

Resources