Any suggestions for identifying what indexes need to be created?

Any suggestions for identifying what indexes need to be created? - sql-server

I'm in a situation where I have to improve the performance of about 75 stored procedures (created by someone else) used for reporting. The first part of my solution was creating about 6 denormalized tables that will be used for the bulk of the reporting. Now that I've created the tables I have the somewhat daunting task of determining what Indexes I should create to best improve the performance of these stored procs.
I'm curious to see if anyone has any suggestions for finding what columns would make sense to include in the indexes? I've contemplated using Profiler/DTA, or possibly fasioning some sort of query like the one below to figure out the popular columns.
SELECT name, Count(so.name) as hits, so.xtype
from syscomments as sc
INNER JOIN sysobjects so ON sc.id=so.id
WHERE sc.text like '%ColumnNamme%'
AND xtype = 'P'
Group by name,so.xtype
ORDER BY hits desc
Let me know if you have any ideas that would help me not have to dig through these 75 procs by hand.
Also, inserts are only performed on this DB once per day so insert performance is not a huge concern for me.

Any suggestions for identifying what indexes need to be created?
Yes! Ask Sql Server to tell you.
Sql Server automatically keeps statistics for what indexes it can use to improve performance. This is already going on in the background for you. See this link:
http://msdn.microsoft.com/en-us/library/ms345417.aspx
Try running a query like this (taken right from msdn):
SELECT mig.*, statement AS table_name,
column_id, column_name, column_usage
FROM sys.dm_db_missing_index_details AS mid
CROSS APPLY sys.dm_db_missing_index_columns (mid.index_handle)
INNER JOIN sys.dm_db_missing_index_groups AS mig ON mig.index_handle = mid.index_handle
ORDER BY mig.index_group_handle, mig.index_handle, column_id;
Just be careful. I've seen people take the missing index views as Gospel, and use them to push out a bunch of indexes they don't really need. Indexes have costs, in terms of upkeep at insert, update, and delete time, as well as disk space and memory use. To make real, accurate use of this information you want to profile actual execution times of your key procedures both before and after any changes, to make sure the benefits of an index (singly or cumulative) aren't outweighed by the costs.

If you know all of the activity is coming from the 75 stored procedures then I would use profiler to track which stored procedures take the longest and are called the most. Once you know which ones are then look at those procs and see what columns are being used most often in the Where clause and JOIN ON sections. Most likely, those are the columns you will want to put non-clustered indexes on. If a set of columns are often times used together then there is a good chance you will want to make 1 non-clustered index for the group. You can have many non-clustered indexes on a table (250) but you probably don't want to put more than a handful on it. I think you will find the data is being searched and joined on the same columns over and over. Remember the 80/20 rule. You will probably get 80% of your speed increases in the first 20% of the work you do. There will be a point where you get very little speed increase for the added indexes, that is when you want to stop.

I concur with bechbd - use a good sample of your database traffic (by running a server trace on a production system during real office hours, to get the best snapshot), and let the Database Tuning Advisor analyze that sampling.
I agree with you - don't blindly rely on everything the Database Tuning Advisor tells you to do - it's just a recommendation, but the DTA can't take everything into account. Sure - by adding indices you can speed up querying - but you'll slow down inserts and updates at the same time.
Also - to really find out if something helps, you need to implement it, measure again, and compare - that's really the only reliable way. There are just too many variables and unknowns involved.
And of course, you can use the DTA to fine-tune a single query to perform outrageously well - but that might neglect the fact that this query is only ever called one per week, or that by tuning this one query and adding an index, you hurt other queries.
Index tuning is always a balance, a tradeoff, and a trial-and-error kind of game - it's not an exact science with a formula and a recipe book to strictly determine what you need.

You can use SQL Server profiler in SSMS to see what and how your tables are being called then using the Database Tuning Tool in profiler to at least start you down the correct path. I know most DBA's will probably scream at me for recommending this but for us non-DBA types such as myself it at least gives us a starting point.

If this is strictly a reporting database and you need performance, consider moving to a data warehouse design. A star or snowflake schema will outperform even a denormalized relational design when it comes to reporting.

Related

How to Pre Stage Queries on SQL Server?

I've looked for tips of how to speed up sql intensive application and found this particularly useful link.
In point 6 he says
Do pre-stage data This is one of my favorite topics because it's an old technique that's often overlooked. If you have a report or a
procedure (or better yet, a set of them) that will do similar joins to
large tables, it can be a benefit for you to pre-stage the data by
joining the tables ahead of time and persisting them into a table. Now
the reports can run against that pre-staged table and avoid the large
join.
You're not always able to use this technique, but when you can, you'll
find it is an excellent way to save server resources.
Note that many developers get around this join problem by
concentrating on the query itself and creating a view-only around the
join so that they don't have to type the join conditions again and
again. But the problem with this approach is that the query still runs
for every report that needs it. By pre-staging the data, you run the
join just once (say, 10 minutes before the reports) and everyone else
avoids the big join. I can't tell you how much I love this technique;
in most environments, there are popular tables that get joined all the
time, so there's no reason why they can't be pre-staged
From what I understood you join the tables once and several SQL Queries can "benefit" from it. That looks extremely interesting for the application I'm working at.
The thing is, I've been looking for pre staging data around and couldn't find anything that seems to be related to that technique. Maybe I'm missing a few keywords ?
I'd like to know how to use the described technique within SQL Server. The link says it's an old technique so it shouldn't be a problem that I'm using SQL Server 2008.
What I would like is the following: I have several SELECT queries that run in a row. All of them join the same 7-8 tables and they're all really heavy, that impacts performance. So I'm thinking of joining them, run the queries and them drop this intermediate table. How/Can it be done ?

If your query meets the requirements for an indexed view then you can just create such an object materialising the result of your query. This means that it will always be up-to-date.
Otherwise you would need to write code to materialise it yourself. Either eagerly on a schedule or potentially on first request and then cached for some amount of time that you deem acceptable.
The second approach is not rocket science and can be done with a TRUNCATE ... INSERT ... SELECT for moderate amounts of data or perhaps an ALTER TABLE ... SWITCH if the data is large and there are concurrent queries that might be accessing the cached result set.
Regarding your edit it seems like you just need to create a #temp table and insert the results of the join into it though. Then reference the #temp table for the several SELECTs and drop it. There is no guarantee that this will improve performance though and insufficient details in the question to even hazard a guess.

improve database querying in ms sql

what's a fast way to query large amounts of data (between 10.000 - 100.000, it will get bigger in the future ... maybe 1.000.000+) spread across multiple tables (20+) that involves left joins, functions (sum, max, count,etc.)?
my solution would be to make one table that contains all the data i need and have triggers that update this table whenever one of the other tables gets updated. i know that trigger aren't really recommended, but this way i take the load off the querying. or do one big update every night.
i've also tried with views, but once it starts involving left joins and calculations it's way too slow and times out.

Since your question is too general, here's a general answer...
The path you're taking right now is optimizing a single query/single issue. Sure, it might solve the issue you have right now, but it's usually not very good in the long run (not to mention the cumulative cost of maintainance of such a thing).
The common path to take is to create an 'analytics' database - the real-time copy of your production database that you're going to query for all your reports. This analytics database can eventually be even a full blown DWH, but you're probably going to start with a simple real-time replication (or replicate nightly or whatever) and work from there...
As I said, the question/problem is too broad to be answered in a couple of paragraphs, these only some of the guidelines...

Need a bit more details, but I can already suggest this:
Use "with(nolock)", this will slightly improve the speed.
Reference: Effect of NOLOCK hint in SELECT statements
Use Indexing for your table fields for fetching data fast.
Reference: sql query to select millions record very fast

What are the types and inner workings of a query optimizer?

As I understand it, most query optimizers are "cost-based". Others are "rule-based", or I believe they call it "Syntax Based". So, what's the best way to optimize the syntax of SQL statements to help an optimizer produce better results?
Some cost-based optimizers can be influenced by "hints" like FIRST_ROWS(). Others are tailored for OLAP. Is it possible to know more detailed logic about how Informix IDS and SE's optimizers decide what's the best route for processing a query, other than SET EXPLAIN? Is there any documentation which illustrates the ranking of SELECT statements as to what's the fastest way to access rows, assuming it's indexed?
I would imagine that "SELECT col FROM table WHERE ROWID = n" is the fastest (rank 1).
If I'm not mistaking, Informix SE's ROWID is a SERIAL(INT) which allows for a max. of 2GB nrows, or maybe it uses INT9 for TB's nrows? SE's optimizer is cost based when it has enough data but it does not use distributions like the IDS optimizer.
IDS'ROWID isn't an INT, it is the logical address of the row's page left
shifted 8 bits plus the slot number on the page that contains the row's data.
IDS' optimizer is a cost based optimizer that uses data
about the index depth and width, number of rows, number of pages, and the
data distributions created by update statistics MEDIUM and HIGH to decide
which query path is the least expensive, but there's no ranking of statements?
I think Oracle uses HEX values for ROWID. Too bad ROWID can't be oftenly used, since a rows ROWID can change. So maybe ROWID can be used by the optimizer as a counter to report a query progress?, an idea I mentioned in my "Begin viewing query results before query completes" question? I feel it wouldn't be that difficult to report a query's progress while being processed, perhaps at the expense of some slight overhead, but it would be nice to know ahead of time: A "Google-like" estimate of how many rows meet a query's criteria, display it's progress every 100, 200, 500 or 1,000 rows, give users the ability to cancel it at anytime and start displaying the qualifying rows as they are being put into the current list, while it continues searching?.. This is just one example, perhaps we could think other neat/useful features, the ingridients are more or less there.
Perhaps we could fine-tune each query with more granularity than currently available? OLTP queries tend to be mostly static and pre-defined. The "what-if's" are more OLAP, so let's try to add more control and intelligence to it? So, therefore, being able to more precisely control, not just "hint/influence" the optimizer is what's needed. We can then have more dynamic SELECT statements for specific situations! Maybe even tell IDS to read blocks of index nodes at-a-time instead of one-by-one, etc. etc.

I'm not really sure what your are after but here is some info on SQL Server query optimizer which I've recently read:
13 Things You Should Know About Statistics and the Query Optimizer
SQL Server Query Execution Plan Analysis
and one for Informix that I just found using google:
Part 1: Tuning Informix SQL

For Oracle, your best resource would be Cost Based oracle Fundamentals. It's about 500 pages (and billed as Volume 1 but there haven't been any followups yet).
For a (very) simple full-table scan, progress can sometimes be monitored through v$session_longops. Oracle knows how many blocks it has to scan, how many blocks it has scanned, how many it has to go, and reports on progress.
Indexes are a different matter. If I search for records for a client 'Frank', and use the index, the database will make a guess at how many 'Frank' entries are in the table, but that guess can be massively off. It may be that you have 1000 'Frankenstein' and just 1 'Frank' or vice versa.
It gets even more complicated as you add in other filter and access predicates (eg where multiple indexes can be chosen), and makes another leap as you include table joins. And thats without getting into the complex stuff about remote databases, domain indexes like Oracle Text and Locator.
In short, it is very complicated. It is stuff that can be useful to know if you are responsible for tuning a large application. Even for basic development you need to have some grounding in how the database can physically retrieve that data you are interested.
But I'd say you are going the wrong way here. The point of an RDBMS is to abstract the details so that, for the most part, they just happen. Oracle employs smart people to write query transformation stuff into the optimizer so us developers can move away from 'syntax fiddling' to get the best plans (not totally, but it is getting better).

What are SQL Execution Plans and how can they help me?

I've been hearing a lot lately that I ought to take a look at the execution plan of my SQL to make a judgment on how well it will perform. However, I'm not really sure where to begin with this feature or what exactly it means.
I'm looking for either a good explanation of what the execution plan does, what its limitations are, and how I can utilize it or direction to a resource that does.

It describes actual algorithms which the server uses to retrieve your data.
An SQL query like this:
SELECT *
FROM mytable1
JOIN mytable2
ON …
GROUP BY
…
ORDER BY
…
, describes what should be done but not how it should be done.
The execution plan shows how: which indexes are used, which join methods are chosen (nested loops or hash join or merge join), how the results are grouped (using sorting or hashing), how they are ordered etc.
Unfortunately, even modern SQL engines cannot automatically find the optimal plans for more or less complex queries, it still takes an SQL developer to reformulate the queries so that they are performant (even they do what the original query does).
A classical example would be these too queries:
SELECT (
SELECT COUNT(*)
FROM mytable mi
WHERE mi.id <= mo.id
)
FROM mytable mo
ORDER BY
id
and
SELECT RANK() OVER (ORDER BY id)
FROM mytable
, which do the same and in theory should be executed using the same algorithms.
However, no actual engine will optimize the former query to implement the same algorithms, i. e. store a counter in a variable and increment it.
It will do what it's told to do: count the rows over and over and over again.
To optimize the queries you need to actually see what's happening behind the scenes, and that's what the execution plans show you.
You may want to read this article in my blog:
Double-thinking in SQL

Here and Here are some article check it out. Execution plans lets you identify the area which is time consuming and therefore allows you to improve your query.

An execution plan shows exactly how SQL Server processes a query
it is produced as part of the query optimisation process that SQL Server does. It is not something that you directly create.
it will show what indexes it has decided are best to be used, and basically is a plan for how SQL server processes a query
the query optimiser will take a query, analyse it and potentially come up with a number of different execution plans. It's a cost-based optimisation process, and it will choose the one that it feels is the best.
once an execution plan has been generated, it will go into the plan cache so that subsequent calls for that same query can reuse the same plan again to save having to redo the work to come up with a plan.
execution plans automatically get dropped from the cache, depending on their value (low value plans get removed before high value plans do in order to provide maximum performance gain)
execution plans help you spot performance issues such as where indexes are missing

A way to ease into this, is simply by using "Ctrl L" (Query | Display Estimated Execution Plan) for some of your queries, in SQL Management Studio.
This will result in showing a graphic view of Execution Plan, which, at first are easier to "decode" than the text version thereof.
Query plans in a tiny nutshell:
Essentially the query plan show the way SQL Server intends to use in resolving a query.
There are indeed many options, even with simple queries.
For example when dealing with a JOIN, one needs to decide whether to loop through the [filtered] rows of "table A" and to lookup the rows of "table B", or to loop through "table B" first instead (this is a simplified example, as there are many other tricks which can be used in dealing with JOINs). Typically, SQL will estimate the number of [filtered] rows which will be produced by either table and pick the one which the smallest count for the outer loop (as this will reduce the number of lookups in the other table)
Another example, is to decide which indexes to use (or not to use).
There are many online resources as well as books which describe the query plans in more detail, the difficulty is that SQL performance optimization is a very broad and complex problem, and many such resources tend to go into too much detail for the novice; One first needs to understand the fundamental principles and structures which underlie SQL Server (the way indexes work, the way the data is stored, the difference between clustered indexes and heaps...) before diving into many of the [important] details of query optimization. It is a bit like baseball: first you need to know the rules before understanding all the subtle [and important] concepts related to the game strategy.
See this related SO Question for additional pointers.

Here's a great resource to help you understand them
http://downloads.red-gate.com/ebooks/HighPerformanceSQL_ebook.zip
This is from red-gate which is a company that makes great SQL server tools, it's free and it's well worth the time to download and read.

it is a very serious part of knowledge. And I highly to recommend special training courses about that. As for me after spent week on courses I boosted performance of queries about 1000 times (nostalgia)

The Execution Plan shows you how the database is fetching, sorting and filtering the data required for your query.
For example:
SELECT
*
FROM
TableA
INNER JOIN
TableB
ON
TableA.Id = TableB.TableAId
WHERE
TableB.TypeId = 2
ORDER BY
TableB.Date ASC
Would result in an execution plan showing the database getting records from TableA and TableB, matching them to satisfy the JOIN, filtering to satisfy the WHERE and sorting to satisfy the ORDER BY.
From this, you can work out what is slowing down the query, whether it would be beneficial to review your indexes or if you can speed things up in another way.

SQL server fields with most usage

Is there an SQL query that can give me the fields that are used in most stored procedures or updated, selected most in a given table. I am asking this because I want to figure out which fields to put indexes on.

Take a look at the missing indexes article on SQLServerPedia http://sqlserverpedia.com/wiki/Find_Missing_Indexes

I think you are looking at the problem the wrong way around.
What you first need to identify are the most expensive (cumulative: so both single-run high cost, and many-runs lower cost) queries in your normal workload.
Once you have identified those queries, you can analyse their query plans and create appropriate indexes.
This SO Answer might be of use: How Can I Log and Find the Most Expensive Queries? (title and tags say SQL Server 2008, but my accepted answer applies to any version).

Most used fields are by no means index candidates. Good index candidates are those that correctly balance the extra storage requirements with SARGability and query projection coverage, as described in Index Design Basics. You should follow the advice the very engine is giving you, using the Missing Indexes feature:
sys.dm_db_missing_index_group_stats
sys.dm_db_missing_index_groups
sys.dm_db_missing_index_details
sys.dm_db_missing_index_columns
A good action plan is to start from the most expensive queries by IO obtained from sys.dm_exec_query_stats and then open the plan of the query with sys.dm_exec_query_plan in Management Studio and at the top of the query plan view will be a proposed index, with the CREATE INDEX just ready to copy and paste into execution. In fact you don't even have to run the queries to find the most expensive query in the plan cache, there are already SSMS reports that can find it for you.

Wish it was that easy.... You need to do a search for SQL Query Optimization. There are lots of tools to use for what you need.
And knowing which fields are used often does not tell you much about what you need to do to optimize access.

You might have some luck by searching your code for all the queries and then running a SELECT EXPLAIN on them. This should give you some glaring numbers when a query is poorly indexed. Unfortunately it won't give you an idea of which statements are called most frequently.
Also - I've noticed some issues with type mismatching in queries. When you pass a string containing a number and it's querying an index based on integers then it will scan the entire table.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight