Does the order of the columns in a WHERE clause effect performance?
e.g.
Say I put a column that has a higher potential for uniqueness first or visa versa?
With a decent query optimiser: it shouldn't.
But in practice, I suspect it might.
You can only tell for your cases by measuring. And the measurements will likely change as the distribution of data changes in the database.
For Transact-SQL there is a defined precedence for operators in the condition of the WHERE clause. The optimizer may re-order this evaluation, so you shouldn't rely on short-circuiting behavior for correctness. The order is generally left to right, but selectivity/availability of indexes probably also matters. Simplifying your search condition should improve the ability of the optimizer to handle it.
Ex:
WHERE (a OR b) AND (b OR c)
could be simplified to
WHERE b OR (a AND c)
Clearly in this case if the query can be constructed to find if b holds first it may be able to skip the evaluation of a and c and thus would run faster. Whether the optimizer can do this simple transformation I can't answer (it may be able to), but the point is that it probably can't do arbitrarily complex transformations and you may be able to effect query performance by rearranging your condition. If b is more selective or has an index, the optimizer would likely be able to construct a query using it first.
EDIT: With regard to your question about ordering based on uniqueness, I would assume that the any hints you can provide to the optimizer based on your knowledge (actual, not assumed) of the data couldn't hurt. Pretend that it won't do any optimization and construct your query as if you needed to define it from most to least selective, but don't obsess about it until performance is actually a problem.
Quoting from the reference above:
The order of precedence for the logical operators is NOT (highest),
followed by AND, followed by OR. Parentheses can be used to override
this precedence in a search condition. The order of evaluation of
logical operators can vary depending on choices made by the query
optimizer.
For SQL Server 2000 / 20005 / 2008, the query optimizer usually will give you identical results no matter how you arrange the columns in the WHERE clause. Having said this, over the years of writing thousands of T-SQL commands I have found a few corner cases where the order altered the performance. Here are some characteristics of the queries that appeared to be subject to this problem:
If you have a large number of tables in your query (10 or more).
If you have several EXISTS, IN, NOT EXISTS, or NOT IN statements in your WHERE clause
If you are using nested CTE's (common-table expressions) or a large number of CTE's.
If you have a large number of sub-queries in your FROM clause.
Here are some tips on trying to evaluate the best way to resolve the performance issue quickly:
If the problem is related to 1 or 2, then try reordering the WHERE clause and compare the sub-tree cost of the queries in the estimated query plans.
If the problem is related to 3 or 4, then try moving the sub-queries and CTE's out of the query and have them load temporary tables. The query plan optimizer is FAR more efficient at estimating query plans if you reduce the number of complex joins and sub-queries from the body of the T-SQL statement.
If you are using temporary tables, then make certain you have specified primary keys for the temporary tables. This means avoid using SELECT INTO FROM to generate the table. Instead, explicitly create the table and specify a primary KEY before using an INSERT INTO SELECT statement.
If you are using temporary tables and MANY processes on the server use temporary tables as well, then you may want to make a more permanent staging table that is truncated and reloaded during the query process. You are more likely to encounter disk contention issues if you are using the TempDB to store your working / staging tables.
Move the statements in the WHERE clause that will filter the most data to the beginning of the WHERE clause. Please note that if this is your solution to the problem, then you will probably have poor performance again down the line when the query plan gets confused again about generating and picking the best execution plan. You are BEST off finding a way to reduce the complexity of the query so that the order of the WHERE clause is no longer relevant.
I hope you find this information helpful. Good luck!
It all depends on the DBMS, query optimizer and rules, but generally it does affect performance.
If a where clause is ordered such that the first condition reduces the resultset significantly, the remaining conditions will only need to be evaluated for a smaller set. Following that logic, you can optimize a query based on condition order in a where clause.
In theory any two queries that are equivalent should produce identical query plans. As the order of WHERE clauses has no effect on the logical meaning of the query, this should mean that the order of the WHERE clause should have no effect.
This is because of the way that the query optimiser works. In a vastly simplified overview:
First SQL Server parses the query and constructs a tree of logical operators (e.g JOIN or SELECT).
Then it translates these logical operators into a "tree of physcial operations" (e.g. "Nested Loops" or "Index scan", i.e. an execution plan)
Next it permutates through the set of equivalent "trees of physcial operations" (i.e. execution plans) by swapping out equivalent operations, estimating the cost of each plan until it finds the optimal one.
The second step is done is a completely nieve way - it simply chooses the first / most obvious physical tree that it can, however in the 3rd step the query optimiser is able to look through all equivalent physical trees (i.e. execution plans), and so as long as the queries are actually equivalent it doesn't matter what initial plan we get in step 2, the set of plans all plans to be considered in step 3 is the same.
(I can't remember the real names for the logical / physical trees, they are in a book but unfortunately the book is the other side of the world from me right now)
See the following series of blog articles for more detail Inside the Optimizer: Constructing a Plan - Part 1
In reality however often the query optimiser doesn't have the chance to consider all equivalent trees in step 3 (for complex queries there can be a massive number of possible plans), and so after a certain cutoff time step 3 is cut short and the query optimiser has to choose the best plan that it has found so far - in this case not all plans will be considered.
There is a lot of behind the sceene magic that goes on to ensure that the query optimiser selectively and inteligently chooses plans to consider, and so most of the time the plan choses is "good enough" - even if its not the absolute fastest plan, its probably not that much slower than the theoretical fastest,
What this means however is that if we have a different starting plan in step 2 (which might happen if we write our query differently), this potentially means that a different subset of plans is considered in step 3, and so in theory SQL Server can come up with different query plans for equivalent queries depending on the way that they were written.
In reality however 99% of the time you aren't going to notice the difference (for many simple plans there wont be any difference as the optimiser will actually consider all plans). Also you can't predict how any of this is going to work and so things that might seem sensible (like putting the WHERE clauses in a certain order), might not have anything like the expected effect.
In the vast majority of cases the query optimizer will determine the most efficient way to select the data you have requested, irrespective of the ordering of the SARGS defined in the WHERE clause.
The ordering is determined by factors such as the selectivity of the column in question (which SQL Server knows based on statistics) and whether or not indexes can be used.
If you are ANDing conditions the first not true will return false, so order can affect performance.
Related
There is a rather complex SQL Server query I have been attempting to optimize for some months now which takes a very long time to execute despite multiple index additions (adding covering, non-clustered indexes) and query refactoring/changes. Without getting into the full details, the execution plan is below. Is there anything here which jumps out to anyone as particularly inefficient or bad? I got rid of all key lookups and there appears to be heavy use of index seeks which is why I am confused that it still takes a huge amount of time somehow. When the query runs, the bottleneck is clearly CPU (not disk I/O). Thanks much for any thoughts.
OK so I made a change based on Martin's comments which have seemingly greatly helped the query speed. I'm not 100% positive this is the solution bc I've been running this a lot and it's possible that so much underlying data has been put into memory that it is now fast. But I think there is actually a true difference.
Specifically, the 3 scans inside of the nested loops were being caused by sub-queries on very small tables that contain a small set of records to be completely excluded from the result set. Conceptually, the query was something like:
SELECT fields
FROM (COMPLEX JOIN)
WHERE id_field NOT IN (SELECT bad_ID_field FROM BAD_IDs)
the idea being that if a record appears in BAD_IDs it should never be included in the results.
I tinkered with this and changed it to something like:
SELECT fields
FROM (COMPLEX JOIN)
LEFT JOIN BAD_IDs ON id_field = bad_ID_field
WHERE BAD_IDs.bad_ID_field IS NULL
This is logically the same thing - it excludes results for any ID in BAD_IDs - but it uses a join instead of a subquery. Even the execution plan is almost identical; a TOP operation gets changed to a FILTER elsewhere in the tree, but the clustered index scan is still there.
But, it seems to run massively faster! Is this to be expected? I have always assumed that a subquery used in the fashion I did was OK and that the server would know how to create the fastest (and presumably identical, which it almost is) execution plan. Is this not correct?
Thx!
As I know, from the relational database theory, a select statement without an order by clause should be considered to have no particular order. But actually in SQL Server and Oracle (I've tested on those 2 platforms), if I query from a table without an order by clause multiple times, I always get the results in the same order. Does this behavior can be relied on? Anyone can help to explain a little?
No, that behavior cannot be relied on. The order is determined by the way the query planner has decided to build up the result set. simple queries like select * from foo_table are likely to be returned in the order they are stored on disk, which may be in primary key order or the order they were created, or some other random order. more complex queries, such as select * from foo where bar < 10 may instead be returned in order of a different column, based on an index read, or by the table order, for a table scan. even more elaborate queries, with multipe where conditions, group by clauses, unions, will be in whatever order the planner decides is most efficient to generate.
The order could even change between two identical queries just because of data that has changed between those queries. a "where" clause may be satisfied with an index scan in one query, but later inserts could make that condition less selective, and the planner could decide to perform a subsequent query using a table scan.
To put a finer point on it. RDBMS systems have the mandate to give you exactly what you asked for, as efficiently as possible. That efficiency can take many forms, including minimizing IO (both to disk as well as over the network to send data to you), minimizing CPU and keeping the size of its working set small (using methods that require minimal temporary storage).
without an ORDER BY clause, you will have not asked exactly for a particular order, and so the RDBMS will give you those rows in some order that (maybe) corresponds with some coincidental aspect of the query, based on whichever algorithm the RDBMS expects to produce the data the fastest.
If you care about efficiency, but not order, skip the ORDER BY clause. If you care about the order but not efficiency, use the ORDER BY clause.
Since you actually care about BOTH use ORDER BY and then carefully tune your query and database so that it is efficient.
No, you can't rely on getting the results back in the same order every time. I discovered that when working on a web page with a paged grid. When I went to the next page, and then back to the previous page, the previous page contained different records! I was totally mystified.
For predictable results, then, you should include an ORDER BY. Even then, if there are identical values in the specified columns there, you can get different results. You may have to ORDER BY fields that you didn't really think you needed, just to get a predictable result.
Tom Kyte has a pet peeve about this topic. For whatever reason, people are fascinated by this, and keep trying to come up with cases where you can rely upon a specific order without specifying ORDER BY. As others have stated, you can't. Here's another amusing thread on the topic on the AskTom website.
The Right Answer
This is a new answer added to correct the old one. I've got answer from Tom Kyte and I post it here:
If you want rows sorted YOU HAVE TO USE AN ORDER. No if, and, or buts about it. period. http://tkyte.blogspot.ru/2005/08/order-in-court.html You need order by on that IOT. Rows are sorted in leaf blocks, but leaf blocks are not stored sorted. fast full scan=unsorted rows.
https://twitter.com/oracleasktom/status/625318150590980097
https://twitter.com/oracleasktom/status/625316875338149888
The Wrong Answer
(Attention! The original answer on the question was placed below here only for the sake of the history. It's wrong answer. The right answer is placed above)
As Tom Kyte wrote in the article mentioned before:
You should think of a heap organized table as a big unordered
collection of rows. These rows will come out in a seemingly random
order, and depending on other options being used (parallel query,
different optimizer modes and so on), they may come out in a different
order with the same query. Do not ever count on the order of rows from
a query unless you have an ORDER BY statement on your query!
But note he only talks about heap-organized tables. But there is also index-orgainzed tables. In that case you can rely on order of the select without ORDER BY because order implicitly defined by primary key. It is true for Oracle.
For SQL Server clustered indexes (index-organized tables) created by default. There is also possibility for PostgreSQL store information aligning by index. More information can be found here
UPDATE:
I see, that there is voting down on my answer. So I would try to explain my point a little bit.
In the section Overview of Index-Organized Tables there is a phrase:
In an index-organized table, rows are stored in an index defined on the primary key for the table... Index-organized tables are useful when related pieces of data must be stored together or data must be physically stored in a specific order.
http://docs.oracle.com/cd/E25054_01/server.1111/e25789/indexiot.htm#CBBJEBIH
Because of index, all data is stored in specific order, I believe same is true for Pg.
http://www.postgresql.org/docs/9.2/static/sql-cluster.html
If you don't agree with me please give me a link on the documenation. I'll be happy to know that there is something to learn for me.
When we add or remove a new index to speed up something, we may end up slowing down something else.
To protect against such cases, after creating a new index I am doing the following steps:
start the Profiler,
run a SQL script which contains lots of queries I do not want to slow down
load the trace from a file into a table,
analyze CPU, reads, and writes from the trace against the results from the previous runs, before I added (or removed) an index.
This is kind of automated and kind of does what I want. However, I am not sure if there is a better way to do it. Is there some tool that does what I want?
Edit 1 The person who voted to close my question, could you explain your reasons?
Edit 2 I googled up but did not find anything that explains how adding an index can slow down selects. However, this is a well known fact, so there should be something somewhere. If nothing comes up, I can write up a few examples later on.
Edit 3 One such example is this: two columns are highly correlated, like height and weight. We have an index on height, which is not selective enough for our query. We add an index on weight, and run a query with two conditions: a range on height and a range on weight. because the optimizer is not aware of the correlation, it grossly underestimates the cardinality of our query.
Another example is adding an index on increasing column, such as OrderDate, can seriously slow down a query with a condition like OrderDate>SomeDateAfterCreatingTheIndex.
Ultimately what you're asking can be rephrased as 'How can I ensure that the queries that already use an optimal, fast, plan do not get 'optimized' into a worse execution plan?'.
Whether the plan changes due to parameter sniffing, statistics update or metadata changes (like adding a new index) the best answer I know of to keep the plan stable is plan guides. Deploying plan guides for critical queries that already have good execution plans is probably the best way to force the optimizer into keep using the good, validated, plan. See Applying a Fixed Query Plan to a Plan Guide:
You can apply a fixed query plan to a plan guide of type OBJECT or
SQL. Plan guides that apply a fixed query plan are useful when you
know about an existing execution plan that performs better than the
one selected by the optimizer for a particular query.
The usual warnings apply as to any possible abuse of a feature that prevents the optimizer from using a plan which may be actually better than the plan guide.
How about the following approach:
Save the execution plans of all typical queries.
After applying new indexes, check which execution plans have changed.
Test the performance of the queries with modified plans.
From the page "Query Performance Tuning"
Improve Indexes
This page has many helpful step-by-step hints on how to tune your indexes for best performance, and what to watch for (profiling).
As with most performance optimization techniques, there are tradeoffs. For example, with more indexes, SELECT queries will potentially run faster. However, DML (INSERT, UPDATE, and DELETE) operations will slow down significantly because more indexes must be maintained with each operation. Therefore, if your queries are mostly SELECT statements, more indexes can be helpful. If your application performs many DML operations, you should be conservative with the number of indexes you create.
Other resources:
http://databases.about.com/od/sqlserver/a/indextuning.htm
However, it’s important to keep in mind that non-clustered indexes slow down the data modification and insertion process, so indexes should be kept to a minimum
http://searchsqlserver.techtarget.com/tip/Stored-procedure-to-find-fragmented-indexes-in-SQL-Server
Fragmented indexes and tables in SQL Server can slow down application performance. Here's a stored procedure that finds fragmented indexes in SQL servers and databases.
Ok . First off, index's slow down two things (at least)
-> insert/update/delete : index rebuild
-> query planning : "shall I use that index or not ?"
Someone mentioned the query planner might take a less efficient route - this is not supposed to happen.
If your optimizer is even half-decent, and your statistics / parameters correct, there is no way it's going to pick the wrong plan.
Either way, in your case (mssql), you can hardly trust the optimizer and will still have to check every time.
What you're currently doing looks quite sound, you should just make sure the data you're looking at is relevant, i.e. real use case queries in the right proportion (this can make a world of difference).
In order to do that I always advise to write a benchmarking script based on real use - through logging of production-env. queries, a bit like I said here :
Complete db schema transformation - how to test rewritten queries?
I've heard that using an IN Clause can hurt performance because it doesn't use Indexes properly. See example below:
SELECT ID, Name, Address
FROM people
WHERE id IN (SELECT ParsedValue FROM UDF_ParseListToTable(#IDList))
Is it better to use the form below to get these results?
SELECT ID,Name,Address
FROM People as p
INNER JOIN UDF_ParseListToTable(#IDList) as ids
ON p.ID = ids.ParsedValue
Does this depend on which version of SQL Server you are using? If so which ones are affected?
Yes, assuming relatively large data sets.
It's considered better to use EXISTS for large data sets. I follow this and have noticed improvements in my code execution time.
According to the article, it has to do with how the IN vs. EXISTS is internalized. Another article: http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx
It's very simple to find out - open Management studio, put both versions of the query in, then run with the Show Execution plan turned on. Compare the two execution plans. Often, but not always, the query optimizer will make the same exact plan / literally do the same thing for different versions of a query that are logically equivalent.
In fact, that's its purpose - the goal is that the optimizer would take ANY version of a query, assuming the logic is the same, and make an optimal plan. Alas, the process isn't perfect.
Here's one scientific comparison:
http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/
http://sqlinthewild.co.za/index.php/2009/08/17/exists-vs-in/
IN can hurt performance because SQL Server must generate a complete result set and then create potentially a huge IF statement, depending on the number of rows in the result set. BTW, calling a UDF can be a real performance hit as well. They are very nice to use but can really impact performance, if you are not careful. You can Google UDF and Performance to do some research on this.
More than the IN or the Table Variable, I would think that proper use of an Index would increase the performance of your query.
Also, from the table name, it does not seem like you are going to have a lot of entries in it so which way you go may be moot point in this particular example.
Secondly, IN will be evaluated only once since there is no subquery. In your case, the #IDList variable is probably going to cause mistmatches you will need #IDList1, #IDList2, #IdList3.... because IN demands a list.
As a general rule of thumb, you should avoid IN with subqueries and use EXISTS with a join - you will get better performance more often than not.
Your first example is not the same as your second example, because WHERE X IN (#variable) is the same as WHERE X = #variable (i.e. you cannot have variable lists).
Regarding performance, you'll have to look at the execution plans to see what indexes are chosen.
I've been hearing a lot lately that I ought to take a look at the execution plan of my SQL to make a judgment on how well it will perform. However, I'm not really sure where to begin with this feature or what exactly it means.
I'm looking for either a good explanation of what the execution plan does, what its limitations are, and how I can utilize it or direction to a resource that does.
It describes actual algorithms which the server uses to retrieve your data.
An SQL query like this:
SELECT *
FROM mytable1
JOIN mytable2
ON …
GROUP BY
…
ORDER BY
…
, describes what should be done but not how it should be done.
The execution plan shows how: which indexes are used, which join methods are chosen (nested loops or hash join or merge join), how the results are grouped (using sorting or hashing), how they are ordered etc.
Unfortunately, even modern SQL engines cannot automatically find the optimal plans for more or less complex queries, it still takes an SQL developer to reformulate the queries so that they are performant (even they do what the original query does).
A classical example would be these too queries:
SELECT (
SELECT COUNT(*)
FROM mytable mi
WHERE mi.id <= mo.id
)
FROM mytable mo
ORDER BY
id
and
SELECT RANK() OVER (ORDER BY id)
FROM mytable
, which do the same and in theory should be executed using the same algorithms.
However, no actual engine will optimize the former query to implement the same algorithms, i. e. store a counter in a variable and increment it.
It will do what it's told to do: count the rows over and over and over again.
To optimize the queries you need to actually see what's happening behind the scenes, and that's what the execution plans show you.
You may want to read this article in my blog:
Double-thinking in SQL
Here and Here are some article check it out. Execution plans lets you identify the area which is time consuming and therefore allows you to improve your query.
An execution plan shows exactly how SQL Server processes a query
it is produced as part of the query optimisation process that SQL Server does. It is not something that you directly create.
it will show what indexes it has decided are best to be used, and basically is a plan for how SQL server processes a query
the query optimiser will take a query, analyse it and potentially come up with a number of different execution plans. It's a cost-based optimisation process, and it will choose the one that it feels is the best.
once an execution plan has been generated, it will go into the plan cache so that subsequent calls for that same query can reuse the same plan again to save having to redo the work to come up with a plan.
execution plans automatically get dropped from the cache, depending on their value (low value plans get removed before high value plans do in order to provide maximum performance gain)
execution plans help you spot performance issues such as where indexes are missing
A way to ease into this, is simply by using "Ctrl L" (Query | Display Estimated Execution Plan) for some of your queries, in SQL Management Studio.
This will result in showing a graphic view of Execution Plan, which, at first are easier to "decode" than the text version thereof.
Query plans in a tiny nutshell:
Essentially the query plan show the way SQL Server intends to use in resolving a query.
There are indeed many options, even with simple queries.
For example when dealing with a JOIN, one needs to decide whether to loop through the [filtered] rows of "table A" and to lookup the rows of "table B", or to loop through "table B" first instead (this is a simplified example, as there are many other tricks which can be used in dealing with JOINs). Typically, SQL will estimate the number of [filtered] rows which will be produced by either table and pick the one which the smallest count for the outer loop (as this will reduce the number of lookups in the other table)
Another example, is to decide which indexes to use (or not to use).
There are many online resources as well as books which describe the query plans in more detail, the difficulty is that SQL performance optimization is a very broad and complex problem, and many such resources tend to go into too much detail for the novice; One first needs to understand the fundamental principles and structures which underlie SQL Server (the way indexes work, the way the data is stored, the difference between clustered indexes and heaps...) before diving into many of the [important] details of query optimization. It is a bit like baseball: first you need to know the rules before understanding all the subtle [and important] concepts related to the game strategy.
See this related SO Question for additional pointers.
Here's a great resource to help you understand them
http://downloads.red-gate.com/ebooks/HighPerformanceSQL_ebook.zip
This is from red-gate which is a company that makes great SQL server tools, it's free and it's well worth the time to download and read.
it is a very serious part of knowledge. And I highly to recommend special training courses about that. As for me after spent week on courses I boosted performance of queries about 1000 times (nostalgia)
The Execution Plan shows you how the database is fetching, sorting and filtering the data required for your query.
For example:
SELECT
*
FROM
TableA
INNER JOIN
TableB
ON
TableA.Id = TableB.TableAId
WHERE
TableB.TypeId = 2
ORDER BY
TableB.Date ASC
Would result in an execution plan showing the database getting records from TableA and TableB, matching them to satisfy the JOIN, filtering to satisfy the WHERE and sorting to satisfy the ORDER BY.
From this, you can work out what is slowing down the query, whether it would be beneficial to review your indexes or if you can speed things up in another way.