Can it make any difference to query optimisation to have WHERE clauses in a different order for SQL Server?
For example, would the query plan for this:
select * from table where col1 = #var1 and col2 = #var2
be different from this?:
select * from table where col2 = #var2 and col1 = #var1
Of course this is a contrived example, and I have tried more complex ones out. The query plan was the same for both, but I've always wondered whether it was worth ordering WHERE clauses such that the most specific clauses come first, in case the optimiser somehow "prunes" results and could end up being faster.
This is really just a thought experiment, I'm not looking to solve a specific performance problem.
What about other RDBMS's, too?
Every modern RDBMS has a query optimizer which is responsible, among other things, of reordering the conditions. Some optimizers use pretty sophisticated statistics to do this, and they often beat human intuition about what will be a good ordering and what will not. So my guess would be: If you can definitely say "This ordering is better than the other one", so can the optimizer, and it will figure it out by itself.
Conclusion: Don't worry about such things. It is rarely worth the time.
Readability for humans should be your only goal when you determine the order of the conditions in the where clause. For example, if you have a join on two tables, A and B, write all conditions for A, then all conditions for B.
No, the query optimizer will find out anyway which index or statistics to use. I'm not entirely sure, but i even think that boolean expressions in sql is not evaluated from left to right but can be evaluated in any order by the query optimzer.
I don't think it'll make much of a difference..
What does make a difference in all the sql languages is the order in which you use sql functions.
For example:
When you do something like this:
select title, date FROM sometable WHERE to_char(date, 'DD-MM-YYYY') > "01-01-1960"
Would go slower then this:
select title, date FROM sometable WHERE date > to_char('DD-MM-YYYY', %USERVALUE%)
This is because of the number of times the function needs to be evaluated.
Also, the ordering might make a difference in case you are using nested queries. The smaller the result set of the inner query, less scanning the outer query will need to do over it.
Of course, having said that, I second what rsp said in the comment above - indexes are the main key in determining how long a query will take; if we have an index on the column, SQL Server would directly do a SEEK instead of SCANning the values and hence, the ordering would be rendered irrelevant
SQL is "declarative" so it makes no difference. You tell the DBMS what you want, it figures out the best way (subject to cost, time etc).
In .net, it would make a difference because it's procedural and executed in order.
Related
I have two tables. In one table we enter all types of Models, each model with around 100 rows. The second table has sales data about the first item. I need to produce a result like this:
Date Model Total(WE BOUGHT) Sold
---------- ----- ---------------- ----
2011-01-21 M34R 300 200
2011-01-21 M71S 250 22
My query looks like this:
select distinct
CONVERT(varchar(10),x.Scantime,120) as ScanDate,
x.ModelNumber,
( Select count(*)
from micro_model z
where
z.ModelNumber=x.ModelNumber
and CONVERT(varchar(10),z.scantime,101)
= CONVERT(varchar(10),x.Scantime,101)
) as Total,
( select COUNT(*)
from
micro_Model m
inner join micro_model_sold y on m.IDNO=y.IDNO
where
CONVERT(varchar(10),m.scantime,101)
= CONVERT(varchar(10),x.Scantime,101)
and x.ModelNumber=m.ModelNumber
) as Sold
from maxis.dbo.maxis_IMEI_Model x
where
CONVERT(varchar(10),x.scantime,101) between '01/01/2011' and '01/25/2011'
I am able to achieve that from the above query but it is taking more than 2 minutes to execute. Please suggest how I can improve the performance. I have heard about pivot tables and indexed views but have never done them.
There are a very many things going on in your query that could be causing problems. There are also some areas of uncertainty that should probably be ironed out. For starters, try out this query:
SELECT
DateAdd(Day, DateDiff(Day, 0, X.ScanTime), 0) ScanDate,
X.ModelNumber,
Coalesce(Z.Total, 0) Total,
Coalesce(Z.Sold, 0) Sold
FROM
maxis.dbo.maxis_IMEI_Model X
LEFT JOIN (
SELECT
Z.ModelNumber,
DateAdd(Day, DateDiff(Day, 0, Z.ScanTime), 0) ScanDate,
Count(DISTINCT M.IDNO) Total,
Count(Y.IDNO) Sold
FROM
micro_model Z
LEFT JOIN micro_model_sold Y
ON Z.IDNO = Y.IDNO
GROUP BY
DateDiff(Day, 0, Z.ScanTime),
Z.ModelNumber
) Z
ON X.ModelNumber = Z.ModelNumber
AND X.ScanTime >= Z.ScanDate
AND X.ScanTime < Z.ScanDate + 1
WHERE
X.ScanTime >= '20110101'
AND X.ScanTime < '20110126'
Converting to character in order to do whole date comparisons (by chopping off the characters that represent the time) is very inefficient. The best practice way is to do as I have shown in the WHERE clause. Notice that I incremented the final date by one day, then made that point exclusive using less-than instead of less-than-or-equal-to (which is what BETWEEN does). All the joins also needed to change. Finally, when it is necessary to remove the time portion of a date, the DateDiff method I show here is best (there is a slightly faster method that is much harder to understand so I can't recommend it, but if you're using SQL Server 2008 you can just do Convert(date, DateColumn) which is the fastest of all).
Using the date format '01/01/2011' is not region safe. If your query is ever used on a machine where the language is changed to one that has a default date format of DMY, your dates will be interpreted incorrectly, swapping the month and day and generating errors. Use the format yyyymmdd to be safe.
Using correlated subqueries (your SELECT statements inside parentheses to pull in column values from other tables) is awkward and in some cases yields very bad execution plans. Even though the optimizer can often convert these to proper joins, there is no guarantee. It also becomes very hard for other people looking at the query to understand what it is doing. It is better to express such things using outer joins as shown. I converted the correlated subqueries to derived tables.
Using DISTINCT is troubling. Your query shouldn't be returning multiple rows for each model. If so, something is logically wrong with the query and you're probably getting incorrect data.
I think I've combined the two correlated subqueries in my derived tables correctly. But I don't have example data and all the schema information so it is my best guess. In any case, my query should give you ideas.
I completely reformatted your query because it was nearly impossible to see what it was doing. I encourage you to do a little more formatting in your own code. This will aid you and anyone who comes after you to understand what is going on much more quickly. If you ask more SQL questions on this site you need to format your own code better. Please do so and also use the "code block" button or simply indent all the code lines by 4 spaces manually so it will get formatted by the web page as a code block.
You know, staring at my query a little more, it's clear that I don't understand the relationship between maxis_IMEI_Model and the other tables. Please explain a bit more what the tables mean and what result you want to see.
It's possible the problems in my query can be solved with a simple GROUP BY and throwing some SUMs on the number columns, but I am not 100% sure. It may be that the maxis_IMEI_Model table needs to go away completely, or to move into its own derived table where it is grouped separately before being joined.
I'm not a SQL expert by any means, but you've got a lot of conversions in there. Why? Why do you need to convert these datetime columns (which is the type I'm assuming for scantime etc) into strings before comparing them?
I strongly suspect that the conversions are removing any benefit you're getting from what indexes you've got present. (You do have indexes for all the columns involved in the join, right?) In fact, both of your joins look to me like they should be joins on multiple columns without any where clauses... although I'd expect the query optimizer to treat them equivalently if possible.
Look at each and every conversion, and check whether you really need it. I suspect you don't actually need any of them - and the final "between" may even be doing the wrong thing at the moment, given that you're converting into a non-sortable format.
In general - not even just in SQL - it's always worth trying to deal with data in its natural form wherever possible. You're dealing with dates/times - so why treat them as strings for comparison? Conversions are a source of both performance problems and correctness problems.
Well I have a sorted table by id and I want to retrieve n rows offset m rows, but when I use it without orderby it is producing unpredictable results and with order by id its taking too much of time, since my table is already ordered, I just want to retrieve n rows leaving first m rows.
The documentation says:
The query planner takes LIMIT into account when generating a query plan, so you are very likely to get different plans (yielding different row orders) depending on what you use for LIMIT and OFFSET. Thus, using different LIMIT/OFFSET values to select different subsets of a query result will give inconsistent results unless you enforce a predictable result ordering with ORDER BY. This is not a bug; it is an inherent consequence of the fact that SQL does not promise to deliver the results of a query in any particular order unless ORDER BY is used to constrain the order.
So what you are trying cannot really be done.
I suppose you could play with the planner options or rearrange the query in a clever way to trick the planner into using a plan that suits you, but without actually showing the query, it's hard to say.
SELECT * FROM mytable LIMIT 100 OFFSET 0
You really shouldn't rely on implicit ordering though, because you may not be able to predict the exact order of how data goes into the database.
As pointed out above, SQL does not guarantee anything about order unless you have an ORDER BY clause. LIMIT can still be useful in such a situation, but I can't think of any use for OFFSET. It sounds like you don't have an index on id, because if you do, the query should be extremely fast, clustered or not. Take another look at that. (Also check CLUSTER, which may improve your performance at the margin.)
REPEAT: this is not something about Postgresql. Its behavior here is conforming.
We are trying to optimize some of our queries.
One query is doing the following:
SELECT t.TaskID, t.Name as Task, '' as Tracker, t.ClientID, (<complex subquery>) Date,
INTO [#Gadget]
FROM task t
SELECT TOP 500 TaskID, Task, Tracker, ClientID, dbo.GetClientDisplayName(ClientID) as Client
FROM [#Gadget]
order by CASE WHEN Date IS NULL THEN 1 ELSE 0 END , Date ASC
DROP TABLE [#Gadget]
(I have removed the complex subquery. I don't think it's relevant other than to explain why this query has been done as a two stage process.)
I thought it would be far more efficient to merge this down into a single query using subqueries as:
SELECT TOP 500 TaskID, Task, Tracker, ClientID, dbo.GetClientDisplayName(ClientID)
FROM
(
SELECT t.TaskID, t.Name as Task, '' as Tracker, t.ClientID, (<complex subquery>) Date,
FROM task t
) as sub
order by CASE WHEN Date IS NULL THEN 1 ELSE 0 END , Date ASC
This would give the optimizer better information to work out what was going on and avoid any temporary tables. I assumed it should be faster.
But it turns out it is a lot slower. 8 seconds vs. under 5 seconds.
I can't work out why this would be the case, as all my knowledge of databases imply that subqueries would always be faster than using temporary tables.
What am I missing?
Edit --
From what I have been able to see from the query plans, both are largely identical, except for the temporary table which has an extra "Table Insert" operation with a cost of 18%.
Obviously as it has two queries the cost of the Sort Top N is a lot higher in the second query than the cost of the Sort in the Subquery method, so it is difficult to make a direct comparison of the costs.
Everything I can see from the plans would indicate that the subquery method would be faster.
"should be" is a hazardous thing to say of database performance. I have often found that temp tables speed things up, sometimes dramatically. The simple explanation is that it makes it easier for the optimiser to avoid repeating work.
Of course, I've also seen temp tables make things slower, sometimes much slower.
There is no substitute for profiling and studying query plans (read their estimates with a grain of salt, though).
Obviously, SQL Server is choosing the wrong query plan. Yes, that can happen, I've had exactly the same scenario as you a few times.
The problem is that optimizing a query (you mention a "complex subquery") is a non-trivial task: If you have n tables, there are roughly n! possible join orders -- and that's just the beginning. So, it's quite plausible that doing (a) first your inner query and (b) then your outer query is a good way to go, but SQL Server cannot deduce this information in reasonable time.
What you can do is to help SQL Server. As Dan Tow writes in his great book "SQL Tuning", the key is usually the join order, going from the most selective to the least selective table. Using common sense (or the method described in his book, which is a lot better), you could determine which join order would be most appropriate and then use the FORCE ORDER query hint.
Anyway, every query is unique, there is no "magic button" to make SQL Server faster. If you really want to find out what is going on, you need to look at (or show us) the query plans of your queries. Other interesting data is shown by SET STATISTICS IO, which will tell you how much (costly) HDD access your query produces.
I have re-iterated this question here: How can I force a subquery to perform as well as a #temp table?
The nub of it is, yes, I get that sometimes the optimiser is right to meddle with your subqueries as if they weren't fully self contained but sometimes it makes a bad wrong turn when it tries to be clever in a way that we're all familiar with. I'm saying there must be a way of switching off that "cleverness" where necessary instead of wrecking a View-led approach with temp tables.
I've heard that using an IN Clause can hurt performance because it doesn't use Indexes properly. See example below:
SELECT ID, Name, Address
FROM people
WHERE id IN (SELECT ParsedValue FROM UDF_ParseListToTable(#IDList))
Is it better to use the form below to get these results?
SELECT ID,Name,Address
FROM People as p
INNER JOIN UDF_ParseListToTable(#IDList) as ids
ON p.ID = ids.ParsedValue
Does this depend on which version of SQL Server you are using? If so which ones are affected?
Yes, assuming relatively large data sets.
It's considered better to use EXISTS for large data sets. I follow this and have noticed improvements in my code execution time.
According to the article, it has to do with how the IN vs. EXISTS is internalized. Another article: http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx
It's very simple to find out - open Management studio, put both versions of the query in, then run with the Show Execution plan turned on. Compare the two execution plans. Often, but not always, the query optimizer will make the same exact plan / literally do the same thing for different versions of a query that are logically equivalent.
In fact, that's its purpose - the goal is that the optimizer would take ANY version of a query, assuming the logic is the same, and make an optimal plan. Alas, the process isn't perfect.
Here's one scientific comparison:
http://sqlinthewild.co.za/index.php/2010/01/12/in-vs-inner-join/
http://sqlinthewild.co.za/index.php/2009/08/17/exists-vs-in/
IN can hurt performance because SQL Server must generate a complete result set and then create potentially a huge IF statement, depending on the number of rows in the result set. BTW, calling a UDF can be a real performance hit as well. They are very nice to use but can really impact performance, if you are not careful. You can Google UDF and Performance to do some research on this.
More than the IN or the Table Variable, I would think that proper use of an Index would increase the performance of your query.
Also, from the table name, it does not seem like you are going to have a lot of entries in it so which way you go may be moot point in this particular example.
Secondly, IN will be evaluated only once since there is no subquery. In your case, the #IDList variable is probably going to cause mistmatches you will need #IDList1, #IDList2, #IdList3.... because IN demands a list.
As a general rule of thumb, you should avoid IN with subqueries and use EXISTS with a join - you will get better performance more often than not.
Your first example is not the same as your second example, because WHERE X IN (#variable) is the same as WHERE X = #variable (i.e. you cannot have variable lists).
Regarding performance, you'll have to look at the execution plans to see what indexes are chosen.
I have a stored procedure which filters based on the result of the DATEADD function - My understanding is that this is similar to using user defined functions in that because SQL server cannot store statistics based on the output of that function it has trouble evaluating the cost of an execution plan.
The query looks a little like this:
SELECT /* Columns */ FROM
TableA JOIN TableB
ON TableA.id = TableB.join_id
WHERE DATEADD(hour, TableB.HoursDifferent, TableA.StartDate) <= #Now
(So its not possible to pre-calculate the outcome of the DATEADD)
What I'm seeing is a terrible terrible execution plan which I believe is due to SQL server incorrectly estimating the number of rows being returned from a part of the tree as being 1, when in fact its ~65,000. I have however seen the same stored procedure execute in a fraction of the time when different (not neccessarily less) data is present in the database.
My question is - in cases like these how does the query optimiser estimate the outcome of the function?
UPDATE: FYI, I'm more interested in understanding why some of the time I get a good execution plan and why the rest of the time I don't - I already have a pretty good idea of how I'm going to fix this in the long term.
It's not the costing of the plan that's the problem here. The function on the columns prevent SQL from doing index seeks. You're going to get an index scan or a table scan.
What I'd suggest is to see if you can get one of the columns out of the function, basically see if you can move the function to the other side of the equality. It's not perfect, but it means that at least one column can be used for an index seek.
Something like this (rough idea, not tested) with an index on TableB.HoursDifference, then an index on the join column in TableA
DATEDIFF(hour, #Now, TableA.StartDate) >= TableB.HoursDifferent
On the costing side, I suspect that the optimiser will use the 30% of the table 'thumb-suck' because it can't use statistics to get an accurate estimate and because it's an inequality. Meaning it's going to guess that 30% of the table will be returned by that predicate.
It's really hard to say anything for sure without seeing the execution plans. You mention an estimate of 1 row and an actual of 65000. In some cases, that's not a problem at all.
http://sqlinthewild.co.za/index.php/2009/09/22/estimated-rows-actual-rows-and-execution-count/
It would help to see the function, but one thing I have seen is burying functions like that in queries can result in poor performance. If you can evaluate some of it beforehand you might be in better shape. For example, instead of
WHERE MyDate < GETDATE()
Try
DECLARE #Today DATETIME
SET #Today = GETDATE()
...
WHERE MyDate < #Today
this seems to perform better
#Kragen,
Short answer: If you are doing queries with ten tables, get used to it. You need to learn all about query hints, and a lot more tricks besides.
Long answer:
SQL server generally generates excellent query plans for up to about three to five tables only. Once you go beyond that in my experience you are basically going to have to write the query plan yourself, using all the index and join hints. (In addition, Scalar functions seem to get estimated at Cost=Zero, which is just mad.)
The reason is it is just too damn complicated after that. The query optimiser has to decide what to do algorithmically, and there are too many possible combinations for even the brightest geniuses on the SQL Server team to create an algorithm which works truly universally.
They say the optimiser is smarter than you. That may be true. But you have one advantage. That advantage is if it doesn't work, you can throw it out and try again! By about the sixth attempt you should have something acceptable, even for a ten-table join, if you know the data. The query optimiser cannot do that, it has to come up with some sort of plan instantly, and it gets no second chances.
My favourite trick is to force the order of the where clause by converting it to a case statement. Instead of:
WHERE
predicate1
AND predicate2
AND....
Use this:
WHERE
case
when not predicate1 then 0
when not predicate2 then 0
when not .... then 0
else 1 end = 1
Order your predicates cheapest to most expensive, and you get an outcome which is logically the same but which SQL server doesn't get to mess around with - it has to do them in the order you say.