I have two tables. In one table we enter all types of Models, each model with around 100 rows. The second table has sales data about the first item. I need to produce a result like this:
Date Model Total(WE BOUGHT) Sold
---------- ----- ---------------- ----
2011-01-21 M34R 300 200
2011-01-21 M71S 250 22
My query looks like this:
select distinct
CONVERT(varchar(10),x.Scantime,120) as ScanDate,
x.ModelNumber,
( Select count(*)
from micro_model z
where
z.ModelNumber=x.ModelNumber
and CONVERT(varchar(10),z.scantime,101)
= CONVERT(varchar(10),x.Scantime,101)
) as Total,
( select COUNT(*)
from
micro_Model m
inner join micro_model_sold y on m.IDNO=y.IDNO
where
CONVERT(varchar(10),m.scantime,101)
= CONVERT(varchar(10),x.Scantime,101)
and x.ModelNumber=m.ModelNumber
) as Sold
from maxis.dbo.maxis_IMEI_Model x
where
CONVERT(varchar(10),x.scantime,101) between '01/01/2011' and '01/25/2011'
I am able to achieve that from the above query but it is taking more than 2 minutes to execute. Please suggest how I can improve the performance. I have heard about pivot tables and indexed views but have never done them.
There are a very many things going on in your query that could be causing problems. There are also some areas of uncertainty that should probably be ironed out. For starters, try out this query:
SELECT
DateAdd(Day, DateDiff(Day, 0, X.ScanTime), 0) ScanDate,
X.ModelNumber,
Coalesce(Z.Total, 0) Total,
Coalesce(Z.Sold, 0) Sold
FROM
maxis.dbo.maxis_IMEI_Model X
LEFT JOIN (
SELECT
Z.ModelNumber,
DateAdd(Day, DateDiff(Day, 0, Z.ScanTime), 0) ScanDate,
Count(DISTINCT M.IDNO) Total,
Count(Y.IDNO) Sold
FROM
micro_model Z
LEFT JOIN micro_model_sold Y
ON Z.IDNO = Y.IDNO
GROUP BY
DateDiff(Day, 0, Z.ScanTime),
Z.ModelNumber
) Z
ON X.ModelNumber = Z.ModelNumber
AND X.ScanTime >= Z.ScanDate
AND X.ScanTime < Z.ScanDate + 1
WHERE
X.ScanTime >= '20110101'
AND X.ScanTime < '20110126'
Converting to character in order to do whole date comparisons (by chopping off the characters that represent the time) is very inefficient. The best practice way is to do as I have shown in the WHERE clause. Notice that I incremented the final date by one day, then made that point exclusive using less-than instead of less-than-or-equal-to (which is what BETWEEN does). All the joins also needed to change. Finally, when it is necessary to remove the time portion of a date, the DateDiff method I show here is best (there is a slightly faster method that is much harder to understand so I can't recommend it, but if you're using SQL Server 2008 you can just do Convert(date, DateColumn) which is the fastest of all).
Using the date format '01/01/2011' is not region safe. If your query is ever used on a machine where the language is changed to one that has a default date format of DMY, your dates will be interpreted incorrectly, swapping the month and day and generating errors. Use the format yyyymmdd to be safe.
Using correlated subqueries (your SELECT statements inside parentheses to pull in column values from other tables) is awkward and in some cases yields very bad execution plans. Even though the optimizer can often convert these to proper joins, there is no guarantee. It also becomes very hard for other people looking at the query to understand what it is doing. It is better to express such things using outer joins as shown. I converted the correlated subqueries to derived tables.
Using DISTINCT is troubling. Your query shouldn't be returning multiple rows for each model. If so, something is logically wrong with the query and you're probably getting incorrect data.
I think I've combined the two correlated subqueries in my derived tables correctly. But I don't have example data and all the schema information so it is my best guess. In any case, my query should give you ideas.
I completely reformatted your query because it was nearly impossible to see what it was doing. I encourage you to do a little more formatting in your own code. This will aid you and anyone who comes after you to understand what is going on much more quickly. If you ask more SQL questions on this site you need to format your own code better. Please do so and also use the "code block" button or simply indent all the code lines by 4 spaces manually so it will get formatted by the web page as a code block.
You know, staring at my query a little more, it's clear that I don't understand the relationship between maxis_IMEI_Model and the other tables. Please explain a bit more what the tables mean and what result you want to see.
It's possible the problems in my query can be solved with a simple GROUP BY and throwing some SUMs on the number columns, but I am not 100% sure. It may be that the maxis_IMEI_Model table needs to go away completely, or to move into its own derived table where it is grouped separately before being joined.
I'm not a SQL expert by any means, but you've got a lot of conversions in there. Why? Why do you need to convert these datetime columns (which is the type I'm assuming for scantime etc) into strings before comparing them?
I strongly suspect that the conversions are removing any benefit you're getting from what indexes you've got present. (You do have indexes for all the columns involved in the join, right?) In fact, both of your joins look to me like they should be joins on multiple columns without any where clauses... although I'd expect the query optimizer to treat them equivalently if possible.
Look at each and every conversion, and check whether you really need it. I suspect you don't actually need any of them - and the final "between" may even be doing the wrong thing at the moment, given that you're converting into a non-sortable format.
In general - not even just in SQL - it's always worth trying to deal with data in its natural form wherever possible. You're dealing with dates/times - so why treat them as strings for comparison? Conversions are a source of both performance problems and correctness problems.
Related
Table of People (name, dob, ssn etc)
Table of NewRecords (name, dob, ssn)
I want to write a query that determines which NewRecords do not in any way match any of the People (it's an update query that sets a flag in the NewRecords table).
Specifically, I want to find the NewRecords for which the Levenshtein distance between the first name, last name and ssn is greater than 2 for all records in People. (i.e. the person has a first, last and ssn that are all different from those in People and so is likely not a match).
I added a user-defined Levensthein function Levenshtein distance in T-SQL and have already added an optimization that adds an extra parameter for maximal allowed distance. (If the calculated levenstein climbs above the max allowed, the function exits early). But the query is still taking an unacceptably long time because the tables are large.
What can I do to speed things up? How do I start thinking about optimization and performance? At what point do I need to start digging into the internals of sql server?
update NewRecords
set notmatchflag=true
from
newRecords elr
inner join People pro
on
dbo.Levenshtein_withThreshold(elr.last,pro.last,2)>2 and
dbo.Levenshtein_withThreshold(elr.first,pro.first,2)>2 and
elr.ssn is null and
elr.dob<>pro.dob
Due to not know exactly the table structure and types of data I'm not 100% sure that this would work, but give it a go anyway!
I would first check the SQL execution plan when you test it, usually there will be some sections that will be taking the most amount of time. From there you should be able to gauge where/if any indexes would help.
My gut feeling though is your function that is being called A LOT by the looks of things, but hopefully the execution plan will determine if that is the case. If it is then a CLR stored procedure might be the way to go.
There seems to be nothing wrong with your query (maybe except for the fact, that you want to find all possible combinations of differing values, which in most scenarios will give a lot of results :)).
Anyway, the problem is your Levenstein functions - I assume that they're written in T-SQL. Even if you optimized them, they're still weak. You really should compile them to CLR (the link that you posted already contain an example) - this will be an order of magnitude faster.
Another idea I'd try with what you've got, is to somehow decrease the number of Levenstein comparisons. Maybe find other conditions, or reverse the query: find all MATCHING records, and then select what's left (it may enable you to introduce those additional conditions.
But Levenstein compiled to CLR is your best option.
For one skip the values that are true so if you run it again then it will not process those.
That distance is expensive - try to eliminate those that don't have a chance first.
If the length differs by more than 2 then I don't think you can have a distance <= 2.
update NewRecords
set notmatchflag=true
from newRecords elr
inner join People pro
on notmatchflag = false
and elr.ssn is null
and elr.dob <> ro.dob
and dbo.Levenshtein_withThreshold( elr.last, pro.last,2) > 2
and dbo.Levenshtein_withThreshold(elr.first,pro.first,2) > 2
I am having an issue with a query where the query plan says that 15% of the execution cost is for one table. However, this table is very small (only 9 rows).
Clearly there is a problem if the smallest table involved in the query has the highest cost.
My guess is that the query keeps on looping over the same table again and again, rather than caching the results.
What can I do about this?
Sorry, I can't paste the exact code (which is quite complex), but here is something similar:
SELECT Foo.Id
FROM Foo
-- Various other joins have been removed for the example
LEFT OUTER JOIN SmallTable as st_1 ON st_1.Id = Foo.SmallTableId1
LEFT OUTER JOIN SmallTable as st_2 ON st_2.Id = Foo.SmallTableId2
WHERE (
-- various where clauses removed for the example
)
AND (st_1.Id is null OR st_1.Code = 7)
AND (st_2.Id is null OR st_2.Code = 4)
Take these execution-plan statistics with a wee grain of salt. If this table is "disproportionately small," relative to all the others, then those cost-statistics probably don't actually mean a hill o' beans.
I mean... think about it ... :-) ... if it's a tiny table, what actually is it? Probably, "it's one lousy 4K storage-page in a file somewhere." We read it in once, and we've got it, period. End of story. Nothing (actually...) there to index; no (actual...) need to index it; and, at the end of the day, the DBMS will understand this just as well as we do. Don't worry about it.
Now, having said that ... one more thing: make sure that the "cost" which seems to be attributed to "the tiny table" is not actually being incurred by very-expensive access to the tables to which it is joined. If those tables don't have decent indexes, or if the query as-written isn't able to make effective use of them, then there's your actual problem; that's what the query optimizer is actually trying to tell you. ("It's just a computer ... backwards things says it sometimes.")
Without the query plan it's difficult to solve your problem here, but there is one glaring clue in your example:
AND (st_1.Id is null OR st_1.Code = 7)
AND (st_2.Id is null OR st_2.Code = 4)
This is going to be incredibly difficult for SQL Server to optimize because it's nearly impossible to accurately estimate the cardinality. Hover over the elements of your query plan and look at EstimatedRows vs. ActualRows and EstimatedExecutions vs. ActualExecutions. My guess is these are way off.
Not sure what the whole query looks like, but you might want to see if you can rewrite it as two queries with a UNION operator rather than using the OR logic.
Well, with the limited information available, all I can suggest is that you ensure all columns being used for comparisons are properly indexed.
In addition, you haven't stated if you have an actual performance problem. Even if those table accesses took up 90% of the query time, it's most likely not a problem if the query only takes (for example) a tenth of a second.
We are trying to optimize some of our queries.
One query is doing the following:
SELECT t.TaskID, t.Name as Task, '' as Tracker, t.ClientID, (<complex subquery>) Date,
INTO [#Gadget]
FROM task t
SELECT TOP 500 TaskID, Task, Tracker, ClientID, dbo.GetClientDisplayName(ClientID) as Client
FROM [#Gadget]
order by CASE WHEN Date IS NULL THEN 1 ELSE 0 END , Date ASC
DROP TABLE [#Gadget]
(I have removed the complex subquery. I don't think it's relevant other than to explain why this query has been done as a two stage process.)
I thought it would be far more efficient to merge this down into a single query using subqueries as:
SELECT TOP 500 TaskID, Task, Tracker, ClientID, dbo.GetClientDisplayName(ClientID)
FROM
(
SELECT t.TaskID, t.Name as Task, '' as Tracker, t.ClientID, (<complex subquery>) Date,
FROM task t
) as sub
order by CASE WHEN Date IS NULL THEN 1 ELSE 0 END , Date ASC
This would give the optimizer better information to work out what was going on and avoid any temporary tables. I assumed it should be faster.
But it turns out it is a lot slower. 8 seconds vs. under 5 seconds.
I can't work out why this would be the case, as all my knowledge of databases imply that subqueries would always be faster than using temporary tables.
What am I missing?
Edit --
From what I have been able to see from the query plans, both are largely identical, except for the temporary table which has an extra "Table Insert" operation with a cost of 18%.
Obviously as it has two queries the cost of the Sort Top N is a lot higher in the second query than the cost of the Sort in the Subquery method, so it is difficult to make a direct comparison of the costs.
Everything I can see from the plans would indicate that the subquery method would be faster.
"should be" is a hazardous thing to say of database performance. I have often found that temp tables speed things up, sometimes dramatically. The simple explanation is that it makes it easier for the optimiser to avoid repeating work.
Of course, I've also seen temp tables make things slower, sometimes much slower.
There is no substitute for profiling and studying query plans (read their estimates with a grain of salt, though).
Obviously, SQL Server is choosing the wrong query plan. Yes, that can happen, I've had exactly the same scenario as you a few times.
The problem is that optimizing a query (you mention a "complex subquery") is a non-trivial task: If you have n tables, there are roughly n! possible join orders -- and that's just the beginning. So, it's quite plausible that doing (a) first your inner query and (b) then your outer query is a good way to go, but SQL Server cannot deduce this information in reasonable time.
What you can do is to help SQL Server. As Dan Tow writes in his great book "SQL Tuning", the key is usually the join order, going from the most selective to the least selective table. Using common sense (or the method described in his book, which is a lot better), you could determine which join order would be most appropriate and then use the FORCE ORDER query hint.
Anyway, every query is unique, there is no "magic button" to make SQL Server faster. If you really want to find out what is going on, you need to look at (or show us) the query plans of your queries. Other interesting data is shown by SET STATISTICS IO, which will tell you how much (costly) HDD access your query produces.
I have re-iterated this question here: How can I force a subquery to perform as well as a #temp table?
The nub of it is, yes, I get that sometimes the optimiser is right to meddle with your subqueries as if they weren't fully self contained but sometimes it makes a bad wrong turn when it tries to be clever in a way that we're all familiar with. I'm saying there must be a way of switching off that "cleverness" where necessary instead of wrecking a View-led approach with temp tables.
I have a stored procedure which filters based on the result of the DATEADD function - My understanding is that this is similar to using user defined functions in that because SQL server cannot store statistics based on the output of that function it has trouble evaluating the cost of an execution plan.
The query looks a little like this:
SELECT /* Columns */ FROM
TableA JOIN TableB
ON TableA.id = TableB.join_id
WHERE DATEADD(hour, TableB.HoursDifferent, TableA.StartDate) <= #Now
(So its not possible to pre-calculate the outcome of the DATEADD)
What I'm seeing is a terrible terrible execution plan which I believe is due to SQL server incorrectly estimating the number of rows being returned from a part of the tree as being 1, when in fact its ~65,000. I have however seen the same stored procedure execute in a fraction of the time when different (not neccessarily less) data is present in the database.
My question is - in cases like these how does the query optimiser estimate the outcome of the function?
UPDATE: FYI, I'm more interested in understanding why some of the time I get a good execution plan and why the rest of the time I don't - I already have a pretty good idea of how I'm going to fix this in the long term.
It's not the costing of the plan that's the problem here. The function on the columns prevent SQL from doing index seeks. You're going to get an index scan or a table scan.
What I'd suggest is to see if you can get one of the columns out of the function, basically see if you can move the function to the other side of the equality. It's not perfect, but it means that at least one column can be used for an index seek.
Something like this (rough idea, not tested) with an index on TableB.HoursDifference, then an index on the join column in TableA
DATEDIFF(hour, #Now, TableA.StartDate) >= TableB.HoursDifferent
On the costing side, I suspect that the optimiser will use the 30% of the table 'thumb-suck' because it can't use statistics to get an accurate estimate and because it's an inequality. Meaning it's going to guess that 30% of the table will be returned by that predicate.
It's really hard to say anything for sure without seeing the execution plans. You mention an estimate of 1 row and an actual of 65000. In some cases, that's not a problem at all.
http://sqlinthewild.co.za/index.php/2009/09/22/estimated-rows-actual-rows-and-execution-count/
It would help to see the function, but one thing I have seen is burying functions like that in queries can result in poor performance. If you can evaluate some of it beforehand you might be in better shape. For example, instead of
WHERE MyDate < GETDATE()
Try
DECLARE #Today DATETIME
SET #Today = GETDATE()
...
WHERE MyDate < #Today
this seems to perform better
#Kragen,
Short answer: If you are doing queries with ten tables, get used to it. You need to learn all about query hints, and a lot more tricks besides.
Long answer:
SQL server generally generates excellent query plans for up to about three to five tables only. Once you go beyond that in my experience you are basically going to have to write the query plan yourself, using all the index and join hints. (In addition, Scalar functions seem to get estimated at Cost=Zero, which is just mad.)
The reason is it is just too damn complicated after that. The query optimiser has to decide what to do algorithmically, and there are too many possible combinations for even the brightest geniuses on the SQL Server team to create an algorithm which works truly universally.
They say the optimiser is smarter than you. That may be true. But you have one advantage. That advantage is if it doesn't work, you can throw it out and try again! By about the sixth attempt you should have something acceptable, even for a ten-table join, if you know the data. The query optimiser cannot do that, it has to come up with some sort of plan instantly, and it gets no second chances.
My favourite trick is to force the order of the where clause by converting it to a case statement. Instead of:
WHERE
predicate1
AND predicate2
AND....
Use this:
WHERE
case
when not predicate1 then 0
when not predicate2 then 0
when not .... then 0
else 1 end = 1
Order your predicates cheapest to most expensive, and you get an outcome which is logically the same but which SQL server doesn't get to mess around with - it has to do them in the order you say.
Can it make any difference to query optimisation to have WHERE clauses in a different order for SQL Server?
For example, would the query plan for this:
select * from table where col1 = #var1 and col2 = #var2
be different from this?:
select * from table where col2 = #var2 and col1 = #var1
Of course this is a contrived example, and I have tried more complex ones out. The query plan was the same for both, but I've always wondered whether it was worth ordering WHERE clauses such that the most specific clauses come first, in case the optimiser somehow "prunes" results and could end up being faster.
This is really just a thought experiment, I'm not looking to solve a specific performance problem.
What about other RDBMS's, too?
Every modern RDBMS has a query optimizer which is responsible, among other things, of reordering the conditions. Some optimizers use pretty sophisticated statistics to do this, and they often beat human intuition about what will be a good ordering and what will not. So my guess would be: If you can definitely say "This ordering is better than the other one", so can the optimizer, and it will figure it out by itself.
Conclusion: Don't worry about such things. It is rarely worth the time.
Readability for humans should be your only goal when you determine the order of the conditions in the where clause. For example, if you have a join on two tables, A and B, write all conditions for A, then all conditions for B.
No, the query optimizer will find out anyway which index or statistics to use. I'm not entirely sure, but i even think that boolean expressions in sql is not evaluated from left to right but can be evaluated in any order by the query optimzer.
I don't think it'll make much of a difference..
What does make a difference in all the sql languages is the order in which you use sql functions.
For example:
When you do something like this:
select title, date FROM sometable WHERE to_char(date, 'DD-MM-YYYY') > "01-01-1960"
Would go slower then this:
select title, date FROM sometable WHERE date > to_char('DD-MM-YYYY', %USERVALUE%)
This is because of the number of times the function needs to be evaluated.
Also, the ordering might make a difference in case you are using nested queries. The smaller the result set of the inner query, less scanning the outer query will need to do over it.
Of course, having said that, I second what rsp said in the comment above - indexes are the main key in determining how long a query will take; if we have an index on the column, SQL Server would directly do a SEEK instead of SCANning the values and hence, the ordering would be rendered irrelevant
SQL is "declarative" so it makes no difference. You tell the DBMS what you want, it figures out the best way (subject to cost, time etc).
In .net, it would make a difference because it's procedural and executed in order.