Unexpected estimated rows in query execution plan (Sql Server 2000) - sql-server

if I run this query
select user from largetable where largetable.user = 1155
(note I'm querying user just to reduce this to its simplest case)
And look at the execution plan, an index seek is planned [largetable has an index on user], and the estimated rows is the correct 29.
But if I do
select user from largetable where largetable.user = (select user from users where externalid = 100)
[with the result of the sub query being the single value 1155 just like above when i hard code it]
The query optimizer estimates 117,000 rows in the result. There are about 6,000,000 rows in largetable, 1700 rows in users. When I run the query of course I get back the correct 29 rows despite the huge estimated rows.
I have updated stats with fullscan on both tables on the relevent indexes, and when I look at the stats, they appear to be correct.
Of note, for any given user, there are no more than 3,000 rows in largetable.
So, why would the estimated execution plan show such a large number of estimated rows? Shouldn't the optimizer know, based on the stats, that it's looking for a result that has 29 corresponding rows, or a MAXIMUM of 3,000 rows even if it doesn't know the user which will be selected by the subquery? Why this huge estimate? The problem is, that this large estimate is then influencing another join in a larger query to do a scan instead of a seek. If I run the larger query with the subquery, it takes 1min 40 secs. If run it with the 1155 hard coded it takes 2 seconds. This is very unusual to me...
Thanks,
Chris

The optimizer does the best it can, but statistics and row count estimations only go so far (as you're seeing).
I'm assuming that your more complex query can't easily be rewritten as a join without a subquery. If it can be, you should attempt that first.
Failing that, it's time for you to use your additional knowledge about the nature of your data to help out the optimizer with hints. Specifically look at the forceseek option in the index hints. Note that this can be bad if your data changes later, so be aware.

Did you try this?
SELECT lt.user
FROM Users u
INNER JOIN largeTable lt
ON u.User = lt.User
WHERE u.externalId = 100
Please see this: subqueries-vs-joins

Related

SQL Server : How to detect appropriate timing to update table/index statistics

May I ask is there any way to get to know appropriate timing to update table/index statistics?
Recently performance is getting worse with one of major data mart table in our BI-DWH, SQL Server 2012.
All indexes are cared every weekend to reorganize/rebuild, according to their fragmentation percentage and now they're under 5% as avg_fragmentation_in_percent.
So we detect that's caused by obsolete table/index statistics or table fragmentations or so.
Generally, we set autostats on and that Table/index stats were updated at July 2018, maybe still it's not time to update according to their optimizer,
since that table is huge, total record is around 0.7 billions, daily increase about 0.5 million records.
Here is PK statistics and actual record count of that table.
-- statistics
dbcc show_statistics("DM1","PK_DM1")
Name Updated Rows Rows Sampled Steps Density AveragekeylengthString Index Filter Expression Unfiltered Rows
------------------------------------------------------------------------------------------------------------------------------------------------------
PK_DM1 07 6 2018 2:54PM 661696443 1137887 101 0 28 NO NULL 661696443
-- actual row count
select count(*) row_cnt from DM1;
row_cnt
-------------
706723646
-- Current Index Fragmmentations
SELECT a.index_id, name, avg_fragmentation_in_percent
FROM sys.dm_db_index_physical_stats (DB_ID(N'DM1'),
OBJECT_ID(N'dbo.DM1'), NULL, NULL, NULL) AS a
JOIN sys.indexes AS b
ON a.object_id = b.object_id AND a.index_id = b.index_id;
GO
index_id name avg_fragmentation_in_percent
--------------------------------------------------
1 PK_DM1 1.32592173128252
7 IDX_DM1_01 1.06209021193359
9 IDX_DM1_02 0.450888386865285
10 IDX_DM1_03 4.78448190118396
So there is less than 10%, but over 45 millions difference between the statistics row counts and actual record counts.
I'm wondering if it can be worth to update the table/index stats manually in this case.
If there are any other information you decided the appropriate timing to update the stats, any advice would be so much appreciated.
Thank you.
-- Result
Thanks to #scsimon advice, I checked all index statistics in detail and main index was missing RANGE_HI_KEY -- that index based on registration date and there was no range after July 2018 last updated statistics.
(The claim was made by user when he searched for 2018 September records)
So I decided to update table/indexes statistics and confirmed the same query was improved from 1 hour 45 mins to 3.5 mins.
Deelpy appreciated all of the advices to my question.
Best Regards.
Well you have auto-update statistics to on so that's good. Also, each time an index is rebuilt, the statistics are recomputed. SQL Server 2008R2 onward, until 2016, has the same behavior as TF 2371 meaning the large table takes fewer rows to need changing to auto compute. Read more here on that.
Also you are showing stats for a single index not the whole table. That index could be filtered. And, remember that the Total number of rows sampled for statistics calculations. If Rows Sampled < Rows, the displayed histogram and density results are estimates based on the sampled rows. You can read more on that here
Back to the core problem of performance... you are focusing on statistics and the indexes which isn't a terrible idea, but it's not necessarily the root cause. You need to identify what query is running slow. Then, get help with that slow query but following the steps in that blog, and others. The big one here is to ask a question about that query with the execution plan. The problem could be the indexes, or it could be:
Memory contention / misallocation
CPU bottleneck
Parallelism (maybe your have MAXDOP set to 0)
Slow disks
Little memory, causing physical reads
The execution plan isn't optimal anymore and perhaps, you need to recompile that query
etc, etc etc... this is where the execution plan and wait stats will shed light

SQL Server : wrong index is used when filter value exceeds the index histogram range

We have a very large table, where every day 1-2 million rows are being added to the table.
In this query:
SELECT jobid, exitstatus
FROM jobsData
WHERE finishtime >= {ts '2012-10-04 03:19:26'} AND task = 't1_345345_454'
GROUP BY jobid, exitstatus
Indexes exists for both Task and FinishTime.
We expected that the task index will be used since it has much fewer rows. The problem that we see is that SQL Server creates a bad query execution plan which uses the FinishTime index instead of the task, and the query takes very long time.
This happens when the finish time value is outside the FinishTime index histogram.
Statistics are updated every day / several hours, but there are still many cases where the queries are for recent values.
The question: we can see clearly in the estimated execution plan that the estimated number of rows for the FinishTime is 1 in this case, so the FinishTime index is selcted. Why SQL Server assumes that this is 1 if there is no data? Is there a way to tell it to use something more reasonable?
When we replace the date with a bit earlier one, statistics exists in the histogram and the estimated number of rows is ~7000
You can use a Plan Guide to instruct the optimizer to use a specific query plan for you. This fits well for generated queries that you cannot modify to add hints.

Why is performance increased when moving from a derived table to a temp table solution?

I'm reading "Dissecting SQL Server Execution Plans" from Grant Fritchey and it's helping me a lot to see why certain queries are slow.
However, I am stumped with this case where a simple rewrite performs quite a lot faster.
This is my first attempt and it takes 21 secs. It uses a derived table:
-- 21 secs
SELECT *
FROM Table1 AS o JOIN(
SELECT col1
FROM Table1
GROUP BY col1
HAVING COUNT( * ) > 1
) AS i ON ON i.col1= o.col1
My second attempt is 3 times faster and simply moves out the derived table to a temp table. Now it's 3 times faster:
-- 7 secs
SELECT col1
INTO #doubles
FROM Table1
GROUP BY col1
HAVING COUNT( * ) > 1
SELECT *
FROM Table1 AS o JOIN #doubles AS i ON i.col1= o.col1
My main interest is into why moving from a derived table to a temp table improves performance so much, not on how to make it even faster.
I would be grateful if someone could show me how I can diagnose this issue using the (graphical) execution plan.
Xml Execution plan:
https://www.sugarsync.com/pf/D6486369_1701716_16980
Edit 1
When I created statistics on the 2 columns that were specified in the group by and the optimizer started doing "the right thing", after giving up the procedure cache (don't forget that if you are a beginner!). I simplified the query in the question which was not a good simplification in retrospect. The attached sqlplan shows the 2 columns but this was not obvious.
The estimates are now a lot more accurate as is the performance which is up to par with the temp table solution. As you know the optimizer creates stats on single columns automatically (if not disabled) but 2 column statistics have to be create by the DBA.
A (non clustered) index on these 2 columns made the query perform the same but in this case a stat is just as good and it doesn't suffer the downside of index maintenance.
I'm going forward with the 2 column stat and see how it performs. #Grant Do you know if the stats on an index are more reliable than that of a column stat?
Edit 2
I always follow up once a problem is solved on how a similar problem can be diagnosed faster in the future.
The problem here was that the estimated row couns were way of. The graphical execution plans shows these when you hover over a row but that's about it.
Some tools that can help:
SET STATISTICS PROFILE ON
I heard this one will become obsolete and be replaced by its XML variant but I still like the output which is in grid format.
Here the big diff between columns "Rows" and "EstimateRows" would have shown the problem
External Tool: SQL Sentry Plan Explorer
http://www.sqlsentry.net/
This is a nice tool especially if you are a beginner. It highlights problems
External Tool: SSMS Tools Pack
http://www.ssmstoolspack.com/
A more general purpose tool but again directs the user to potential problems
Kind Regards, Tom
Looking at the values for the first execution plan, it looks like it's statistics. You have an estimated number of rows at 800 and an actual of 1.2 million. I think you'll find that updating the statistics will change the way the first query's plan is generated.

Sql query gets too slow

Few days ago I wrote one query and it gets executes quickly but now a days it takes 1 hrs.
This query run on my SQL7 server and it takes about 10 seconds.
This query exists on another SQL7 server and until last week it took about
10 seconds.
The configuration of both servers are same. Only the hardware is different.
Now, on the second server this query takes about 30 minutes to extract the s
ame details, but anybody has changed any details.
If I execute this query without Where, it'll show me the details in 7
seconds.
This query still takes about same time if Where is problem
Without seeing the query and probably the data I can't do a lot other than offer tips.
Can you put more constraints on the query. If you can reduce the amount of data involved then this will speed up the query.
Look at the columns used in your joins, where and having clauses and order by. Check that the tables that the columns belong to contain indices for these columns.
Do you need to use the user defined function or can it be done another way?
Are you using subquerys? If so can these be pulled out into separate views?
Hope this helps.
Without knowing how much data is going into your tables, and not knowing your schema, it's hard to give a definitive answer but things to look at:
Try running UPDATE STATS or DBCC REINDEX.
Do you have any indexes on the tables? If not, try adding indexes to columns used in WHERE clauses and JOIN predicates.
Avoid cross table OR clauses (i.e, where you do WHERE table1.col1 = #somevalue OR table2.col2 = #someothervalue). SQL can't use indexes effectively with this construct and you may get better performance by splitting the query into two and UNION'ing the results.
What do your functions (UDFs) do and how are you using them? It's worth noting that dropping them in the columns part of a query gets expensive as the function is executed per row returned: thus if a function does a select against the database, then you end up running n + 1 queries against the database (where n = number of rows returned in the main select). Try and engineer the function out if possible.
Make sure your JOINs are correct -- where you're using a LEFT JOIN, revisit the logic and see if it needs to be a LEFT or whether it can be turned into an INNER JOIN. Sometimes people use LEFT JOINs, but when you examine the logic in the rest of the query, it can sometimes be apparent that the LEFT JOIN gives you nothing (because, for example, someone may had added a WHERE col IS NOT NULL predicate against the joined table). INNER JOINs can be faster, so it's worth reviewing all of these.
It would be a lot easier to suggest things if we could see the query.

SQL Query Costing, aggregating a view is faster?

I have a table, Sheet1$ that contains 616 records. I have another table, Rates$ that contains 47880 records. Rates contains a response rate for a given record in the sheet for 90 days from a mailing date. Within all 90 days of a records Rates relation the total response is ALWAYS 1 (100%)
Example:
Sheet1$: Record 1, 1000 QTY, 5% Response, Mail 1/1/2009
Rates$: Record 1, Day 1, 2% Response
Record 1, Day 2, 3% Response
Record 1, Day 90, 1% Response
Record N, Day N, N Response
So in that, I've written a view that takes these tables and joins them to the right on the rates to expand the data so I can perform some math to get a return per day for any given record.
SELECT s.[Mail Date] + r.Day as Mail_Date, s.Quantity * s.[Expected Response Rate] * r.Response as Pieces, s.[Bounce Back Card], s.Customer, s.[Point of Entry]
FROM Sheet1$ as s
RIGHT OUTER JOIN Rates$ as r
ON s.[Appeal Code] = r.Appeal
WHERE s.[Mail Date] IS NOT NULL
AND s.Quantity <> 0
AND s.[Expected Response Rate] <> 0
AND s.Quantity IS NOT NULL
AND s.[Expected Response Rate] IS NOT NULL);
So I save this as a view called Test_Results. Using SQL Server Management Studio I run this query and get a result of 211,140 records. Elapsed time was 4.121 seconds, Est. Subtree Cost was 0.751.
Now I run a query against this view to aggregate a piece count on each day.
SELECT Mail_Date, SUM(Pieces) AS Piececount
FROM Test_Results
GROUP BY Mail_Date
That returns 773 rows and it only took 0.452 seconds to execute! 1.458 Est. Subtree Cost.
My question is, with a higher estimate how did this execute SO much faster than the original view itself?! I would assume a piece might be that its returning rows to management studio. If that is the case, how would I go about viewing the true cost of this query without having to account for the return feedback?
SELECT * FROM view1 will have a plan
SELECT * FROM view2 (where view2 is based on view1) will have its own complete plan
The optimizer is smart enough to make the plan for view2 combine/collapse the operations into a most efficient operation. It is only going to observe the semantics of the design of view1, but it is not necessarily required to use the plan for SELECT * FROM view1 and than apply another plan for view2 - this will, in general, be a completely different plan, and it will do whatever it can to get the most efficient results.
Typically, it's going to push the aggregation down to improve the selectivity, and reduce the data requirements, and that's going to speed up the operation.
I think that Cade has covered the most important part - selecting from a view doesn't necessarily entail returning all of the view rows and then selecting against that. SQL Server will optimize the overall query.
To answer your question though, if you want to avoid the network and display costs then you can simply select each query result into a table. Just add "INTO Some_Table" after the column list in the SELECT clause.
You should also be able to separate things out by showing client statistics or by using Profiler, but the SELECT...INTO method is quick and easy.
Query costs are unitless, and are just used by the optimizer to choose what it thinks the most efficient execution path for a particular query is. They can't really be compared between queries. This, although old, is a good quick read. Then you'll probably want to look around for some books or articles on the MSSQL optimizer and about reading query plans if you're really interested.
(Also, make sure you're viewing the actual execution plan, and not the explain plan ... they can be different)

Resources