SQL Server - wrong execution plan? - sql-server

I have a very big table with a lot of rows and a lot of columns (I know it's bad but let's leave this aside).
Specifically, I had two columns - FinishTime, JobId. The first one is the finish time of the row and the second is its id (not unique, but almost unique - only few records exist with the same id).
I have index on jobid and index on finishtime.
We insert rows all the time, mostly ordered by the finish time. We also update statistics of each index periodically.
Now to the problem:
When I run query with filter jobid==<some id> AND finishtime > <now minus 1 hour> - this query takes a lot of time, and when showing the estimated execution plan I see that the plan is to go over the finishtime index, even though going over the jobid index should be a lot better. When looking at the index statistics, I see that the server "thinks" that the number of jobs in the last hour is 1 because we didn't update the statistics of this index.
When I run query with filter jobid==<some id> AND finishtime > <now minus 100 days> - this works great, because the SQL Server knows to go over the correct index - the job id index.
So basically my question is why if we don't update index statistics all the time (which is time consuming), the server assumes that the number of records past the last bucket is 1?
Thanks very much

You can get a histogram of what the statistics contains for an index using DBCC SHOW_STATISTICS, e.g.
DBCC SHOW_STATISTICS ( mytablename , myindexname )
For date-based records, queries will always be prone to incorrect statistics. Running this should show that the last bucket in the histogram has barely any records in the range [prior-to-today / after-today]. However, all else being equal, SQL Server should still prefer the job_id index to the finishtime index if both are single-column indexes with no included columns; this is due to job_id (int) being faster to lookup than finishtime (datetime).
Note: If your finishtime is covering for the query, this will heavily influence the query optimizer in selecting it since it eliminates a bookmark lookup operation.
To combat this, either
update statistics regularly
create multiple filtered indexes (2008+ feature) on the data, with one partition updated far more regularly being the tail end
use index hints on sensitive queries

Related

Two apparently identical queries using the same query plan, return different number of rows from an index scan

I'm trying to understand why two queries using the same query plan, querying the same data, are behaving differently.
I was investigating the following query.
select T1.id, T2.Num
from DB1.dbo.Table1 T1, DB1.dbo.Table2 T2
where T1.DataDate = (select MAX(datadate) from DB1.dbo.Table1)
and T1.col2 = T2.col2
and T1.col3 = 1
and T2.col4 = 'A string'
order by T1.id
Table1 is a heap with a single unique nonclustered index based on datadate, col2 and col3. The table has 150 million rows. The datadate column is a char(8) nullable column holding date values (yyyymmdd).
Table2 has around 300 rows.
The data in both tables has been static for the last few weeks. Stats have not been updated on either table for the last few days.
The issue centred around the subquery
(select MAX(datadate) from DB1.dbo.Table1)
which generates the following branch in the estimated query plan.
Query Plan for large index scan
This includes an index scan against the Table1 unique index. This passes all 150 million rows through to the next step (Distribute & Gather Streams). This version of the query took on average 45 seconds to complete.
I found that by adding a "IS NOT NULL" where clause as follows -
(select MAX(datadate) from DB1.dbo.Table1 where datadate is null)
the query plan branch changed to the one shown in the link below. With an index seek against the Table1 unique index, passing a single row through to the "Top" step, with no intermediate Distribute or Gather stream steps. This version of the query completed in around 1 second.
Query plan for index seek
So far so good. I can't say I understand exactly why the addition of the IS NOT NULL clause should have such a dramatic effect, but the query plans of the two different queries made sense to me.
Where I started to get confused, was when I tried repeatedly running the original query, just to confirm that the difference in duration between the two queries was not a fluke. What I found was that running the first version of the query (without the "IS NOT NULL" where clause) took over 40 seconds most times it ran, but every so often it would complete in under 1 second.
When I looked at the actual execution plan XML of the two runs of the same query (the "1 second version" and the "45 second version") I found the following.
The two queries were using the same query plan.
I based this on the QueryPlanHash
QueryHash="0xCF7F3761DC77476E" QueryPlanHash="0x55A6B0D6E3D73607"
QueryHash="0xCF7F3761DC77476E" QueryPlanHash="0x55A6B0D6E3D73607"
However, when it came to the number of rows read by the index scan against the Table1 unique index, the query that took over 40 seconds read all 150 million rows.
Actual Execution Plan XML - long duration query
However the query that took 1 second only scanned a fraction of those rows.
Actual Execution Plan XML - short duration query
This can also be seen in the graphical actual execution plan, with the long duration query passing 150 million rows from the index scan
Graphical Actual Execution Plan - long duration query
And the "1 second version" passing just 14,000 rows from the index scan.
Graphical Actual Execution Plan - short duration query
This different behaviour explained the difference in the duration of the two queries.
So after this very long explanation, my question is:
How can two identical queries, using the same query plan, querying the same data (no updates to the tables involved between the two queries, no update stats between the two queries) return a different number of rows from the same index scan?

Actual Number of Rows in Execution Plan is different than actual rows returned

I've studied few questions that were already asked about "Actual Number of Rows" but none matched my problem and so posting it.
Also, I studied https://www.sqlpassion.at/archive/2018/05/28/actual-number-of-rows-are-not-always-accurate/ on why Actual Rows differ but this blog spoke about NonClustered Column Store Indexes which isn't relevant to my table.
Using https://data.stackexchange.com/stackoverflow/, queried Users table with "Actual Execution Plan" included-
SELECT TOP 10 Location FROM Users WHERE Location = 'Hyderabad'
SELECT TOP 10 Location FROM Users WHERE Location LIKE 'Hyderabad'
The Results are attached.
StackOver_Top10HydUsers_ActualExecPlanIssue.Jpg
Questions are (based on my understanding how WHERE operators work ) -
Both queries yielded same type of Exec Plan. But both showed different "Actual Number of Rows" in the plan. How & Why?
The "Actual Number of Rows" of both plans were wrong since both queries
returned 10 records. So, is "Actual Number of Rows" a misnomer? I
studied why Estimated Rows differs based on reasons of Statistics
but "Actual Rows" also !?
-- In 'thoughts'...
UPDATE # 1:
Actually I was intending to understand how could ActualRows differ than what is returned even after PhysicalOp Clustered Id Scan happened which actually counted records that satisfied WHERE.
Your statistics (that determine what index is to be used and how many rows are likely to be read) is off. Please don't set up a job to rebuild all statistics every hour or every day (its very like unnecessary and can be expensive) but rebuild the statistics on this table by:
update statistics schemaname.tablename with fullscan;
Usually this will sync up the query plan with the actual plan. Rebuilding indexes on a table will update all the statistics but this may not be necessary.
To rebuild all the indexes on a given table you can issue:
ALTER INDEX ALL ON schemaname.tablename REBUILD;
The advantage of doing this is that all fragmentation will be resolved and up to date statistics will be available for this table. On a large table this could take a while.

Are there any performance difference between these 2 queries?

Query 1 - UserId is the main identifier, non-clustered index
update myTable set
CurrentHp=MaximumHp,
SelectedAttack1RemainingPP=SelectedAttack1MaximumPP,
SelectedAttack2RemainingPP=SelectedAttack2MaximumPP,
SelectedAttack3RemainingPP=SelectedAttack3MaximumPP,
SelectedAttack4RemainingPP=SelectedAttack4MaximumPP where UserId=1001695
Query 2
update myTable set
CurrentHp=MaximumHp,
SelectedAttack1RemainingPP=SelectedAttack1MaximumPP,
SelectedAttack2RemainingPP=SelectedAttack2MaximumPP,
SelectedAttack3RemainingPP=SelectedAttack3MaximumPP,
SelectedAttack4RemainingPP=SelectedAttack4MaximumPP
where UserId=1001695
and
(
SelectedAttack1RemainingPP!=SelectedAttack1MaximumPP
or
SelectedAttack2RemainingPP!=SelectedAttack2MaximumPP
or
SelectedAttack3RemainingPP!=SelectedAttack3MaximumPP
or
SelectedAttack4RemainingPP!=SelectedAttack4MaximumPP
or
CurrentHp!=MaximumHp
)
When i check via SQL server management studio and compare "Include Actual Execution Plan", their cost is same
However when i check via Include Client Statistics, I see that the first query shows 1900 rows updated while the second one shows 0 rows updated
So here my question, when column A and B value are equal, do SQL still make an update?
I also logically think that both query should be same but i would like to hear your opinion
execution plan same performance image
client statistics query 1
client statistics query 2
Two execution plans are the same because your first filter condition (UserId=1001695) chooses just one row and the table has an index on this field.
If you change your queries as a range condition such as (userID > 100), the costs in execution plans changed and they are not the same, or if your filter is in another field that table does not have index on it, the structures of execution plans are changed and they are not the same.

Improve performance of insert?

I ran the Performance – Top Queries by Total IO (I am trying to improve this process).
The top #1 is this code:
DECLARE #LeadsVS3 AS TT_LEADSMERGE
DECLARE #LastUpdateDate DATETIME
SELECT #LastUpdateDate = MAX(updatedate)
FROM [BUDatamartsource].[dbo].[salesforce_lead]
INSERT INTO #LeadsVS3
SELECT
Lead_id,
(more columns…)
OrderID__c,
City__c
FROM
[ReplicatedVS3].[dbo].[Lead]
WHERE
UpdateDate > #LastUpdateDate
(the code is a piece of a larger SP)
This is in a job that runs every 15 minutes... Other than running the job less frequently is there any other improvement I could make?
Make a try with a local hash table like #LeadsVS3, it is faster than udtt in most cases
Also there is another trick you may do.
On those cases where you always get all 'recent' rows, you may get locked for 1 row, the latest, waiting to commit. You may sacrifice a small part e.g. 1 minute that is to ignore last minute records ( current datetime - 1 minute ). You get those to the next run and save yourself any transaction (or replication) lock waits.
The execution plan that you posted appears to be the estimated execution plan. (the actual execution plan includes the actual number of rows). Without the actual plan it's impossible to tell what's really going on.
The obvious improvement would be to add a covering nonclustered index on Lead.leadid that includes the other columns in your SELECT statement. Right now your scanning a the widest possible index (your clustered index) to retrieve a presumably small percentage of records. Turning that clustered scan into a non-clustered seek will be huge.
On that same note you could make that index a filtered index that's only includes records for dates greater than your last UpdateDate. Then setup a regular SQL Job that periodically rebuilds it to filter on a more current date.
Other things you can do to increase insert performance:
Drop any constraints and/or indexes before the insert then rebuild
them after.
Use smaller data types

SQL Server : wrong index is used when filter value exceeds the index histogram range

We have a very large table, where every day 1-2 million rows are being added to the table.
In this query:
SELECT jobid, exitstatus
FROM jobsData
WHERE finishtime >= {ts '2012-10-04 03:19:26'} AND task = 't1_345345_454'
GROUP BY jobid, exitstatus
Indexes exists for both Task and FinishTime.
We expected that the task index will be used since it has much fewer rows. The problem that we see is that SQL Server creates a bad query execution plan which uses the FinishTime index instead of the task, and the query takes very long time.
This happens when the finish time value is outside the FinishTime index histogram.
Statistics are updated every day / several hours, but there are still many cases where the queries are for recent values.
The question: we can see clearly in the estimated execution plan that the estimated number of rows for the FinishTime is 1 in this case, so the FinishTime index is selcted. Why SQL Server assumes that this is 1 if there is no data? Is there a way to tell it to use something more reasonable?
When we replace the date with a bit earlier one, statistics exists in the histogram and the estimated number of rows is ~7000
You can use a Plan Guide to instruct the optimizer to use a specific query plan for you. This fits well for generated queries that you cannot modify to add hints.

Resources