Which one is the better query plan? - sql-server

I have this query:
SELECT TOP 1
MAX(HORA_LEIT), ST_BOMBA, Q_BOMBA, SEQUENCIAL
FROM
DADOS
WHERE
COD_PONTO = 2085
AND (ST_BOMBA = 'ON' OR ST_BOMBA = 'OFF')
GROUP BY
ST_BOMBA, Q_BOMBA, SEQUENCIAL
ORDER BY
MAX(HORA_LEIT) DESC
I decided to create two indexes:
CREATE INDEX ix_1
ON dados (cod_ponto, St_bomba)
INCLUDE (q_bomba, sequencial, hora_leit)
WHERE St_bomba IN ('ON', 'OFF')
Actual execution plan: https://www.brentozar.com/pastetheplan/?id=HkxiKmDXs
and
CREATE INDEX ix_2
ON dados (cod_ponto, hora_leit)
INCLUDE (St_bomba, q_bomba, sequencial)
WHERE St_bomba IN ('ON', 'OFF')
Actual execution plan: https://www.brentozar.com/pastetheplan/?id=By_1tmDQj
I figured out, as far as I can see, that the first execution plan is better, although the query optimizer is giving preference to the second one.
Am I misevaluating the performance?

The ix_2 is clearly better.
The first thing to note is that the query is written in a very convoluted way.
The query
SELECT TOP 1
MAX(HORA_LEIT), ST_BOMBA, Q_BOMBA, SEQUENCIAL
FROM
DADOS
WHERE
COD_PONTO = 2085
AND (ST_BOMBA = 'ON' OR ST_BOMBA = 'OFF')
GROUP BY
ST_BOMBA, Q_BOMBA, SEQUENCIAL
ORDER BY
MAX(HORA_LEIT) DESC
is equivalent to
SELECT TOP 1
HORA_LEIT, ST_BOMBA, Q_BOMBA, SEQUENCIAL
FROM
DADOS
WHERE
COD_PONTO = 2085
AND (ST_BOMBA = 'ON' OR ST_BOMBA = 'OFF')
ORDER BY
HORA_LEIT DESC
As you are only interested in the TOP 1 the GROUP BY can be optimized out here.
Side note this may not be immediately apparent why but...
Take a row with the highest HORA_LEIT in the table (matching the where conditions). This is going to be in a group at least tied for first place when ordered by MAX(HORA_LEIT) DESC.
So taking the ST_BOMBA, Q_BOMBA, SEQUENCIAL from that row is going to produce the values for a valid group that satisfies the initial query. If different rows have the same HORA_LEIT equalling the MAX but different ST_BOMBA, Q_BOMBA, SEQUENCIAL then it is undeterminstic which one you get in both versions of the query.
Both execution plans recognize this and don't contain any aggregation operators.
For index 2 the filtered index condition guarantees that all rows match the ST_BOMBA condition. It just has to do a backward ordered index seek on COD_PONTO = 2085 and read the first row and then stop (as the second key column is HORA_LEIT).
For index 1 the index seek is actually two seeks.
seek on (COD_PONTO, ST_BOMBA) = (2085, 'ON')
seek on (COD_PONTO, ST_BOMBA) = (2085, 'OFF')
The combined rows matching either of those conditions then go into the Top N sort to get the TOP 1 matching row as ordered by HORA_LEIT DESC. It is only 96 rows in this case but is potentially unbounded and just depends on your data.
Reading a single row and stopping is better than reading an arbitrary number of rows and sorting them.

Aside from marc said in the comments, the second query plan only has to seek into 1 row in the index (Actual Number of Rows) and has a perfect cardinality estimate then because it also estimated 1 row.
The first execution plan estimates around 250 rows but actually seeks into 95 rows on the index, so slightly less efficient and a worse cardinality estimate.
But to be honest, for such a simple query plan and small amount of data, you might find the best way to compare is by looking at the TIME STATISTICS and IO STATISTICS. E.g. run SET STATISTICS TIME, IO ON; first, then run each query that produces each plan above and compare. The results will be in the Messages window in SSMS.
TIME STATISTICS will give you the Parse and Compile Time of the query (which shouldn't be super relevant for this test) and the total CPU and Elapsed Time.
IO STATISTICS will tell you how many data pages were read from Memory (the Logical Reads) to serve your query.
The plan that required less of either or both is generally better.
Is there a better index to cover this query?
You may want to try the index on (COD_PONTO, ST_BOMBA, Q_BOMBA, SEQUENCIAL, HORA_LEIT DESC) WHERE ST_BOMBA IN ('ON', 'OFF') which will cover all of the fields in your query, or possibly (COD_PONTO, ST_BOMBA, HORA_LEIT DESC, Q_BOMBA, SEQUENCIAL) WHERE ST_BOMBA IN ('ON', 'OFF'). Generally (but not always) indexing by your predicates (JOIN, WHERE, HAVING clauses) first is most advantageous, then by the GROUP BY and ORDER BY clauses next. But definitely test and compare, that's the only way to be sure.

Related

creating an index did not change my query cost

I was trying to decrease the cost of query execution by creating an index on the rating column. The table has 2680 tuples
SELECT * from cup_matches WHERE rating*3 > 20
However when i used pgAdmin to view the query cost before and after indexing, it remained the same. I thought it would decrease as the processes of indexing should decrease the cost of data being taken from the hardisk, due to indexing (reducing I/O cost), to the memory. Can someone tell me why did it stay the same?
The cost did not diminish because you are doing a mutation operation within the where so it cannot use the index. removing the "*3" operation should do the trick.
SELECT * from cup_matches WHERE rating > 20
Should have the performance increase, because you are no longer mutating the rating value. When values are mutated you need to do a complete table scan in order to do comparisons.
because the index is on rating and not on rating*3. To use your current index, try
SELECT * from cup_matches WHERE rating > 20/3

SQL join running slow

I have 2 sql queries doing the same thing, first query takes 13 sec to execute while second takes 1 sec to execute. Any reason why ?
Not necessary all the ids in ProcessMessages will have data in ProcessMessageDetails
-- takes 13 sec to execute
Select * from dbo.ProcessMessages t1
join dbo.ProcessMessageDetails t2 on t1.ProcessMessageId = t2.ProcessMessageId
Where Id = 4 and Isdone = 0
--takes under a sec to execute
Select * from dbo.ProcessMessageDetails
where ProcessMessageId in ( Select distinct ProcessMessageId from dbo.ProcessMessages t1
Where Where Id = 4 and Isdone = 0 )
I have clusterd index on t1.processMessageId(Pk) and non clusterd index on t2.processMessageId (FK)
I would need the actual execution plans to tell you exactly what SqlServer is doing behind the scenes. I can tell you these queries aren't doing the exact same thing.
The first query is going through and finding all of the items that meet the conditions for t1 and finding all of the items for t2 and then finding which ones match and joining them together.
The second one is saying first find all of the items that are meet my criteria from t1, and then find the items in t2 that have one of these IDs.
Depending on your statistics, available indexes, hardware, table sizes: Sql Server may decide to do different types of scans or seeks to pick data for each part of the query, and it also may decide to join together data in a certain way.
The answer to your question is really simple the first query which have used will generate more number of rows as compared to the second query so it will take more time to search those many rows that's the reason your first query took 13 seconds and the second one to only one second
So it is generally suggested that you should apply your conditions before making your join or else your number of rows will increase and then you will require more time to search those many rows when joined.

Is "offset-fetch and order by" ordering the whole table or partial table?

In SQL Server, if I try the following query:
select id from table
order by id
offset 1000000 ROWS
fetch next 1000000 ROWS ONLY;
How will SQL Server work? What strategy does SQL server use?
1. Do a sorting on the whole table first and then select the 1 million rows we need
2. Do a sorting on partial table and then return the 1 million rows we need.
I assume it is 2nd option. If so, how does SQL server decide which range of the table to be sorted?
Edit 1:
I am asking this question to understand what could cause the query slow. I am testing with two queries:
--Query 1:
select id from table
order by id
offset 1 ROWS
fetch next 1 ROWS ONLY;
and
--Query 2:
select id from table
order by id
offset 1000000000 ROWS
fetch next 1 ROWS ONLY;
I found the second query can take me about 30 minutes to finish while the first takes almost 0 second.
So I am curious on what causes this difference? If the two have same time used for order by (or does it even really do a sorting on the whole table? The id is the clustered indexed column of the table. I cannot imagine that it takes 0 second to finish sorting on a terabyte table.)
Then if the sorting takes same time, only difference would be the clustered-index scan. For first query, it only needs to scan first 1 or 10 (a small number) of rows. While for the second query, it needs to scan a much bigger number of rows ( >1000000000 ). But I am not quite sure if this is correct.
Thank you for your help!
Let me take a simple example..
order by id
offset 50 rows fetch 25 rows only
For the above query,the steps would be
1.Table should be sorted by id (if not pay penalty of sort,there is no partial sort,always a full sort)
2.Then scan 50+25 rows(paying cost of 75 rows) and return 25 rows only..
Below is an example of orders table i have(orderid is Pk,so sorted),you can see even though, we are getting only 20 rows ,you are paying cost of 120 rows...
Coming to your question,there is no partial sort (Which implies first option regarding sort only),even you try to return one row like below..
select top 1* from table
order by orderid

How to speed up a query with few intersect operations

Is there a way to speed up a query with few intersect operations? Query looks something like this:
SELECT TOP(2000) vk_key FROM (
SELECT vk_key FROM mat_property_det where [type]='mp' AND (ISNULL(mat_property_det.value_max, ISNULL(mat_property_det.value_min, 9999)) <= 980 OR ISNULL(mat_property_det.value, 9999) <= 980) AND mat_property_det.property_id=6
INTERSECT
SELECT vk_key FROM search_advance_mat WHERE 1=1 AND (search_advance_mat.group_id = 101 )
INTERSECT
SELECT vk_key FROM V_search_advance_subgroup_en WHERE CONTAINS((Subgroup_desc, Comment, [Application], HeatTreatment), ' "plates*"') ) T
We don't know in advance how many intersections will we have and we couldn't change intersection with e.g. inner join because query is created from application according to user's search parameters.
Here is an execution plan:
Any help or advice would be appreciated!
Help/Advice: Focus on the Table Spool (Lazy Spool) in your execution plan, as it is consuming 36% of the query effort. You should be able to mouse hover over that area of the plan and discover the table/view/function/etc involved.
As MSDN states for lazy spools, the spool operator gets a row from its input operator and stores it in the spool, rather than consuming all rows at once, which means every row is being scanned, one at a time.
Do everything you can to remove the Table Spool and/or improve the performance in this specific area of the query/execution plan and the entire query will benefit.

Not able to understand co-relation between cost keyword in explain plan with time

When I run below query
explain
select count(*) over() as t_totalcnt, max(hits) over() as t_maxhits, max(bytes) over() as t_maxbytes, *
from
(
select category,sum(hits) as hits,sum(bytes) as bytes
from (
select "5mintime",category,hits,bytes,appid, 0 as tmpfield
from web_categoryutfv1_24hr_ts_201209
where "5mintime" >='2012-09-12 00:00:00' and "5mintime" < '2012-09-19 00:00:00'
) as tmp
where "5mintime" >='2012-09-12 00:00:00'
and "5mintime" <= '2012-09-18 23:59:59'
and appid in ('') group by category order by hits desc
) as foo
limit 10;
I get the below output
Limit (**cost=31.31..31.61** rows=10 width=580)
-> WindowAgg (**cost=31.31..32.03** rows=24 width=580)
-> Subquery Scan foo (cost=31.31..31.61 rows=24 width=580)
-> Sort (**cost=31.31..31.37** rows=24 width=31)
Sort Key: (sum(web_categoryutfv1_24hr_ts_201209.hits))
-> HashAggregate (**cost=30.39..30.75** rows=24 width=31)
-> Seq Scan on web_categoryutfv1_24hr_ts_201209 (cost=0.00..27.60 rows=373 width=31)
Filter: (("5mintime" >= '2012-09-12 00:00:00'::timestamp without time zone)
AND ("5mintime" < '2012-09-19 00:00:00'::timestamp without time zone)
AND ("5mintime" >= '2012-09-12 00:00:00'::timestamp without time zone)
AND ("5mintime" <= '2012-09-18 23:59:59'::timestamp without time zone)
AND ((appid)::text = ''::text))
When I have run above query without the explain tag. I get output with in 1 seconds, while here cost=31.31..31.61.
Anybody please help me to understood what is cost keyword means in explain plan I mean units of cost keyword in explain plan?
Cost is the query planner's estimate of how difficult an operation is or how long it will take to perform. It's based on some machine-level parameters -- how long a disk seek will take versus a streaming read, for example -- along with table-level information like how big each row is, how many rows there are, or the distribution of values in each column. There are no units, and the resulting cost values are arbitrary. Costs are the metric PostgreSQL uses to figure out how to execute a query; it will consider the myriad ways in which to execute your query and choose the plan with the lowest cost. For more specifics on cost calculations, see Planner Cost Constraints.
Assuming you're using the default settings, a cost this low for a sequential scan suggests to me that PostgreSQL thinks there aren't many rows in that table. The fact that it's taking a full second to run suggests that there are, in fact, a lot of rows in that table. You can tell PostgreSQL to collect new statistics on that table by saying ANALYZE web_categoryutfv1_24hr_ts_201209. The pg_autovacuum process should regularly collect statistics anyway, but maybe you're on an older version of PostgreSQL, or it hasn't run in a while, or who knows; regardless, there's no harm in doing again by hand.
If PostgreSQL thinks that table is small, it'll prefer a sequential scan over using indexes, because a sequential read of the whole table is faster than an index scan followed by a bunch of random reads. On the other hand, if PostgreSQL thinks the table is large, it will likely be faster to reference an index on 5mintime and/or appid, assuming that said index will allow it to exclude many rows. If you have no such index, consider creating one.
One last thing: EXPLAIN has a big brother named EXPLAIN ANALYZE. While EXPLAIN shows you the query plan PostgreSQL will choose along with the costs that guided its decision, EXPLAIN ANALYZE actually executes the query and shows you how long each component took to run. See EXPLAIN for more.

Resources