Estimating range predicate with SQL Server statistics histogram

Estimating range predicate with SQL Server statistics histogram - sql-server

I would like to ask how does SQL Server can estimate those rows in the below query, If it use histogram to calculate the estimate rows, how does it do. any hints or links to the answer are highly appreciate.
use AdventureWorks2012
go
select *
from sales.SalesOrderDetail
where SalesOrderID between 43792 and 44000
option (recompile)
this is execution plan
this is statistics info

SQLSERVER constructs statistics of the column to analyze data distribution in that column and based on that histogram it derives estimations
lets take a small example to understand data more..
drop table t1
create table t1
(
id int
)
insert into t1
select top 300 row_number() over(order by t1.number) as N
from master..spt_values t1
cross join master..spt_values t2
go 3
select * from t1 where id=1
dbcc show_statistics('t1','_WA_Sys_00000001_29572725')
dbcc gives me below historgram
RANGE_HI_KEY RANGE_ROWS EQ_ROWS DISTINCT_RANGE_ROWS AVG_RANGE_ROWS
1 0 3 0 1
3 3 3 1 3
4 0 3 0 1
6 3 3 1 3
8 3 3 1 3
10 3 3 1 3
Above is a snip of dbcc output.Before jumping into explaining what those mean.Lets understand how data is distributed in the table
there are 300 rows from 1 to 300,duplicated 3 times.So total count of rows is 900
Now lets understand what those columns mean
RANGE_HI_KEY :
sql server used the values in this column as top keys to construct histogram,since histogram is limited to only 200 steps..It choose
rows used to construct histogram ..this will be limited to 200 steps.in this case the values are 1,3,4,6 and so on
RANGE_ROWS:
This number shows the number of rows within the step that are greater than the previous top key and the current top key, but not equal to either.
rows >1 and <3 and so on
EQ_ROWS :
Specifies how many rows are exactly equa1 to top value .in this case ,it is = 1 ,3 and so on
DISTINCT_RANGE_ROWS :
These are the distinct count of rows within a step. If all the rows are unique, then the RANGE_ROWS and the DISTINCT_RANGE_ROWS will be equal.
distinct rows where value >1 and <3 and so on
AVG_RANGE_ROWS:
This represents the average number of rows equal to a key value within the step,which means avg number of rows equal to top key ie., 1,3 and so on
**some demo queries **
select * from id=1
we know EQ_rows for 1 has a value of 3,so you can see estimated rows as 3
this is for simple equal query,but how does it work for multiple predicates like the one in your case..
Bart Duncan provides some insights
The optimizer has a number of ways to estimate cardinality, none of which are completely foolproof.
If the predicate is simple like “column=123” and if the search value happens to be a histogram endpoint (RANGE_HI_KEY), then EQ_ROWS can be used for a very accurate estimate.
If the search value happens to fall between two step endpoints, then the average density of values in that particular histogram step is used to estimate predicate selectivity and operator cardinality.
If the specific search value is not known at compile time, the next best option is to use average column density (“All density”), which can be used to calculate the number of rows that will match an average value in the column.
In some cases none of the above are possible and optimizer has to resort to a “magic number”-based estimate. For example, it might make a totally blind guess that 10% of the rows will be returned, where the “10%” value would be hardcoded in the optimizer’s code rather than being derived from statistics.
further references and reading :
https://sqlperformance.com/2014/01/sql-plan/cardinality-estimation-for-multiple-predicates
https://blogs.msdn.microsoft.com/bartd/2011/01/25/query-tuning-fundamentals-density-predicates-selectivity-and-cardinality/

Related

How to determine if count is greater than threshold in most efficient way in SQL server?

I want to select if a user produced more than 1000 logs. Given these queries I let SQL Server Studio display estimated execution plan.
select count(*) from tbl_logs where id_user = 3
select 1 from tbl_logs where id_user = 3 having count(1) > 1000
I thought the second one should be better because it can return as soon as SQL Server found 1000 rows. Whereas the first one returns the actual count of rows.
Also when I profile the queries they are equal in terms of Reads, CPU and Duration.
What would be the most efficient query for my task?

This query should also improve the performance :
select 1 from tbl_logs order by 1 offset (1000) rows fetch next (1) rows only
You get a 1 when more than 1.000 rows exists, and an empty dataset when they doesn't.
It only fetches the first 1.001 rows, as Alexander's answer does, but after that it has the advantage that it doesn't need to re-count the rows already fetched.
If you want the result to be exactly 1 or 0, then you could read it like this:
with Row_1001 as (
select 1 as Row_1001 from tbl_logs order by 1 offset (1000) rows fetch next (1) rows only
)
select count(*) as More_Than_1000_Rows_Exist from Row_1001

I think, some performance improvement can be achieved this way:
select 1 from (
select top 1001 1 as val from tbl_logs where id_user = 3
) cnt
having count(*) > 1000
In this example, derived query will fetch only first 1001 rows (if they are exists) and outer query will perform a logical check on count.
However, it will not lead to reduction of reads if table tbl_logs is tiny, so index seek uses so small index that only few pages to be fetched

SELECT TOP 1 from table value function much slower than selecting all rows

I have a table-valued function that returns four rows:
SELECT *
FROM dbo.GetStuff('0D182B8B-7A80-4D45-8900-23FA01FCFE5A')
ORDER BY TurboEncabulatorID DESC
and this returns quickly (< 1s):
TurboEncabulatorID Trunion Casing StatorSlots
------------------ ------- ------ -----------
4 G Red 19
3 F Pink 24
2 H Maroon 17
1 G Purple 32
But i only want the "last" row (i.e. the row with the highest TurboEncabulatorID). So i add the TOP:
SELECT TOP 1 *
FROM dbo.GetStuff('0D182B8B-7A80-4D45-8900-23FA01FCFE5A')
ORDER BY TurboEncabulatorID DESC
This query takes ~40s to run, with a huge amount of I/O, and a much worse query plan.
Obviously this is an issue with the optimizer - but how can i work around it?
i've updated all statistics
i've rebuilt all indexes
Bonus Reading
Will the OPTIMIZE option work in a multi-statement table function?
OPTIMIZE FOR UNKNOWN – a little known SQL Server 2008 feature

The workaround i came up with, which obviously isn't an answer, is to try to confuse the optimizer:
select top 1 rows
from top 100 percent of rows
from table valued function
In other words:
WITH optimizerWorkaround AS
(
SELECT TOP 1 PERCENT *
FROM dbo.GetStuff('0D182B8B-7A80-4D45-8900-23FA01FCFE5A')
ORDER BY TurboEncabulatorID DESC
)
SELECT TOP 1 *
FROM optimizerWorkaround
ORDER BY TurboEncabulatorID DESC
This returns quickly as if i had no TOP in the first place.

SQLite Row_Num/ID

I have a SQLite database that I'm trying to use data from, basically there are multiple sensors writing to the database. And I need to join one row to the proceeding row to calculate the value difference for that time period. But the only catch is the ROWID field in the database can't be used to join on anymore since there are more sensors beginning to write to the database.
In SQL Server it would be easy to use Row_Number and partition by sensor. I found this topic: How to use ROW_NUMBER in sqlite and implemented the suggestion:
select id, value ,
(select count(*) from data b where a.id >= b.id and b.value='yes') as cnt
from data a where a.value='yes';
It works but is very slow. Is there anything simple I'm missing? I've tried to join on the time difference possibly, create a view. Just at wits end! Thanks for any ideas!
Here is sample data:
ROWID - SensorID - Time - Value
1 2 1-1-2015 245
2 3 1-1-2015 4456
3 1 1-1-2015 52
4 2 2-1-2015 325
5 1 2-1-2015 76
6 3 2-1-2015 5154
I just need to join row 6 with row 2 and row 3 with row 5 and so forth based on the sensorID.

The subquery can be sped up with an index with the correct structure.
In this case, the column with the equality comparison must come first, and the one with unequality, second:
CREATE INDEX xxx ON MyTable(SensorID, Time);

Is the 'BETWEEN' function very expensive in SQL Server?

I'm trying to join two relatively simple tables together, but my query is experiencing serious hangups. I'm not sure why, but I think it might have something to do with the 'between' function. My first table looks something like this (with a lot of other columns, but this would be the only column I'm pulling):
RowNumber
1
2
3
4
5
6
7
8
My second table "groups" my rows into "blocks", and has the following schema:
BlockID RowNumberStart RowNumberStop
1 1 3
2 4 7
3 8 8
The desired result I'm looking to get is to link the RowNumber with the BlockID like below, with the same number of rows with the first table. So the result would look like this:
RowNumber BlockID
1 1
2 1
3 1
4 2
5 2
6 2
7 2
8 3
In order to get that, I used the following query, writing the results into a temp table:
select A.RowNumber, B.BlockID
into TEMP_TABLE
from TABLE_1 A left join TABLE_2 B
on A.RowNumber between B.RowNumberStart and B.RowNumberStop
TABLE_1 and TABLE_2 are actually very large tables. Table 1 is about 122M Rows, and TABLE_2 is about 65M rows. In TABLE_1, the RowNumber is defined as a 'bigint', and in TABLE_2, the BlockID, RowNumberStart, and RowNumberStop are all defined as 'int'. Not sure that makes a difference, but just wanted include that information, too.
The query has now been hung up for eight hours. Similar queries on this type and volume of data are not taking anywhere near this long. So I'm wondering if it could be the 'between' statement that's hanging up this query.
Definitely would welcome any suggestions on how to make this more efficient.

BETWEEN is simply shorthand for :
select A.RowNumber, B.BlockID
into TEMP_TABLE
from TABLE_1 A left join TABLE_2 B
on A.RowNumber >= B.RowNumberStart AND A.RowNumber <= B.RowNumberStop
If execution plan goes from B to A (but left join would indicate it has to go from A to B, really), then I'm assuming TABLE_1 is indexed on RowNumber (and that should be covering on this query). If it's only got a clustered index on RowNumber and the table is very wide, I recommend a non-clustered index only on RowNumber, since you'll fit a lot more rows per page that way.
Otherwise, you want to index on TABLE_2 on RowNumberStart DESC or RowNumberStop ASC, because for given A you'd need a DESC on RowNumberStart to match.
I think you might want to change your join to an INNER JOIN, the way your join criteria is set up. (Are you ever going to get TABLE_1 in no block?)
If you look at your execution plan, you should get more clues as to why the performance might be bad, but the Stop criteria is probably not used on the seek into TABLE_1.
Unfortunately SQLMenace's answer about the SELECT INTO has been deleted. My comment regarding that was meant to be: #Martin SELECT INTO performance isn't as bad as it once was, but I still recommend a CREATE TABLE for most production because SELECT INTO will infer types and NULLability. This is fine if you verify it is doing what you think it is doing, but creating a super long varchar or a decimal column with very strange precision can result in not only odd tables, but performance issues (especially with some of those big varchars when you forget a LEFT or whatever). I think it just helps to make it clear what you are expecting the table to look like. Often I will SELECT INTO using WHERE 0 = 1 and check out the schema and then script it with my tweaks (like adding an IDENTITY or adding a column with a timestamp default).

You have one main problem: you want to display too much data volume at once. Ar you really sure you want handle the result of ALL 122M rows from table 1 at once? Do you really need that?

MS Access row number, specify an index

Is there a way in MS access to return a dataset between a specific index?
So lets say my dataset is:
rank | first_name | age
1 Max 23
2 Bob 40
3 Sid 25
4 Billy 18
5 Sally 19
But I only want to return those records between 'rank' 2 and 4, so my results set is Bob, Sid and Billy? However, Rank is not part of the table, and this should be generated when the query is run. Why don't I use an autogenerated number, because if a record is deleted, this will be inconsistent, and what if I wanted the results in reverse!
This obviously very simple, and the reason I ask is because I am working on a product catalogue and I am looking for a more efficient way of paging through the returned dataset, so if I only return 1 page worth of data from the database this is obviously going to be quicker then return a complete set of 3000 records and then having to subselect from that set!
Thanks R.

Original suggestion:
SELECT * from table where rank BETWEEN 2 and 4;
Modified after comment, that rank is not existing in structure:
Select top 100 * from table;
And if you want to choose subsequent results, you can choose the ID of the last record from the first query, say it was ID 101, and use a WHERE clause to get the next 100;
Select top 100 * from table where ID > 100;
But these won't give you what you're looking for either, I bet.

How are you calculating rank? I assume you are basing it on some data in another dataset somewhere. If so, create a function, do a table join, or do something that can calculate rank based on values in other table(s), then you can do queries based on the rank() function.
For example:
select *
from table
where rank() between 2 and 4
If you are not calculating rank based on some data somewhere, there really isn't a way to write this query, and you might as well be returning three random rows from the table.

I think you need to use a correlated subquery to calculate the rank on the fly e.g. I'm guessing the rank is based on name:
SELECT T1.first_name, T1.age,
(
SELECT COUNT(*) + 1
FROM MyTable AS T2
WHERE T1.first_name > T2.first_name
) AS rank
FROM MyTable AS T1;
The bad news is the Access data engine is poorly optimized for this kind of query; in my experience, performace will start to noticeably degrade beyond a few hundred rows.
If it is not possible to maintain the rank on the db side of the house (e.g. high insertion environment) consider doing the paging on the client side. For example, an ADO classic recordset object has properties to support paging (PageCount, PageSize, AbsolutePage, etc), something for which DAO recordsets (being of an older vintage) have no support.
As always, you'll have to perform your own timings but I suspect that when there are, say, 10K rows you will find it faster to take on the overhead of fetching all the rows to an ADO recordset then finding the page (then perhaps fabricate smaller ADO recordset consisting of just that page's worth of rows) than it is to perform a correlated subquery to only fetch the number of rows for the page.

Unfortunately the LIMIT keyword isn't available in MS Access -- that's what is used in MySQL for a multi-page presentation. If you can write an order key into the results table, then you can use it something like this:
SELECT TOP 25 MyOrder, Etc FROM Table1 WHERE MyOrder in
(SELECT TOP 55 MyOrder FROM Table1 ORDER BY MyOrder DESC)
ORDER BY MyOrder ASCENDING

If I understand you correctly, there is ionly first_name and age columns in your table. If this is the case, then there is no way to return Bob, Sid, and Billy with a single query. Unless you do something like
SELECT * FROM Table
WHERE FirstName = 'Bob'
OR FirstName = 'Sid'
OR FirstName = 'Billy'
But I think that this is not what you are looking for.
This is because SQL databases make no guarantee as to the order that the data will come out of the database unless you specify an ORDER BY clause. It will usually come out in the same order it was added, but there are no guarantees, and once you get a lot of rows in your table, there's a reasonably high probability that they won't come out in the order you put them in.
As a side note, you should probably add a "rank" column (this column is usually called id) to your table, and make it an auto incrementing integer (see Access documentation), so that you can do the query mentioned by Sev. It's also important to have a primary key so that you can be certain which rows are being updated when you are running an update query, or which rows are being deleted when you run a delete query. For example, if you had 2 people named Max, and they were both 23, how you delete 1 row without deleting the other. If you had another auto incrementing unique column in there, you could specify the unique ID in your query to delete only one.
[ADDITION]
Upon reading your comment, If you add an autoincrement field, and want to read 3 rows, and you know the ID of the first row you want to read, then you can use "TOP" to read 3 rows.
Assuming your data looks like this
ID | first_name | age
1 Max 23
2 Bob 40
6 Sid 25
8 Billy 18
15 Sally 19
You can wuery Bob, Sid and Billy with the following QUERY.
SELECT TOP 3 FirstName, Age
From Table
WHERE ID >= 2
ORDER BY ID

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight