SQL Statistics and Indexes - How detailed are they?

SQL Statistics and Indexes - How detailed are they? - sql-server

Do the statistics (which help decide whether an index is to be used) take into account the number of rows per actual column value, or does it just use the average number of rows per value.
Suppose I have a table with an bit column called active which has a million of rows, but with 99.99% set to false. If I have an index on this column, then is Sql smart enough to know to use the index if searching for active=1 but that there is no point if searching for active=0.
Another example, if I have a table which has say 1,000,000 records with a indexed column which contains about 50,000 different values with an average number of rows per value of 10, but then one special value which has 500,000 rows. The index may not be useful if searching for this special record, but would be very useful when looking for any of the other codes.
But does this special case ruin the effectiveness of the index.

You can see for yourself:
CREATE TABLE IndexTest (
Id int not null primary key identity(1,1),
Active bit not null default(0),
IndexedValue nvarchar(10) not null
)
CREATE INDEX IndexTestActive ON IndexTest (Active)
CREATE INDEX IndexTestIndexedValue ON IndexTest (IndexedValue)
DECLARE #values table
(
Id int primary key IDENTITY(1, 1),
Value nvarchar(10)
)
INSERT INTO #values(Value) VALUES ('1')
INSERT INTO #values(Value) VALUES ('2')
INSERT INTO #values(Value) VALUES ('3')
INSERT INTO #values(Value) VALUES ('4')
INSERT INTO #values(Value) VALUES ('5')
INSERT INTO #values(Value) VALUES ('Many')
INSERT INTO #values(Value) VALUES ('Many')
INSERT INTO #values(Value) VALUES ('Many')
INSERT INTO #values(Value) VALUES ('Many')
INSERT INTO #values(Value) VALUES ('Many')
DECLARE #rowCount int
SET #rowCount = 100000
WHILE(#rowCount > 0)
BEGIN
DECLARE #valueIndex int
SET #valueIndex = CAST(RAND() * 10 + 1 as int)
DECLARE #selectedValue nvarchar(10)
SELECT #selectedValue = Value FROM #values WHERE Id = #valueIndex
DECLARE #isActive bit
SELECT #isActive = CASE
WHEN RAND() < 0.001 THEN 1
ELSE 0
END
INSERT INTO IndexTest(Active, IndexedValue) VALUES (#isActive, #selectedValue)
SET #rowCount = #rowCount - 1
END
SELECT count(*) FROM IndexTest WHERE Active = 1
SELECT count(*) FROM IndexTest WHERE Active = 0
SELECT count(*) FROM IndexTest WHERE IndexedValue = '1'
SELECT count(*) FROM IndexTest WHERE IndexedValue = 'Many'
It looks to me like it always uses the indexes on this query plan:

It creates a histogramm and will thus use that.
With a bit column it will have a good idea how many are 0 and 1
With a string column, it iwll have a rough idea of "bands" (value starting a, b, c etc.). Same for numbers (it creates x bands of value ranges).
Just look up how statistics look in your management studio - you can actually access the histograms.

You can simply look at the statistics and see for yourself :) DBCC SHOW_STATISTICS. See the Remarks section, it has a nice explanation of how the histograms are actually stored and used:
To create the histogram, the query
optimizer sorts the column values,
computes the number of values that
match each distinct column value and
then aggregates the column values into
a maximum of 200 contiguous histogram
steps. Each step includes a range of
column values followed by an upper
bound column value. The range includes
all possible column values between
boundary values, excluding the
boundary values themselves. The lowest
of the sorted column values is the
upper boundary value for the first
histogram step.
For each histogram step:
Bold line represents the upper boundary value (RANGE_HI_KEY) and the
number of times it occurs (EQ_ROWS)
Solid area left of RANGE_HI_KEY represents the range of column values
and the average number of times each
column value occurs (AVG_RANGE_ROWS).
The AVG_RANGE_ROWS for the first
histogram step is always 0.
Dotted lines represent the sampled values used to estimate total
number of distinct values in the range
(DISTINCT_RANGE_ROWS) and total number
of values in the range (RANGE_ROWS).
The query optimizer uses RANGE_ROWS
and DISTINCT_RANGE_ROWS to compute
AVG_RANGE_ROWS and does not store the
sampled values.
The query optimizer defines the
histogram steps according to their
statistical significance. It uses a
maximum difference algorithm to
minimize the number of steps in the
histogram while maximizing the
difference between the boundary
values. The maximum number of steps is
200. The number of histogram steps can be fewer than the number of distinct
values, even for columns with fewer
than 200 boundary points. For example,
a column with 100 distinct values can
have a histogram with fewer than 100
boundary points.

Related

Update table with random DECIMAL number within a range (SQL Server)

I want to update the column ItemValue of table Items with a decimal value generated randomly within 1 and 100 (a different value for each row). Each value should have two (random) digits.
CREATE TABLE Items
(
ItemID int IDENTITY(1,1) NOT NULL,
ItemValue decimal(13, 4) NULL,
CONSTRAINT PK_Items PRIMARY KEY CLUSTERED (ItemID ASC)
)
INSERT INTO Items(ItemValue) VALUES (0)
INSERT INTO Items(ItemValue) VALUES (0)
INSERT INTO Items(ItemValue) VALUES (0)
INSERT INTO Items(ItemValue) VALUES (0)
-- Now, I want to update the table

You can use RAND to generate random number. But there is one problem - RAND is executed only once per query, so all your rows will contain same random value. You can use CHECKSUM(NEWID()) to make it random per row, like this
UPDATE items
SET itemValue = ROUND(RAND(CHECKSUM(NEWID())) * (100), 2)

You could use this snippet to generate random decimal values:
CONVERT( DECIMAL(13, 4), 10 + (30-10)*RAND(CHECKSUM(NEWID()))
This will generate random decimal numbers between 10 and 30.

Speed up to offset in SQL Server 2014

I have a table with about 70000000 rows of phone numbers. I use OFFSET to read those 50 by 50 numbers.
But it takes a long time (about 1 min).
However, that full-text index used for search and does not impact for offset.
How I can speed up my query?
SELECT *
FROM tblPhoneNumber
WHERE CountryID = #CountryID
ORDER BY ID
OFFSET ((#NumberCount - 1) * #PackageSize) ROWS
FETCH NEXT #PackageSize ROWS ONLY

Throw a sequence on that table, index it and fetch ranges by sequence. You could alternatively just use the ID column.
select *
FROM tblPhoneNumber
WHERE
CountryID = #CountryID
and Sequence between #NumberCount and (#NumberCount + #PackageSize)
If you're inserting/deleting frequently, this can leave gaps, so depending on the code that utilizes these batches of numbers, this might be a problem, but in general a few gaps here and there may not be a problem for you.

Try using CROSS APPLY instead of OFFSET FETCH and do it all in one go. I grab TOP 2 to show you that you can grab any number of rows.
IF OBJECT_ID('tempdb..#tblPhoneNumber') IS NOT NULL
DROP TABLE #tblPhoneNumber;
IF OBJECT_ID('tempdb..#Country') IS NOT NULL
DROP TABLE #Country;
CREATE TABLE #tblPhoneNumber (ID INT, Country VARCHAR(100), PhoneNumber INT);
CREATE TABLE #Country (Country VARCHAR(100));
INSERT INTO #Country
VALUES ('USA'),('UK');
INSERT INTO #tblPhoneNumber
VALUES (1,'USA',11111),
(2,'USA',22222),
(3,'USA',33333),
(4,'UK',44444),
(5,'UK',55555),
(6,'UK',66666);
SELECT *
FROM #Country
CROSS APPLY(
SELECT TOP (2) ID,Country,PhoneNumber --Just change to TOP(50) for your code
FROM #tblPhoneNumber
WHERE #Country.Country = #tblPhoneNumber.Country
) CA

Store default sequence number in table without using any variable

I have one table in database which should contain sequence number.
create table SequenceNumber(
number int indentity(1,1) primary key
)
Now I want to store number from 1 to 1448 without setting IDENTITY_INSERT ON/OFF and without counter variable.
I need values from 1 to 1448 in 'number' column
can anyone tell me how can I do it?

yes you can do it as follow
just change the value 1448 as per your need
idea from here : http://www.codeproject.com/Tips/780441/Tricky-SQL-Questions
CREATE TABLE SequenceNumber(
NUMBER BIGINT IDENTITY(1,1) PRIMARY KEY
)
WHILE(1=1)
BEGIN
INSERT INTO SequenceNumber
DEFAULT VALUES
IF EXISTS(SELECT 1 FROM SequenceNumber WHERE NUMBER = 1448)
BREAK
END
SELECT NUMBER FROM SequenceNumber

Why is SQL Server using index scan instead of index seek when WHERE clause contains parameterized values

We have found that SQL Server is using an index scan instead of an index seek if the where clause contains parametrized values instead of string literal.
Following is an example:
SQL Server performs index scan in following case (parameters in where clause)
declare #val1 nvarchar(40), #val2 nvarchar(40);
set #val1 = 'val1';
set #val2 = 'val2';
select
min(id)
from
scor_inv_binaries
where
col1 in (#val1, #val2)
group by
col1
On the other hand, the following query performs an index seek:
select
min(id)
from
scor_inv_binaries
where
col1 in ('val1', 'val2')
group by
col1
Has any one observed similar behavior, and how they have fixed this to ensure that query performs index seek instead of index scan?
We are not able to use forceseek table hint, because forceseek is supported on SQL Sserver 2005.
I have updated the statistics as well.
Thank you very much for help.

Well to answer your question why SQL Server is doing this, the answer is that the query is not compiled in a logical order, each statement is compiled on it's own merit,
so when the query plan for your select statement is being generated, the optimiser does not know that #val1 and #Val2 will become 'val1' and 'val2' respectively.
When SQL Server does not know the value, it has to make a best guess about how many times that variable will appear in the table, which can sometimes lead to sub-optimal plans. My main point is that the same query with different values can generate different plans. Imagine this simple example:
IF OBJECT_ID(N'tempdb..#T', 'U') IS NOT NULL
DROP TABLE #T;
CREATE TABLE #T (ID INT IDENTITY PRIMARY KEY, Val INT NOT NULL, Filler CHAR(1000) NULL);
INSERT #T (Val)
SELECT TOP 991 1
FROM sys.all_objects a
UNION ALL
SELECT TOP 9 ROW_NUMBER() OVER(ORDER BY a.object_id) + 1
FROM sys.all_objects a;
CREATE NONCLUSTERED INDEX IX_T__Val ON #T (Val);
All I have done here is create a simple table, and add 1000 rows with values 1-10 for the column val, however 1 appears 991 times, and the other 9 only appear once. The premise being this query:
SELECT COUNT(Filler)
FROM #T
WHERE Val = 1;
Would be more efficient to just scan the entire table, than use the index for a seek, then do 991 bookmark lookups to get the value for Filler, however with only 1 row the following query:
SELECT COUNT(Filler)
FROM #T
WHERE Val = 2;
will be more efficient to do an index seek, and a single bookmark lookup to get the value for Filler (and running these two queries will ratify this)
I am pretty certain the cut off for a seek and bookmark lookup actually varies depending on the situation, but it is fairly low. Using the example table, with a bit of trial and error, I found that I needed the Val column to have 38 rows with the value 2 before the optimiser went for a full table scan over an index seek and bookmark lookup:
IF OBJECT_ID(N'tempdb..#T', 'U') IS NOT NULL
DROP TABLE #T;
DECLARE #I INT = 38;
CREATE TABLE #T (ID INT IDENTITY PRIMARY KEY, Val INT NOT NULL, Filler CHAR(1000) NULL);
INSERT #T (Val)
SELECT TOP (991 - #i) 1
FROM sys.all_objects a
UNION ALL
SELECT TOP (#i) 2
FROM sys.all_objects a
UNION ALL
SELECT TOP 8 ROW_NUMBER() OVER(ORDER BY a.object_id) + 2
FROM sys.all_objects a;
CREATE NONCLUSTERED INDEX IX_T__Val ON #T (Val);
SELECT COUNT(Filler), COUNT(*)
FROM #T
WHERE Val = 2;
So for this example the limit is 3.7% of matching rows.
Since the query does not know the how many rows will match when you are using a variable it has to guess, and the simplest way is by finding out the total number rows, and dividing this by the total number of distinct values in the column, so in this example the estimated number of rows for WHERE val = #Val is 1000 / 10 = 100, The actual algorithm is more complex than this, but for example's sake this will do. So when we look at the execution plan for:
DECLARE #i INT = 2;
SELECT COUNT(Filler)
FROM #T
WHERE Val = #i;
We can see here (with the original data) that the estimated number of rows is 100, but the actual rows is 1. From the previous steps we know that with more than 38 rows the optimiser will opt for a clustered index scan over an index seek, so since the best guess for the number of rows is higher than this, the plan for an unknown variable is a clustered index scan.
Just to further prove the theory, if we create the table with 1000 rows of numbers 1-27 evenly distributed (so the estimated row count will be approximately 1000 / 27 = 37.037)
IF OBJECT_ID(N'tempdb..#T', 'U') IS NOT NULL
DROP TABLE #T;
CREATE TABLE #T (ID INT IDENTITY PRIMARY KEY, Val INT NOT NULL, Filler CHAR(1000) NULL);
INSERT #T (Val)
SELECT TOP 27 ROW_NUMBER() OVER(ORDER BY a.object_id)
FROM sys.all_objects a;
INSERT #T (val)
SELECT TOP 973 t1.Val
FROM #T AS t1
CROSS JOIN #T AS t2
CROSS JOIN #T AS t3
ORDER BY t2.Val, t3.Val;
CREATE NONCLUSTERED INDEX IX_T__Val ON #T (Val);
Then run the query again, we get a plan with an index seek:
DECLARE #i INT = 2;
SELECT COUNT(Filler)
FROM #T
WHERE Val = #i;
So hopefully that pretty comprehensively covers why you get that plan. Now I suppose the next question is how do you force a different plan, and the answer is, to use the query hint OPTION (RECOMPILE), to force the query to compile at execution time when the value of the parameter is known. Reverting to the original data, where the best plan for Val = 2 is a lookup, but using a variable yields a plan with an index scan, we can run:
DECLARE #i INT = 2;
SELECT COUNT(Filler)
FROM #T
WHERE Val = #i;
GO
DECLARE #i INT = 2;
SELECT COUNT(Filler)
FROM #T
WHERE Val = #i
OPTION (RECOMPILE);
We can see that the latter uses the index seek and key lookup because it has checked the value of variable at execution time, and the most appropriate plan for that specific value is chosen. The trouble with OPTION (RECOMPILE) is that means you can't take advantage of cached query plans, so there is an additional cost of compiling the query each time.

I had this exact problem and none of query option solutions seemed to have any effect.
Turned out I was declaring an nvarchar(8) as the parameter and the table had a column of varchar(8).
Upon changing the parameter type, the query did an index seek and ran instantaneously. Must be the optimizer was getting messed up by the conversion.
This may not be the answer in this case, but something that's worth checking.

Try
declare #val1 nvarchar(40), #val2 nvarchar(40);
set #val1 = 'val1';
set #val2 = 'val2';
select
min(id)
from
scor_inv_binaries
where
col1 in (#val1, #val2)
group by
col1
OPTION (RECOMPILE)

What datatype is col1?
Your variables are nvarchar whereas your literals are varchar/char; if col1 is varchar/char it may be doing the index scan to implicitly cast each value in col1 to nvarchar for the comparison.

I guess first query is using predicate and second query is using seek predicate.
Seek Predicate is the operation that describes the b-tree portion of the Seek. Predicate is the operation that describes the additional filter using non-key columns. Based on the description, it is very clear that Seek Predicate is better than Predicate as it searches indexes whereas in Predicate, the search is on non-key columns – which implies that the search is on the data in page files itself.
For more details please visit:-
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/36a176c8-005e-4a7d-afc2-68071f33987a/predicate-and-seek-predicate

Avoiding unnecessary sort in SQL Server GROUP BY?

I have tables of data samples, with a timestamp and some data. Each table has a clustered index on the timestamp, and then a data-specific key. Data samples are not necessarily equidistant.
I need to downsample the data in a particular time range in order to draw graphs - say, going from 100,000 rows to N, where N is about 50. While I may have to compromise on the "correctness" of the algorithm from a DSP point of view, I'd like to keep this in SQL for performance reasons.
My current idea is to group samples in the time range into N boxes, and then take the average of each group. One way to achieve this in SQL is to apply a partition function to the date that ranges from 0 to N-1 (inclusive) and then GROUP BY and AVG.
I think that this GROUP BY can be performed without a sort, because the date is from a clustered index, and the partition function is monotone. However, SQL Server doesn't seem to notice this, and it issues a sort that represents 78% of the execution cost (in the example below). Assuming I'm right, and this sort is unnecessary, I could make the query 5 times faster.
Is there any way to force SQL Server to skip the sort? Or is there a better way to approach the problem?
Cheers.
Ben
IF EXISTS(SELECT name FROM sysobjects WHERE name = N'test') DROP TABLE test
CREATE TABLE test
(
date DATETIME NOT NULL,
v FLOAT NOT NULL,
CONSTRAINT PK_test PRIMARY KEY CLUSTERED (date ASC, v ASC)
)
INSERT INTO test (date, v) VALUES ('2009-08-22 14:06:00.000', 1)
INSERT INTO test (date, v) VALUES ('2009-08-22 17:09:00.000', 8)
INSERT INTO test (date, v) VALUES ('2009-08-24 00:00:00.000', 2)
INSERT INTO test (date, v) VALUES ('2009-08-24 03:00:00.000', 9)
INSERT INTO test (date, v) VALUES ('2009-08-24 14:06:00.000', 7)
-- the lower bound is set to the table min for demo purposes; in reality
-- it could be any date
declare #min float
set #min = cast((select min(date) from test) as float)
-- similarly for max
declare #max float
set #max = cast((select max(date) from test) as float)
-- the number of results to return (assuming enough data is available)
declare #count int
set #count = 3
-- precompute scale factor
declare #scale float
set #scale = (#count - 1) / (#max - #min)
select #scale
-- this scales the dates from 0 to n-1
select (cast(date as float) - #min) * #scale, v from test
-- this rounds the scaled dates to the nearest partition,
-- groups by the partition, and then averages values in each partition
select round((cast(date as float) - #min) * #scale, 0), avg(v) from test
group by round((cast(date as float) - #min) * #scale, 0)

There is really no way SQL Server would know that the date clustered key can be used for an expression like round(cast.. as float)) to guarantee the order. Only that and would throw it off the track. Add in the (... -#min) * #scale and you got yourself a perfect mess. If you need to sort and group by such expressions, have them stored in persisted computed columns and index by them. You probably want to use DATEPART though as going through an imprecise type like float is likely to render the expression unusable for a persisted computed columns.
Update
On the topic of date and float being equivalent:
declare #f float, #d datetime;
select #d = cast(1 as datetime);
select #f = cast(1 as float);
select cast(#d as varbinary(8)), cast(#f as varbinary(8)), #d, cast(#d as float)
Produces this:
0x0000000100000000 0x3FF0000000000000 1900-01-02 00:00:00.000 1
So you can see that altough they are both stored on 8 bytes (at least the float(25...53)), the internal representation of datetime is not a float with integer part being day and fractional part being time (as is often assumed).
To give another example:
declare #d datetime;
select #d = '1900-01-02 12:00 PM';
select cast(#d as varbinary(8)), cast(#d as float)
0x0000000100C5C100 1.5
Again the result of casting #d to float is 1.5, but the datetime internal representation of 0x0000000100C5C100 would be the IEEE double value 2.1284E-314, not 1.5.

Yes, SQL-Server has always had some problems with this kind of time-partitioning summary SELECTs. Analysis Services has a variety of ways to handle it, but the Data Servies side is more limited.
What I would suggest you try (I cannot try or test anything from here) is to make a secondary "partition table" that contains yor partition definitions and then join against it. You will need some mathcing indexes for his to have a chance to work:

Two questions:
How long does this query take?
And are you sure that it is sorting the date? Also where in the plan is it sorting the date? After it partitions? That would be my guess. I would doubt it's like the first thing it does... Maybe the way it paritions or groups it needs to do a sort again.
Anyways, even if it did sort an already sorted list, it would not think that it would take very long because it is alredy sorted...

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight