Avoiding unnecessary sort in SQL Server GROUP BY? - sql-server

I have tables of data samples, with a timestamp and some data. Each table has a clustered index on the timestamp, and then a data-specific key. Data samples are not necessarily equidistant.
I need to downsample the data in a particular time range in order to draw graphs - say, going from 100,000 rows to N, where N is about 50. While I may have to compromise on the "correctness" of the algorithm from a DSP point of view, I'd like to keep this in SQL for performance reasons.
My current idea is to group samples in the time range into N boxes, and then take the average of each group. One way to achieve this in SQL is to apply a partition function to the date that ranges from 0 to N-1 (inclusive) and then GROUP BY and AVG.
I think that this GROUP BY can be performed without a sort, because the date is from a clustered index, and the partition function is monotone. However, SQL Server doesn't seem to notice this, and it issues a sort that represents 78% of the execution cost (in the example below). Assuming I'm right, and this sort is unnecessary, I could make the query 5 times faster.
Is there any way to force SQL Server to skip the sort? Or is there a better way to approach the problem?
Cheers.
Ben
IF EXISTS(SELECT name FROM sysobjects WHERE name = N'test') DROP TABLE test
CREATE TABLE test
(
date DATETIME NOT NULL,
v FLOAT NOT NULL,
CONSTRAINT PK_test PRIMARY KEY CLUSTERED (date ASC, v ASC)
)
INSERT INTO test (date, v) VALUES ('2009-08-22 14:06:00.000', 1)
INSERT INTO test (date, v) VALUES ('2009-08-22 17:09:00.000', 8)
INSERT INTO test (date, v) VALUES ('2009-08-24 00:00:00.000', 2)
INSERT INTO test (date, v) VALUES ('2009-08-24 03:00:00.000', 9)
INSERT INTO test (date, v) VALUES ('2009-08-24 14:06:00.000', 7)
-- the lower bound is set to the table min for demo purposes; in reality
-- it could be any date
declare #min float
set #min = cast((select min(date) from test) as float)
-- similarly for max
declare #max float
set #max = cast((select max(date) from test) as float)
-- the number of results to return (assuming enough data is available)
declare #count int
set #count = 3
-- precompute scale factor
declare #scale float
set #scale = (#count - 1) / (#max - #min)
select #scale
-- this scales the dates from 0 to n-1
select (cast(date as float) - #min) * #scale, v from test
-- this rounds the scaled dates to the nearest partition,
-- groups by the partition, and then averages values in each partition
select round((cast(date as float) - #min) * #scale, 0), avg(v) from test
group by round((cast(date as float) - #min) * #scale, 0)

There is really no way SQL Server would know that the date clustered key can be used for an expression like round(cast.. as float)) to guarantee the order. Only that and would throw it off the track. Add in the (... -#min) * #scale and you got yourself a perfect mess. If you need to sort and group by such expressions, have them stored in persisted computed columns and index by them. You probably want to use DATEPART though as going through an imprecise type like float is likely to render the expression unusable for a persisted computed columns.
Update
On the topic of date and float being equivalent:
declare #f float, #d datetime;
select #d = cast(1 as datetime);
select #f = cast(1 as float);
select cast(#d as varbinary(8)), cast(#f as varbinary(8)), #d, cast(#d as float)
Produces this:
0x0000000100000000 0x3FF0000000000000 1900-01-02 00:00:00.000 1
So you can see that altough they are both stored on 8 bytes (at least the float(25...53)), the internal representation of datetime is not a float with integer part being day and fractional part being time (as is often assumed).
To give another example:
declare #d datetime;
select #d = '1900-01-02 12:00 PM';
select cast(#d as varbinary(8)), cast(#d as float)
0x0000000100C5C100 1.5
Again the result of casting #d to float is 1.5, but the datetime internal representation of 0x0000000100C5C100 would be the IEEE double value 2.1284E-314, not 1.5.

Yes, SQL-Server has always had some problems with this kind of time-partitioning summary SELECTs. Analysis Services has a variety of ways to handle it, but the Data Servies side is more limited.
What I would suggest you try (I cannot try or test anything from here) is to make a secondary "partition table" that contains yor partition definitions and then join against it. You will need some mathcing indexes for his to have a chance to work:

Two questions:
How long does this query take?
And are you sure that it is sorting the date? Also where in the plan is it sorting the date? After it partitions? That would be my guess. I would doubt it's like the first thing it does... Maybe the way it paritions or groups it needs to do a sort again.
Anyways, even if it did sort an already sorted list, it would not think that it would take very long because it is alredy sorted...

Related

SQL Server Sort varchar columns depend on numbers inside like int and decimal

Hello looking solution to sort sizes inside my column.
Exmaple :
-- CREATE TEMP TABLE
Create Table #MyTempTable (
size varchar(20)
);
-- insert sample data to TEMP TABLE
insert into #MyTempTable
values
('10.5W'),
('10W'),
('11.5W'),
('11W'),
('12W'),
('5.5W'),
('5W'),
('6.5W'),
('6W'),
('7.5W'),
('7W'),
('8.5W'),
('8W'),
('9.5W'),
('9W'),
('4')
select 'BEFORE',* from #MyTempTable
SELECT 'AFTER',size
FROM #MyTempTable
ORDER BY LEN(size)
When i order by LEN there is no good sorting like this :
AFTER 5W
AFTER 6W
AFTER 7W
AFTER 8W
AFTER 9W
AFTER 10W
AFTER 11W
AFTER 12W
AFTER 5.5W
AFTER 7.5W
AFTER 6.5W
AFTER 9.5W
AFTER 8.5W
AFTER 10.5W
AFTER 11.5W
All im' looking for is to sort in proper order. like this :
5W
5.5W
6W
6.5W
7W
7.5W
8W
8.5W
9W
9.5W
10W
10.5W
11W
11.5W
12W
I seearched a lot of stackoverflow and can't find solution for that because there is not only int and also decimal numbers. So don't know how to get it
Assuming each value would always end in just one unit, you may sort on the numeric portion cast to a decimal:
SELECT size
FROM #MyTempTable
ORDER BY CAST(
CASE WHEN size LIKE '%[A-Z]'
THEN LEFT(size, LEN(size) - 1)
ELSE size END AS DECIMAL(10, 2)
);
A couple of other options:
-- if you don't know all of the potential non-numeric characters:
SELECT size FROM #MyTempTable
ORDER BY TRY_CONVERT(decimal(5,2),
SUBSTRING(size,1,COALESCE(NULLIF
(PATINDEX('%[^0-9.]%', size),0),255)-1));
-- if there is a finite set (say, W and D):
DECLARE #KnownChars varchar(32) = 'WD';
SELECT size FROM #MyTempTable
ORDER BY TRY_CONVERT(decimal(5,2),
TRANSLATE(size, #KnownChars, REPLICATE(space(1), LEN(#KnownChars))));
In order by clause first remove W from then cast as NUMERIC data type. Now you can expect sorting like a number.
SELECT *
FROM #MyTempTable AS mtt
ORDER BY CAST(LEFT(mtt.size, LEN(mtt.size) - 1) AS DECIMAL) ASC;

Why is SQL Server using index scan instead of index seek when WHERE clause contains parameterized values

We have found that SQL Server is using an index scan instead of an index seek if the where clause contains parametrized values instead of string literal.
Following is an example:
SQL Server performs index scan in following case (parameters in where clause)
declare #val1 nvarchar(40), #val2 nvarchar(40);
set #val1 = 'val1';
set #val2 = 'val2';
select
min(id)
from
scor_inv_binaries
where
col1 in (#val1, #val2)
group by
col1
On the other hand, the following query performs an index seek:
select
min(id)
from
scor_inv_binaries
where
col1 in ('val1', 'val2')
group by
col1
Has any one observed similar behavior, and how they have fixed this to ensure that query performs index seek instead of index scan?
We are not able to use forceseek table hint, because forceseek is supported on SQL Sserver 2005.
I have updated the statistics as well.
Thank you very much for help.
Well to answer your question why SQL Server is doing this, the answer is that the query is not compiled in a logical order, each statement is compiled on it's own merit,
so when the query plan for your select statement is being generated, the optimiser does not know that #val1 and #Val2 will become 'val1' and 'val2' respectively.
When SQL Server does not know the value, it has to make a best guess about how many times that variable will appear in the table, which can sometimes lead to sub-optimal plans. My main point is that the same query with different values can generate different plans. Imagine this simple example:
IF OBJECT_ID(N'tempdb..#T', 'U') IS NOT NULL
DROP TABLE #T;
CREATE TABLE #T (ID INT IDENTITY PRIMARY KEY, Val INT NOT NULL, Filler CHAR(1000) NULL);
INSERT #T (Val)
SELECT TOP 991 1
FROM sys.all_objects a
UNION ALL
SELECT TOP 9 ROW_NUMBER() OVER(ORDER BY a.object_id) + 1
FROM sys.all_objects a;
CREATE NONCLUSTERED INDEX IX_T__Val ON #T (Val);
All I have done here is create a simple table, and add 1000 rows with values 1-10 for the column val, however 1 appears 991 times, and the other 9 only appear once. The premise being this query:
SELECT COUNT(Filler)
FROM #T
WHERE Val = 1;
Would be more efficient to just scan the entire table, than use the index for a seek, then do 991 bookmark lookups to get the value for Filler, however with only 1 row the following query:
SELECT COUNT(Filler)
FROM #T
WHERE Val = 2;
will be more efficient to do an index seek, and a single bookmark lookup to get the value for Filler (and running these two queries will ratify this)
I am pretty certain the cut off for a seek and bookmark lookup actually varies depending on the situation, but it is fairly low. Using the example table, with a bit of trial and error, I found that I needed the Val column to have 38 rows with the value 2 before the optimiser went for a full table scan over an index seek and bookmark lookup:
IF OBJECT_ID(N'tempdb..#T', 'U') IS NOT NULL
DROP TABLE #T;
DECLARE #I INT = 38;
CREATE TABLE #T (ID INT IDENTITY PRIMARY KEY, Val INT NOT NULL, Filler CHAR(1000) NULL);
INSERT #T (Val)
SELECT TOP (991 - #i) 1
FROM sys.all_objects a
UNION ALL
SELECT TOP (#i) 2
FROM sys.all_objects a
UNION ALL
SELECT TOP 8 ROW_NUMBER() OVER(ORDER BY a.object_id) + 2
FROM sys.all_objects a;
CREATE NONCLUSTERED INDEX IX_T__Val ON #T (Val);
SELECT COUNT(Filler), COUNT(*)
FROM #T
WHERE Val = 2;
So for this example the limit is 3.7% of matching rows.
Since the query does not know the how many rows will match when you are using a variable it has to guess, and the simplest way is by finding out the total number rows, and dividing this by the total number of distinct values in the column, so in this example the estimated number of rows for WHERE val = #Val is 1000 / 10 = 100, The actual algorithm is more complex than this, but for example's sake this will do. So when we look at the execution plan for:
DECLARE #i INT = 2;
SELECT COUNT(Filler)
FROM #T
WHERE Val = #i;
We can see here (with the original data) that the estimated number of rows is 100, but the actual rows is 1. From the previous steps we know that with more than 38 rows the optimiser will opt for a clustered index scan over an index seek, so since the best guess for the number of rows is higher than this, the plan for an unknown variable is a clustered index scan.
Just to further prove the theory, if we create the table with 1000 rows of numbers 1-27 evenly distributed (so the estimated row count will be approximately 1000 / 27 = 37.037)
IF OBJECT_ID(N'tempdb..#T', 'U') IS NOT NULL
DROP TABLE #T;
CREATE TABLE #T (ID INT IDENTITY PRIMARY KEY, Val INT NOT NULL, Filler CHAR(1000) NULL);
INSERT #T (Val)
SELECT TOP 27 ROW_NUMBER() OVER(ORDER BY a.object_id)
FROM sys.all_objects a;
INSERT #T (val)
SELECT TOP 973 t1.Val
FROM #T AS t1
CROSS JOIN #T AS t2
CROSS JOIN #T AS t3
ORDER BY t2.Val, t3.Val;
CREATE NONCLUSTERED INDEX IX_T__Val ON #T (Val);
Then run the query again, we get a plan with an index seek:
DECLARE #i INT = 2;
SELECT COUNT(Filler)
FROM #T
WHERE Val = #i;
So hopefully that pretty comprehensively covers why you get that plan. Now I suppose the next question is how do you force a different plan, and the answer is, to use the query hint OPTION (RECOMPILE), to force the query to compile at execution time when the value of the parameter is known. Reverting to the original data, where the best plan for Val = 2 is a lookup, but using a variable yields a plan with an index scan, we can run:
DECLARE #i INT = 2;
SELECT COUNT(Filler)
FROM #T
WHERE Val = #i;
GO
DECLARE #i INT = 2;
SELECT COUNT(Filler)
FROM #T
WHERE Val = #i
OPTION (RECOMPILE);
We can see that the latter uses the index seek and key lookup because it has checked the value of variable at execution time, and the most appropriate plan for that specific value is chosen. The trouble with OPTION (RECOMPILE) is that means you can't take advantage of cached query plans, so there is an additional cost of compiling the query each time.
I had this exact problem and none of query option solutions seemed to have any effect.
Turned out I was declaring an nvarchar(8) as the parameter and the table had a column of varchar(8).
Upon changing the parameter type, the query did an index seek and ran instantaneously. Must be the optimizer was getting messed up by the conversion.
This may not be the answer in this case, but something that's worth checking.
Try
declare #val1 nvarchar(40), #val2 nvarchar(40);
set #val1 = 'val1';
set #val2 = 'val2';
select
min(id)
from
scor_inv_binaries
where
col1 in (#val1, #val2)
group by
col1
OPTION (RECOMPILE)
What datatype is col1?
Your variables are nvarchar whereas your literals are varchar/char; if col1 is varchar/char it may be doing the index scan to implicitly cast each value in col1 to nvarchar for the comparison.
I guess first query is using predicate and second query is using seek predicate.
Seek Predicate is the operation that describes the b-tree portion of the Seek. Predicate is the operation that describes the additional filter using non-key columns. Based on the description, it is very clear that Seek Predicate is better than Predicate as it searches indexes whereas in Predicate, the search is on non-key columns – which implies that the search is on the data in page files itself.
For more details please visit:-
https://social.msdn.microsoft.com/Forums/sqlserver/en-US/36a176c8-005e-4a7d-afc2-68071f33987a/predicate-and-seek-predicate

SQL server, using a variable in a cast as numeric(#variable,2)

I have a big table that includes a column of floats (1.5million rows) that I join to a slightly smaller table (15k rows) which also has floats, I then multiply various floats.
I have discovered I get significant performance gains (over 10 times faster) using numerics rather than floats in the big table.
Trouble is I don't know the size of the floats in advance so I was hoping to calculate the length of the biggest float and then use that information to cast the float to a numeric using a variable in the declaration i.e. cast(MyFloatColumn as numeric(#varInt,2))
It seems I'm not allowed to do this (Incorrect syntax error) so is there an alternative?
Below is some code that shows what I am trying to do - final statement is where the error is.
Many thanks for your help,
Simon
CREATE TABLE dbo.MyTable
(
MyFloatColumn float
);
GO
INSERT INTO dbo.MyTable VALUES (12345.12041);
INSERT INTO dbo.MyTable VALUES (123.1);
GO
declare #precisionofbiggest int
SELECT #precisionofbiggest = sizeofintpart + 2
FROM (SELECT TOP (1) Len(Cast(Cast(myfloatcolumn AS BIGINT) AS VARCHAR)) AS
sizeofintpart
FROM dbo.mytable
ORDER BY myfloatcolumn DESC) AS atable
SELECT cast(myfloatcolumn AS numeric(#precisionofbiggest,2)) AS anewnumericcolumn
FROM dbo.mytable
(#precisionofbiggest will be 7 in this example so if it worked I would get
aNewNumericColumn
12345.12
123.10
)
the last statement should be dynamic to get the variable value
declare #sql nvarchar(max)
SET #sql = 'SELECT cast(myfloatcolumn AS numeric('
+ CONVERT(VARCHAR(20), #precisionofbiggest)
+ ',2)) AS anewnumericcolumn FROM dbo.mytable'
exec sp_executesql #sql

Detecting changes in SQL Server 2000 table data

I have a periodic check of a certain query (which by the way includes multiple tables) to add informational messages to the user if something has changed since the last check (once a day).
I tried to make it work with checksum_agg(binary_checksum(*)), but it does not help, so this question doesn't help much, because I have a following case (oversimplified):
select checksum_agg(binary_checksum(*))
from
(
select 1 as id,
1 as status
union all
select 2 as id,
0 as status
) data
and
select checksum_agg(binary_checksum(*))
from
(
select 1 as id,
0 as status
union all
select 2 as id,
1 as status
) data
Both of the above cases result in the same check-sum, 49, and it is clear that the data has been changed.
This doesn't have to be a simple function or a simple solution, but I need some way to uniquely identify the difference like these in SQL server 2000.
checksum_agg appears to simply add the results of binary_checksum together for all rows. Although each row has changed, the sum of the two checksums has not (i.e. 17+32 = 16+33). This is not really the norm for checking for updates, but the recommendations I can come up with are as follows:
Instead of using checksum_agg, concatenate the checksums into a delimited string, and compare strings, along the lines of SELECT binary_checksum(*) + ',' FROM MyTable FOR XML PATH(''). Much longer string to check and to store, but there will be much less chance of a false positive comparison.
Instead of using the built-in checksum routine, use HASHBYTES to calculate MD5 checksums in 8000 byte blocks, and xor the results together. This will give you a much more resilient checksum, although still not bullet-proof (i.e. it is still possible to get false matches, but very much less likely). I'll paste the HASHBYTES demo code that I wrote below.
The last option, and absolute last resort, is to actually store the table table in XML format, and compare that. This is really the only way you can be absolutely certain of no false matches, but is not scalable and involves storing and comparing large amounts of data.
Every approach, including the one you started with, has pros and cons, with varying degrees of data size and processing requirements against accuracy. Depending on what level of accuracy you require, use the appropriate option. The only way to get 100% accuracy is to store all of the table data.
Alternatively, you can add a date_modified field to each table, which is set to GetDate() using after insert and update triggers. You can do SELECT COUNT(*) FROM #test WHERE date_modified > #date_last_checked. This is a more common way of checking for updates. The downside of this one is that deletions cannot be tracked.
Another approach is to create a modified table, with table_name (VARCHAR) and is_modified (BIT) fields, containing one row for each table you wish to track. Using insert, update and delete triggers, the flag against the relevant table is set to True. When you run your schedule, you check and reset the is_modified flag (in the same transaction) - along the lines of SELECT #is_modified = is_modified, is_modified = 0 FROM tblModified
The following script generates three result sets, each corresponding with the numbered list earlier in this response. I have commented which output correspond with which option, just before the SELECT statement. To see how the output was derived, you can work backwards through the code.
-- Create the test table and populate it
CREATE TABLE #Test (
f1 INT,
f2 INT
)
INSERT INTO #Test VALUES(1, 1)
INSERT INTO #Test VALUES(2, 0)
INSERT INTO #Test VALUES(2, 1)
/*******************
OPTION 1
*******************/
SELECT CAST(binary_checksum(*) AS VARCHAR) + ',' FROM #test FOR XML PATH('')
-- Declaration: Input and output MD5 checksums (#in and #out), input string (#input), and counter (#i)
DECLARE #in VARBINARY(16), #out VARBINARY(16), #input VARCHAR(MAX), #i INT
-- Initialize #input string as the XML dump of the table
-- Use this as your comparison string if you choose to not use the MD5 checksum
SET #input = (SELECT * FROM #Test FOR XML RAW)
/*******************
OPTION 3
*******************/
SELECT #input
-- Initialise counter and output MD5.
SET #i = 1
SET #out = 0x00000000000000000000000000000000
WHILE #i <= LEN(#input)
BEGIN
-- calculate MD5 for this batch
SET #in = HASHBYTES('MD5', SUBSTRING(#input, #i, CASE WHEN LEN(#input) - #i > 8000 THEN 8000 ELSE LEN(#input) - #i END))
-- xor the results with the output
SET #out = CAST(CAST(SUBSTRING(#in, 1, 4) AS INT) ^ CAST(SUBSTRING(#out, 1, 4) AS INT) AS VARBINARY(4)) +
CAST(CAST(SUBSTRING(#in, 5, 4) AS INT) ^ CAST(SUBSTRING(#out, 5, 4) AS INT) AS VARBINARY(4)) +
CAST(CAST(SUBSTRING(#in, 9, 4) AS INT) ^ CAST(SUBSTRING(#out, 9, 4) AS INT) AS VARBINARY(4)) +
CAST(CAST(SUBSTRING(#in, 13, 4) AS INT) ^ CAST(SUBSTRING(#out, 13, 4) AS INT) AS VARBINARY(4))
SET #i = #i + 8000
END
/*******************
OPTION 2
*******************/
SELECT #out

TSQL to identify long Float values

I'm dealing with a legacy system where I need to identify some bad records based on a column with a data type of Float.
Good records have a value of...
1
2
1.01
2.01
Bad records are anything such as..
1.009999999999999
2.003423785643000
3.009999990463260
I've tried a number of select statements where I Convert to Decimal and cast to a varchar and use the LEN() function but this don't seem to work as the good records that are 1.01 become 1.0100000000000000
--Edit
I'm a little closer now as I have discovered I can do (Weight * 100) and all of the good records become whole number values such as 101,201,265,301,500, etc...
and bad ones such as 2.00999999046326 become 200.999999046326
This works on my SQL Server 2005 DB:
select len(cast(cast(1.01 as float) as varchar))
Result:
4
In fact, it even lets me skip the explicit varchar cast if I want to:
select len(cast(1.011 as float))
Result:
5
Update: First of all, I still needed the cast to varchar. Thinking otherwise was wrong. That said, I had this all working using strings and was about to post how. Then you I your update on mulitpling by 100 and realized that was the way to go. So here's my code for testing both ways:
declare #test table ( c float)
insert into #test
select * from
( select 14.0100002288818 as c union
select 1.01 union
select 2.00999999046326 union
select 14.01
) t
select c,
case when c = cast(cast(c as varchar) as float) AND LEN(cast(c as varchar))<=5 then 1 else 0 end,
case when c * 100 = floor(c * 100) then 1 else 0 end
from #test
something like this, maybe? (adjust the precision/scale in the where clause, of course)
select val from mytable WHERE CONVERT(decimal(5,2), val) <> val
Have you thought about using CLR integration and using .net to handle the validation
see this link Basics of Using a .NET Assembly in MS SQL - User Functions
basically you use .net methods as a user defined function to do the validation; .NET is better at working with numbers.
You could do something like this:
SELECT *
FROM YourTable
WHERE CAST(YourFloatField AS DECIMAL(15,2)) <> YourFloatField
I'm assuming that anything "bad" has more than 2 decimal places given.
This really will become a pain in the neck because floats are an imprecise datatype and you will get implicit conversions when casting.
it also depends where you run something like the following
select convert(float,1.33)
in query analyzer the output is 1.3300000000000001
in SSMS the output is 1.33
when you convert to decimal you need to specify scale and precision
so if you do
select convert(decimal(10,6),convert(float,1.33))
you get this 1.330000 because you specified a scale of 6
you could do something like this where after converting to decimal you drop the trailing 0s
select replace(rtrim(replace(convert(varchar(30),
(convert(decimal(10,6),convert(float,1.33)))),'0',' ')),' ','0')
for a value of 3.00999999046326 you need a scale of at least 14
select replace(rtrim(replace(convert(varchar(30),
(convert(decimal(30,14),convert(float,3.00999999046326)))),'0',' ')),' ','0')
Run this:
DECLARE #d FLOAT;
SET #d = 1.23;
SELECT ABS(CAST(#d AS DECIMAL(10,2)) - CAST(#d AS DECIMAL(15,8)));
SET #d = 1.230000098;
SELECT ABS(CAST(#d AS DECIMAL(10,2)) - CAST(#d AS DECIMAL(15,8)));
Use some threshold such as:
ABS(CAST(#d AS DECIMAL(10,2)) - CAST(#d AS DECIMAL(15,8)))<0.00001

Resources