Why case..when get a table scan ? how to workarround - sql-server

When I use CASE .. WHEN .. END I get an index scan less efficient than the index seek.
I have complex business rules I need to use the CASE, is there any workaround ?
Query A:
select * from [dbo].[Mobile]
where((
CASE
where ([MobileNumber] = (LTRIM(RTRIM('987654321'))))
END
) = 1)
This query gets an index scan and 199 logical reads.
Query B:
select * from [dbo].[Mobile]
where ([MobileNumber] = (LTRIM(RTRIM('987654321'))))
This query gets an index seek and 122 logical reads.

For the table
CREATE TABLE #T(X CHAR(1) PRIMARY KEY);
And the query
SELECT *
FROM #T
WHERE CASE WHEN X = 'A' THEN 1 ELSE 0 END = 1;
It is apparent without that much thought that the only circumstances in which the CASE expression evaluates to 1 are when X = 'A' and that the query has the same semantics as
SELECT *
FROM #T
WHERE X = 'A';
However the first query will get a scan and the second one a seek.
The SQL Server optimiser will try all sorts of relational transformations on queries but will not even attempt to rearrange expressions such as CASE WHEN X = 'A' THEN 1 ELSE 0 END = 1 to express it as an X = expression so it can perform an index seek on it.
It is up to the query writer to write their queries in such a way that they are sargable.
There is no workaround to get an index seek on column MobileNumber with your existing CASE predicate. You just need to express the condition differently (as in your example B).
Potentially you could create a computed column with the CASE expression and index that - and you could then see an index seek on the new column. However this is unlikely to be useful to you as I assume in reality the mobile number 987654321 is dynamic and not something to be hardcoded into a column used by an index.

After cleaning up and fixing your code, you have a WHERE which is boolean expression based around a CASE.
As mentioned by #MartinSmith, there is simply no way SQL Server will re-arrange this. It does not do the kind of dynamic slicing that would allow it to re-arrange the first query into the second version.
select *
from [dbo].[Mobile]
where
CASE
WHEN [MobileNumber] = LTRIM(RTRIM('987654321'))
THEN 1
END
= 1
You may ask: the second version also has an expression in it, why does this not also get a scan?
select *
from [dbo].[Mobile]
where [MobileNumber] = LTRIM(RTRIM('987654321'))
The reason is that what SQL Server can recognize is that LTRIM(RTRIM('987654321')) is a deterministic constant expression: it does not change depending on runtime settings, nor on the result of in-row calculations.
Therefore, it can optimize by calculating it at compile time. The query therefore becomes this under the hood, which can be used against an index on MobileNumber.
select *
from [dbo].[Mobile]
where [MobileNumber] = '987654321'

Related

SQL Server CHOOSE() function behaving unexpectedly with RAND() function

I've encountered an interesting SQL server behaviour while trying to generate random values in T-sql using RAND and CHOOSE functions.
My goal was to try to return one of two given values using RAND() as rng. Pretty easy right?
For those of you who don't know it, CHOOSE function accepts in an index number(int) along with a collection of values and returns a value at specified index. Pretty straightforward.
At first attempt my SQL looked like this:
select choose(ceiling((rand()*2)) ,'a','b')
To my surprise, this expression returned one of three values: null, 'a' or 'b'. Since I didn't expect the null value i started digging. RAND() function returns a float in range from 0(included) to 1 (excluded). Since I'm multiplying it by 2, it should return values anywhere in range from 0(included) to 2 (excluded). Therefore after use of CEILING function final value should be one of: 0,1,2. After realising that i extended the value list by 'c' to check whether that'd be perhaps returned. I also checked the docs page of CEILING and learnt that:
Return values have the same type as numeric_expression.
I assumed the CEILINGfunction returned int, but in this case would mean that the value is implicitly cast to int before being used in CHOOSE, which sure enough is stated on the docs page:
If the provided index value has a numeric data type other than int,
then the value is implicitly converted to an integer.
Just in case I added an explicit cast. My SQL query looks like this now:
select choose(cast(ceiling((rand()*2)) as int) ,'a','b','c')
However, the result set didn't change. To check which values cause the problem I tried generating the value beforehand and selecting it alongside the CHOOSE result. It looked like this:
declare #int int = cast(ceiling((rand()*2)) as int)
select #int,choose( #int,'a','b','c')
Interestingly enough, now the result set changed to (1,a), (2,b) which was my original goal. After delving deeper in the CHOOSE docs page and some testing i learned that 'null' is returned in one of two cases:
Given index is a null
Given index is out of range
In this case that would mean that index value when generated inside the SELECT statement is either 0 or above 2/3 (I'm assuming that negative numbers are not possible here and CHOOSE function indexes from 1). As I've stated before 0 should be one of possibilities of:
ceiling((rand()*2))
,but for some reason it's never 0 (at least when i tried it 1 million+ times like this)
set nocount on
declare #test table(ceiling_rand int)
declare #counter int = 0
while #counter<1000000
begin
insert into #test
select ceiling((rand()*2))
set #counter=#counter+1
end
select distinct ceiling_rand from #test
Therefore I assume that the value generated in SELECT is greater than 2/3 or NULL. Why would it be like this only when generated in SELECT statement? Perhaps order of resolving CAST, CELING or RAND inside SELECT is different than it would seem? It's true I've only tried it a limited number of times, but at this point the chances of it being a statistical fluctuation are extremely small. Is it somehow a floating-point error? I truly am stumbled and looking forward to any explanation.
TL;DR: When generating a random number inside a SELECT statement result set of possible values is different then when it's generated before the SELECT statement.
Cheers,
NFSU
EDIT: Formatting
You can see what's going on if you look at the execution plan.
SET SHOWPLAN_TEXT ON
GO
SELECT (select choose(ceiling((rand()*2)) ,'a','b'))
Returns
|--Constant Scan(VALUES:((CASE WHEN CONVERT_IMPLICIT(int,ceiling(rand()*(2.0000000000000000e+000)),0)=(1) THEN 'a' ELSE CASE WHEN CONVERT_IMPLICIT(int,ceiling(rand()*(2.0000000000000000e+000)),0)=(2) THEN 'b' ELSE NULL END END)))
The CHOOSE is expanded out to
SELECT CASE
WHEN ceiling(( rand() * 2 )) = 1 THEN 'a'
ELSE
CASE
WHEN ceiling(( rand() * 2 )) = 2 THEN 'b'
ELSE NULL
END
END
and rand() is referenced twice. Each evaluation can return a different result.
You will get the same problem with the below rewrite being expanded out too
SELECT CASE ceiling(( rand() * 2 ))
WHEN 1 THEN 'a'
WHEN 2 THEN 'b'
END
Avoid CASE for this and any of its variants.
One method would be
SELECT JSON_VALUE ( '["a", "b"]' , CONCAT('$[', FLOOR(rand()*2) ,']') )

Unusual SQL Server query with "select top 1 #arastr = k"

select top 1 #arastr = k
from #m
where datalength(k) = (select max(datalength(k)) from #m)
What does this query do, and what is the point of select top 1 #arastr = k? This query is taken from a stored procedure which has been working for 7-8 years, so there is nothing wrong with the query, but I could not understand what it does.
(#m is a temp table which is created in the early part of the query.)
The query select one random value (since top is used without an order by clause) from the column k in the temporary table #m and assigns it to a variable #arastr (which has been previously declared supposedly). The string selected will be any matching the longest (measured in bytes (by the datalength function)) string in the table.
This is a quite common (but a little old fashioned) way to get the value of k into the (previously declared!) variable #arastr for later usage.
The function DATALENGTH will measure the length of e.g. a VARCHAR.
With TOP 1 you geht in any case only one result row, the one with the "longest" k, it's value is in #arastr afterwards...
EDIT: As pointed out by #jpw this will be random, if there is more than one k with the same (longest) length.
Without knowing, what #m looks like and what kind of data is in 'k' I cannot tell you any more.
probably makes more sense if it looks like this
SET #arastr = (SELECT TOP 1 k
FROM #m
WHERE DATALENGTH(k) = (SELECT MAX(DATALENGTH(k)) FROM #m))

Why does SUM(...) on an empty recordset return NULL instead of 0?

I understand why null + 1 or (1 + null) returns null: null means "unknown value", and if a value is unknown, its successor is unknown as well. The same is true for most other operations involving null.[*]
However, I don't understand why the following happens:
SELECT SUM(someNotNullableIntegerField) FROM someTable WHERE 1=0
This query returns null. Why? There are no unknown values involved here! The WHERE clause returns zero records, and the sum of an empty set of values is 0.[**] Note that the set is not unknown, it is known to be empty.
I know that I can work around this behaviour by using ISNULL or COALESCE, but I'm trying to understand why this behaviour, which appears counter-intuitive to me, was chosen.
Any insights as to why this makes sense?
[*] with some notable exceptions such as null OR true, where obviously true is the right result since the unknown value simply does not matter.
[**] just like the product of an empty set of values is 1. Mathematically speaking, if I were to extend $(Z, +)$ to $(Z union {null}, +)$, the obvious choice for the identity element would still be 0, not null, since x + 0 = x but x + null = null.
The ANSI-SQL-Standard defines the result of the SUM of an empty set as NULL. Why they did this, I cannot tell, but at least the behavior should be consistent across all database engines.
Reference: http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt on page 126:
b) If AVG, MAX, MIN, or SUM is specified, then
Case:
i) If TXA is empty, then the result is the null value.
TXA is the operative resultset from the selected column.
When you mean empty table you mean a table with only NULL values, That's why we will get NULL as output for aggregate functions. You can consider this as by design for SQL Server.
Example 1
CREATE TABLE testSUMNulls
(
ID TINYINT
)
GO
INSERT INTO testSUMNulls (ID) VALUES (NULL),(NULL),(NULL),(NULL)
SELECT SUM(ID) FROM testSUMNulls
Example 2
CREATE TABLE testSumEmptyTable
(
ID TINYINT
)
GO
SELECT SUM(ID) Sums FROM testSumEmptyTable
In both the examples you will NULL as output..

Querying all records are true in sql-server - is casting expensive performance wise

I have a table with a column of bit values. I want to write a function that returns true if all records of an associated item are true.
One way I found of doing it is:
Select #Ret = CAST(MIN(CAST(IsCapped as tinyInt)) As Bit)
from ContractCover cc
Inner join ContractRiskVersion crv on cc.ContractRiskId = crv.ContractRiskId
WHERE crv.ContractVersionId = #ContractVersionId
AND cc.IsActive = 1
return #ret
But is the casting to int to get the minimum expensive? Should I instead just be querying based on say:
(count(Id) where IsCapped = 0 > 0) returning false rather than doing the multiple casts?
In the execution plan it doesn't seem like calling this function is heavy in the execution (but I'm not too familiar with analysing query plans - it just seems to have the same % cost as another section of the stored proc of like 2%).
Edit - when I execute the stored proc which calls the function and look at the execution plan - the part where it calls the function has a query cost (relative to the batch) : 1% which is comparable to other sections of the stored proc. Unless I'm looking at the wrong thing :)
Thanks!!
I would do this with an exists statement as it will jump out of the query from the moment it finds 1 record where IsCapped = 0 where as your query will always read all data.
CREATE FUNCTION dbo.fn_are_contracts_capped(#ContractVersionId int)
RETURNS bit
WITH SCHEMABINDING
AS
BEGIN
DECLARE #return_value bit
IF EXISTS(
SELECT 1
FROM dbo.ContractCover cc
JOIN dbo.ContractRiskVersion crv
ON cc.ContractRiskId = crv.ContractRiskId
WHERE crv.ContractVersionId = #ContractVersionId
AND cc.IsActive = 1
AND IsCapped = 0)
BEGIN
SET #return_value = 0
END
ELSE
BEGIN
SET #return_value = 1
END
RETURN #return_value
END
Compared to the IO required to read the data, the cast will not add a lot of overhead.
Edit: wrapped code in a scalar function.
Casting in the SELECT would be CPU and memory bound. Not sure how much in this case--under normal circumstances we usually try to optimize for IO first, and then worry about CPU and memory second. So I don't have a definite answer for you there.
That said, the problem with this particular solution to your problem is that it won't short-circuit. SQL Server will read out all rows where ContractVersionId = #ContractVersionId and IsActive = 1, convert IsCapped to an INT, and take the min, where really, you can quit as soon as you find a single row where IsCapped = 0. It won't matter much if ContactVersionId is highly selective, and only returns a very small fraction of the table, or if most rows are capped. But if ContactVersionId is not highly selective, or if a high percentage of the rows are uncapped, then you are asking SQL Server to do too much work.
Second consideration is that scalar-valued functions are a notorious performance drag in SQL Server. It is better to create as an in-line table function if possible, eg:
create function AreAllCapped(#ContractVersionId int)
returns table as return (
select
ContractVersionId = #ContractVersionId
, AreAllCapped = case when exists (
select *
from ContractCover cc
join ContractRiskVersion crv on cc.ContractRiskId = crv.ContractRiskId
where crv.ContractVersionId = #ContractVersionId
and cc.IsActive = 1
and IsCapped = 0
)
then 0 else 1 end
)
Which you then can call using CROSS APPLY in the FROM clause (assuming SQL 2005 or later).
Final note: taking the count where IsCapped = 0 has similar problems. It's like the difference between Any() and Count() in LINQ, if you are familiar. Any() will short-circuit, Count() has to actually count all the elements. SELECT COUNT(*) ... WHERE IsCapped = 0 still has to count all the rows, even though a single row is all you need to move on.
Of course, it is a known fact that a bit column can't be passed as an argument to an aggregate function (and thus, if it needs to be passed, you have to cast it as an integer first), but bit columns can be sorted by. Your query, therefore, could be rewritten like this:
SELECT TOP 1 #Ret = IsCapped
FROM ContractCover cc
INNER JOIN ContractRiskVersion crv on cc.ContractRiskId = crv.ContractRiskId
WHERE crv.ContractVersionId = #ContractVersionId
AND cc.IsActive = 1
ORDER BY IsCapped;
Note that in this particular query it is assumed that IsCapped can't be NULL. If it can, you'll need to add an additional filter to the WHERE clause:
AND IsCapped IS NOT NULL
Unless, of course, you would actually prefer to return NULL instead of 0, if any.
As for the cost of casting, I don't really have anything to add to what has already been said by Filip and Peter. I do find it a nuisance that bit data require casting before aggregating, but that's never something of a primary concern.

How would you query an array of 1's and 0's chars from a database?

Say you had a long array of chars that are either 1 or 0, kind of like a bitvector, but on a database column. How would you query to know what values are set/no set? Say you need to know if the char 500 and char 1500 are "true" or not.
SELECT
Id
FROM
BitVectorTable
WHERE
SUBSTRING(BitVector, 500, 1) = '1'
AND SUBSTRING(BitVector, 1000, 1) = '1'
No index can be used for this kind of query, though. When you have many rows, this will get slow very quickly.
Edit: On SQL Server at least, all built-in string functions are deterministic. That means you could look into the possibility to make computed columns based on the SUBSTRING() results for the whole combined value, putting an index on each of them. Inserts will be slower, table size will increase, but searches will be really fast.
SELECT
Id
FROM
BitVectorTable
WHERE
BitVector_0500 = '1'
AND BitVector_1000 = '1'
Edit #2: The limits for SQL Server are:
1,024 columns per normal table
30.000 columns per "wide" table
In MySQL, something using substring like
select foo from bar
where substring(col, 500,1)='1' and substring(col, 1500,1)='1';
This will be pretty inefficient though, you might want to rethink your schema. For example, you could store each bit separately to tradeoff space for speed...
create table foo
(
id int not null,
bar varchar(128),
primary key(id)
);
create table foobit
(
int foo_id int not null,
int idx int not null,
value tinyint not null,
primary key(foo_id,idx),
index(idx,value)
);
Which would be queried
select foo.bar from foo
inner join foobit as bit500
on(foo.id=bit500.foo_id and bit500.idx=500)
inner join foobit as bit1500
on(foo.id=bit1500.foo_id and bit1500.idx=1500)
where
bit500.value=1 and bit1500.value=1;
Obviously consumes more storage, but should be faster for those query operations as an index will be used.
I would convert the column to multiple bit-columns and rewrite the relevant code - Bit masks are so much faster than string comparisons. But if you can't do that, you must use db-specific functions. Regular expressions could be an option
-- Flavor: MySql
SELECT * FROM table WHERE column REGEXP "^.{499}1.{999}1"
select substring(your_col, 500,1) as char500,
substring(your_col, 1500,1) as char1500 from your_table;

Resources