I've been trying to understand why I get a "divide by zero encountered" (Msg 8134) with my SQL query, but I must be missing something. I would like like to know the why for the specific case below, I am not looking for NULLIF, CASE WHEN... or similar as I already know about them (and can of course use them in a situation as the one below).
I have an SQL statement with a computed column similar to
SELECT
TotalSize,
FreeSpace,
(FreeSpace / TotalSize * 100)
FROM
tblComputer
...[ couple of joins ]...
WHERE
SomeCondition = SomeValue
Running this statement errors with the above mentioned error messages, which, in itself, is not the problem - obviously TotalSize might well be 0 and therefore cause the error.
Now what I don't understand is that I do not have any rows where the TotalSize column is 0 when I comment the computed column out, I double checked that this isn't the case.
Then I thought that for some reason the column computation would be performed on the whole result set before actually filtering with the conditions of the where clause, but this a) wouldn't make sense imho and b) when trying to reproduce the error with a test set-up everything works fine (see below):
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0001',1)
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0002',1)
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0003',1)
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0004',0)
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0005',1)
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0006',0)
INSERT INTO tblComputer (ComputerName, IsServer) VALUES ('PC0007',1)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (1,100,21)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (2,100,10)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (3,100,55)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (4,0,10)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (5,100,23)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (6,100,18)
INSERT INTO tblHDD (ComputerID, TotalSize, FreeSpace) VALUES (7,100,11)
-- This statement does not throw an error as apparently the row for ComputerID 4
-- is filtered out before computing the (FreeSpace / TotalSize * 100)
SELECT
TotalSize,
FreeSpace,
(FreeSpace / TotalSize * 100)
FROM
tblComputer
JOIN
tblHDD ON
tblComputer.ID = tblHDD.ComputerID
WHERE
IsServer = 1
I am quite stumped and would like to know what the reason is.
Any ideas or pointers into the right direction are very welcome, thanks in advance
Update
Thank you so far for your input, but unfortunately I seem not to be getting closer to the root of the problem. I managed to strip the statement down a little bit and now have the case that I can execute it without errors if one JOIN is removed (I would need it for additional columns in the output which I temporarily removed).
I do not understand, why using the JOIN leads to the error, shouldn't a standard INNER JOIN always either return the same number of rows or less, but never more?
Working code
SELECT
TotalSize,
FreeSpace
((FreeSpace / TotalSize) * 100)
FROM
MyTable1
INNER JOIN
MyTable2 ON
MyTable1.ID = MyTable2.Table1ID
WHERE
SomeCondition
Error causing code
SELECT
TotalSize,
FreeSpace
((FreeSpace / TotalSize) * 100)
FROM
MyTable1
INNER JOIN
MyTable2 ON
MyTable1.ID = MyTable2.Table1ID
-- This JOIN causes "divide by zero encountered" error
INNER JOIN
MyTable3 ON
MyTable2.ID = MyTable3.Table2ID
WHERE
SomeCondition
I also tried my luck using a cursor and looping over the result row by row, but in that case no error occurred (no matter, which of the two statements above I tried).
Sorry for the messy code indentation, somehow the correct formatting doesn't seem to be applied.
G.
SQL is a declarative language; you write a query that logically describes the result you want, but it is up to the optimizer to produce a physical plan. This physical plan may not bear much relation to the written form of the query, because the optimizer does not simply reorder 'steps' derived from the textual form of the query, it can apply over 300 different transformations to find an efficient execution strategy.
The optimizer has considerable freedom to reorder expressions, joins, and other logical query constructions. This means that you cannot, in general, rely on any written query form to force one thing to be evaluated before another. In particular, the rewrite given by Lieven does not force the WHERE clause predicate to be evaluated before the expression. The optimizer may, depending on cost estimations, decide to evaluate the expression wherever it seems most efficient to do so. This may even mean, in some cases, that the expression is evaluated more than once.
The original question considered this possibility, but rejected it as 'not making much sense'. Nevertheless, this is the way the product works - if SQL Server estimates that a join will reduce the set size enough to make it cheaper to compute the expression on the result of the join, it is free to do so.
The general rule is to never depend on a particular evaluation order to avoid things like overflow or divide-by-zero errors. In this example, one would employ a CASE statement to check for a zero divisor - an example of defensive programming.
The optimizer's freedom to reorder things is a fundamental tenet of its design. You can find cases where it leads to counter-intuitive behaviours, but overall the benefits far outweigh the disadvantages.
Paul
The basic steps that SQL Server uses to process a single SELECT statement include the following
The parser scans the SELECT statement and breaks it into logical
units such as keywords, expressions,
operators, and identifiers.
A query tree, sometimes referred to as a sequence tree, is built
describing the logical steps needed to
transform the source data into the
format required by the result set.
The query optimizer analyzes different ways the source tables can
be accessed. It then selects the
series of steps that returns the
results fastest while using fewer
resources. The query tree is updated
to record this exact series of steps.
The final, optimized version of the
query tree is called the execution
plan.
The relational engine starts executing the execution plan. As the
steps that require data from the base
tables are processed, the relational
engine requests that the storage
engine pass up data from the rowsets
requested from the relational engine.
The relational engine processes the data returned from the storage
engine into the format defined for the
result set and returns the result set
to the client.
My interpretation of things is that there is no guarantee that your where clause get's evaluated before evaluating the computed column for all rows.
You could verify that assumption by changing you query like below and forcing the where clause to be evaluated before the computation.
SELECT
TotalSize,
FreeSpace,
(FreeSpace / TotalSize * 100)
FROM (
SELECT
TotalSize,
FreeSpace,
FROM
tblComputer
...[ couple of joins ]...
WHERE
SomeCondition = SomeValue
) t
What rows are returned when you run:
SELECT
TotalSize
FROM
tblComputer
...[ couple of joins ]...
WHERE
SomeCondition = SomeValue
and ((TotalSize * 100) = 0)
This might give you a clue as to how SQL Serve ris evaluating (TotalSize * 100) to be zero.
Another idea, is there anything in your where statement which might also be the problem?
You're assuming it's the TotalSize, but it might be somewhere else.
I was running into the same issue. In my case NULLs were acceptable so I was able to fix it this way:
Select Expression1 / Expression2 -- Caused Division By 0
Select Expression1 / NULLIF(Expression2,0) -- Causes result to be NULL
If you need other handling, you can wrap the entire expression in an ISNULL function like this:
Select ISNULL(Expression1 / NULLIF(Expression2,0)-5) -- Returns -5 instead of null or divide by 0
Related
Since SnowFlake is a columnar database, does it impact performance when you use COUNT(*) vs COUNT(column)? And this is assuming that the column that you're referencing does NOT have any NULLs
As a_horse_with_no_name explained these two functions are different but you already mentioned that the column has no NULL values. So they should return the same result in your case.
More important thing is, Snowflake has a special optimization for the COUNT function. As far I see, it does NOT impact performance if you use COUNT(*) or COUNT(column), even when the column contains NULL values! For both of them, Snowflake uses METADATA statistics, so it does not actually count rows.
You can test it with SNOWFLAKE_SAMPLE_DATA:
select count(*) from snowflake_sample_data.TPCH_SF1000.LINEITEM;
-- 5999989709
select count(L_ORDERKEY) from snowflake_sample_data.TPCH_SF1000.LINEITEM;
-- 5999989709
Both queries will return a result immediately although the table size is about 170G, and contain more than 5G rows.
I have to add this extra information because of the conversation between Niru and a_horse_with_no_name. a_horse_with_no_name said:
Even if all columns of a row are NULL, count(*) should include that row in the result. If it doesn't this is a clear violation of the SQL standard
I'm not sure about the SQL standard but when you use COUNT(*), Snowflake doesn't check if the columns are NULL or not (as you expected). I can see why Niru misunderstood the documents, the docs and the samples should be improved.
If you run my sample queries, you will see that they are completed in milliseconds. We are talking about counting almost 6 billion rows:
select count(*) from snowflake_sample_data.TPCH_SF1000.LINEITEM;
-- completes in milliseconds
select count(L_ORDERKEY) from snowflake_sample_data.TPCH_SF1000.LINEITEM;
-- completes in milliseconds
But if I do a little modification on the query, it takes about 3 minutes on the same warehouse (XSMALL):
select count(t.*) from sample_data.TPCH_SF1000.LINEITEM t;
-- completes in 3 minutes!?
Here is the trick:
Alias.*, which indicates that the function should return the number of rows that do not contain any NULLs.
https://docs.snowflake.com/en/sql-reference/functions/count.html#arguments
Only if you use alias.* (like I used t.* in my sample), Snowflake will check if all columns are null when producing the count. This is why it is much slower, and this is why there shouldn't be any performance issues when you are running COUNT(XYZ) or COUNT(*) on a table.
Here is the snowflake doc.. hope it helps
https://docs.snowflake.com/en/sql-reference/functions/count.html Please refer to snowflake document.. it does effect count(alias.*) will check the each column in the row where as count(column) do null check only on that column..
As far as I am aware, the only way to get a random value in a SELECT statement is by using the newid() function, as the random() function doesn’t generate new values for each row.
This leads to the following awkward construction to get a random number from, say 0 - 9:
abs(checksum(newid())) % 10
If I use this expression in the SELECT clause, it behaves as expected. However, if I try something like the following:
select *
from table
where abs(checksum(newid())) % 10>4;
I should have though that I would get roughly half the rows. Instead I get I get all or none of them. Apparently newid() is only evaluated once, instead of for each row.
The question is, how can I use a random number in the WHERE clause?
More
There is a similar question which asks for fixed number of rows at random. In the above example I could have used:
select top 50 percent from table order by newid();
which will get me what I am looking for.
The question remains, how can I use a random number in the WHERE clause. For example, is it possible to do something like this?
select *
from table
where code={random number};
Here is one way to get around the problem
SELECT *
FROM (SELECT *,
Abs(Checksum(Newid())) % 10 AS ran
FROM yourtable) a
WHERE ran > 4;
for some reason newid() in where clause it is executed only once and it is checked with the constant.
When I check the execution plan your query is missing compute scalar where as my query has compute scalar present in execution plan.
The function newid() is calculate only once in the WHERE clause, not row by row. The trick is to force it to run row by row.
Of course it is possible to include it in a SELECT clause, and, in turn, include that in a CTE or a subquery, as per the other answers.
Microsoft offer a solution here: https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2008-r2/ms189108(v=sql.105)?redirectedfrom=MSDN
The trick is to force newid() to recalculate by combining it with some row value. This is easily done in the checksum() function.
For example:
SELECT *
FROM table
WHERE abs(checksum(newid(),id)) % 10>4;
I should have though that I would get roughly half the rows. Instead I get I get all or none of them
You may get all of the rows or none of them ,since NEWID() is executed once per query when you use it in where clause..This is explained here by Conor Cunnigham and the technical term for this is called RumTimeConstants
You can look at your execution plan and look out for below expression
Const ConstValue
which you can see is calculated once and used throughout and finally you are doing just a boolean comparison,so you will end up with all rows or none
you have to use CTE Like the one stated in another answer or use Top with order by newid() or tablesample to return random rows
you may find Tablesample option more helpfull,since this may not go though all the table data to get only sample set of rows,unlike Newid()
below is one example on a table having 1000000 rows
select * from Orders
TABLESAMPLE (50 PERCENT)
plan
This is more of a curious question. I know this question seems like an odd ball but I use null when checking for data because I'm not concerned what data is there but only IF data is there. I believe the following scenario only occurs in SQL Server.
When I want to see if a record exists I'll use:
IF(EXISTS(SELECT null FROM Table1 WHERE Criteria IN (1, 2)))
The following code also works:
IF((SELECT COUNT(null) FROM Table1 WHERE Criteria = 1) = 2)
But this doesn't work:
IF((SELECT COUNT(null) FROM Table1 WHERE Criteria IN (1,2)) = 2)
and get this error:
Operand data type NULL is invalid for count operator.
Why is the third statement any different because of the IN clause?
Here is a SQL Fiddle of what I'm talking about:
http://sqlfiddle.com/#!6/6d7db/8
Narrowed it down to only if there are multiple items in the IN clause too
It seems to be something about the query optimizer.
In the first two queries (from your fiddle), the count(null) seems to be converted to COUNT(*) as you can see in the execution plan.
In the second line, IN with only one value is optimized to =, resulting in the exact same query as above:
With IN (1,2) the query fails. It's the same if you use COUNT(1): It's converted to COUNT(*) where the query can only return one row, but stays COUNT(1) in the third.
Another sidenote: The effect only works with a real table. If you use a table variable, all three statements throw the error.
The bottom line should probably be: count(null) is wrong (as Heinzi explained), it just may slip through the optimizer in very rare circumstances.
COUNT(null), the short form of COUNT(ALL null), simply does not make sense. Let's have a look at the definition of COUNT (emphasis mine):
COUNT(*) returns the number of items in a group. This includes NULL values and duplicates.
COUNT(ALL expression) evaluates expression for each row in a group and returns the number of nonnull values.
COUNT(DISTINCT expression) evaluates expression for each row in a group and returns the number of unique, nonnull values.
Thus, COUNT(ALL someExpressionThatYieldsNull) would always return 0, no matter how many records are matched by your WHERE clause. Obviously, that makes it utterly unsuitable for counting rows. COUNT(*) would be correct here.
I am quite surprised that your second example works at all, you might have stumbled upon a bug here. Trying the following in MSSQL 2012 (SQLFiddle):
SELECT COUNT(NULL) FROM someTable;
yields the following error:
Operand data type NULL is invalid for count operator.
which makes perfect sense.
Code analysis rule SR0007 for Visual Studio 2010 database projects states that:
You should explicitly indicate how to handle NULL values in comparison expressions by wrapping each column that can contain a NULL value in an ISNULL function.
However code analysis rule SR0006 is violated when:
As part of a comparison, an expression contains a column reference ... Your code could cause a table scan if it compares an expression that contains a column reference.
Does this also apply to ISNULL, or does ISNULL never result in a table scan?
Yes it causes table scans. (though seems to get optimised out if the column isn't actually nullable)
The SR0007 rule is extremely poor blanket advice as it renders the predicate unsargable and means any indexes on the column will be useless. Even if there is no index on the column it might still make cardinality estimates inaccurate affecting other parts of the plan.
The categorization of it in the Microsoft.Performance category is quite amusing as it seems to have been written by someone with no understanding of query performance.
It claims the rationale is
If your code compares two NULL values or a NULL value with any other
value, your code will return an unknown result.
Whilst the expression itself does evaluate to unknown your code returns a completely deterministic result once you understand that any =, <>, >, < etc comparison with NULL evaluate as Unknown and that the WHERE clause only returns rows where the expression evaluates to true.
It is possible that they mean if ANSI_NULLS is off but the example they give in the documentation of WHERE ISNULL([c2],0) > 2; vs WHERE [c2] > 2; would not be affected by this setting anyway. This setting
affects a comparison only if one of the operands of the comparison is
either a variable that is NULL or a literal NULL.
Execution plans showing scans vs seek or below
CREATE TABLE #foo
(
x INT NULL UNIQUE
)
INSERT INTO #foo
SELECT ROW_NUMBER() OVER (ORDER BY ##SPID)
FROM sys.all_columns
SELECT *
FROM #foo
WHERE ISNULL(x, 10) = 10
SELECT *
FROM #foo
WHERE x = 10
SELECT *
FROM #foo
WHERE x = 10
OR x IS NULL
These two statements are logically equivalent:
SELECT * FROM table WHERE someColumn BETWEEN 1 AND 100
SELECT * FROM table WHERE someColumn >= 1 AND someColumn <= 100
Is there a potential performance benefit to one versus the other?
No benefit, just a syntax sugar.
By using the BETWEEN version, you can avoid function reevaluation in some cases.
There's no performance benefit, it's just easier to read/write the first one.
No, no performance benifit. Its just a little candy.
If you were to check a query comparison, something like
DECLARE #Table TABLE(
ID INT
)
SELECT *
FROM #Table
WHERE ID >= 1 AND ID <= 100
SELECT *
FROM #Table
WHERE ID BETWEEN 1 AND 100
and check the execution plan, you should notice that it is exactly the same.
Hmm, here was a surprising result. I don't have SQL Server here, so I tried this in Postgres. Obviously disclaimers apply: this won't necessarily give the same results, your mileage may vary, consult a physician before using. But still ...
I just wrote a simple query in two different ways:
select *
from foo
where (select code from bar where bar.barid=foo.barid) between 'A' and 'B'
and
select *
from foo
where (select code from bar where bar.barid=foo.barid)>='A'
and (select code from bar where bar.barid=foo.barid)<='B'
Surprisingly to me, both had almost identical run times. When I did an EXPLAIN PLAN, they gave identical results. Specifically, the first query did the lookup against bar twice, once for the >= test and again for the <= test, just like the second query.
Conclusion: In Postgres, at least, BETWEEN is indeed just syntactic sugar.
Personally, I use it regularly because it is clearer to the reader, especially if the value being tested is an expression. Figuring out that two complex expressions are identical can be a non-trivial exercise. Figuring out that two complex expressions SHOULD BE identical even though they're not is even more difficult.
Oh, but you all referring to the case when search value is on the left side of the where clause.
Did anybody look at differences when is in the other side of the clause.
SELECT * FROM table WHERE #date BETWEEN someCol1 AND someCol2
SELECT * FROM table WHERE someCol1 >= #date AND someCol2 <= #date