SQL Server CREATE AGGREGATE logic error - sql-server

I've created a CLR funtion for SQL Server 2014 that should calculate subtraction between the first and the last value in column [Value].
Here is the table:
Date_Time Value
-------------------------------------
2018-03-29 09:30:02.533 6771
2018-03-29 10:26:23.557 6779
2018-03-29 13:12:04.550 6787
2018-03-29 13:55:44.560 6795
Here is the code:
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
[Serializable]
[SqlUserDefinedAggregate(Format.Native,
IsInvariantToDuplicates = false,
IsInvariantToNulls = true,
IsInvariantToOrder = false,
IsNullIfEmpty = true,
Name = "SUBT")]
public struct SUBT
{
private double first;
private double last;
private int count;
public void Init()
{
first = 0.0;
last = 0.0;
count = 0;
}
public void Accumulate(SqlDouble Value)
{
if (!Value.IsNull)
{
if (count == 0)
first = (double)Value;
else
last = (double)Value;
count += 1;
}
}
public void Merge(SUBT Group)
{
first = Group.first;
last = Group.last;
count += 1;
}
public SqlDouble Terminate()
{
double value = (double)(last - first);
return new SqlDouble(value);
}
}
So the result should be [Value]=24, i.e. 6795 - 6771, but I get 6795 :(
Where is the error?

This aggregate function is dependent on order but there is no ordering guarantee for the aggregation input stream. Consequently, the results are execution plan dependent.
Assuming the Date_Time value is the desired ordering, you could provide both Date_Time and Value as function arguments, save the value with the lowest and highest Date_time values and use those in the Merge and Terminate methods.

There are a couple of problems here, I believe:
In the Merge method you are making two bad assumptions:
You are assuming that the incoming Group has a value, yet it could have been called on one or more NULL values, in which case all 3 internal variables are 0. Yet you are overwriting the current values of first and last, which could have non-0 values prior to Merge being called, but then will end up back at 0 due to being overwritten.
You are assuming that at least one of the instances — the current one or the incoming Group — has values set (i.e. has been called at least once on a non-NULL value). In the case that both instances have only been called with a NULL value, you will have 0 for first and last, yet you will increment counter. I am not sure if Accumulate will be called again once things are being merged, but if it does, you will skip setting first. This is not your problem at the moment since you do not have multiple (or any) NULL values, but it is a potential problem for real data sets.
In the case where both instances have been called on non-NULL values, both will at least have first set, and maybe last (or maybe not). By overwriting the current instance with the incoming Group, you could be losing the real first value, losing the real last value.
As #DanGuzman mentioned in his answer, there is no guaranteed ordering of User-Defined Aggregates (the IsInvariantToOrder property of the SqlUserDefinedAggregate attribute is ignored / unused). And, as he noted, you will need to pass in the Date_Time value in order to handle this aspect of the operation manually. However, it won't be used in the Terminate method. It will instead be used to compare to two new variables: firstDate and lastDate, initialized to a future and past, respectively (this will likely require changing the Format to UserDefined and then adding custom Read and Write methods -- unless you can store the full DateTime values as ticks, perhaps).
Get rid of the counter variable
In the Accumulate method, you will need to:
IF the incoming Date_Value against firstDate. If it is before to firstDate then store this new value as firstDate and store Value as first, ELSE
IF the incoming Date_Value against lastDate. If it is after to lastDate then store this new value as lastDate and store Value as last, ELSE do nothing
In the Merge method, do a similar comparison for firstDate between both instances and keep the earlier one (date and value). Do the same with lastDate and keep the later one (date and value). (Note: these changes should fix all of the Merge issues noted above in # 1)
The terminate method shouldn't change
For what it's worth, I ran the code exactly as you have posted in the question, and it returns the expected value, using the following test query:
CREATE TABLE #Test ([Date_Time] DATETIME, [Value] FLOAT);
-- TRUNCATE TABLE #Test;
INSERT INTO #Test VALUES ('2018-03-29 09:30:02.533', 6771);
INSERT INTO #Test VALUES ('2018-03-29 10:26:23.557', 6779);
INSERT INTO #Test VALUES ('2018-03-29 13:12:04.550', 6787);
INSERT INTO #Test VALUES ('2018-03-29 13:55:44.560', 6795);
SELECT dbo.SUBT([Value])
FROM #Test;
-- 24
So, if you are still having issues, then you will need to post more info, such as the test query (and maybe table) that you are using. But even if it appears to work, like it does on my system, it still has the potential ordering problem and will need to be updated as noted above regardless.
Other notes:
In the Accumulate method you have (double)Value. There is no need to cast the incoming parameter. All Sql* types have a Value property that returns the value in the native .NET type. In this case, just use Value.Value. That isn't great for readability, so consider changing the name of the input paramter ;-).
You never use the value of counter, so why increment it? You could instead just use a bool, and set it to true here. Setting it to true for each non-NULL value will not change the operation. However, this is a moot point since you truly need to set either first or last in each call to this UDA, based on the current Date_Time value.

Related

SQL Server CHOOSE() function behaving unexpectedly with RAND() function

I've encountered an interesting SQL server behaviour while trying to generate random values in T-sql using RAND and CHOOSE functions.
My goal was to try to return one of two given values using RAND() as rng. Pretty easy right?
For those of you who don't know it, CHOOSE function accepts in an index number(int) along with a collection of values and returns a value at specified index. Pretty straightforward.
At first attempt my SQL looked like this:
select choose(ceiling((rand()*2)) ,'a','b')
To my surprise, this expression returned one of three values: null, 'a' or 'b'. Since I didn't expect the null value i started digging. RAND() function returns a float in range from 0(included) to 1 (excluded). Since I'm multiplying it by 2, it should return values anywhere in range from 0(included) to 2 (excluded). Therefore after use of CEILING function final value should be one of: 0,1,2. After realising that i extended the value list by 'c' to check whether that'd be perhaps returned. I also checked the docs page of CEILING and learnt that:
Return values have the same type as numeric_expression.
I assumed the CEILINGfunction returned int, but in this case would mean that the value is implicitly cast to int before being used in CHOOSE, which sure enough is stated on the docs page:
If the provided index value has a numeric data type other than int,
then the value is implicitly converted to an integer.
Just in case I added an explicit cast. My SQL query looks like this now:
select choose(cast(ceiling((rand()*2)) as int) ,'a','b','c')
However, the result set didn't change. To check which values cause the problem I tried generating the value beforehand and selecting it alongside the CHOOSE result. It looked like this:
declare #int int = cast(ceiling((rand()*2)) as int)
select #int,choose( #int,'a','b','c')
Interestingly enough, now the result set changed to (1,a), (2,b) which was my original goal. After delving deeper in the CHOOSE docs page and some testing i learned that 'null' is returned in one of two cases:
Given index is a null
Given index is out of range
In this case that would mean that index value when generated inside the SELECT statement is either 0 or above 2/3 (I'm assuming that negative numbers are not possible here and CHOOSE function indexes from 1). As I've stated before 0 should be one of possibilities of:
ceiling((rand()*2))
,but for some reason it's never 0 (at least when i tried it 1 million+ times like this)
set nocount on
declare #test table(ceiling_rand int)
declare #counter int = 0
while #counter<1000000
begin
insert into #test
select ceiling((rand()*2))
set #counter=#counter+1
end
select distinct ceiling_rand from #test
Therefore I assume that the value generated in SELECT is greater than 2/3 or NULL. Why would it be like this only when generated in SELECT statement? Perhaps order of resolving CAST, CELING or RAND inside SELECT is different than it would seem? It's true I've only tried it a limited number of times, but at this point the chances of it being a statistical fluctuation are extremely small. Is it somehow a floating-point error? I truly am stumbled and looking forward to any explanation.
TL;DR: When generating a random number inside a SELECT statement result set of possible values is different then when it's generated before the SELECT statement.
Cheers,
NFSU
EDIT: Formatting
You can see what's going on if you look at the execution plan.
SET SHOWPLAN_TEXT ON
GO
SELECT (select choose(ceiling((rand()*2)) ,'a','b'))
Returns
|--Constant Scan(VALUES:((CASE WHEN CONVERT_IMPLICIT(int,ceiling(rand()*(2.0000000000000000e+000)),0)=(1) THEN 'a' ELSE CASE WHEN CONVERT_IMPLICIT(int,ceiling(rand()*(2.0000000000000000e+000)),0)=(2) THEN 'b' ELSE NULL END END)))
The CHOOSE is expanded out to
SELECT CASE
WHEN ceiling(( rand() * 2 )) = 1 THEN 'a'
ELSE
CASE
WHEN ceiling(( rand() * 2 )) = 2 THEN 'b'
ELSE NULL
END
END
and rand() is referenced twice. Each evaluation can return a different result.
You will get the same problem with the below rewrite being expanded out too
SELECT CASE ceiling(( rand() * 2 ))
WHEN 1 THEN 'a'
WHEN 2 THEN 'b'
END
Avoid CASE for this and any of its variants.
One method would be
SELECT JSON_VALUE ( '["a", "b"]' , CONCAT('$[', FLOOR(rand()*2) ,']') )

Compare and divide in report builder

I have a condition I want to divide two values in report builder with same date but different sample name in the same table...
For example, in this image, I want to divide the CM2(2.85) and Raw Meal(0.58) value. their result should be 4.9.
And if these two parameters (CM2 and Raw Meal) are not on the same date then the value should be empty or nothing. Please help I am new to report builder expression.
I've tried this expression but it does not give me what I need
IIf(InStr(Fields!Sample_Code.Value,"CM2") > 0, Fields!So3.Value, nothing) / IIf(InStr(Fields!Sample_Code.Value,"Raw Meal") > 0, Fields!So3.Value, nothing)
The record will either be CM2 or Raw Meal, but it will never contain both at the same time. If Fields!Sample_Code.Value = "CM2" is true, the second half of the expression will be false, or vice-versa. Fields!Sample_Code.Value can't be two different values at the same time, it can only contain the data from a single record.
Your expression will result in either:
nothing/Fields!So3.Value
or
Fields!So3.Value/nothing.
As a simplified example:
IF( x=1 , 1 , null) / IF( x=2 , 1 , null)
X cannot be two different values at the same time so the expression will never return a non-null result.
You'll need to join CM2 and RAW MEAL records together to evaluate them at the same time. That would probably require a significant change to what you've posted so far.

How to push individual (optional) parameters into an array

I'm trying to populate an array (fruit.Price)with properties supplied in the first WITH line of the following cypher code:
WITH [{Price_1:15,Price_2:20,Price_3:17,strFruit:"apples"},{Price_1:2,Price_2:1,Price_3:1.5,Price_4:3,strFruit:"pears"}] AS props
UNWIND props as p
MATCH (fruit:Fruit) WHERE fruit.strFruit=p.strFruit
FOREACH (price in [p.Price_1,p.Price_2,p.Price_3,p.Price_4] |SET fruit.Price = fruit.Price + price)
RETURN fruit
where the maximum quantity of p.Price_n is 4, but not all are necessarily supplied (as above, where p.Price_4 is missing in the first row). These properties will always be supplied consecutively i.e. Price_4 won't be supplied without Price_3 also.
How do I populate an array with a variable number of elements in this way? For what it's worth; I'm actually using the HTTP Rest API and the WITH line is in reality a parameters: command.
thanks
I would use coalesce(), and default to 0 for the ones that don't exist. Also, it might be easier to do reduce() instead of foreach(). (Updated to use CASE/WHEN instead of coalesce.)
Even easier would be to pass in an array of variable length {prices:[15,20,17], strFruit:"apples"}... or just the total price (if you have control over that).
WITH [{Price_1:15,Price_2:20,Price_3:17,strFruit:"apples"},{Price_1:2,Price_2:1,Price_3:1.5,Price_4:3,strFruit:"pears"}] AS props
UNWIND props as p
MATCH (fruit:Fruit) WHERE fruit.strFruit=p.strFruit
SET fruit.Price = reduce(total = [], price in [p.Price_1,p.Price_2,p.Price_3,p.Price_4] | CASE WHEN NOT price is NULL THEN total + price ELSE total END)
RETURN fruit
http://console.neo4j.org/r/o69bii

Why does SUM(...) on an empty recordset return NULL instead of 0?

I understand why null + 1 or (1 + null) returns null: null means "unknown value", and if a value is unknown, its successor is unknown as well. The same is true for most other operations involving null.[*]
However, I don't understand why the following happens:
SELECT SUM(someNotNullableIntegerField) FROM someTable WHERE 1=0
This query returns null. Why? There are no unknown values involved here! The WHERE clause returns zero records, and the sum of an empty set of values is 0.[**] Note that the set is not unknown, it is known to be empty.
I know that I can work around this behaviour by using ISNULL or COALESCE, but I'm trying to understand why this behaviour, which appears counter-intuitive to me, was chosen.
Any insights as to why this makes sense?
[*] with some notable exceptions such as null OR true, where obviously true is the right result since the unknown value simply does not matter.
[**] just like the product of an empty set of values is 1. Mathematically speaking, if I were to extend $(Z, +)$ to $(Z union {null}, +)$, the obvious choice for the identity element would still be 0, not null, since x + 0 = x but x + null = null.
The ANSI-SQL-Standard defines the result of the SUM of an empty set as NULL. Why they did this, I cannot tell, but at least the behavior should be consistent across all database engines.
Reference: http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt on page 126:
b) If AVG, MAX, MIN, or SUM is specified, then
Case:
i) If TXA is empty, then the result is the null value.
TXA is the operative resultset from the selected column.
When you mean empty table you mean a table with only NULL values, That's why we will get NULL as output for aggregate functions. You can consider this as by design for SQL Server.
Example 1
CREATE TABLE testSUMNulls
(
ID TINYINT
)
GO
INSERT INTO testSUMNulls (ID) VALUES (NULL),(NULL),(NULL),(NULL)
SELECT SUM(ID) FROM testSUMNulls
Example 2
CREATE TABLE testSumEmptyTable
(
ID TINYINT
)
GO
SELECT SUM(ID) Sums FROM testSumEmptyTable
In both the examples you will NULL as output..

Is SQL Server's double checking needed here?

IF #insertedValue IS NOT NULL AND #insertedValue > 0
This logic is in a trigger.
The value comes from a deleted or inserted row (doesn't matter).
2 questions :
Do I need to check both conditions? (I want all value > 0, value in db can be nullable)
Does SQL Server check the expression in the order I wrote it ?
1) Actually, no, since if the #insertedValue is NULL, the expression #insertedValue > 0 will evaulate to false. (Actually, as Martin Smith points out in his comment, it will evaluate to a special value "unknown", which when forced to a Boolean result on its own collapses to false - examples: unknown AND true = unknown which is forced to false, unknown OR true = true.) But you're relying on comparison behaviour with NULL values. A single step equivalent method, BTW, would be:
IF ISNULL(#insertedValue, 0) > 0
IMHO, you're better sticking with the explicit NULL check for clarity if nothing else.
2) Since the query will be optimised before execution, there is absolutely no guarantee of order of execution or short circuiting of the AND operator.
Combining the two - if the double check is truly unnecessary, then it will probably be optimised out before execution anyway, but your SQL code will be more maintainable in my view if you make this explicit.
You can use COALESCE => Returns the first nonnull expression among its arguments.
Now you can make the query more flexible, by increasing the column limits and again you need to check the Greater Then Zero condition. Important point to note down here is you have the option to check values in multiple columns.
declare #val int
set #val = COALESCE( null, 1, 10 )
if(#val>0)
select 'fine'
else
select 'not fine'

Resources