Bounded cumulative sum in SQL - sql-server

How can I use SQL to compute a cumulative sum over a column, so that the cumulative sum always stays within upper/lower bounds. Example with lower bound -2 and upper bound 10, showing the regular cumulative sum and the bounded cumulative sum.
id input
-------------
1 5
2 7
3 -10
4 -10
5 5
6 10
Result:
id cum_sum bounded_cum_sum
---------------------------------
1 5 5
2 12 10
3 2 0
4 -8 -2
5 -3 3
6 7 10
See https://codegolf.stackexchange.com/questions/61684/calculate-the-bounded-cumulative-sum-of-a-vector for some (non SQL) examples of a bounded cumulative sum.

You can (almost) always use a cursor to implement whatever cumulative logic you have. The technique is quite routine so can be used to tackle a variety of problems easily once you get it.
One specific thing to note: Here I update the table in-place, so the [id] column must be uniquely indexed.
(Tested on SQL Server 2017 latest linux docker image)
Test Dataset
use [testdb];
if OBJECT_ID('testdb..test') is not null
drop table testdb..test;
create table test (
[id] int,
[input] int,
);
insert into test (id, input)
values (1,5), (2,7), (3,-10), (4,-10), (5,5), (6,10);
Solution
/* A generic row-by-row cursor solution */
-- First of all, make [id] uniquely indexed to enable "where current of"
create unique index idx_id on test(id);
-- append answer columns
alter table test
add [cum_sum] int,
[bounded_cum_sum] int;
-- storage for each row
declare #id int,
#input int,
#cum_sum int,
#bounded_cum_sum int;
-- record accumulated values
declare #prev_cum_sum int = 0,
#prev_bounded_cum_sum int = 0;
-- open a cursor ordered by [id] and updatable for assigned columns
declare cur CURSOR local
for select [id], [input], [cum_sum], [bounded_cum_sum]
from test
order by id
for update of [cum_sum], [bounded_cum_sum];
open cur;
while 1=1 BEGIN
/* fetch next row and check termination condition */
fetch next from cur
into #id, #input, #cum_sum, #bounded_cum_sum;
if ##FETCH_STATUS <> 0
break;
/* program body */
-- main logic
set #cum_sum = #prev_cum_sum + #input;
set #bounded_cum_sum = #prev_bounded_cum_sum + #input;
if #bounded_cum_sum > 10 set #bounded_cum_sum=10
else if #bounded_cum_sum < -2 set #bounded_cum_sum=-2;
-- write the result back
update test
set [cum_sum] = #cum_sum,
[bounded_cum_sum] = #bounded_cum_sum
where current of cur;
-- setup for next row
set #prev_cum_sum = #cum_sum;
set #prev_bounded_cum_sum = #bounded_cum_sum;
END
-- cleanup
close cur;
deallocate cur;
-- show
select * from test;
Result
| | id | input | cum_sum | bounded_cum_sum |
|---|----|-------|---------|-----------------|
| 1 | 1 | 5 | 5 | 5 |
| 2 | 2 | 7 | 12 | 10 |
| 3 | 3 | -10 | 2 | 0 |
| 4 | 4 | -10 | -8 | -2 |
| 5 | 5 | 5 | -3 | 3 |
| 6 | 6 | 10 | 7 | 10 |

Related

Splitting up one row/column into one or mulitple rows over two columns

First post here! I'm trying to update a stored procedure in my employer's Data Warehouse that's linking two tables on their ID's. The stored procedure is based on 2 columns in Table A. It's primary key, and a column that contains the primary keys from Table B and it's domain in one column. Note that it physically only needs Table A since the ID's from B are in there. The old code used some PATINDEX/SUBSTRING code that assumes two things:
The FK's are always 7 characters long
Domain strings look like this "#xx-yyyy" where xx has to be two characters and yyyy four.
The problem however:
We've recently outgrown the 7-digit FK's and are now looking at 7 or 8 digits
Longer domain strings are implemented (where xx may be between 2 or 15 characters)
Sometimes there is no domain string. Just some FK's, delimited the same way.
The code is poorly documented and includes some ID exceptions (not a problem, just annoying)
Some info:
The Data Warehouse follows the Data Vault method and this procedure is stored on SQL Server and is triggered by SSIS. Subsequent to this procedure the HUB and Satellites are updated so in short: I can't just create a new stored procedure but instead will try to integrate my code into the old stored procedure.
The servers is running on SQL Server 2012 so I can't use string_split
This platform is dying out so I just have to "keep it running" for this year.
An ID and domain are always seperated with one space
If a record has no foreign keys it will always have an empty string
When a record has multiple (foreign) ID's it will always use the same delimiting, even when the individual FK's have no domain string next to it. Delimiter looks like this:
"12345678 #xx-xxxx[CR][CR][CR][LF]12345679 #yy-xxxx"
I've managed to create some code that will assign row numbers and is flexible in recognising the amount of FK's.
This is a piece of the old code:
DECLARE
#MAXCNT INT = (SELECT MAX(ROW) FROM #Worktable),
#C_ID INT,
#R_ID INT,
#SOURCE CHAR(5),
#STRING VARCHAR(20),
#VALUE CHAR(20),
#LEN INT,
#STARTSTRINGLEN INT =0,
#MAXSTRINGLEN INT,
#CNT INT = 1
WHILE #CNT <= #MAXCNT
BEGIN
SELECT #LEN=LEN(REQUESTS),#STRING =REQUESTS, #C_ID =C_ID FROM #Worktable WHERE ROW = #CNT
--1 REQUEST RELATED TO ONE CHANGE
IF #LEN < 17
BEGIN
INSERT INTO #ChangeRequest
SELECT #C_ID,SUBSTRING(#STRING,0,CASE WHEN PATINDEX('%-xxxx%',#STRING) = 0 THEN #LEN+1 ELSE PATINDEX('%-xxxx%',#STRING)-4 END)
--SELECT #STRING AS STRING, #LEN AS LENGTH
END
ELSE
-- MULTIPLE REQUESTS RELATED TO ONE CHANGE
SET #STARTSTRINGLEN = 0
WHILE #STARTSTRINGLEN<#LEN
BEGIN
SET #MAXSTRINGLEN = (SELECT PATINDEX('%-xxxx%',SUBSTRING(#STRING,#STARTSTRINGLEN,#STARTSTRINGLEN+17)))+7
INSERT INTO #ChangeRequest
--remove CRLF
SELECT #C_ID,
REPLACE(REPLACE(
substring(#string,#STARTSTRINGLEN+1,#MAXSTRINGLEN )
, CHAR(13), ''), CHAR(10), '')
SET #STARTSTRINGLEN=#STARTSTRINGLEN+#MAXSTRINGLEN
IF #MAXSTRINGLEN = 0 BEGIN SET #STARTSTRINGLEN = #len END
END
SET #CNT = #CNT + 1;
END;
Since this loop is assuming fixed lengths I need to make it more flexible. My code:
(CASE WHEN LEN([Requests]) = 0
THEN 0
ELSE (LEN(REPLACE(REPLACE(Requests,CHAR(10),'|'),CHAR(13),''))-LEN(REPLACE(REPLACE(Requests,CHAR(10),''),CHAR(13),'')))+1
END)
This consistently shows the accurate number of FK's and thus the number of rows to be created. Now I need to create a loop in which to physically create these rows and split the FK and domain into two columns.
Source table:
+---------+----------------------------------------------------------------------------+
| Some ID | Other ID's |
+---------+----------------------------------------------------------------------------+
| 1 | 21 |
| 2 | 31 #xxx-xxx |
| 3 | 41 #xxx-xxx[CR][CR][CR][LF]42 #yyy-xxx[CR][CR][CR][LF]43 #zzz-xxx |
| 4 | 51[CR][CR][CR][LF]52[CR][CR][CR][LF]53 #xxx-xxx[CR][CR][CR][LF]54 #yyy-xxx |
| 5 | <empty string> |
+---------+----------------------------------------------------------------------------+
Target table:
+-----+----------------+----------------+
| SID | OID | Domain |
+-----+----------------+----------------+
| 1 | 21 | <empty string> |
| 2 | 31 | xxx-xxx |
| 3 | 41 | xxx-xxx |
| 3 | 42 | yyy-xxx |
| 3 | 43 | zzz-xxx |
| 4 | 51 | <empty string> |
| 4 | 52 | <empty string> |
| 4 | 53 | xxx-xxx |
| 4 | 54 | yyy-xxx |
| 5 | <empty string> | <empty string> |
+-----+----------------+----------------+
Currently all rows are created but every one beyond the first for each SID is empty.

Why is my SQL Server 2017 query returning incorrect results?

Below is some repro code for an issue I am having.
Run it in SQL SERVER 2017 you will get different (and incorrect) result compared with any other SQL SERVER version Setting the database to lower compatibility level on the sql Server 2017 instance, it works fine too.
Why does this happen and how can I fix it without changing the compatibility level?
Actual Result
+--------------+--------------+----------------+---------+-----------+---------+------------+-------+
| IsPriorAfter | IsIdealAfter | IsCurrentAfter | IsPrior | IsCurrent | IsIdeal | SecurityID | PosID |
+--------------+--------------+----------------+---------+-----------+---------+------------+-------+
| 1 | 1 | 1 | 1 | 1 | 1 | 123 | 1 |
| 0 | 0 | 0 | 0 | 1 | 1 | 234 | 2 |
| 0 | 0 | 0 | 1 | 0 | 0 | 234 | 3 |
+--------------+--------------+----------------+---------+-----------+---------+------------+-------+
Expected Result
+--------------+--------------+----------------+---------+-----------+---------+------------+-------+
| IsPriorAfter | IsIdealAfter | IsCurrentAfter | IsPrior | IsCurrent | IsIdeal | SecurityID | PosID |
+--------------+--------------+----------------+---------+-----------+---------+------------+-------+
| 1 | 1 | 1 | 1 | 1 | 1 | 123 | 1 |
| 0 | 1 | 1 | 0 | 1 | 1 | 234 | 2 |
| 1 | 0 | 0 | 1 | 0 | 0 | 234 | 3 |
+--------------+--------------+----------------+---------+-----------+---------+------------+-------+
Repro
if object_id('ForSubQuery') is not null begin
DROP TABLE ForSubQuery
end
Create Table ForSubQuery
(
SecID int
)
INSERT INTO ForSubQuery SELECT 123
INSERT INTO ForSubQuery SELECT 234
GO
SELECT * FROM ForSubQuery
if object_id('MainTable') is not null begin
DROP TABLE MainTable
end
Create Table MainTable
(
IsPrior bit,
IsCurrent bit,
IsIdeal bit,
[SecurityID] int,
PosID int
)
INSERT INTO MainTable SELECT 1,1,1,123,1
INSERT INTO MainTable SELECT 0,1,1,234,2
INSERT INTO MainTable SELECT 1,0,0,234,3
GO
SELECT * FROM MainTable
SELECT
CASE
WHEN
Position.IsPrior = 1
AND Position.[SecurityID] in (SELECT
SecID
FROM ForSubQuery
)
THEN 1
ELSE 0
END AS IsPriorAfter
,CASE
WHEN
Position.IsIdeal = 1
AND [Position].[SecurityID] IN (SELECT
secid
FROM ForSubQuery
)
THEN 1
ELSE 0
END AS IsIdealAfter
,CASE
WHEN
Position.IsCurrent = 1
AND [Position].[SecurityID] IN (SELECT
secid
FROM ForSubQuery
)
THEN 1
ELSE 0
END AS IsCurrentAfter
, Position.*
FROM MainTable [Position]
order by Position.PosID
TLDR
This is a bug that has been fixed in CU8 so installing at least that CU and ideally the most recent one will fix it.
Pre SQL Server 2017
In SQL Server 2016 the plan looks as above. The IN is treated the same as EXISTS so it evaluates the following three columns.
CASE WHEN IsPrior = 1 AND EXISTS (SELECT * FROM ForSubQuery WHERE SecID = MainTable.SecurityID) THEN 1 ELSE 0 END AS IsPriorAfter
CASE WHEN IsIdeal = 1 AND EXISTS (SELECT * FROM ForSubQuery WHERE SecID = MainTable.SecurityID) THEN 1 ELSE 0 END AS IsIdealAfter
CASE WHEN IsCurrent = 1 AND EXISTS (SELECT * FROM ForSubQuery WHERE SecID = MainTable.SecurityID) THEN 1 ELSE 0 END AS IsCurrentAfter
Each subquery instance gets its own operator in the plan and the query returns the correct result but this is sub optimal as the identical subquery may be executed up to three times per row.
Because each sub query has an AND next to it SQL Server can skip evaluating the sub query if the result of that expression is false however. This is achieved by each nested loops containing a pass through predicate. For example the one corresponding to evaluation of IsPriorAfter has a pass through predicate of IsFalseOrNull (IsPrior=1)
IsPrior=1 is a boolean expression that can return false, null, or true. The IsFalseOrNull then inverts the result and returns 1 for false, null and 0 for true. So the pass through predicate evaluates to true/1 if IsPrior is anything other than 1 (including NULL) and would then skip executing the sub query.
SQL Server 2017 RTM
SQL Server 2017 introduces a new optimisation rule CollapseIdenticalScalarSubquery. In the RTM version the execution plan is not correct.
Problem Plan
The sub query is now in a single operator and the pass through predicates are combined
IsFalseOrNull([IsCurrent]=(1)) OR IsFalseOrNull([IsIdeal]=(1)) OR IsFalseOrNull([IsPrior]=(1))
However this condition is not correct! It evaluates to true unless all three of IsPrior, IsIdeal, IsCurrent are 1.
So in your case the sub query is only executed once (for the first row in the table - where all three of the columns are equal to 1).
For the two other rows it should be executed but isn't. The nested loops has a probe column that is set to 1 if the correlated subquery returns a row. (Labelled Expr1016 in the plan). When execution is skipped this probe column is set to NULL
The final compute scalar in the plan has the following expression. When Expr1016 is null this evaluates to 0 for all three of your calculated columns using CASE.
[Expr1005] = Scalar Operator(CASE WHEN [IsPrior]=(1) AND [Expr1016] THEN (1) ELSE (0) END),
[Expr1009] = Scalar Operator(CASE WHEN [IsIdeal]=(1) AND [Expr1016] THEN (1) ELSE (0) END),
[Expr1013] = Scalar Operator(CASE WHEN [IsCurrent]=(1) AND [Expr1016] THEN (1) ELSE (0) END)
SQL Server 2017 patched
The final fixed plan after the CU is applied has the same plan shape as the 2017 RTM plan (with the subquery only appearing once) but the pass through predicate is now
IsFalseOrNull([IsCurrent]=(1)) AND IsFalseOrNull([IsIdeal]=(1)) AND IsFalseOrNull([IsPrior]=(1))
This only evaluates to true if none of those columns have a value of 1 so the sub query is now evaluated exactly when needed.

SQL Update everytime column hits x number rows

I have table call question with two columns, it contains more than 160K rows, example:
id | questionID
1 | 1
2 | 2
3 | 3
4 | 4
5 | 5
6 | 6
7 | 7
8 | 8
9 | 9
10 | 10
...
I would like to update the questionID column so it will look like the example below. For every x number rows it need update to set from 1 again. The final result should be something like this:
id | questionID
1 | 1
2 | 2
3 | 3
4 | 4
5 | 1
6 | 2
7 | 3
8 | 4
9 | 1
10 | 2
...
The table contains some many rows, so its not an option do it manually.
What could be the easiest way to update the table?
Any help will be appreciated. Thanks
If you are going to use the modulus operator. Both SQL Server and MySQL support %:
UPDATE question
SET questionID = 1 + ((id - 1) % 4);
If the numbers have gaps, then you need to do something different. In that case, the solution is highly database dependent.
Simply use modulo operator:
UPDATE question
SET questionID = CASE WHEN id % 4 = 0 THEN 4 ELSE id % 4 END
or, if id has gaps and you are using SQL Server, then you can use this:
UPDATE q1
SET id = (CASE WHEN q2.rn % 4 = 0 THEN 4 ELSE q2.rn % 4 END)
FROM question q1
INNER JOIN (
SELECT id, ROW_NUMBER() OVER (ORDER by id) AS rn
FROM question ) q2 ON q1.ID = q2.ID
UPDATE question SET questionID = questionID % 4 + 1

Getting list of spatial points from polygon within query

I have a database with various defined polygons which represent the outer boundarys of buildings on a map of a business park.
If I perform a Select within Management Studio, I get a result similar to the following:
LocationCode LocationPolygon
1 POLYGON((1 1, 2 1, 2 2, 1 2, 1 1))
2 POLYGON((10 10, 20 10, 20 20, 10 20, 10 10))
What I would like to get is the following:
LocationCode PointX PointY
1 1 1
1 2 1
1 2 2
1 1 2
2 10 10
etc etc etc
I cannot see anywhere where I can extract the points from the Polygon using SQL Server from within a SQL Query? I can evidentally take the whole polygon and then do the rest on the client, but I would rather deal in SQL if possible.
Any help appreciated in pointing me in the right direction.
I've answered a similar question before and that time I used a user defined function to extract the points and return a table. Assuming a table Locations defined as: (LocationCode int, LocationPolygon geometry) then the following function:
CREATE FUNCTION dbo.GetPoints()
RETURNS #ret TABLE (LocationCode INT, PointX INT, PointY INT)
AS
BEGIN
DECLARE #max INT
SET #max = (SELECT MAX(LocationPolygon.STNumPoints()) FROM Locations)
;WITH Sequence(Number) AS
(
SELECT 1 AS Number
UNION ALL
SELECT Number + 1
FROM Sequence
WHERE Number < #max
)
INSERT INTO #ret
SELECT
l.LocationCode
,l.LocationPolygon.STPointN(nums.number).STX AS PointX
,l.LocationPolygon.STPointN(nums.number).STY AS PointY
FROM Locations l, Sequence nums
WHERE nums.number <= l.LocationPolygon.STNumPoints()
RETURN
END;
When executed as SELECT DISTINCT * FROM dbo.GetPoints() ORDER BY LocationCode; will give the following result (using your sample data):
| LOCATIONCODE | POINTX | POINTY |
|--------------|--------|--------|
| 1 | 1 | 1 |
| 1 | 1 | 2 |
| 1 | 2 | 1 |
| 1 | 2 | 2 |
| 2 | 10 | 10 |
| 2 | 10 | 20 |
| 2 | 20 | 10 |
| 2 | 20 | 20 |
I'm sure the function can be improved, but it should give you some ideas on how this problem can be solved.
Sample SQL Fiddle

SSIS data manipulation

I am currently using SSIS to read the data from a table, modify a column and inset it into a new table.
The modification I want to perform will occur if a previously read row has an identical value in a particular column.
My original idea was to use a c# script with a dictionary containing previously read values and a count of how many times it has been seen.
My problem is that I cannot save a dictionary as an SSIS variable. Is it possible to save a C# variable inside an SSIS script component? or is there another method I could use to accomplish this.
As an example, the data below
/--------------------------------\
| Unique Column | To be modified |
|--------------------------------|
| X5FG | 0 |
| QFJD | 0 |
| X5FG | 0 |
| X5FG | 0 |
| DFHG | 0 |
| DDFB | 0 |
| DDFB | 0 |
will be transformed into
/--------------------------------\
| Unique Column | To be modified |
|--------------------------------|
| X5FG | 0 |
| QFJD | 0 |
| X5FG | 1 |
| X5FG | 2 |
| DFHG | 0 |
| DDFB | 0 |
| DDFB | 1 |
Rather than use a cursor, just use a set based statment
Assuming SQL 2005+ or Oracle, use the ROW_NUMBER function in your source query like so. What's important to note is the PARTITION BY defines your group/when the numbers restart. The ORDER BY clause directs the order in which the numbers are applied (most recent mod date, oldest first, highest salary, etc)
SELECT
D.*
, ROW_NUMBER() OVER (PARTITION BY D.unique_column ORDER BY D.unique_column ) -1 AS keeper
FROM
(
SELECT 'X5FG'
UNION ALL SELECT 'QFJD'
UNION ALL SELECT 'X5FG'
UNION ALL SELECT 'X5FG'
UNION ALL SELECT 'DFHG'
UNION ALL SELECT 'DDFB'
UNION ALL SELECT 'DDFB'
) D (unique_column)
Results
unique_column keeper
DDFB 0
DDFB 1
DFHG 0
QFJD 0
X5FG 0
X5FG 1
X5FG 2
You can create a script component. When given the choice, select the row transformation (instead of source or destination).
In the script, you can create a global variable that you will update in the process row method.
Perhaps SSIS isn't the solution for this one task. Using a cursor with a table-valued variable you would be able to accomplish the same result. I'm not a fan of cursors in most situation, but when you need to iterate through data that depends on previous iterations or is self-reliant then it can be useful. Here's an example:
DECLARE
#value varchar(4)
,#count int
DECLARE #dictionary TABLE ( value varchar(4), count int )
DECLARE cur CURSOR FOR
(SELECT UniqueColumn FROM SourceTable s)
OPEN cur;
FETCH NEXT FROM cur INTO #value;
WHILE ##FETCH_STATUS = 0
BEGIN
DECLARE #innerCount int = 0
IF NOT EXISTS (SELECT 1 FROM #dictionary WHERE value = #value)
BEGIN
INSERT INTO #dictionary ( value, count )
VALUES( #value, 0 )
END
ELSE
BEGIN
SET #innerCount = (SELECT count + 1 FROM #dictionary WHERE value = #value)
UPDATE #dictionary
SET count = #innerCount
WHERE value = #value
END
INSERT INTO TargetTable ( value, count )
VALUES (#value, #innerCount)
FETCH NEXT FROM cur INTO #value;
END

Resources