Tsql group by clause with exceptions - sql-server

I have a problem with a query.
This is the data (order by Timestamp):
Data
ID Value Timestamp
1 0 2001-1-1
2 0 2002-1-1
3 1 2003-1-1
4 1 2004-1-1
5 0 2005-1-1
6 2 2006-1-1
7 2 2007-1-1
8 2 2008-1-1
I need to extract distinct values and the first occurance of the date. The exception here is that I need to group them only if not interrupted with a new value in that timeframe.
So the data I need is:
ID Value Timestamp
1 0 2001-1-1
3 1 2003-1-1
5 0 2005-1-1
6 2 2006-1-1
I've made this work by a complicated query, but am sure there is an easier way to do it, just cant think of it. Could anyone help?
This is what I started with - probably could work with that. This is a query that should locate when a value is changed.
> SELECT * FROM Data d1 join Data d2 ON d1.Timestamp < d2.Timestamp and
> d1.Value <> d2.Value
It probably could be done with a good use of row_number clause but cant manage it.

Sample data:
declare #T table (ID int, Value int, Timestamp date)
insert into #T(ID, Value, Timestamp) values
(1, 0, '20010101'),
(2, 0, '20020101'),
(3, 1, '20030101'),
(4, 1, '20040101'),
(5, 0, '20050101'),
(6, 2, '20060101'),
(7, 2, '20070101'),
(8, 2, '20080101')
Query:
;With OrderedValues as (
select *,ROW_NUMBER() OVER (ORDER By TimeStamp) as rn --TODO - specific columns better than *
from #T
), Firsts as (
select
ov1.* --TODO - specific columns better than *
from
OrderedValues ov1
left join
OrderedValues ov2
on
ov1.Value = ov2.Value and
ov1.rn = ov2.rn + 1
where
ov2.ID is null
)
select * --TODO - specific columns better than *
from Firsts
I didn't rely on the ID values being sequential and without gaps. If that's the situation, you can omit OrderedValues (using the table and ID in place of OrderedValues and rn). The second query simply finds rows where there isn't an immediate preceding row with the same Value.
Result:
ID Value Timestamp rn
----------- ----------- ---------- --------------------
1 0 2001-01-01 1
3 1 2003-01-01 3
5 0 2005-01-01 5
6 2 2006-01-01 6
You can order by rn if you need the results in this specific order.

Related

Calculating the stddev and avg between the most recent number and all the other numbers in a running list snowflake

I have a dataset that looks something like this:
id committed delivered timestamp stddev
1 10 8 01-02-2022 ?
2 20 15 01-14-2022 ?
3 12 12 01-30-2022 ?
4 2 0 02-14-2022 ?
.
.
99 null
I am trying to calculate the standard deviation between sprint x and all the sprints after sprint x; for example, the standard deviation and avg between sprint 1, 2, 3 & 4, 2, 3 & 4, 3 & 4, and so on. If there are no records after 4, that stddev would be null
With the current snowflake functions, I am generally unable to calculate the stddev in general, let alone do something with a lag/lead function.
Does anyone have any advice? Thanks in advance!
Update:
I've figured out how to calculate a moving avg over sprint x and the next sprint, but not for all previous sprints:
(delivered + lead(delivered) over (partition by id order by timestamp asc)) / 2
stddev can also be calculated using abs / sqrt (2)
You're looking for a frame clause -- this is part of the window function that can specify which rows in the current partition to use in the calculation.
select
id,
stddev(delivered) over (
order by id asc
rows between current row and unbounded following
) as stddev,
avg(delivered) over (
order by id asc
rows between current row and unbounded following
) as avg
from my_data
tconbeer is 100% correct, but here is the code and count to "show it working"
and you example data (moshed) into a VALUES section to avoid making a table.
I also stripped out timestamp, as it's not used in this demo, but normally I would order by that, but I could not see the pattern, so just dropped it, as it's non material to the example.
SELECT t.*
,count(delivered) over ( order by id asc rows between current row and unbounded following ) as _count
,stddev(delivered) over ( order by id asc rows between current row and unbounded following ) as stddev
,avg(delivered) over ( order by id asc rows between current row and unbounded following ) as avg
FROM VALUES
(1, 10, 8),
(2, 20, 15),
(3, 12, 12),
(4, 2, 0),
(5, 2, 0),
(6, 0, 0)
t(id, committed, delivered)
ORDER BY 1;
gives:
ID
COMMITTED
DELIVERED
_COUNT
STDDEV
AVG
1
10
8
6
6.765106577
5.833
2
20
15
5
7.469939759
5.4
3
12
12
4
6
3
4
2
0
3
0
0
5
2
0
2
0
0
6
0
0
1
null
0
you can create a dummy table which will have a id generated sequentially using the generator function for a particular range and do a LEFT join with the table. This way you will get rows with NULL values where the id is not present, and then you can use lag / leap to get the average.
--- untested
select seq4() as id1 , TMP.* from table(generator(rowcount => 10)) v
LEFT JOIN (SELECT * FROM (
SELECT 1 as id , 8 As committed , '01-02-2022' as delivered UNION ALL
SELECT 2 as id , 20 As committed , '01-14-2022' as delivered UNION ALL
SELECT 3 as id , 12 As committed , '01-30-2022' as delivered UNION ALL
SELECT 5 as id , 2 As committed , '02-14-2022' as delivered
)) TMP
ON trim(id1) = trim(tmp.id)

Identify duplicates based on multiple columns and parent row

This is an example of table data that I am working on (the table contained a lot of columns, I am showing here only the relevant ones):
Id
job_number
status
parent_id
1
42FWD-42
0
0
2
42FWD-42
1
1
3
42FWD-42
5
1
Id is auto generated. parent_id links the job using the id.
When a new job is created via the app, a new row is created (with status "0"). The auto-generated Id is then used for subsequent rows of same job, and set as parent id.
Another record with status "1" (which is code for started) is also created just after parent record.
Explanation of the problem: due to a bug in the app, there are duplicate set of rows for the same job.
Example of problem
Id
job_number
status
parent_id
1
42FWD-42
0
0
2
42FWD-42
0
0
3
42FWD-42
1
1
4
42FWD-42
1
2
5
42FWD-42
5
1
As you can see from this example, due to the bug, there are 2 rows with "0" status for the same job, and 2 rows with "1" status.
This creates a lot of problems in operation in app where the job is updated using the job number.
The status number should not repeat for a specific job.
What I want to do is to find all duplicates like those in example. For example, I want a query where I can find all duplicates which have same job number, but different parent_id and NO "5" status.
Example result using the example table above, I need the query to return:
Id
job_number
status
parent_id
2
42FWD-42
0
0
4
42FWD-42
1
2
Explanation of this result:
Row with Id=1 is considered the correct record because it has an associated record with status "5"
Row with Id=2 is considered duplicate and its associated records are also considered duplicate
Another possible case: there are duplicate rows, but none have status=5. These rows can be discarded, ie need not be shown in results.
A brief explanation of how the query works will be appreciated.
EDIT:
I forgo to add an important information:
job_number is case sensitive.
ie: 42FWD-42 and 42fwd-42 are different and valid job number. They should not be considered duplicates, and are 2 separate jobs.
The reason for this is the actual job number is not small text as in my example. It is a long string like a guid.
First I must mention you should block identical rows by means of a unique constraint. I suggest that once you have eliminated all duplicates you put up a such a constraint to keep this from happening again.
Now for your question, you can do this by grouping on the duplicate columns, and have only those that count more than one.
Here is an example
declare #t table (id int, job_number varchar(10), status int, parent_id int)
insert into #t
values (1, '42FWD-42', 0, 0), (2, '42FWD-42', 0, 0), (3, '42FWD-42', 1, 1), (4, '42FWD-42', 1, 2), (5, '42FWD-42', 5, 1)
select max(t.id) as id, t.job_number, t.status
from #t t
group by t.job_number, t.status
having count(*) > 1
the result is
id job_number status
2 42FWD-42 0
4 42FWD-42 1
and to get also the parent_id you can add a self join
select max(t.id) as id,
t.job_number,
t.status,
(select t2.parent_id from #t t2 where t2.id = max(t.id)) as parent_id
from #t t
group by t.job_number, t.status
having count(*) > 1
this returns
id job_number status parent_id
2 42FWD-42 0 0
4 42FWD-42 1 2
EDIT
To solve the addional problem in the edit of your question, about the case sensitive, you can fix that by using a COLLATE in your field retrieval and your comparision
this should do it
declare #t table (id int, job_number varchar(10), status int, parent_id int)
insert into #t
values (1, '42FWD-42', 0, 0),
(2, '42FWD-42', 0, 0),
(3, '42FWD-42', 1, 1),
(4, '42fwd-42', 1, 2), -- LOWERCASE !!!
(5, '42FWD-42', 5, 1)
select max(t.id) as id,
t.job_number COLLATE Latin1_General_CS_AS,
t.status,
(select t2.parent_id from #t t2 where t2.id = max(t.id)) as parent_id
from #t t
group by t.job_number COLLATE Latin1_General_CS_AS, t.status
having count(*) > 1
and now the result will be
id job_number status parent_id
2 42FWD-42 0 0
Yet another edit
Now, suppose you need to use the result of these duplicate id's in another query, you could do something like this
select t.*
from #t t
where t.id in ( select max(t.id) as id
from #t t
group by t.job_number COLLATE Latin1_General_CS_AS, t.status
having count(*) > 1
)
What I am doing here is getting only the duplicate id's in a form that can be used to feed a where clause in another query.
This way you can use the result set in any way you wish.
Also note that for this we don't need the self join to retrieve the parent_id anymore.
One possible use of this could be to delete duplicate rows, you can write
delete from yourtable
where id in ( select max(t.id) as id
from #t t
group by t.job_number COLLATE Latin1_General_CS_AS, t.status
having count(*) > 1
)
you can try to use ROW_NUMBER window function to get duplicate row data and its id by job_number, then using cte recursive to find all error records by this id
Query 1:
;WITH CTE AS (
SELECT *,ROW_NUMBER() OVER (PARTITION BY job_number ORDER BY Id) rn
FROM T
WHERE status = 0
), CTE1 AS (
SELECT id,job_number,status,parent_id
FROM CTE
WHERE rn > 1
UNION ALL
SELECT t.id,t.job_number,t.status,t.parent_id
FROM CTE1 c INNER JOIN T t
ON c.id = t.parent_id
)
SELECT *
FROM CTE1
Results:
| id | job_number | status | parent_id |
|----|------------|--------|-----------|
| 2 | 42FWD-42 | 0 | 0 |
| 4 | 42FWD-42 | 1 | 2 |

SQL Server - recursively calculated column

I need to calculate a column based of a seed row where each row's value uses the "previous" row's values. I feel like this should be a recursive query but I can't quite wrap my head around it.
To illustrate:
BOP EOP IN OUT Wk_Num
--------------------------------------
6 4 10 12 1
? ? 2 6 2
? ? 7 5 3
... ... ... ... ...
So the next row's BOP and EOP columns need to be calculated using the seed row. The IN and OUT values are already present in the table.
BOP = (previous row's EOP)
EOP = (Previous row's EOP) + IN - OUT [where IN and OUT are from the current row)
OUTPUT of this example should look like:
BOP EOP IN OUT Wk_num
-------------------------------------
6 4 10 12 1
4 0 2 6 2
0 2 7 5 3
2 6 4 0 4
... ... ... ... ...
You can use a Recursive CTE for this;
WITH RecursiveCTE AS (
-- Base Case
SELECT
BOP,
EOP,
[IN],
[OUT],
[WK_Num]
FROM [someTable]
WHERE BOP IS NOT NULL
UNION ALL
SELECT
r.EOP AS BOP,
r.EOP + r2.[In] - r2.[Out] AS EOP,
r2.[IN],
r2.[OUT],
r2.[WK_Num]
FROM [someTable] r2
INNER JOIN [RecursiveCTE] r
ON r2.[Wk_Num] = r.[Wk_Num] + 1
)
SELECT * FROM RecursiveCTE
Here is a SQL Fiddle: http://sqlfiddle.com/#!18/e041f/1
You basically define the base case as the first row (by saying the row with BOP != null), then join to each following week with the Wk_Num + 1 join, and reference the previous rows values
You can use SUM OVER like this:
DECLARE #TempTable AS TABLE(T_In INT, T_Out INT, T_WeekNum INT)
INSERT INTO #TempTable VALUES (6, 0, 0)
INSERT INTO #TempTable VALUES (10, 12, 1)
INSERT INTO #TempTable VALUES (2, 6, 2)
INSERT INTO #TempTable VALUES (7, 5, 3)
INSERT INTO #TempTable VALUES (4, 0, 4)
SELECT
SUM(T_In - T_Out) OVER(ORDER BY T_WeekNum ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) - T_In + T_Out AS T_Bop,
SUM(T_In - T_Out) OVER(ORDER BY T_WeekNum ASC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS T_Eop,
T_In,
T_Out,
T_WeekNum
FROM #TempTable
The calculation is the same for BOP and EOP but the values from the current row are subtracted from the BOP column to get the value from the last row.

UPDATE with non-CLR product aggregate results in off-by-one

I was answering another question and ran into a strange outcome - the output of a product aggregate (without CLR) was different when used in a SELECT vs UPDATE.
This is simplified from the original question to minimally reproduce the problem:
GroupKey RowIndex A
----------- ----------- -----------
25 1 5
25 2 6
25 3 NULL
26 1 3
26 2 4
26 3 NULL
The goal is for each group key to update the A column of each row with a RowIndex = 3 to the product of the A columns of each row with RowIndex IN (1, 2), so this would produce the following changes:
GroupKey RowIndex A
----------- ----------- -----------
25 3 30
26 3 12
So this is the code I used:
UPDATE T SET
A = Products.Product
FROM #Table T
INNER JOIN (
SELECT
GroupKey,
EXP(SUM(LOG(A))) AS Product
FROM #Table
WHERE RowIndex IN (1, 2)
GROUP BY
GroupKey
) Products
ON Products.GroupKey = T.GroupKey
WHERE T.RowIndex = 3
SELECT * FROM #Table WHERE RowIndex = 3
Which then produced the off-by-one results:
GroupKey RowIndex A
----------- ----------- -----------
25 3 29
26 3 12
If I just run the sub-query, I see the correct values.
GroupKey Product
----------- ----------------------
25 30
26 12
Here's the full script to make it easy to play with. I can't figure out where the off-by-one is coming from.
DECLARE #Table TABLE (GroupKey INT, RowIndex INT, A INT)
INSERT #Table VALUES (25, 1, 5), (25, 2, 6), (25, 3, NULL), (26, 1, 3), (26, 2, 4), (26, 3, NULL)
SELECT * FROM #Table
SELECT
GroupKey,
EXP(SUM(LOG(A))) AS Product
FROM #Table
WHERE RowIndex IN (1, 2)
GROUP BY
GroupKey
UPDATE T SET
A = Products.Product
FROM #Table T
INNER JOIN (
SELECT
GroupKey,
EXP(SUM(LOG(A))) AS Product
FROM #Table
WHERE RowIndex IN (1, 2)
GROUP BY
GroupKey
) Products
ON Products.GroupKey = T.GroupKey
WHERE T.RowIndex = 3
SELECT * FROM #Table WHERE RowIndex = 3
Here are some references I came across:
Non-CLR Aggregate: http://michaeljswart.com/2011/03/the-aggregate-function-product/
Original question: Set one row fields as a multiplication of 2 others
I'd say that this cute "PRODUCT" aggregate is inherently unreliable if you want to work with ints - EXP and LOG are only defined against the float type and so we get rounding errors creeping in.
Why they're not consistently appearing, I couldn't say, except to suggest that different queries may cause changes in evaluation orders.
As a simpler example of how this can go wrong:
select CAST(EXP(LOG(5)) as int)
Can produce 4. EXP and LOG together will produce a value that is just less than 5, but of course when converting to int, SQL Server always truncates rather than applying any rounding.

Highlighting smallest Value, if not within X of second smallest value in column group?

I can highlight the smallest value in each row of a column group of an SSRS tablix with no issue by adding a hidden Min(Value) column outside the group and comparing it using ReportItem!MinVale.Value.
The column is calculated as:
=IIF(Fields!TotalSales.Value=0
,0
,Fields!Sales.Value / IIF(Fields!TotalSales.Value<>0
,Fields!TotalSales.Value
,1
)
)
I have been asked to only highlight it if it is less by a margin of 1% or more.
This record should not be highlighted as it is only .02% less than the next lowest value.
I cannot figure out a way to calculate the second lowest value for comparison and trying to google it hasn't turned up anything either.
Is it possible to calculate the second smallest value in each row within a column group?
(If possible I would like to avoid changing the underlying TSql query to rank the results in a new field to highlight based on that as this s a small part of a far larger set of reports.)
DECLARE #MyTable TABLE
(
RowId VARCHAR(20),
Field1 INT,
Field2 INT,
Field3 INT,
Field4 INT
)
INSERT INTO #MyTable
VALUES
('A', 1, 2, 3, 4 ),
('B', 2, 3, 4, 1 ),
('C', 3, 4, 1, 2 ),
('D', 4, 1, 2, 3 )
SELECT m.*,
u.FieldName,
u.ValueRank
FROM #MyTable m
LEFT JOIN
(
SELECT u.RowId,
u.FieldName,
u.Value,
RANK() OVER(PARTITION BY u.RowId ORDER BY VALUE DESC) ValueRank
FROM #MyTable
UNPIVOT
(
Value
for FieldName in (Field1, Field2, Field3, Field4)
) u
) u
ON u.RowId = m.RowId
AND u.ValueRank = 2
Here is the output:
RowId Field1 Field2 Field3 Field4 FieldName ValueRank
A 1 2 3 4 Field3 2
B 2 3 4 1 Field2 2
C 3 4 1 2 Field1 2
D 4 1 2 3 Field4 2
I unpivoted the columns so I could rank the values and then I pull only the 2nd from last rank (order by desc) to find which column had the 3 value. You can use this same technique to order columns in a row from least to most, etc.

Resources