Batches over groups - sql-server

I need to process rows in a table in batches of not less than N rows. Each batch needs to contain an entire group of rows (group is just another column) i.e. when I select top N rows from the table for processing, I need to extend that N to cover the last group in the batch rather than splitting the last group between batches.
Sample data:
CREATE TABLE test01 (id INT PRIMARY KEY CLUSTERED IDENTITY(1, 1) NOT NULL
, person_name NVARCHAR(100)
, person_surname NVARCHAR(100)
, person_group_code CHAR(2) NOT NULL);
INSERT INTO
dbo.test01 (person_name
, person_surname
, person_group_code)
VALUES
('n1', 's1', 'g1')
, ('n2', 's2', 'g1')
, ('n3', 's3', 'g1')
, ('n4', 's4', 'g1')
, ('n5', 's5', 'g2')
, ('n6', 's6', 'g2')
, ('n7', 's7', 'g2')
, ('n8', 's8', 'g2')
, ('n9', 's9', 'g2')
, ('n10', 's10', 'g2')
, ('n11', 's11', 'g3')
, ('n12', 's12', 'g3')
, ('n13', 's13', 'g3')
, ('n14', 's14', 'g3');
My current attempt:
DECLARE #batch_start INT = 1
, #batch_size INT = 5;
DECLARE #max_id INT = (SELECT MAX(id) FROM dbo.test01);
WHILE #batch_start <= #max_id
BEGIN
SELECT *
FROM dbo.test01
WHERE id BETWEEN #batch_start AND #batch_start + #batch_size - 1;
SELECT #batch_start += #batch_size;
END;
DROP TABLE dbo.test01;
In the example above, I am splitting the 14 rows into 3 batches: 5 rows in batch #1, another 5 rows in batch #2 and then 4 rows in the final batch.
The first batch (id from 1 to 5) covers only fraction of the 'g2' group so I need to extend this batch to cover rows 1-10 (I need to process the entire g2 in a single batch).
(by the way, I don't mind batch upsizing - I need to make sure I cover at least one full group per batch).
The result would be that batch #1 would cover groups g1 and g2 (10 rows) then batch #2 would cover group g3 (4 rows) and there would be no batch #3 at all.
Now, the table is billions of rows and batch sizes are around 50K-100K each so I need a solution that performs well.
Any hints on how to approach this with minimal performance hit?

The first thing I've noticed is that your current code assumes no gaps in the identity column - However that is a mistake. An identity column may (and often do) have gaps in the numbers - so the first thing you want to do is use row_number() over(order by id) to get a continuous running number for all your records.
The second thing I've added as a column that gave a numeric id for each group ordered by the same order as the identity column - using a well-known technique for solving gaps and islands problems.
I've used a table variable to store this data for each id on the source table for the purpose if this demonstration, but you might want to use a temporary table and add indexes on the relevant columns to improve performance.
I've also renamed your #batch_size variable to #batch_min_size and added a few other variables.
So here is the table variable I've used:
DECLARE #Helper As Table (Id int, Rn int, GroupId int)
INSERT INTO #Helper (Id, Rn, GroupId)
SELECT Id,
ROW_NUMBER() OVER(ORDER BY ID) As Rn,
ROW_NUMBER() OVER(ORDER BY ID) -
ROW_NUMBER() OVER(PARTITION BY person_group_code ORDER BY ID) As GroupId
FROM dbo.test01
This is the content of this table:
Id Rn GroupId
1 1 0
2 2 0
3 3 0
4 4 0
5 5 4
6 6 4
7 7 4
8 8 4
9 9 4
10 10 4
11 11 10
12 12 10
13 13 10
14 14 10
I've used a while loop to do the batches.
In the loop, I've used this table to calculate the first and last id of each batch, as well as the last row number of the batch.
Then all I had to do was to use the first and last id in the where clause of the original table:
DECLARE #batch_min_size int = 10
, #batch_end int = 0
, #batch_start int
, #first_id_of_batch int
, #last_id_of_batch int
, #total_row_count int;
SELECT #total_row_count = COUNT(*) FROM #test01
WHILE #batch_end < #total_row_count
BEGIN
SELECT #batch_start = #batch_end + 1;
SELECT #batch_end = MAX(Rn)
, #first_id_of_batch = MIN(Id)
, #last_id_of_batch = MAX(Id)
FROM #Helper
WHERE Rn >= #batch_start
AND GroupId <=
(
SELECT MAX(GroupId)
FROM #Helper
WHERE Rn <= #batch_start + #batch_min_size - 1
)
SELECT id, person_name, person_surname, person_group_code
FROM dbo.test01
WHERE Id >= #first_id_of_batch
AND Id <= #last_id_of_batch
END
See a live demo on rextester.

See if below helps:
CREATE TABLE #Temp(g_record_count int, groupname varchar(50) )
insert into #Temp(g_record_count,groupname) SELECT MAX(id),person_group_code FROM dbo.test01 group by person_group_code
After this loop through this temporary table :
DECLARE #rec_per_batch INT = 1
WHILE #batch_start <= #max_id
BEGIN
select min(g_record_count) into #rec_per_batch from #temp where g_record_count>=#batch_size * #batch_start;
SELECT *
FROM dbo.test01
WHERE id BETWEEN #batch_start AND #rec_per_batch;
SELECT #batch_start += #batch_size;
END;

Related

SQL Server update rows without nulls ordered by consecutive numbers and nulls

I have query returning few rows. There is column with consecutive numbers and nulls in it.
For example, it has values from 1-10 then 5 nulls, then from 16-30 and then 10 nulls, then from 41-45 and so on.
I need to update that column or create another column to create groupId for consecutive columns.
Meaning as per above example, for rows 1-10, groupID can be 1. Then for 5 nulls nothing and then from 16-30 groupId can be 2. Then for 10 nulls nothing. Then from 41-45 groupId can be 3 and so on.
Please let me know
This was a fun one. Here is the solution with a simple table that contains just integers, but with gaps.
create table n(v int)
insert n values (1),(2),(3),(5),(6),(7),(9),(10)
select n.*, g.group_no
from n
join (
select row_number() over (order by low.v) group_no, low.v as low, min(high.v) as high
from n as low
join n as high on high.v>low.v
and not exists(select * from n h2 where h2.v=high.v+1)
where not exists(select * from n l2 where l2.v=low.v-1)
group by low.v
) g on g.low<=n.v and g.high>=n.v
Result:
v group_no
1 1
2 1
3 1
5 2
6 2
7 2
9 3
10 3
Typical island & gap solution
select col, grp = dense_rank() over (order by grp)
from
(
select col, grp = col - dense_rank() over (order by col)
from yourtable
) d

Create new incremental grouping column based on if logic from group, rank, and category columns

I'm trying to sum totals together that goes beyond a basic "group by" or "case" statement.
Here's an example datasets:
Amt Cust_id Ranking PlanType
10 1 1 Term
6 1 2 Variable
8 1 3 Variable
7 1 4 Variable
12 1 5 Term
6 1 6 Variable
10 1 7 Variable
The objective is to return the max sum where the plan type is 'Variable' and
the Ranking numbers are adjacent to each other.
So the answer to the example would be the sum of rows 2-4 which returns 21.
The answer is not the sum of all variable plan types, because row 5 is a 'Term' which breaks it apart.
So I'd like to end with a dataset like below to handle multiple groups of customers:
Amt Cust_ID
21 1
30 2
45 3
Here's where I'm stuck which returns wrong answer:
Create Table #tb (Amt INT, Cust_id TINYINT, Ranking INT, PlanType
VARCHAR(10))
INSERT INTO #tb
VALUES (10,1,1,'Term'),
(6,1,2,'Variable'),
(8,1,3,'Variable'),
(7,1,4,'Variable'),
(12,1,5,'Term'),
(6,1,6,'Variable'),
(10,1,7,'Variable'),
(10,2,1,'Term'),
(6,2,2,'Variable'),
(7,2,4,'Variable'),
(12,2,5,'Term'),
(6,2,6,'Variable'),
(50,2,7,'Variable')
select
( SELECT SUM(Amt) FROM #tb as t2
WHERE t2.Cust_ID=t1.Cust_ID AND t2.Ranking<=t1.Ranking AND
t2.PlanType='Variable') RollingAmt
,Cust_ID, Ranking, Amt, PlanType
from #tb as t1
order by Cust_ID, Ranking
The query runs a rolling sum ordered by "Ranking" where PlanType = 'Variable'. Unfortunately it runs a rolling sum of all "Variable"'s together. I need it to not do that.
If it runs into a PlanType "Term" it needs to start over its sum within each group.
In order to do this you need to use a gaps-and-islands technique to generate a "group id" based on consecutive runs of the same PlanType, then you can sum and sort based on that new group id.
Try this:
DECLARE #data TABLE (Amt INT, Cust_id TINYINT, Ranking INT, PlanType VARCHAR(10))
INSERT INTO #data
VALUES (10,1,1,'Term'),
(6,1,2,'Variable'),
(8,1,3,'Variable'),
(7,1,4,'Variable'),
(12,1,5,'Term'),
(6,1,6,'Variable'),
(10,1,7,'Variable'),
(10,2,1,'Term'),
(6,2,2,'Variable'),
(7,2,4,'Variable'),
(12,2,5,'Term'),
(6,2,6,'Variable'),
(50,2,7,'Variable')
;WITH X AS
(
SELECT *,
ROW_NUMBER() OVER(PARTITION BY Cust_id,PlanType ORDER BY Ranking)
- ROW_NUMBER() OVER(PARTITION BY Cust_id ORDER BY Ranking) groupID /* Assign a groupID to consecutive runs of PlanTypes by Cust_id */
FROM #data
), Y AS
(
SELECT *, SUM(Amt) OVER(PARTITION BY Cust_id,groupID) AS AmtSum /* Sum Amt by Cust/groupID */
FROM X
WHERE PlanType='Variable'
), Z AS
(
SELECT *, ROW_NUMBER() OVER(PARTITION BY Cust_id ORDER BY AmtSum DESC) AS RN /* Assign a row number (1) to highest AmtSum by Cust */
FROM Y
)
SELECT AmtSum, Cust_id
FROM Z
WHERE RN=1 /* Only select RN=1 to get highest value by cust_id/groupId */
If you are curious about how this all works, you can comment the last SELECT and do SELECT * FROM X then SELECT * FROM Y etc, to see what each step does along the way; but only one SELECT can follow the entire CTE structure.

Transact-SQL - number rows until condition met

I'm trying to generate the numbers in the "x" column considering the values in field "eq", in a way that it should assign a number for every record until it meets the value "1", and the next row should reset and start counting again. I've tried with row_number, but the problem is that I only have ones and zeros in the column I need to evaluate, and the cases I've seen using row_number were using growing values in a column. Also tried with rank, but I haven't managed to make it work.
nInd Fecha Tipo #Inicio #contador_I #Final #contador_F eq x
1 18/03/2002 I 18/03/2002 1 null null 0 1
2 20/07/2002 F 18/03/2002 1 20/07/2002 1 1 2
3 19/08/2002 I 19/08/2002 2 20/07/2002 1 0 1
4 21/12/2002 F 19/08/2002 2 21/12/2002 2 1 2
5 17/03/2003 I 17/03/2003 3 21/12/2002 2 0 1
6 01/04/2003 I 17/03/2003 4 21/12/2002 2 0 2
7 07/04/2003 I 17/03/2003 5 21/12/2002 2 0 3
8 02/06/2003 F 17/03/2003 5 02/06/2003 3 0 4
9 31/07/2003 F 17/03/2003 5 31/07/2003 4 0 5
10 31/08/2003 F 17/03/2003 5 31/08/2003 5 1 6
11 01/09/2005 I 01/09/2005 6 31/08/2003 5 0 1
12 05/09/2005 I 01/09/2005 7 31/08/2003 5 0 2
13 31/12/2005 F 01/09/2005 7 31/12/2005 6 0 3
14 14/01/2006 F 01/09/2005 7 14/01/2006 7 1 4
There is another solution available:
select
nind, eq, row_number() over (partition by s order by s)
from (
select
nind, eq, coalesce((
select sum(eq) +1 from mytable pre where pre.nInd < mytable.nInd)
,1) s --this is the sum of eq!
from mytable) g
The inner subquery creates groups sequentially for each occurrence of 1 in eq. Then we can use row_number() over partition to get our counter.
Here is an example using Sql Server
I have two answers here. One is based off of the ROW_NUMBER() and the other is based off of what appears to be your index (nInd). I wasn't sure if there would be a gap in your index so I made the ROW_NUMBER() as well.
My table format was as follows -
myIndex int identity(1,1) NOT NULL
number int NOT NULL
First one is ROW_NUMBER()...
WITH rn AS (SELECT *, ROW_NUMBER() OVER (ORDER BY myIndex) AS rn, COUNT(*) AS max
FROM counting c GROUP BY c.myIndex, c.number)
,cte (myIndex, number, level, row) AS (
SELECT r.myIndex, r.number, 1, r.rn + 1 FROM rn r WHERE r.rn = 1
UNION ALL
SELECT r1.myIndex, r1.number,
CASE WHEN r1.number = 0 AND r2.number = 1 THEN 1
ELSE c.level + 1
END,
row + 1
FROM cte c
JOIN rn r1
ON c.row = r1.rn
JOIN rn r2
ON c.row - 1 = r2.rn
)
SELECT c.myIndex, c.number, c.level FROM cte c OPTION (MAXRECURSION 0);
Now the index...
WITH cte (myIndex, number, level) AS (
SELECT c.myIndex + 1, c.number, 1 FROM counting c WHERE c.myIndex = 1
UNION ALL
SELECT c1.myIndex + 1, c1.number,
CASE WHEN c1.number = 0 AND c2.number = 1 THEN 1
ELSE c.level + 1
END
FROM cte c
JOIN counting c1
ON c.myIndex = c1.myIndex
JOIN counting c2
ON c.myIndex - 1 = c2.myIndex
)
SELECT c.myIndex - 1 AS myIndex, c.number, c.level FROM cte c OPTION (MAXRECURSION 0);
The answer that I have now is via using
Cursor
I know if there is another solution without cursor it will be better for performance aspects
here is a quick demo of my solution:
-- Create DBTest
use master
Go
Create Database DBTest
Go
use DBTest
GO
-- Create table
Create table Tabletest
(nInd int , eq int)
Go
-- insert dummy data
insert into Tabletest (nInd,eq)
values (1,0),
(2,1),
(3,0),
(4,1),
(5,0),
(6,0),
(7,0),
(8,0),
(9,1),
(8,0),
(9,1)
Create table #Tabletest (nInd int ,eq int ,x int )
go
DECLARE #nInd int , #eq int , #x int
set #x = 1
DECLARE db_cursor CURSOR FOR
SELECT nInd , eq
FROM Tabletest
order by nInd
OPEN db_cursor
FETCH NEXT FROM db_cursor INTO #nInd , #eq
WHILE ##FETCH_STATUS = 0
BEGIN
if (#eq = 0)
begin
insert into #Tabletest (nInd ,eq ,x) values (#nInd , #eq , #x)
set #x = #x +1
end
else if (#eq = 1)
begin
insert into #Tabletest (nInd ,eq ,x) values (#nInd , #eq , #x)
set #x = 1
end
FETCH NEXT FROM db_cursor INTO #nInd , #eq
END
CLOSE db_cursor
DEALLOCATE db_cursor
select * from #Tabletest
The end result set will be as following:
Hope it helps.
Looking at this a slightly different way (which might not be true, but eliminates the need for cursors of recursive CTEs), it looks like you building ordered groups within your dataset. So, start by finding those groups, then determining the ordering of each of them.
The real key is to determine the rules to find the correcting grouping. Based on your description and comments, I'm guessing the grouping is from the start (ordered by the nInd column) ending at each row with and eq value of 1, so you can do something like:
;with ends(nInd, ord) as (
--Find the ending row for each set
SELECT nInd, row_number() over(order by nInd)
FROM mytable
WHERE eq=1
), ranges(sInd, eInd) as (
--Find the previous ending row for each ending row, forming a range for the group
SELECT coalesce(s.nInd,0), e.nInd
FROM ends s
right join ends e on s.ord=e.ord-1
)
Then, using these group ranges, you can find the final ordering of each:
select t.nInd, t.Fecha, t.eq
,[x] = row_number() over(partition by sInd order by nInd)
from ranges r
join mytable t on r.sInd < t.nInd
and t.nInd <= r.eInd
order by t.nInd

Select randomly few Rows of the same ID in the same table (T-SQL)

I'm trying to select randomly few rows for each Id stored in one table where these Ids have multiple rows on this table. It's difficult to explain with words, so let me show you with an example :
Example from the table :
Id Review
1 Text11
1 Text12
1 Text13
2 Text21
3 Text31
3 Text32
4 Text41
5 Text51
6 Text61
6 Text62
6 Text63
Result expected :
Id Review
1 Text11
1 Text13
2 Text21
3 Text32
4 Text41
5 Text51
6 Text62
In fact, the table contains thousands of rows. Some Ids contain only one Review but others can contain hundreds of reviews. I would like to select 10% of these, and select at least once, all rows wich have 1-9 reviews (I saw the SELECT TOP 10 percent FROM table ORDER BY NEWID() includes the row even if it's alone)
I read some Stack topics, I think I have to use a subquery but I don't find the correct solution.
Thanks by advance.
Regards.
Try this:
DECLARE #t table(Id int, Review char(6))
INSERT #t values
(1,'Text11'),
(1,'Text12'),
(1,'Text13'),
(2,'Text21'),
(3,'Text31'),
(3,'Text32'),
(4,'Text41'),
(5,'Text51'),
(6,'Text61'),
(6,'Text62'),
(6,'Text63')
;WITH CTE AS
(
SELECT
id, Review,
row_number() over (partition by id order by newid()) rn,
count(*) over (partition by id) cnt
FROM #t
)
SELECT id, Review
FROM CTE
WHERE rn <= (cnt / 10) + 1
Result(random):
id Review
1 Text12
2 Text21
3 Text31
4 Text41
5 Text51
6 Text63

How we can get two rows before and after for a given id in a table?

I have a table with 10 rows
id values
1 a
2 b
3 c
4 d
5 e
6 f
7 g
8 h
9 i
10 j
I want to get two rows before and two rows after for #id = 5.
How can get?
Edit This should work as expected (hopefully):
select id, value
from [table]
where id-#id >= -2
AND id-#id <= 2
AND id-#id <> 0
Here's the running sql: http://sqlfiddle.com/#!6/ca4e5/3/0
One possible solution:
select *
from table
where id in (3, 4, 6, 7)
If you are using a int variable #id, you can do it like this:
select *
from table
where id in (#id-2, #id-1, #id+1, #id+2)
To select the previous two:
select top 2 *
from tablename
where id < #id
order by id desc
To select the next two:
select top 2 *
from tablename
where id > #id
order by id asc

Resources