SQL spread out one group proportionally to other groups - sql-server

So, for example, we have a number of users with different group id. Some of them don't have group:
userID groupID
-------------
user1 group1
user2 group1
user3 group2
user4 group1
user5 NULL
user6 NULL
user7 NULL
user8 NULL
We need to group users by their groupID. And we want users without group (groupID equals NULL) to be assigned to one of existing groups(group1 or group2 in this example). But we want to distribute them proportionally to amount of users already assigned to those groups. In our example group1 has 3 users and group2 has only 1 user. So 75% (3/4) of new users should be counted as members of group 1 and other 25% (1/4) should be "added" to group2. The end result should look like this:
groupID numOfUsers
-------------
group1 6
group2 2
This is a simplified example.
Basically we just can't figure out how users without a group can be divided between groups in a certain proportion, not just evenly distributed between them.
We can have any number of groups and users, so we can't just hardcode percentages.
Any help is appreciated.
Edit:
I tried to use NTILE(), but it gives even distribution, not proportional to amount of users in groups
SELECT userID ,
NTILE(2) OVER( ) gr
from(
select DISTINCT userID
from test_task
WHERE groupID IS NULL ) AS abc

here is one way:
select
groupid
, count(*)
+ round(count(*) / sum(count(*)) over(),0) * (select count(*) from table where groupid ='no_group')
from table
where groupid <> 'no_group'
group by groupid

We can use an updatable CTE to do this
First, we take all existing data, group it up by groupID, and calculate a running sum of the number of rows, as well as the total rows over the whole set
We take the rows we want to update and add a row-number (subtract 1 so the calculations work)
Join the two based on that row-number modulo the total existing rows should be between the previous running sum and the current running sum
Note that this only works well when there are a divisible number of rows eg. 4 or 8, by 4 existing rows
WITH Groups AS (
SELECT
groupID,
perGroup = COUNT(*),
total = SUM(COUNT(*)) OVER (),
runningSum = SUM(COUNT(*)) OVER (ORDER BY groupID ROWS UNBOUNDED PRECEDING)
FROM test_task
WHERE groupID IS NOT NULL
GROUP BY groupID
),
ToUpdate AS (
SELECT
groupID,
userID,
rn = ROW_NUMBER() OVER (ORDER BY userID) - 1
FROM test_task tt
WHERE groupID IS NULL
)
UPDATE u
SET groupID = g.groupID
FROM ToUpdate u
JOIN Groups g
ON u.rn % (g.total) >= g.runningSum - g.perGroup
AND u.rn % (g.total) < g.runningSum;
db<>fiddle

Related

SQL Server : how to divide database records into even, random groups

tblNames
OrganizationID (int)
LastName (varchar)
...
GroupNumber (int)
GroupNumber is currently NULL for all records, I need an UPDATE statement to update this column.
I need to split up records on an OrganizationID level into even, random groups.
If there are < 20,000 records for an OrganizationID, I need 2 even, random groups. So records for that OrganizationID will have a GroupNumber of 1 or 2. There will be the same (or if odd number of records difference of only 1) number of records for GroupNumber = 1 and for GroupNumber = 2, and there will be no recognizable way to tell how a person got into a GroupNumber - i.e. LastNames that start with A-L are group 1, M-Z are group 2 would not be OK.
If there are > 20,000 records for an OrganizationID, I need 4 even, random groups. So records for that OrganizationID will have a GroupNumber values of 1, 2, 3, or 4. There will be the same (or if odd number of records difference of only 1) number of records for each GroupNumber, and there will be no recognizable way to tell how a person got into a GroupNumber - i.e. LastNames that start with A-F are group 1, G-L are group 2, etc. would not be OK.
There are only about 20 organizations, so I can run an update statement 20 times, once per organizationID if needed.
I have full control of the table so I can add keys or columns, but for now this is what it is.
Would appreciate any help.
Create row numbers randomly (with ROW_NUMBER and GETID). Then get their modulo 2 or 4 depending on the record count to get buckets 0 to 1 or 0 to 3.
select
organizationid, lastname, ...,
case when cnt <= 20000 then rn % 2 else rn % 4 end as bucket
from
(
select
organizationid, lastname, ...,
row_number() over(order by newid()) as rn,
count(*) over () as cnt
from mytable
) randomized;
UPDATE: I suppose the update statement would have to look something like this:
with randomized as
(
select
groupnumber,
row_number() over(order by newid()) as rn,
count(*) over () as cnt
from mytable
)
update randomzized
set groupnumber = case when cnt <= 20000 then rn % 2 else rn % 4 end + 1;
Another slightly different approach;
Setting up some fake data:
if object_id('tempdb.dbo.#Orgs') is not null drop table #Orgs
create table #Orgs
(
RID int identity(1,1) primary key clustered,
OrganizationId int,
LastName varchar(36),
GroupId int
)
insert into #Orgs (OrganizationId, LastName)
select top 40000 row_number() over (order by (select null)) % 20000, newid()
from sys.all_objects a, sys.all_objects b
then using the rarely useful ntile() function to get as close to identically sized groups as possible. Sorting by newid() essentially sorts the data randomly (or as random as generating one guid to the next is).
declare #NumRandomGroups int = 4
update o
set GroupId = x.GroupId
from #orgs o
inner join (select RID, GroupId = ntile(#NumRandomGroups) over (order by newid())
from #orgs) x
on o.RID = x.RID
select GroupId, count(1)
from #Orgs
group by GroupId
select *
from #Orgs
order by RID
You can then set #NumRandomGroups to whatever you want it to be based on the count of Organizations

get first row for each group

I want to transform this data
account; docdate; docnum
17700; 9/11/2015; 1
17700; 9/12/2015; 2
70070; 9/1/2015; 4
70070; 9/2/2015; 6
70070; 9/3/2015; 9
into this
account; docdate; docnum
17700; 9/12/2015; 2
70070; 9/3/2015; 9
.. for each account I want to have one row with the most current (=max(docdate)) docdate. I already tried different approaches with cross apply and row_number but couldn't achieve the desired results
Use ROW_NUMBER:
SELCT account, docdate, docnum
FROM (
SELECT account, docdate, docnum,
ROW_NUMBER() OVER (PARTITION BY account
ORDER BY docdate DESC) AS rn
FROM mytable ) AS t
WHERE t.rn = 1
PARTITION BY account clause creates slices of rows sharing the same account value. ORDER BY docdate DESC places the record having the maximum docdate value at the top of its related slice. Hence rn = 1 points to the record with the maximum docdate value within each account partition.

Left outer join for first row in group only

I have a table that looks like this:
BANK ACCOUNT_NAME EXCESS DEBT
Acme Bank Checking1 500 300
Acme Bank Personal 200 100
Bank One Business 100 50
I need a sql query that returns.
BANK ACCOUNT_NAME EXCESS DEBT AVAILABLE
Acme Bank Checking1 500 300 300
Acme Bank Personal 200 100 NULL
Bank One Business 100 50 50
AVAILABLE would be the Sum(EXCESS) - Sum(DEBT) grouped by BANK. AVAILABLE would then appear only on the first row of BANK-ACCOUNT_NAME combination. How do I do this?
My first attempt results in AVAILABLE having values on all rows, which not intended. I only want the first row in the group to have an AVAILABLE value.
SELECT
outer.BANK
,outer.ACCOUNT_NAME
,outer.EXCESS
,outer.DEBT
,inner2.AVAILABLE
FROM BankBalances AS outer
CROSS APPLY
(
SELECT TOP 1
Bank
,SUM(EXCESS) - SUM(DEBT) AS AVAILABLE
FROM BankBalances AS inner
GROUP BY Bank
WHERE outer.BANK = inner.BANK
) AS inner2
You can use the following query:
SELECT BANK, ACCOUNT_NAME, EXCESS, DEBT,
CASE WHEN ROW_NUMBER() OVER (PARTITION BY BANK ORDER BY ACCOUNT_NAME) = 1
THEN SUM(EXCESS) OVER (PARTITION BY BANK) -
SUM(DEBT) OVER (PARTITION BY BANK)
ELSE NULL
END AS AVAILABLE
FROM BankBalances
You can use windowed version of SUM in order to avoid CROSS APPLY. ROW_NUMBER is simply used to check for first row.
I have made the assumption that first row is considered the one having the 'minimum' ACCOUNT_NAME value within each BANK partition.
Demo here
you can use ROW_NUMBER and SUM OVER() with Partition by like this.
;WITH CTE AS
(
SELECT
BANK
,ACCOUNT_NAME
,EXCESS
,DEBT
,SUM(EXCESS - DEBT) OVER(PARTITION BY BANK) AS AVAILABLE,
,ROW_NUMBER()OVER(PARTITION BY BANK ORDER BY ACCOUNT_NAME ASC) rn
FROM BankBalances
)
SELECT BANK
,ACCOUNT_NAME
,EXCESS
,DEBT
,CASE WHEN rn = 1 THEN AVAILABLE ELSE null end as AVAILABLE
FROM CTE

Check for applicable Group query for shopping cart

I have problem for one of discount check condition. I have tables structure as below:
Cart table (id, customerid, productid)
Group table (groupid, groupname, discountamount)
Group Products table (groupproductid, groupid, productid)
While placing an order, there will be multiple items in cart, I want to check those items with top most group if that group consists of all product shopping cart have?
Example:
If group 1 consists 2 products and those two products exists in cart table then group 1 discount should be returned.
please help
It's tricky, without having real table definitions nor sample data. So I've made some up:
create table Carts(
id int,
customerid int,
productid int
)
create table Groups(
groupid int,
groupname int,
discountamount int
)
create table GroupProducts(
groupproductid int,
groupid int,
productid int
)
insert into Carts (id,customerid,productid) values
(1,1,1),
(2,1,2),
(3,1,4),
(4,2,2),
(5,2,3)
insert into Groups (groupid,groupname,discountamount) values
(1,1,10),
(2,2,15),
(3,3,20)
insert into GroupProducts (groupproductid,groupid,productid) values
(1,1,1),
(2,1,5),
(3,2,2),
(4,2,4),
(5,3,2),
(6,3,3)
;With MatchedProducts as (
select
c.customerid,gp.groupid,COUNT(*) as Cnt
from
Carts c
inner join
GroupProducts gp
on
c.productid = gp.productid
group by
c.customerid,gp.groupid
), GroupSizes as (
select groupid,COUNT(*) as Cnt from GroupProducts group by groupid
), MatchingGroups as (
select
mp.*
from
MatchedProducts mp
inner join
GroupSizes gs
on
mp.groupid = gs.groupid and
mp.Cnt = gs.Cnt
)
select * from MatchingGroups
Which produces this result:
customerid groupid Cnt
----------- ----------- -----------
1 2 2
2 3 2
What we're doing here is called "relational division" - if you want to search elsewhere for that term. In my current results, each customer only matches one group - if there are multiple matches, we need some tie-breaking conditions to determine which group to report. I prompted with two suggestions in comments (lowest groupid or highest discountamount). Your response of "added earlier" doesn't help - we don't have a column which contains the addition dates of groups. Rows have no inherent ordering in SQL.
We would do the tie-breaking in the definition of MatchingGroups and the final select:
MatchingGroups as (
select
mp.*,
ROW_NUMBER() OVER (PARTITION BY mp.customerid ORDER BY /*Tie break criteria here */) as rn
from
MatchedProducts mp
inner join
GroupSizes gs
on
mp.groupid = gs.groupid and
mp.Cnt = gs.Cnt
)
select * from MatchingGroups where rn = 1

SQL - Most recent by date

I have a consumer table which has the columns - Email, AccountState and DateCreated
AccountState can have the values 1 (active), 2 (inactive) and 3 (archived)
A specific consumer can have multiple rows consisting of the account state's above.
What I am trying to do is construct a query which returns the following
A. A list of consumer records for each consumer (using email address)
B. Only the records which aren't the most recent (so if a specific email address has 3 records, 1 for each state, it would return the 2 which aren't the most recent)
Then once I have this list I want to set all these states to 3 as they need to be archived.
So for the example data shown here
Only rows 13 - 16 would be returned.
I have tried to do this using the query below but it isn't working.
SELECT con.Email,
con.Id,
con.DateCreated AS DateRegistered,
con.DateLastActivity,
con.hasiPhone,
con.hasAndroid,
con.hasSMS,
con.CurrencyCode AS Currency,
con.AccountState
FROM Consumer con
WHERE con.AccountState <> 1
AND DateCreated =( SELECT MAX(DateCreated)
FROM Consumer con_most_recent
WHERE con_most_recent.AccountState <> 1
AND con_most_recent.Id = con.Id)
order by Email asc
;WITH x AS
(
SELECT rn = ROW_NUMBER() OVER (PARTITION BY EMail ORDER BY DateCreated DESC),
Email, Id, DateCreated AS DateRegistered --, ... other columns
FROM dbo.Consumer
WHERE AccountState <> 1
)
SELECT Email, Id, DateRegistered --, ... other columns
FROM x
WHERE rn > 1
ORDER BY Email;
EDIT changing state for these rows
;WITH x AS
(
SELECT rn = ROW_NUMBER() OVER (PARTITION BY EMail ORDER BY DateCreated DESC),
Email, Id, DateCreated AS DateRegistered --, ... other columns
FROM dbo.Consumer
WHERE AccountState <> 1
)
UPDATE x
SET AccountState = 3
WHERE rn > 1;

Resources