Left outer join for first row in group only - sql-server

I have a table that looks like this:
BANK ACCOUNT_NAME EXCESS DEBT
Acme Bank Checking1 500 300
Acme Bank Personal 200 100
Bank One Business 100 50
I need a sql query that returns.
BANK ACCOUNT_NAME EXCESS DEBT AVAILABLE
Acme Bank Checking1 500 300 300
Acme Bank Personal 200 100 NULL
Bank One Business 100 50 50
AVAILABLE would be the Sum(EXCESS) - Sum(DEBT) grouped by BANK. AVAILABLE would then appear only on the first row of BANK-ACCOUNT_NAME combination. How do I do this?
My first attempt results in AVAILABLE having values on all rows, which not intended. I only want the first row in the group to have an AVAILABLE value.
SELECT
outer.BANK
,outer.ACCOUNT_NAME
,outer.EXCESS
,outer.DEBT
,inner2.AVAILABLE
FROM BankBalances AS outer
CROSS APPLY
(
SELECT TOP 1
Bank
,SUM(EXCESS) - SUM(DEBT) AS AVAILABLE
FROM BankBalances AS inner
GROUP BY Bank
WHERE outer.BANK = inner.BANK
) AS inner2

You can use the following query:
SELECT BANK, ACCOUNT_NAME, EXCESS, DEBT,
CASE WHEN ROW_NUMBER() OVER (PARTITION BY BANK ORDER BY ACCOUNT_NAME) = 1
THEN SUM(EXCESS) OVER (PARTITION BY BANK) -
SUM(DEBT) OVER (PARTITION BY BANK)
ELSE NULL
END AS AVAILABLE
FROM BankBalances
You can use windowed version of SUM in order to avoid CROSS APPLY. ROW_NUMBER is simply used to check for first row.
I have made the assumption that first row is considered the one having the 'minimum' ACCOUNT_NAME value within each BANK partition.
Demo here

you can use ROW_NUMBER and SUM OVER() with Partition by like this.
;WITH CTE AS
(
SELECT
BANK
,ACCOUNT_NAME
,EXCESS
,DEBT
,SUM(EXCESS - DEBT) OVER(PARTITION BY BANK) AS AVAILABLE,
,ROW_NUMBER()OVER(PARTITION BY BANK ORDER BY ACCOUNT_NAME ASC) rn
FROM BankBalances
)
SELECT BANK
,ACCOUNT_NAME
,EXCESS
,DEBT
,CASE WHEN rn = 1 THEN AVAILABLE ELSE null end as AVAILABLE
FROM CTE

Related

How to create clusters of records from consecutive events

I have BI data stored in a table in snowflake. To simplify, let's say there are only 3 columns in the table:
user_id event_time event_key
I would like to create key clusters on top of the key events. For each user, I want to find groups of consecutive rows that their event_key is in <event_keys_array> and the time difference (event_time) from the previous event of the set is less than 30 seconds.
Meaning, if the event is created less than 30 seconds from the previous event and there are no event with event_key that is not included in <event_keys_array> between them, it will be considered as the same cluster.
How can I achieve this?
This can be done inline with a collection of nested window functions. I've taken some liberties with the "event_keys_array" requirement without some example data to go on? I tend to nest sub queries but this could just as easily be expressed in a chain of CTEs
The key thing is identifying each cluster start. With that the rest falls in to place.
CREATE OR REPLACE TEMPORARY TABLE event_stream
(
event_id NUMBER(38,0)
,user_id NUMBER(38,0)
,event_key NUMBER(38,0)
,event_time TIMESTAMP_NTZ(3)
);
INSERT INTO event_stream
(event_id,user_id,event_key,event_time)
VALUES
(1 ,1,1,'2023-01-25 16:25:01.123')--User 1 - Cluster 1
,(2 ,1,1,'2023-01-25 16:25:22.123')--User 1 - Cluster 1
,(3 ,1,1,'2023-01-25 16:25:46.123')--User 1 - Cluster 1
,(4 ,1,2,'2023-01-25 16:26:01.123')--User 1 - Cluster 2 (Not in array)
,(5 ,1,3,'2023-01-25 16:26:02.123')--User 1 - Cluster 3
,(6 ,2,1,'2023-01-25 16:25:01.123')--User 2 - Cluster 1
,(7 ,2,1,'2023-01-25 16:26:01.123')--User 2 - Cluster 2
,(8 ,2,1,'2023-01-25 16:27:01.123')--User 2 - Cluster 3 (in array)
,(9 ,2,3,'2023-01-25 16:27:04.123')--User 2 - Cluster 3 (in array)
,(10,2,2,'2023-01-25 16:27:07.123')--User 2 - Cluster 4
;
SELECT --Distinct to dedup final output down to window function outputs. remove to bring event level data through alongside cluster details.
DISTINCT
D.user_id AS user_id
,MAX(CASE WHEN D.event_position = 1 THEN D.event_time END) OVER(PARTITION BY D.user_id,D.grp) AS event_cluster_start_time
,MAX(CASE WHEN D.event_position_reverse = 1 THEN D.event_time END) OVER(PARTITION BY D.user_id,D.grp) AS event_cluster_end_time
,DATEDIFF(SECOND,event_cluster_start_time,event_cluster_end_time) AS event_cluster_duration_seconds
,COUNT(1) OVER(PARTITION BY D.user_id,D.grp) AS event_cluster_total_contained_events
,FIRST_VALUE(D.event_id) OVER(PARTITION BY D.user_id,D.grp ORDER BY D.event_time ASC) AS event_cluster_intitial_event_id
FROM (
SELECT *
,ROW_NUMBER() OVER(PARTITION BY A.user_id,A.grp ORDER BY A.event_time) AS event_position
,ROW_NUMBER() OVER(PARTITION BY A.user_id,A.grp ORDER BY A.event_time DESC) AS event_position_reverse
FROM (
SELECT *
--A rolling sum of cluster starts at the row level provides a value to partition the data on.
,SUM(A.is_start) OVER(PARTITION BY A.user_id ORDER BY A.event_time ROWS UNBOUNDED PRECEDING) AS grp
FROM (
SELECT A.event_id
,A.user_id
,A.event_key
,array_contains(A.event_key::variant, array_construct(1,3)) AS event_key_grouped
,A.event_time
,LAG(event_time,1) OVER(PARTITION BY A.user_id ORDER BY A.event_time) AS previous_event_time
,LAG(event_key_grouped,1) OVER(PARTITION BY A.user_id ORDER BY A.event_time) AS previous_event_key_grouped
,CASE
WHEN --Current event should be grouped with previous if within 30 seconds
DATEADD(SECOND,-30,A.event_time) <= previous_event_time
--add additional cluster inclusion criteria, e.g. same grouped key
AND event_key_grouped = previous_event_key_grouped
THEN NULL ELSE 1
END AS is_start
FROM event_stream A
) AS A
) AS A
) AS D
ORDER BY 1,2 ;
If you wanted to split clusters by another field value such as event_key you just need to add the field to all window function partitions.
Result Set:

SQL spread out one group proportionally to other groups

So, for example, we have a number of users with different group id. Some of them don't have group:
userID groupID
-------------
user1 group1
user2 group1
user3 group2
user4 group1
user5 NULL
user6 NULL
user7 NULL
user8 NULL
We need to group users by their groupID. And we want users without group (groupID equals NULL) to be assigned to one of existing groups(group1 or group2 in this example). But we want to distribute them proportionally to amount of users already assigned to those groups. In our example group1 has 3 users and group2 has only 1 user. So 75% (3/4) of new users should be counted as members of group 1 and other 25% (1/4) should be "added" to group2. The end result should look like this:
groupID numOfUsers
-------------
group1 6
group2 2
This is a simplified example.
Basically we just can't figure out how users without a group can be divided between groups in a certain proportion, not just evenly distributed between them.
We can have any number of groups and users, so we can't just hardcode percentages.
Any help is appreciated.
Edit:
I tried to use NTILE(), but it gives even distribution, not proportional to amount of users in groups
SELECT userID ,
NTILE(2) OVER( ) gr
from(
select DISTINCT userID
from test_task
WHERE groupID IS NULL ) AS abc
here is one way:
select
groupid
, count(*)
+ round(count(*) / sum(count(*)) over(),0) * (select count(*) from table where groupid ='no_group')
from table
where groupid <> 'no_group'
group by groupid
We can use an updatable CTE to do this
First, we take all existing data, group it up by groupID, and calculate a running sum of the number of rows, as well as the total rows over the whole set
We take the rows we want to update and add a row-number (subtract 1 so the calculations work)
Join the two based on that row-number modulo the total existing rows should be between the previous running sum and the current running sum
Note that this only works well when there are a divisible number of rows eg. 4 or 8, by 4 existing rows
WITH Groups AS (
SELECT
groupID,
perGroup = COUNT(*),
total = SUM(COUNT(*)) OVER (),
runningSum = SUM(COUNT(*)) OVER (ORDER BY groupID ROWS UNBOUNDED PRECEDING)
FROM test_task
WHERE groupID IS NOT NULL
GROUP BY groupID
),
ToUpdate AS (
SELECT
groupID,
userID,
rn = ROW_NUMBER() OVER (ORDER BY userID) - 1
FROM test_task tt
WHERE groupID IS NULL
)
UPDATE u
SET groupID = g.groupID
FROM ToUpdate u
JOIN Groups g
ON u.rn % (g.total) >= g.runningSum - g.perGroup
AND u.rn % (g.total) < g.runningSum;
db<>fiddle

Adding conditions at the WHERE clause gives more results

I use SqlServer. I have a table with lots of columns the importants of which are:
· User_name
· Partition - Date in xxxx-xx-xx format
· Game - a string that works as an ID
· Credits - A number
· Bet - Another number
· Prize - Another number
· Num_Spins - Another number
I wrote a query to select of those the ones that interest me given a specific date.
Select distinct CONCAT(User_Name, DATALENGTH(User_Name)) as User_name, Partition, Game, Bet, Num_spins, Credits, Prize
from ***
where Partition>='2019-09-01' and Partition<'2019-11-17' and Bet>0 and credits is not null
and User_Name IN (Select distinct userName from *** where GeoIpCountryCode='ES')
I wish I could make that a view or something, but unfortunately I don't have the privileges to do so. Therefore, I do a subquery from it:
I want to find out of those rows, the ones whose numbers follow a certain math result: (Credits+Bet-Prize) > 100000 and num_spins>5
Select user_name, partition, count(Game) as difMachines
FROM
(
Select distinct CONCAT(User_Name, DATALENGTH(User_Name)) as User_name, Partition, Game, Bet, Num_spins, Credits, Prize
from ***
where Partition>='2019-09-01' and Partition<'2019-11-17' and Bet>0 and credits is not null
and User_Name IN (Select distinct userName from *** where GeoIpCountryCode='ES')
) as A
where
(Credits+Bet-Prize) > 100000 and num_spins>5
group by User_Name, Partition;
Now, I got all the information I need. I run the last query, to group_by date these results so I can analyze them:
Select datepart(week,Partition) as Week, count (distinct user_name) as Users
from (
Select user_name, partition, count(Game) as difMachines
FROM
(
Select distinct CONCAT(User_Name, DATALENGTH(User_Name)) as User_name, Partition, Game, Bet, Num_spins, Credits, Prize
from ***
where Partition>='2019-09-01' and Partition<'2019-11-17' and Bet>0 and credits is not null
and User_Name IN (Select distinct userName from *** where GeoIpCountryCode='ES')
) as A
where
(Credits+Bet-Prize) > 100000 and num_spins>5
group by User_Name, Partition
) as B
Where difMachines=1
group by datepart(week,Partition)
order by Week asc;
I know the query can be optimized, but that's not what troubles me. The problem is that when running this query, I obtain at week 36 17050 users. If I change this line (Credits+Bet-Prize) > 100000 and num_spins>5 for this one (Credits+Bet-Prize) > 100000 (so, I purely remove the num_spins>5 part), I get 16800 users instead.
To sum up, I get more results by being more restrictive in my query. That does not make sense to me. Someone please can help? Head me to the right direction or something?
Thank you
You are trying to get the count of result set with this filter diffmachine=1,isn't?. but if you remove the filter num_spins>5 then count will increase for diffmachine greater than 1.here i give an example like yours
Declare #t table
(
[user_name] varchar(5), [partition] date, Game varchar(10),num_spins int
)
insert into #t
select 'a','01nov19','g1',1
union all
select 'a','01nov19','g1',2
union all
select 'a','01nov19','g1',3
union all
select 'a','01nov19','g1',4
union all
select 'a','01nov19','g1',5
union all
select 'a','01nov19','g1',6
union all
select 'b','01nov19','g1',7
select * from
(
select [user_name],[partition],count(game) cnt
from #t
where num_spins>5
group by [user_name],[partition]
)a
where cnt=1

get first row for each group

I want to transform this data
account; docdate; docnum
17700; 9/11/2015; 1
17700; 9/12/2015; 2
70070; 9/1/2015; 4
70070; 9/2/2015; 6
70070; 9/3/2015; 9
into this
account; docdate; docnum
17700; 9/12/2015; 2
70070; 9/3/2015; 9
.. for each account I want to have one row with the most current (=max(docdate)) docdate. I already tried different approaches with cross apply and row_number but couldn't achieve the desired results
Use ROW_NUMBER:
SELCT account, docdate, docnum
FROM (
SELECT account, docdate, docnum,
ROW_NUMBER() OVER (PARTITION BY account
ORDER BY docdate DESC) AS rn
FROM mytable ) AS t
WHERE t.rn = 1
PARTITION BY account clause creates slices of rows sharing the same account value. ORDER BY docdate DESC places the record having the maximum docdate value at the top of its related slice. Hence rn = 1 points to the record with the maximum docdate value within each account partition.

Get total sum for in Sql server for unknown location?

There is an insurance policy and this policy can be paid by 1-3 agents.
line #1 ) for example : for policy Id 1 , an agent who's ID is 100 , paid 123
line #3 ) for example : for policy Id 3 , an agent who's ID is 999 , paid 741 , and also another agent who's ID is 100 paid 874
(the representation is not how it should be done correctly , but that how I have it as a fact).
How can I found how much agent ID 100 has paid total ?
(123+541+874+557+471+552)
I have a very ugly union's solution.
SQL ONLINE
In a well normalized model this is a simple query. You can 'normalize' in a CTE query then sum:
with cte as (
select agent1id as id, agent1sum as s
from insurance where agent1id is not null
union all
select agent2id as id, agent2sum as s
from insurance where agent2id is not null
union all
select agent3id as id, agent3sum as s
from insurance where agent3id is not null
)
select sum( s)
from cte
where id = 100
This is a friendly index approax if your table contains index for agents columns. A friendly index query avoid full table scan.
Looks like
SUM(
CASE WHEN agent1id=100 THEN agent1sum ELSE 0 END +
CASE WHEN agent2id=100 THEN agent2sum ELSE 0 END +
CASE WHEN agent3id=100 THEN agent3sum ELSE 0 END)
should aggregate it properly. If you need to do it for all agents, I'd use the agent table or use a CTE before this query to get the distinct agent IDs, then replace 100 above.

Resources