SQL: Number of occurrences grouped by frequency - sql-server

It's not clear the exact statement for me to use here. I want to know how many times certain occurrences happen in the table when the value is A. So for some sample data:
user | value
1 | A
1 | A
1 | B
4 | A
4 | A
4 | B
5 | A
5 | A
5 | A
Would result in:
Occurrence Frequency
1 0
2 2
3 1
Which reads as: there are 0 users that have 1 value A. There are 2 users that have two value A etc.
I feel like I should use a group by and a count(*) by not clear to me how to construct it.

Since you want the occurrences even for 0 frequencies, you need a recursive cte which return all occurrences from 1 to the max number of occurrences.
Then you join this cte with a LEFT join to a query that aggregates on the table and aggregate once more to get the frequencies:
with
cte as (
select count(*) counter
from tablename
where value = 'A'
group by [user]
),
top_counter as (select max(counter) counter from cte),
occurrences as (
select 1 occurrence
union all
select occurrence + 1
from occurrences
where occurrence < (select counter from top_counter)
)
select o.occurrence, count(c.counter) frequency
from occurrences o left join cte c
on c.counter = o.occurrence
group by o.occurrence
See the demo.
Results:
> occurrence | frequency
> ---------: | --------:
> 1 | 0
> 2 | 2
> 3 | 1

You do use COUNT, just 2 of them:
WITH Counts AS(
SELECT V.[User],
COUNT([Value]) AS Frequency
FROM (VALUES(1,'A'),
(1,'A'),
(1,'B'),
(4,'A'),
(4,'A'),
(4,'B'),
(5,'A'),
(5,'A'),
(5,'A'))V([User],[Value]) --USER is a reserved keyword and should not be used for object names
WHERE V.[Value] = 'A'
GROUP BY V.[user])
SELECT V.I,
COUNT(C.Frequency) AS Frequecy
FROM (VALUES(1),(2),(3))V(I)
LEFT JOIN Counts C ON V.I = C.Frequency
GROUP BY V.I;

Here's my take:
with cte as (
select * from (values
(1, 'A'),
(1, 'A'),
(1, 'B'),
(4, 'A'),
(4, 'A'),
(4, 'B'),
(5, 'A'),
(5, 'A'),
(5, 'A')
) as x([User], [Value])
)
select c, count(*)
from (
select [User], count(*) as c
from cte
where [Value] = 'A'
group by [User]
) as s
group by c;
The common table expression isn't important here - it's just setting up your test data.
What you're after is an aggregation of aggretations. That is, the first level aggregate is a "count of value by user". But then you're going to get a "count of (count of value by user) by (that count)". Note, my set doesn't produce the "0 users that have 1 value A". Nor does it produce "0 users that have 17 value A". If it's important that it produce certain negative results, you'll need a list of which ones you care about and join that list with this set of results with an outer join.

Related

Identify duplicates based on multiple columns and parent row

This is an example of table data that I am working on (the table contained a lot of columns, I am showing here only the relevant ones):
Id
job_number
status
parent_id
1
42FWD-42
0
0
2
42FWD-42
1
1
3
42FWD-42
5
1
Id is auto generated. parent_id links the job using the id.
When a new job is created via the app, a new row is created (with status "0"). The auto-generated Id is then used for subsequent rows of same job, and set as parent id.
Another record with status "1" (which is code for started) is also created just after parent record.
Explanation of the problem: due to a bug in the app, there are duplicate set of rows for the same job.
Example of problem
Id
job_number
status
parent_id
1
42FWD-42
0
0
2
42FWD-42
0
0
3
42FWD-42
1
1
4
42FWD-42
1
2
5
42FWD-42
5
1
As you can see from this example, due to the bug, there are 2 rows with "0" status for the same job, and 2 rows with "1" status.
This creates a lot of problems in operation in app where the job is updated using the job number.
The status number should not repeat for a specific job.
What I want to do is to find all duplicates like those in example. For example, I want a query where I can find all duplicates which have same job number, but different parent_id and NO "5" status.
Example result using the example table above, I need the query to return:
Id
job_number
status
parent_id
2
42FWD-42
0
0
4
42FWD-42
1
2
Explanation of this result:
Row with Id=1 is considered the correct record because it has an associated record with status "5"
Row with Id=2 is considered duplicate and its associated records are also considered duplicate
Another possible case: there are duplicate rows, but none have status=5. These rows can be discarded, ie need not be shown in results.
A brief explanation of how the query works will be appreciated.
EDIT:
I forgo to add an important information:
job_number is case sensitive.
ie: 42FWD-42 and 42fwd-42 are different and valid job number. They should not be considered duplicates, and are 2 separate jobs.
The reason for this is the actual job number is not small text as in my example. It is a long string like a guid.
First I must mention you should block identical rows by means of a unique constraint. I suggest that once you have eliminated all duplicates you put up a such a constraint to keep this from happening again.
Now for your question, you can do this by grouping on the duplicate columns, and have only those that count more than one.
Here is an example
declare #t table (id int, job_number varchar(10), status int, parent_id int)
insert into #t
values (1, '42FWD-42', 0, 0), (2, '42FWD-42', 0, 0), (3, '42FWD-42', 1, 1), (4, '42FWD-42', 1, 2), (5, '42FWD-42', 5, 1)
select max(t.id) as id, t.job_number, t.status
from #t t
group by t.job_number, t.status
having count(*) > 1
the result is
id job_number status
2 42FWD-42 0
4 42FWD-42 1
and to get also the parent_id you can add a self join
select max(t.id) as id,
t.job_number,
t.status,
(select t2.parent_id from #t t2 where t2.id = max(t.id)) as parent_id
from #t t
group by t.job_number, t.status
having count(*) > 1
this returns
id job_number status parent_id
2 42FWD-42 0 0
4 42FWD-42 1 2
EDIT
To solve the addional problem in the edit of your question, about the case sensitive, you can fix that by using a COLLATE in your field retrieval and your comparision
this should do it
declare #t table (id int, job_number varchar(10), status int, parent_id int)
insert into #t
values (1, '42FWD-42', 0, 0),
(2, '42FWD-42', 0, 0),
(3, '42FWD-42', 1, 1),
(4, '42fwd-42', 1, 2), -- LOWERCASE !!!
(5, '42FWD-42', 5, 1)
select max(t.id) as id,
t.job_number COLLATE Latin1_General_CS_AS,
t.status,
(select t2.parent_id from #t t2 where t2.id = max(t.id)) as parent_id
from #t t
group by t.job_number COLLATE Latin1_General_CS_AS, t.status
having count(*) > 1
and now the result will be
id job_number status parent_id
2 42FWD-42 0 0
Yet another edit
Now, suppose you need to use the result of these duplicate id's in another query, you could do something like this
select t.*
from #t t
where t.id in ( select max(t.id) as id
from #t t
group by t.job_number COLLATE Latin1_General_CS_AS, t.status
having count(*) > 1
)
What I am doing here is getting only the duplicate id's in a form that can be used to feed a where clause in another query.
This way you can use the result set in any way you wish.
Also note that for this we don't need the self join to retrieve the parent_id anymore.
One possible use of this could be to delete duplicate rows, you can write
delete from yourtable
where id in ( select max(t.id) as id
from #t t
group by t.job_number COLLATE Latin1_General_CS_AS, t.status
having count(*) > 1
)
you can try to use ROW_NUMBER window function to get duplicate row data and its id by job_number, then using cte recursive to find all error records by this id
Query 1:
;WITH CTE AS (
SELECT *,ROW_NUMBER() OVER (PARTITION BY job_number ORDER BY Id) rn
FROM T
WHERE status = 0
), CTE1 AS (
SELECT id,job_number,status,parent_id
FROM CTE
WHERE rn > 1
UNION ALL
SELECT t.id,t.job_number,t.status,t.parent_id
FROM CTE1 c INNER JOIN T t
ON c.id = t.parent_id
)
SELECT *
FROM CTE1
Results:
| id | job_number | status | parent_id |
|----|------------|--------|-----------|
| 2 | 42FWD-42 | 0 | 0 |
| 4 | 42FWD-42 | 1 | 2 |

SQL Server SUM based on subsequent records

Microsoft SQL Server 2012 (SP1) - 11.0.3156.0 (X64)
I am not sure of the best way to word this and have tried a few different searches with different combinations of words without success.
I only want to Sum Sequence = 1 when there are Sequence > 1, in the table below the Sequence = 1 lines marked with *. I don't care at all about checking that Sequence 2,3,etc match the same pattern because if they exist at all I need to Sum them.
I have data that looks like this:
| Sequence | ID | Num | OtherID |
|----------|----|-----|---------|
| 1 | 1 | 10 | 1 |*
| 2 | 1 | 15 | 1 |
| 3 | 1 | 20 | 1 |
| 1 | 2 | 10 | 1 |*
| 2 | 2 | 15 | 1 |
| 1 | 3 | 10 | 1 |
| 1 | 1 | 40 | 3 |
I need to sum the Num column but only when there is more than one sequence. My output would look like this:
Sequence Sum OtherID
1 20 1
2 30 1
3 20 1
I have tried grouping the queries in a bunch of different ways but really by the time I get to the sum, I don't know how to look ahead to make sure there are greater than 1 sequences for an ID.
My query at the moment looks something like this:
select Sequence, Sum(Num) as [Sum], OtherID
from tbl
where ID in (Select ID from tbl where Sequence > 1)
Group by Sequence, OtherID
tbl is a CTE that I wrapped around my query and it partially works, but is not really the filter I wanted.
If this is something that just shouldn't be done or can't be done then I can handle that, but if it's something I am missing I'd like to fix the query.
Edit:
I can't give the full query here but I started with this table/data (to get the above output). The OtherID is there because the data has the same ID/Sequence combinations but that OtherID helps separate them out so the rows are not identical (multiple questions on a form).
Create table #tmpTable (ID int, Sequence int, Num int, OtherID int)
insert into #tmpTable (ID, Sequence, Num, OtherID) values (1, 1, 10, 1)
insert into #tmpTable (ID, Sequence, Num, OtherID) values (1, 2, 15, 1)
insert into #tmpTable (ID, Sequence, Num, OtherID) values (1, 3, 20, 1)
insert into #tmpTable (ID, Sequence, Num, OtherID) values (2, 1, 10, 1)
insert into #tmpTable (ID, Sequence, Num, OtherID) values (2, 2, 15, 1)
insert into #tmpTable (ID, Sequence, Num, OtherID) values (3, 1, 10, 1)
insert into #tmpTable (ID, Sequence, Num, OtherID) values (1, 1, 40, 3)
The following will sum over Sequence and OtherID, but only when:
Either
sequence is greater than 1
or
there is something else with the same ID and OtherID, but a different sequence.
Query:
select Sequence, Sum(Num) as SumNum, OtherID from #tmpTable a
where Sequence > 1
or exists (select * from #tmpTable b
where a.ID = b.ID
and a.OtherID = b.OtherID
and b.Sequence <> a.Sequence)
group by Sequence, OtherID;
It looks like you are trying to sum by Sequence and OtherID if the Count of ID >1, so you could do something like below:
select Sequence, Sum(Num) as [Sum], OtherID
from tbl
where ID in (Select ID from tbl where Sequence > 1)
Group by Sequence, OtherID
Having count(id)>1

updating min value on the second column when the first column appears more then once

Im struggling with how to do this in one step.
I have a column with values which vary between 1 and +-20. Linked to this is a second value which is normally between 1 and 5.
what i want to do is when Number 1 values appears more then once then I need to update the value in column Number 2 to 99 but only the highest number in the Number 2 column.
I have added a pic to explain better.
Basically id is unique, if value 1 appears more then once I need to update value 2 for where the value in value 2 is the highest value.
You can use row_number() to find the row with the highest No2 value and you can use count() over() to check if there are more than one row present for a No1 value.
SQL Fiddle
MS SQL Server 2008 Schema Setup:
create table YourTable
(
No1 int,
No2 int
);
insert into YourTable values
(1, 3),
(1, 2),
(2, 1);
Query 1:
with C as
(
select No2,
row_number() over(partition by No1 order by No2 desc) as rn,
count(*) over(partition by No1) as c
from YourTable
)
update C
set No2 = 99
where rn = 1 and
c > 1
Results:
Query 2:
select *
from YourTable
Results:
| NO1 | NO2 |
|-----|-----|
| 1 | 99 |
| 1 | 2 |
| 2 | 1 |

Tsql group by clause with exceptions

I have a problem with a query.
This is the data (order by Timestamp):
Data
ID Value Timestamp
1 0 2001-1-1
2 0 2002-1-1
3 1 2003-1-1
4 1 2004-1-1
5 0 2005-1-1
6 2 2006-1-1
7 2 2007-1-1
8 2 2008-1-1
I need to extract distinct values and the first occurance of the date. The exception here is that I need to group them only if not interrupted with a new value in that timeframe.
So the data I need is:
ID Value Timestamp
1 0 2001-1-1
3 1 2003-1-1
5 0 2005-1-1
6 2 2006-1-1
I've made this work by a complicated query, but am sure there is an easier way to do it, just cant think of it. Could anyone help?
This is what I started with - probably could work with that. This is a query that should locate when a value is changed.
> SELECT * FROM Data d1 join Data d2 ON d1.Timestamp < d2.Timestamp and
> d1.Value <> d2.Value
It probably could be done with a good use of row_number clause but cant manage it.
Sample data:
declare #T table (ID int, Value int, Timestamp date)
insert into #T(ID, Value, Timestamp) values
(1, 0, '20010101'),
(2, 0, '20020101'),
(3, 1, '20030101'),
(4, 1, '20040101'),
(5, 0, '20050101'),
(6, 2, '20060101'),
(7, 2, '20070101'),
(8, 2, '20080101')
Query:
;With OrderedValues as (
select *,ROW_NUMBER() OVER (ORDER By TimeStamp) as rn --TODO - specific columns better than *
from #T
), Firsts as (
select
ov1.* --TODO - specific columns better than *
from
OrderedValues ov1
left join
OrderedValues ov2
on
ov1.Value = ov2.Value and
ov1.rn = ov2.rn + 1
where
ov2.ID is null
)
select * --TODO - specific columns better than *
from Firsts
I didn't rely on the ID values being sequential and without gaps. If that's the situation, you can omit OrderedValues (using the table and ID in place of OrderedValues and rn). The second query simply finds rows where there isn't an immediate preceding row with the same Value.
Result:
ID Value Timestamp rn
----------- ----------- ---------- --------------------
1 0 2001-01-01 1
3 1 2003-01-01 3
5 0 2005-01-01 5
6 2 2006-01-01 6
You can order by rn if you need the results in this specific order.

Reporting on data when data is missing (ie. how to report zero activities for a customer on a given week)

I want to create a report which aggregates the number of activities per customer per week.
If there has been no activites on that customer for a given week, 0 should be displayed (i.e week 3 and 4 in the sample below)
CUSTOMER | #ACTIVITIES | WEEKNUMBER
A | 4 | 1
A | 2 | 2
A | 0 | 3
A | 0 | 4
A | 1 | 5
B ...
C ...
The problem is that if there are no activities there is no data to report on and therefor week 3 and 4 in the sample below is not in the report.
What is the "best" way to solve this?
Try this:
DECLARE #YourTable table (CUSTOMER char(1), ACTIVITIES int, WEEKNUMBER int)
INSERT #YourTable VALUES ('A' , 4 , 1)
INSERT #YourTable VALUES ('A' , 2 , 2)
INSERT #YourTable VALUES ('A' , 0 , 3)
INSERT #YourTable VALUES ('A' , 0 , 4)
INSERT #YourTable VALUES ('A' , 1 , 5)
INSERT #YourTable VALUES ('B' , 5 , 3)
INSERT #YourTable VALUES ('C' , 2 , 4)
DECLARE #StartNumber int
,#EndNumber int
SELECT #StartNumber=1
,#EndNumber=5
;WITH AllNumbers AS
(
SELECT #StartNumber AS Number
UNION ALL
SELECT Number+1
FROM AllNumbers
WHERE Number<#EndNumber
)
, AllCustomers AS
(
SELECT DISTINCT CUSTOMER FROM #YourTable
)
SELECT
n.Number AS WEEKNUMBER, c.CUSTOMER, CASE WHEN y.Customer IS NULL THEN 0 ELSE y.ACTIVITIES END AS ACTIVITIES
FROM AllNumbers n
CROSS JOIN AllCustomers c
LEFT OUTER JOIN #YourTable y ON n.Number=y.WEEKNUMBER AND c.CUSTOMER=y.CUSTOMER
--OPTION (MAXRECURSION 500)
OUTPUT:
WEEKNUMBER CUSTOMER ACTIVITIES
----------- -------- -----------
1 A 4
1 B 0
1 C 0
2 A 2
2 B 0
2 C 0
3 A 0
3 B 5
3 C 0
4 A 0
4 B 0
4 C 2
5 A 1
5 B 0
5 C 0
(15 row(s) affected)
I use a CTE to build a Numbers table, but you could build a permanent one look at this question: What is the best way to create and populate a numbers table?. You could Write the Query without a CTE (same results as above):
SELECT
n.Number AS WEEKNUMBER, c.CUSTOMER, CASE WHEN y.Customer IS NULL THEN 0 ELSE y.ACTIVITIES END AS ACTIVITIES
FROM Numbers n
CROSS JOIN (SELECT DISTINCT
CUSTOMER
FROM #YourTable
) c
LEFT OUTER JOIN #YourTable y ON n.Number=y.WEEKNUMBER AND c.CUSTOMER=y.CUSTOMER
WHERE n.Number>=1 AND n.Number<=5
ORDER BY n.Number,c.CUSTOMER
Keep a table of time periods separately, and then outer left join the activities to it.
Like:
select *
from ReportingPeriod as p
left join Activities as a on a.ReportingPeriodId = p.ReportingPeriodId;

Resources