TSQL count vs sum distinct values

TSQL count vs sum distinct values - sql-server

I have a slightly confusing conundrum and I have been stuck all day.
I have the following types of data ...
For each customer record I have order numbers and for each order, I have a series of package numbers and for each package number, I have a possibility of zones... Normally the math would be relatively simple if there was 1 package with 1 or more zones we just select distinct amount of seats for example.
+-----------+-------+-----+------+-------+
| customer | order | pkg | zone | seats |
+-----------+-------+-----+------+-------+
| 1 | 1 | 11 | 7 | 2 |
| 1 | 1 | 12 | 7 | 2 |
+-----------+-------+-----+------+-------+
We know customer 1 has 2 seats per package.
Here is where it gets tricky
+----------+-------+-----+------+-------+
| customer | order | pkg | zone | seats |
+----------+-------+-----+------+-------+
| 2 | 3 | 8 | 5 | 2 |
| 2 | 3 | 9 | 5 | 2 |
| 2 | 3 | 10 | 5 | 2 |
-- In the above case we know a given customer has one order #3, with three packages in the same zone each package has two seats.
| 2 | 3 | 9 | 6 | 1 |
| 2 | 3 | 9 | 8 | 1 |
| 2 | 3 | 10 | 7 | 2 |
+----------+-------+-----+------+-------+
-- Here things are confusing because the same customer, has a single order #3 (and its possible
-- both scenarios occur in one single order) with two packages 9 and 10, package 9 has two zones
-- 1 and 1 and package 10 has one zones with two seats. how do we distinguish when we simply count
-- the seats like in the first/second occurrence or when we sum the seats like in the last example.
To reiterate a single customer would have a single order each order can have many packages in it with distinct package numbers each package can have 1 or more zones and each zone can have 1 or more seats.
When the zones are the same for a single package we simply count distinct. when a single order+package has more than one zone we sum we don't count.
I can't figure out how to code the logic. Please help.
My columns are customer_no, order_no, pkg_no, zone_no and pkg_seats.
Here is a real example
+----------+-------+-----+-------+------+
| customer | order | pkg | seats | zone |
+----------+-------+-----+-------+------+
| 374 | 876 | 68 | 2 | 26 |
| 374 | 876 | 68 | 1 | 32 |
| 374 | 876 | 68 | 1 | 56 |
| 374 | 876 | 71 | 2 | 56 |
| 374 | 876 | 71 | 2 | 79 |
| 862 | 538 | 71 | 2 | 33 |
| 862 | 538 | 71 | 1 | 81 |
| 862 | 538 | 71 | 1 | 82 |
-- In the below case we simply count 2. in the above we sum.
| 575 | 994 | 68 | 2 | 34 |
| 575 | 994 | 68 | 2 | 79 |
+----------+-------+-----+-------+------+
I should add one super confusing piece. We have a series of packages that are part of other packages. For example package 68, 70 and 71 are all together and the parent package is 68.
I can't figure out the grouping.

with data as (
select *,
min(zone_no) over
(partition by customer_no, order_no, pkg_no) as min_zone_no1,
min(zone_no) over
(partition by customer_no, order_no, pkg_no, pkg_seats) as min_zone_no2
from T
)
select
customer_no, order_no,
sum(case when zone_no = min_zone_no1 then pkg_seats end) as seat_total1,
sum(case when zone_no = min_zone_no2 then pkg_seats end) as seat_total2
from data
group by customer_no, order_no
order by customer_no, order_no;
I've poured over your description a few times and I'm still not certain I'm on the right track. You seem to have a problem of double-counting: essentially you want a sum, but some of the rows shouldn't be included. (To "count distinct seats" is likely the wrong nomenclature here.)
My approach above is to try and identify sets of rows that involve "duplicates" and some data that will assist in counting only one of them. I'm not sure what to make of order 876 which has different numbers of seats across the three zones.

Related

Programming In SQLITE

In college I learned PL/SQL, which I used to insert/update data into table programmatically.
So is there any way to do it in SQLITE?
I have one table book which has two columns: readPages and currentPage. readPage contains info about how many pages I've read today and currentPage shows total read pages till today.
Currently I have data for only readPages so I want to calculate currentPage for past days, e.g.
readPages: 19 10 43 20 35 # I have data for 5 days
currentPage: 19 29 72 92 127 # I want to calculate it
So this can be easy with programming, but how to do with sqlite as it is not like plsql.

The order of the rows can be determined by id or by date.
The problem with the column date is that its format: 'DD-MM' is not comparable.
Better change it to something like: 'YYYY-MM-DD'.
Since your version of SQLite does not allow you to use window functions, you can do what you need with this:
update findYourWhy
set currentPage = coalesce(
(select sum(f.readPage) from findYourWhy f where f.id <= findYourWhy.id),
0
);
If you change the format of the date column, you can also do it with this:
update findYourWhy
set currentPage = coalesce(
(select sum(f.readPage) from findYourWhy f where f.date <= findYourWhy.date),
0
);
See the demo.
CREATE TABLE findYourWhy (
id INTEGER,
date TEXT,
currentPage INTEGER,
readPage INTEGER,
PRIMARY KEY(id)
);
INSERT INTO findYourWhy (id,date,currentPage,readPage) VALUES
(1,'06-05',null,36),
(2,'07-05',null,9),
(3,'08-05',null,12),
(4,'09-05',null,5),
(5,'10-05',null,12),
(6,'11-05',null,13),
(7,'12-05',null,2),
(8,'13-05',null,12),
(9,'14-05',null,3),
(10,'15-05',null,5),
(11,'16-05',null,6),
(12,'17-05',null,7),
(13,'18-05',null,7);
Results:
| id | date | currentPage | readPage |
| --- | ----- | ----------- | -------- |
| 1 | 06-05 | 36 | 36 |
| 2 | 07-05 | 45 | 9 |
| 3 | 08-05 | 57 | 12 |
| 4 | 09-05 | 62 | 5 |
| 5 | 10-05 | 74 | 12 |
| 6 | 11-05 | 87 | 13 |
| 7 | 12-05 | 89 | 2 |
| 8 | 13-05 | 101 | 12 |
| 9 | 14-05 | 104 | 3 |
| 10 | 15-05 | 109 | 5 |
| 11 | 16-05 | 115 | 6 |
| 12 | 17-05 | 122 | 7 |
| 13 | 18-05 | 129 | 7 |

If you're using sqlite 3.25 or newer, something like:
SELECT date, readPages
, sum(readPages) OVER (ORDER BY date) AS total_pages_read
FROM yourTableName
ORDER BY date;
will compute the running total of pages.

How to delete duplicates in SQL table with Primary Key [duplicate]

This question already has answers here:
How can I remove duplicate rows?
(43 answers)
Closed 6 years ago.
As an example, consider the following table.
+-------+----------+----------+------------+
| ID(PK)| ClientID | ItemType | ItemID |
+-------+----------+----------+------------+
| 1 | 4 | B | 56 |
| 2 | 8 | B | 54 |
| 3 | 276 | B | 57 |
| 4 | 8653 | B | 25 |
| 5 | 3 | B | 55 |
| 6 | 4 | B | 56 |
| 7 | 4 | B | 56 |
| 8 | 276 | B | 57 |
| 9 | 8653 | B | 25 |
+-------+----------+----------+------------+
We have a process that's causing duplicates that we need to delete. In the example above, clients 4, 276, and 8653 should only ever have one ItemType/ItemID combination. How would I delete the extra rows that I don't need. So in this example, I'd need to delete all row contents of ID(PK)s 6, 7, 8, 9. Now this would need to happen on a much larger scale so I can't just go in one by one and delete the rows. Is there a query that will identify all ID(PK)s that aren't the lowest ID(PK) so I can delete them? I'm picturing a delete statement that operates on a subquery, but I'm open to suggestions. I've tried creating a rownumber to identify duplicates, however, because the table has a PK all rows are unique so that hasn't worked for me.
Thank you!
Edit: Here's the expected result
+-------+----------+----------+------------+
| ID(PK)| ClientID | ItemType | ItemID |
+-------+----------+----------+------------+
| 1 | 4 | B | 56 |
| 2 | 8 | B | 54 |
| 3 | 276 | B | 57 |
| 4 | 8653 | B | 25 |
| 5 | 3 | B | 55 |
+-------+----------+----------+------------+

You can use CTE:
;WITH ToDelete AS (
SELECT ROW_NUMBER() OVER (PARTITION BY ClientID, ItemType, ItemID
ORDER BY ID) AS rn
FROM mytable
)
DELETE FROM ToDelete
WHERE rn > 1

How to get Detailed Explain Plan?

I worked on management studio in the past and remember explain/query plan was descriptive like it used to tell
1) Order in which statements will be fired
2) Number of rows return by each statement
I am using "explain plan" by OracleSQL developer but i don't see above features. Is there any other good free tool ?

Order in which statements will be fired
Adrian Billington has created an "XPlan Utility", to extend the output of DBMS_XPLAN to include the execution order of the steps. The following output shows the difference between the default output and that produced by Adrian's XPlan Utility.
For example,
EXPLAIN PLAN FOR
SELECT *
FROM emp e, dept d
WHERE e.deptno = d.deptno
AND e.ename = 'SMITH';
SET LINESIZE 130
-- Default Output
SELECT * FROM TABLE(DBMS_XPLAN.DISPLAY);
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------
Plan hash value: 3625962092
----------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 58 | 3 (0)| 00:00:53 |
| 1 | NESTED LOOPS | | | | | |
| 2 | NESTED LOOPS | | 1 | 58 | 3 (0)| 00:00:53 |
|* 3 | TABLE ACCESS FULL | EMP | 1 | 38 | 2 (0)| 00:00:35 |
|* 4 | INDEX UNIQUE SCAN | PK_DEPT | 1 | | 0 (0)| 00:00:01 |
| 5 | TABLE ACCESS BY INDEX ROWID| DEPT | 1 | 20 | 1 (0)| 00:00:18 |
----------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter("E"."ENAME"='SMITH')
4 - access("E"."DEPTNO"="D"."DEPTNO")
18 rows selected.
SQL>
Let's see the extended plan to see the order of steps. See the ORD column:
-- XPlan Utility output
#xplan.display.sql
PLAN_TABLE_OUTPUT
----------------------------------------------------------------------------------------------------
Plan hash value: 3625962092
----------------------------------------------------------------------------------------------------
| Id | Pid | Ord | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
----------------------------------------------------------------------------------------------------
| 0 | | 6 | SELECT STATEMENT | | 1 | 58 | 3 (0)| 00:00:53 |
| 1 | 0 | 5 | NESTED LOOPS | | | | | |
| 2 | 1 | 3 | NESTED LOOPS | | 1 | 58 | 3 (0)| 00:00:53 |
|* 3 | 2 | 1 | TABLE ACCESS FULL | EMP | 1 | 38 | 2 (0)| 00:00:35 |
|* 4 | 2 | 2 | INDEX UNIQUE SCAN | PK_DEPT | 1 | | 0 (0)| 00:00:01 |
| 5 | 1 | 4 | TABLE ACCESS BY INDEX ROWID| DEPT | 1 | 20 | 1 (0)| 00:00:18 |
----------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter("E"."ENAME"='SMITH')
4 - access("E"."DEPTNO"="D"."DEPTNO")
About
------
- XPlan v1.2 by Adrian Billington (http://www.oracle-developer.net)
18 rows selected.
SQL>
Number of rows return by each statement
In SQL Developer, the explain plan window has the cardinality column which shows the number of rows.
In SQL*Plus, using DBMS_XPLAN, you can display in a readable format. The rows column shows the number of rows.
See How to create and display explain plan in SQL*Plus. Few good examples and usage here.

Flatten SQL for 4 tables

In SQL Server 2012, I have four tables that look like:
Issues:
IssueID | IssueTitle
1 | Light Bulb Burnt Out
2 | Thermostat not working
LocationTypes:
TypeID | Type
1 | Building
2 | Floor
3 | Room
Locations:
LocaltionID | TypeID | Location | ParentLocation
0 | 1 | default | 0
1 | 1 | Sears Tower | 0
2 | 1 | IDS | 0
3 | 2 | Floor 1 | 1
4 | 2 | Floor 2 | 1
5 | 2 | Floor 3 | 1
6 | 2 | Floor 4 | 1
7 | 2 | Floor 5 | 1
8 | 2 | Floor 6 | 1
9 | 2 | Floor 7 | 1
10 | 2 | Floor 8 | 1
108 | 3 | Room 101 | 3
109 | 3 | Room 102 | 3
110 | 3 | Room 110 | 3
111 | 3 | Room 202 | 4
112 | 3 | Room 300 | 5
175 | 2 | 1st Floor | 2
185 | 2 | 2nd Floor | 2
186 | 3 | Suite 295 | 185
IssueLocations:
IssueID | LocationId
1 | 1
1 | 5
1 | 112
2 | 2
2 | 185
And what I want to do is combine the tables so that I end up with one row for each issuer, with field names as column headers and the field values, so I end up with:
Result:
IssueID | IssueTitle | Building | Floor | Room
--------------------------------------------------------------------------
1 | Light Bulb Burnt Out | Sears Tower | Floor 1 | Room 300
2 | Thermostat not working | IDS | 2nd Floor |
Notice the second issue doesn't have a room (no locations are required), location less issues are valid. Note other constraints might cause a required location but I don't think that is not relevant for this question.

You need to use Pivot to transpose your rows to columns.
SQL FIDDLE DEMO
SELECT *
FROM (SELECT il.IssueID,
l.Location,
i.IssueTitle,
lt.Type
FROM Locations l
JOIN LocationTypes lt
ON l.TypeID = lt.TypeID
JOIN IssueLocations il
ON il.LocationId = l.LocaltionID
JOIN issues i
ON i.IssueID = il.IssueID) a
PIVOT (Max(location)
FOR type IN([Building],
[Floor],
[Room]))piv

Aggregating using a combination of rows and columns

Sample Data from Ranges Table named ranges is shown below:
+-----------------+-------------------+----------+----------+
| SectionCategory | RangeName | LowerEnd | UpperEnd |
+-----------------+-------------------+----------+----------+
| Sanction | 0-7 days | 0 | 7 |
| Sanction | 8-15 days | 8 | 15 |
| Sanction | More than 15 days | 16 | 99999 |
| Disbursal | 0-7 days | 0 | 7 |
| Disbursal | 8-15 days | 8 | 15 |
| Disbursal | More than 15 days | 16 | 99999 |
+-----------------+-------------------+----------+----------+
Sample Data from the Delays Table is shown below:
+-----------+---------------+-----------------+
| Loan No. | SanctionDelay | Disbursal Delay |
+-----------+---------------+-----------------+
| 247 | 8 | 35 |
| 661 | 18 | 37 |
| 1235 | 12 | 6 |
| 1235 | 8 | 15 |
| 1241 | 28 | 9 |
| 1241 | 11 | 9 |
| 1283 | 22 | 20 |
| 1283 | 28 | 41 |
| 1523 | 1 | 27 |
| 1523 | 6 | 28 |
+-----------+---------------+-----------------+
The desired output is shown below:
+-----------+-------------------+-------+
| Section | Range | Count |
+-----------+-------------------+-------+
| Sanction | 0-7 days | 2 |
| Sanction | 8-15 days | 4 |
| Sanction | More than 15 days | 4 |
| Disbursal | 0-7 days | 1 |
| Disbursal | 8-15 days | 3 |
| Disbursal | More than 15 days | 6 |
+-----------+-------------------+-------+
Currently two separate queries are written and UNION is used to collate the output.
From a maintainability point of view, would it be possible to do this in a single query?
(For Sanction in the Ranges table, the SanctionDelay column from Delays Table should be used and for Disbursal, the DisbursalDelay column should be used.) The need is because the number of stages of the loan lifecycle is expected to increase and more and more UNIONs would be needed to collate the output.

It can be done with a CROSS JOIN, not sure how efficient it is.
Sample data:
declare #Ranges table (SectionCategory varchar(10) not null,RangeName varchar(20) not null,LowerEnd int not null,UpperEnd int not null)
insert into #Ranges (SectionCategory,RangeName,LowerEnd,UpperEnd) values
('Sanction','0-7 days',0,7),
('Sanction','8-15 days',8,15),
('Sanction','More than 15 days',16,99999),
('Disbursal','0-7 days',0,7),
('Disbursal','8-15 days',8,15),
('Disbursal','More than 15 days',16,99999)
declare #Delays table (LoanNo int not null,SanctionDelay int not null,DisbursalDelay int not null)
insert into #Delays (LoanNo,SanctionDelay,DisbursalDelay) values
( 247, 8,35),
( 661,18,37),
(1235,12, 6),
(1235, 8,15),
(1241,28, 9),
(1241,11, 9),
(1283,22,20),
(1283,28,41),
(1523, 1,27),
(1523, 6,28)
Query (must be run in same batch as sample data):
select
r.SectionCategory,
r.RangeName,
SUM(CASE
WHEN r.SectionCategory='Sanction' and d.SanctionDelay BETWEEN r.LowerEnd and r.UpperEnd then 1
WHEN r.SectionCategory='Disbursal' and d.DisbursalDelay BETWEEN r.LowerEnd and r.UpperEnd then 1
else 0 end) as Cnt
from #Ranges r
cross join
#Delays d
group by
r.SectionCategory,
r.RangeName
order by SectionCategory,RangeName
Results:
SectionCategory RangeName Cnt
--------------- -------------------- -----------
Disbursal 0-7 days 1
Disbursal 8-15 days 3
Disbursal More than 15 days 6
Sanction 0-7 days 2
Sanction 8-15 days 4
Sanction More than 15 days 4
From a maintainability perspective, it may be better to have a single delay column in the delays table and an additional column that specifies the type of the delay. At the moment, it feels like some form of attribute splitting - in the Ranges table, the type is represented as a column value (Sanction, Disbursal, etc), yet in the delays table, this same "type" is represented in the table meta-data, in terms of distinct column names.
You say that "the number of stages of the loan lifecycle is expected to increase", and I'd expect that this cross over (representing attributes as data in some tables and meta-data in others) will increase the pain of writing decent queries.

Try this
SELECT
SectionCategory
,RangeName
,CASE
WHEN R.SectionCategory='Sanction' THEN
(SELECT COUNT(1) FROM Delays D WHERE D.Sanction_Delay BETWEEN R.LowerEnd AND R.UpperEnd)
WHEN R.SectionCategory='Disbursal' THEN
(SELECT COUNT(1) FROM Delays D WHERE D.[Disbursal Delay] BETWEEN R.LowerEnd AND R.UpperEnd)
END as cnt
FROM Ranges R
Here is SQLFiddle demo

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

TSQL count vs sum distinct values - sql-server

Related

Programming In SQLITE

How to delete duplicates in SQL table with Primary Key [duplicate]

How to get Detailed Explain Plan?

Flatten SQL for 4 tables

Aggregating using a combination of rows and columns

Categories

Resources