Converting subquery into window function - sql-server

I had an interview where they asked me: "Find the most popular class among students using their first enrollment date", where the assumption was that a student would pick their favorite class first. For simplicity, no two EnrollmentDT could be exactly the same and there are no data issues (e.g. a student can't be enrolled in the same class twice).
They expected me to use a window function, and I'm curious how to do that for this problem.
I quickly setup some seed data as such (I'm aware the seed portion isn't a perfect representation, but I needed something close enough quickly):
IF OBJECT_ID('StudentClass') IS NOT NULL DROP TABLE StudentClass;
IF OBJECT_ID('Class') IS NOT NULL DROP TABLE Class;
IF OBJECT_ID('Student') IS NOT NULL DROP TABLE Student;
CREATE TABLE Student (
StudentID INT IDENTITY(1,1) PRIMARY KEY,
[Name] UNIQUEIDENTIFIER DEFAULT NEWID(),
);
CREATE TABLE Class (
ClassID INT IDENTITY(1,1) PRIMARY KEY,
[Name] UNIQUEIDENTIFIER DEFAULT NEWID(),
ClassLevel INT DEFAULT CAST(CEILING(RAND() * 3) AS INT)
);
CREATE TABLE StudentClass (
StudentClassID INT IDENTITY(1,1),
StudentID INT FOREIGN KEY REFERENCES Student (StudentID),
ClassID INT FOREIGN KEY REFERENCES Class (ClassID),
EnrollmentDT DATETIME2
);
GO
INSERT INTO Student DEFAULT VALUES
GO 50
INSERT INTO Class DEFAULT VALUES
GO 5
DECLARE #StudentIndex INT = 1;
DECLARE #Cycle INT = 1;
WHILE #Cycle <= 5
BEGIN
IF RAND() > 0.5
BEGIN
INSERT INTO StudentClass (StudentID, ClassID, EnrollmentDT)
VALUES
(#StudentIndex, #Cycle, DATEADD(SECOND, CAST(CEILING(RAND() * 10000) AS INT), SYSDATETIME()))
END
SET #StudentIndex = #StudentIndex + 1;
IF #StudentIndex = 50
BEGIN
SET #Cycle = #Cycle + 1;
SET #StudentIndex = 1;
END
END
But the only thing I could come up with was:
SELECT
sc.ClassID,
COUNT(*) AS IsFavoriteClassCount
FROM
StudentClass sc
INNER JOIN (
SELECT
StudentID,
MIN(EnrollmentDT) AS MinEnrollmentDT
FROM
StudentClass
GROUP BY
StudentID
) sq
ON sc.StudentID = sq.StudentID
AND sc.EnrollmentDT = sq.MinEnrollmentDT
GROUP BY
sc.ClassID
ORDER BY
IsFavoriteClassCount DESC;
Any guidance on their thinking would be greatly appreciated! If I made any errors in my constructions / query, take that as a proper error and not something intentional.

SELECT
ClassID,
COUNT(*) AS IsFavoriteClassCount
FROM
(
SELECT
sc.ClassID,
sc.StudentID,
ROW_NUMBER() OVER (PARTITION BY sc.StudentID
ORDER BY
sc.EnrollmentDT) AS rn
FROM
StudentClass sc
)
t
WHERE
rn = 1
GROUP BY
ClassID
ORDER BY
IsFavoriteClassCount DESC;
The query uses ROW_NUMBER() as a window function to assign a unique row number to each student's first enrollment in a class (based on their enrollment date). The inner query selects the ClassID and the StudentID for each enrollment and adds the row number, partitioned by the StudentID and ordered by the enrollment date. The outer query filters for the rows where the row number is 1, which indicates that these are the first enrollments for each student, and aggregates the data by the ClassID to get the count of how many times each class was the first choice of a student.

You can use FIRST_VALUE() window function to get the first pick of each student:
WITH cte AS (
SELECT DISTINCT StudentID,
FIRST_VALUE(ClassID) OVER (PARTITION BY StudentID ORDER BY EnrollmentDT) AS ClassID
FROM StudentClass
)
SELECT TOP 1 WITH TIES
ClassID,
COUNT(*) AS IsFavoriteClassCount
FROM cte
GROUP BY ClassID
ORDER BY COUNT(*) DESC;
Remove TOP 1 WITH TIES to get results for all classes.

Related

Select all Main table rows with detail table column constraints with GROUP BY

I've 2 tables tblMain and tblDetail on SQL Server that are linked with tblMain.id=tblDetail.OrderID for orders usage. I've not found exactly the same situation in StackOverflow.
Here below is the sample table design:
/* create and populate tblMain: */
CREATE TABLE tblMain (
ID int IDENTITY(1,1) NOT NULL,
DateOrder datetime NULL,
CONSTRAINT PK_tblMain PRIMARY KEY
(
ID ASC
)
)
GO
INSERT INTO tblMain (DateOrder) VALUES('2021-05-20T12:12:10');
INSERT INTO tblMain (DateOrder) VALUES('2021-05-21T09:13:13');
INSERT INTO tblMain (DateOrder) VALUES('2021-05-22T21:30:28');
GO
/* create and populate tblDetail: */
CREATE TABLE tblDetail (
ID int IDENTITY(1,1) NOT NULL,
OrderID int NULL,
Gencod VARCHAR(255),
Quantity float,
Price float,
CONSTRAINT PK_tblDetail PRIMARY KEY
(
ID ASC
)
)
GO
INSERT INTO tblDetail (OrderID, Gencod, Quantity, Price) VALUES(1, '1234567890123', 8, 12.30);
INSERT INTO tblDetail (OrderID, Gencod, Quantity, Price) VALUES(1, '5825867890321', 2, 2.88);
INSERT INTO tblDetail (OrderID, Gencod, Quantity, Price) VALUES(3, '7788997890333', 1, 1.77);
INSERT INTO tblDetail (OrderID, Gencod, Quantity, Price) VALUES(3, '9882254656215', 3, 5.66);
INSERT INTO tblDetail (OrderID, Gencod, Quantity, Price) VALUES(3, '9665464654654', 4, 10.64);
GO
Here is my SELECT with grouping:
SELECT tblMain.id,SUM(tblDetail.Quantity*tblDetail.Price) AS TotalPrice
FROM tblMain LEFT JOIN tblDetail ON tblMain.id=tblDetail.orderid
WHERE (tblDetail.Quantity<>0) GROUP BY tblMain.id;
GO
This gives:
The wished output:
We see that id=2 is not shown even with LEFT JOIN, as there is no records with OrderID=2 in tblDetail.
How to design a new query to show tblMain.id = 2? Mean while I must keep WHERE (tblDetail.Quantity<>0) constraints. Many thanks.
EDIT:
The above query serves as CTE (Common Table Expression) for a main query that takes into account payments table tblPayments again.
After testing, both solutions work.
In my case, the main table has 15K records, while detail table has some millions. With (tblDetail.Quantity<>0 OR tblDetail.Quantity IS NULL) AND tblDetail.IsActive=1 added on JOIN ON clause it takes 37s to run, while the first solution of #pwilcox, the condition being added on the where clause, it ends up on 29s. So a gain of time of 20%.
tblDetail.IsActive column permits me ignore detail rows that is temporarily ignored by setting it to false.
So the for me it's ( #pwilcox's answer).
where (tblDetail.quantity <> 0 or tblDetail.quantity is null)
Change
WHERE (tblDetail.Quantity<>0)
to
where (tblDetail.quantity <> 0 or tblDetail.quantity is null)
as the former will omit id = 2 because the corresponding quantity would be null in a left join.
And as HABO mentions, you can also make the condition a part of your join logic as opposed to your where statement, avoiding the need for the 'or' condition.
select m.id,
totalPrice = sum(d.quantity * d.price)
from tblMain m
left join tblDetail d
on m.id = d.orderid
and d.quantity <> 0
group by m.id;

Efficiently query for the latest version of a record using SQL

I need to query a table for the latest version of a record for all available dates (end of day time-series). The example below illustrates what I am trying to achieve.
My question is whether the table's design (primary key, etc.) and the LEFT OUTER JOIN query is accomplishing this goal in the most efficient manner.
CREATE TABLE [PriceHistory]
(
[RowID] [int] IDENTITY(1,1) NOT NULL,
[ItemIdentifier] [varchar](10) NOT NULL,
[EffectiveDate] [date] NOT NULL,
[Price] [decimal](12, 2) NOT NULL,
CONSTRAINT [PK_PriceHistory]
PRIMARY KEY CLUSTERED ([ItemIdentifier] ASC, [RowID] DESC, [EffectiveDate] ASC)
)
INSERT INTO [PriceHistory] VALUES ('ABC','2016-03-15',5.50)
INSERT INTO [PriceHistory] VALUES ('ABC','2016-03-16',5.75)
INSERT INTO [PriceHistory] VALUES ('ABC','2016-03-16',6.25)
INSERT INTO [PriceHistory] VALUES ('ABC','2016-03-17',6.05)
INSERT INTO [PriceHistory] VALUES ('ABC','2016-03-18',6.85)
GO
SELECT
L.EffectiveDate, L.Price
FROM
[PriceHistory] L
LEFT OUTER JOIN
[PriceHistory] R ON L.ItemIdentifier = R.ItemIdentifier
AND L.EffectiveDate = R.EffectiveDate
AND L.RowID < R.RowID
WHERE
L.ItemIdentifier = 'ABC' and R.EffectiveDate is NULL
ORDER BY
L.EffectiveDate
Follow up: Table can contain thousands of ItemIdentifiers each with dacades worth of price data. Historical version of data needs to be preserved for audit reasons. Say I query the table and use the data in a report. I store #MRID = Max(RowID) at the time the report was generated. Now if the price for 'ABC' on '2016-03-16' is corrected at some later date, I can modify the query using #MRID and replicate the report that I ran earlier.
A slightly modified version of #SeanLange's answer will give you the last row per date, instead of per product:
with sortedResults as
(
select *
, ROW_NUMBER() over(PARTITION by ItemIdentifier, EffectiveDate
ORDER by ID desc) as RowNum
from PriceHistory
)
select ItemIdentifier, EffectiveDate, Price
from sortedResults
where RowNum = 1
order by 2
I assume you have more than 1 ItemIdentifier in your table. Your design is a bit problematic in that you are keeping versions of the data in your table. You can however do something like this quite easily to get the most recent one for each ItemIdentifier.
with sortedResults as
(
select *
, ROW_NUMBER() over(PARTITION by ItemIdentifier order by EffectiveDate desc) as RowNum
from PriceHistory
)
select *
from sortedResults
where RowNum = 1
Short answer, no.
You're hitting the same table twice, and possibly creating a looped table scan, depending on your existing indexes. In the best case, you're causing a looped index seek, and then throwing out most of the rows.
This would be the most efficient query for what you're asking.
SELECT
L.EffectiveDate,
L.Price
FROM
(
SELECT
L.EffectiveDate,
L.Price,
ROW_NUMBER() OVER (
PARTITION BY
L.ItemIdentifier,
L.EffectiveDate
ORDER BY RowID DESC ) RowNum
FROM [PriceHistory] L
WHERE L.ItemIdentifier = 'ABC'
) L
WHERE
L.RowNum = 1;

SELECT from multiple queries

I have this tables:
tblDiving(
diving_number int primary key
diving_club int
date_of_diving date)
tblDivingClub(
number int primary key not null check (number>0),
name char(30),
country char(30))
tblWorks_for(
diver_number int
club_number int
end_working_date date)
tblCountry(
name char(30) not null primary key)
I need to write a query to return a name of a country and the number of "Super club" in it.
a Super club is a club which have more than 25 working divers (tblWorks_for.end_working_date is null) or had more than 100 diving's in it(tblDiving) in the last year.
after I get the country and number of super club, I need to show only the country's that contains more than 2 super club.
I wrote this 2 queries:
select tblDivingClub.name,count(distinct tblWorks_for.diver_number) as number_of_guids
from tblWorks_for
inner join tblDivingClub on tblDivingClub.number = tblWorks_for.club_number,tblDiving
where tblWorks_for.end_working_date is null
group by tblDivingClub.name
select tblDivingClub.name, count(distinct tblDiving.diving_number) as number_of_divings
from tblDivingClub
inner join tblDiving on tblDivingClub.number = tblDiving.diving_club
WHERE tblDiving.date_of_diving <= DATEADD(year,-1, GETDATE())
group by tblDivingClub.name
But I don't know how do I continue.
Every query works separately, but how do I combine them and select from them?
It's university assignment and I'm not allowed to use views or temporary tables.
It's my first program so I'm not really sure what I'm doing:)
WITH CTE AS (
select tblDivingClub.name,count(distinct tblWorks_for.diver_number) as diving_number
from tblWorks_for
inner join tblDivingClub on tblDivingClub.number = tblWorks_for.club_number,tblDiving
where tblWorks_for.end_working_date is null
group by tblDivingClub.name
UNION ALL
select tblDivingClub.name, count(distinct tblDiving.diving_number) as diving_number
from tblDivingClub
inner join tblDiving on tblDivingClub.number = tblDiving.diving_club
WHERE tblDiving.date_of_diving <= DATEADD(year,-1, GETDATE())
group by tblDivingClub.name
)
SELECT * FROM CTE
You can combine the queries using a UNION ALL as long as there are the same number of columns in each query. You can then roll them into a Common Table Expression (CTE) and do a select from that.

Check for applicable Group query for shopping cart

I have problem for one of discount check condition. I have tables structure as below:
Cart table (id, customerid, productid)
Group table (groupid, groupname, discountamount)
Group Products table (groupproductid, groupid, productid)
While placing an order, there will be multiple items in cart, I want to check those items with top most group if that group consists of all product shopping cart have?
Example:
If group 1 consists 2 products and those two products exists in cart table then group 1 discount should be returned.
please help
It's tricky, without having real table definitions nor sample data. So I've made some up:
create table Carts(
id int,
customerid int,
productid int
)
create table Groups(
groupid int,
groupname int,
discountamount int
)
create table GroupProducts(
groupproductid int,
groupid int,
productid int
)
insert into Carts (id,customerid,productid) values
(1,1,1),
(2,1,2),
(3,1,4),
(4,2,2),
(5,2,3)
insert into Groups (groupid,groupname,discountamount) values
(1,1,10),
(2,2,15),
(3,3,20)
insert into GroupProducts (groupproductid,groupid,productid) values
(1,1,1),
(2,1,5),
(3,2,2),
(4,2,4),
(5,3,2),
(6,3,3)
;With MatchedProducts as (
select
c.customerid,gp.groupid,COUNT(*) as Cnt
from
Carts c
inner join
GroupProducts gp
on
c.productid = gp.productid
group by
c.customerid,gp.groupid
), GroupSizes as (
select groupid,COUNT(*) as Cnt from GroupProducts group by groupid
), MatchingGroups as (
select
mp.*
from
MatchedProducts mp
inner join
GroupSizes gs
on
mp.groupid = gs.groupid and
mp.Cnt = gs.Cnt
)
select * from MatchingGroups
Which produces this result:
customerid groupid Cnt
----------- ----------- -----------
1 2 2
2 3 2
What we're doing here is called "relational division" - if you want to search elsewhere for that term. In my current results, each customer only matches one group - if there are multiple matches, we need some tie-breaking conditions to determine which group to report. I prompted with two suggestions in comments (lowest groupid or highest discountamount). Your response of "added earlier" doesn't help - we don't have a column which contains the addition dates of groups. Rows have no inherent ordering in SQL.
We would do the tie-breaking in the definition of MatchingGroups and the final select:
MatchingGroups as (
select
mp.*,
ROW_NUMBER() OVER (PARTITION BY mp.customerid ORDER BY /*Tie break criteria here */) as rn
from
MatchedProducts mp
inner join
GroupSizes gs
on
mp.groupid = gs.groupid and
mp.Cnt = gs.Cnt
)
select * from MatchingGroups where rn = 1

Problem with unique SQL query

I want to select all records, but have the query only return a single record per Product Name. My table looks similar to:
SellId ProductName Comment
1 Cake dasd
2 Cake dasdasd
3 Bread dasdasdd
where the Product Name is not unique. I want the query to return a single record per ProductName with results like:
SellId ProductName Comment
1 Cake dasd
3 Bread dasdasdd
I have tried this query,
Select distict ProductName,Comment ,SellId from TBL#Sells
but it is returning multiple records with the same ProductName. My table is not realy as simple as this, this is just a sample. What is the solution? Is it clear?
Select ProductName,
min(Comment) , min(SellId) from TBL#Sells
group by ProductName
If y ou only want one record per productname, you ofcourse have to choose what value you want for the other fields.
If you aggregate (using group by) you can choose an aggregate function,
htat's a function that takes a list of values and return only one : here I have chosen MIN : that is the smallest walue for each field.
NOTE : comment and sellid can come from different records, since MIN is taken...
Othter aggregates you might find useful :
FIRST : first record encountered
LAST : last record encoutered
AVG : average
COUNT : number of records
first/last have the advantage that all fields are from the same record.
SELECT S.ProductName, S.Comment, S.SellId
FROM
Sells S
JOIN (SELECT MAX(SellId)
FROM Sells
GROUP BY ProductName) AS TopSell ON TopSell.SellId = S.SellId
This will get the latest comment as your selected comment assuming that SellId is an auto-incremented identity that goes up.
I know, you've got an answer already, I'd like to offer a way that was fastest in terms of performance for me, in a similar situation. I'm assuming that SellId is Primary Key and identity. You'd want an index on ProductName for best performance.
select
Sells.*
from
(
select
distinct ProductName
from
Sells
) x
join
Sells
on
Sells.ProductName = x.ProductName
and Sells.SellId =
(
select
top 1 s2.SellId
from
Sells s2
where
x.ProductName = s2.ProductName
Order By SellId
)
A slower method, (but still better than Group By and MIN on a long char column) is this:
select
*
from
(
select
*,ROW_NUMBER() over (PARTITION BY ProductName order by SellId) OccurenceId
from sells
) x
where
OccurenceId = 1
An advantage of this one is that it's much easier to read.
create table Sale
(
SaleId int not null
constraint PK_Sale primary key,
ProductName varchar(100) not null,
Comment varchar(100) not null
)
insert Sale
values
(1, 'Cake', 'dasd'),
(2, 'Cake', 'dasdasd'),
(3, 'Bread', 'dasdasdd')
-- Option #1 with over()
select *
from Sale
where SaleId in
(
select SaleId
from
(
select SaleId, row_number() over(partition by ProductName order by SaleId) RowNumber
from Sale
) tt
where RowNumber = 1
)
order by SaleId
-- Option #2
select *
from Sale
where SaleId in
(
select min(SaleId)
from Sale
group by ProductName
)
order by SaleId
drop table Sale

Resources