Interview question help on relatively basic JOIN and subqueries - sql-server

I was asked to:
Print the following sequence of columns for each plant that only blooms in one type of weather.
WEATHER_TYPE
PLANT_NAME"
Schema
PLANTS (table name)
PLANT_NAME, string, The name of the plant. This is the primary key.
PLANT_SPECIES, sting, The species of the plant.
SEED_DATE, date, The date the seed was planted.
WEATHER (table name)
PLANT_SPECIES, string, The species of the plant.
WEATHER_TYPE, string, The type of weather in which the plant will bloom.
I wrote the script below and tested it against sample input and achieved a desired result. I don't know if this is what is considered a 'printed' result.
Seeking understanding on what I might have missed. How might I make this script 'more efficient' and/or 'better' and/or 'more robust'?
SELECT WEATHER.WEATHER_TYPE, a.PLANT_NAME
FROM (SELECT b.PLANT_NAME, b.PLANT_SPECIES
FROM (SELECT PLANTS.PLANT_NAME, PLANTS.PLANT_SPECIES, PLANTS.SEED_DATE, WEATHER.WEATHER_TYPE
FROM PLANTS JOIN WEATHER
ON PLANTS.PLANT_SPECIES = WEATHER.PLANT_SPECIES) b
GROUP BY b.PLANT_NAME, b.PLANT_SPECIES
HAVING count(*) = 1) a JOIN WEATHER
ON a.PLANT_SPECIES = WEATHER.PLANT_SPECIES
I achieved the expected result in a SQL Server Management Studio window, but not sure if it's the 'printed' result the question-askers are looking for.

I personally consider CTEs easier to read and to debug, compared to nested "Table Expressions", as you have done. I would have done something like:
with
x as (
select p.plant_name
from plants p
join weather w on w.plant_species = p.plant_species
group by p.plant_name
having count(*) = 1
)
select x.plant_name, w.weather_type
from x
join weather w on w.plant_species = x.plant_species

I have to agree with The Impaler in regards to the readability and ease of debugging nested table expressions. As another option to the CTE (which is really the better choice), if you really want to nest things without overthinking it you can use a correlated subquery. It's easier to read, though as your result set grows you'll lose efficiency.
SELECT w.weather_type, p.plant_name
FROM plants p
JOIN weather w
ON w.plant_species = p.plant_species
WHERE (SELECT COUNT(1) FROM dbo.weather WHERE plant_species = w.plant_species) = 1
or with grouping...
SELECT w.weather_type, p.plant_name
FROM plants p
JOIN weather w
ON w.plant_species = p.plant_species
WHERE w.plant_species IN (SELECT plant_species FROM dbo.weather GROUP BY plant_species HAVING COUNT(1) = 1)

SELECT w.weather_type, p.plant_name
FROM plants p
JOIN weather w
ON w.plant_species = p.plant_species
WHERE w.weather_type="Sunny";

Related

TSQL: Relational Division with Remainder (RDWR)

I have mapping table CandidatesSkills which holds the mapping between candidate and the skills they possess. Then I have another table JobRequirements that maps jobs and required skills for that jobs.
A candidate can apply to a job if he possesses ALL the required skills for that job. A candidate can have extra skills. Given CandiateID I want to find all the jobs that candidate can apply.
I think this is Relational Division with Remainder in SQL. And there is an article here that explains the exact issue. (Note: the article tries to find all Candidates who has ALL skills for the given job. My problem is exactly opposite. I am trying to find all Jobs that matches with given Candidate's skill)
Candidate's Skills
Job to required skills mapping
based on the dataset, the query below should return JobID 2,3 and 5
Here my SQL (based on Peter Larsson (PESO) Solution for RDNR/RDWR)
DECLARE #CandidateID INT = 1
SELECT JobID
FROM
(
SELECT jr.JobID
,cnt=SUM(CASE WHEN jr.SkillID = c.SkillID THEN 1 ELSE 0 END)
,Items=COUNT(*)
FROM dbo.JobRequirements AS jr
CROSS JOIN dbo.CandidatesSkills AS c
WHERE c.CandidateID = #CandidateID
GROUP BY jr.JobID, jr.SkillID
) d
GROUP BY JobID
HAVING SUM(cnt) = MIN(Items)
AND MIN(cnt) >= 0;
However, query does not return anything. Trying to find what's wrong with my query
Here is the SQL Fiddle
Something like:
DECLARE #CandidateID INT = 1;
with cj as
(
select cs.CandidateId,
jr.JobId,
count(*) over (partition by jr.JobId, cs.CandidateId) skillsPosessed,
(select count(*) from JobRequirements where JobId = jr.JobId) skillsRequired
from CandidatesSkills cs
join JobRequirements jr
on cs.SkillId = jr.SkillId
)
select distinct cj.CandidateId, cj.JobId
from cj
where cj.skillsPosessed = cj.skillsRequired
In this case, you doing relational division with multiple divisors. In other words, you are dividing each set of JobRequirements per each JobID, by the CandidateSkills of that candidate.
In this case, a LEFT JOIN solution is much simpler
DECLARE #CandidateID INT = 1;
SELECT jr.JobID
,Skills = COUNT(c.SkillID)
,Requirements = COUNT(*)
FROM dbo.JobRequirements AS jr
LEFT JOIN dbo.CandidatesSkills AS c ON c.SkillID = jr.SkillID
AND c.CandidateID = #CandidateID
GROUP BY jr.JobID
HAVING COUNT(*) = COUNT(c.SkillID);
What this does is left-join the candidate's skills to the requirements. We then simply count up all the Requirements for the JobID, and ensure it is equal to the number of matches.
Another way to write this is
HAVING COUNT(CASE WHEN c.SkillID IS NULL THEN 1 END) = 0;
In other words: the number of non-matches should be zero.
SQL Fiddle

Database Design Relational Algebra query

I have this schema:
Suppliers(sid: integer, sname: string, address: string)
Parts(pid: integer, pname: string, color: string)
Catalog(sid: integer, pid: integer, cost: real)
And this task:
Find the sids of suppliers who supply every part.
What I don't understand is why in this solution we don't work with a negation. I was tempted to put C1.pid <> P.pid instead of C1.pid = P.pid in the end. Can someone explain?
SELECT C.sid
FROM Catalog C
WHERE NOT EXISTS (SELECT P.pid
FROM Parts P
WHERE NOT EXISTS (SELECT C1.sid
FROM Catalog C1
WHERE C1.sid = C.sid
AND C1.pid = P.pid))
Let's say you have 2 parts and 1 supplier. The supplier has both parts. If you join on <>, your innermost subquery will get two rows back: one for the Catalog entry for Part #1 (because Part #1 <> Part #2 is true); and one for the Catalog entry for Part #2 (likewise).
Your reasoning isn't entirely off, but the way to do that is not to use an inequality, but rather to use an outer join and test for the missing record on the "outer" table:
SELECT c.sid
FROM catalog c
WHERE NOT EXISTS
(SELECT c1.sid
FROM catalog c1 LEFT JOIN parts p ON c1.pid = p.pid
WHERE c.sid = c1.sid AND p.pid IS NULL)
Personally, I find the nested not exists to be a little confusing and needlessly complex. I would be more likely to solve this problem using count:
SELECT c.sid
FROM catalog c
GROUP BY c.sid
HAVING COUNT (DISTINCT c.pid) = (SELECT COUNT (*) FROM parts)

Join subquery with min

I'm pulling my hair out over a subquery that I'm using to avoid about 100 duplicates (out of about 40k records). The records that are duplicated are showing up because they have 2 dates in h2.datecreated for a valid reason, so I can't just scrub the data.
I'm trying to get only the earliest date to return. The first subquery (that starts with "select distinct address_id", with the MIN) works fine on it's own...no duplicates are returned. So it would seem that the left join (or just plain join...I've tried that too) couldn't possibly see the second h2.datecreated, since it doesn't even show up in the subquery. But when I run the whole query, it's returning 2 values for some ipc.mfgid's, one with the h2.datecreated that I want, and the other one that I don't want.
I know it's got to be something really simple, or something that just isn't possible. It really seems like it should work! This is MSSQL. Thanks!
select distinct ipc.mfgid as IPC, h2.datecreated,
case when ad.Address is null
then ad.buildingname end as Address, cast(trace.name as varchar)
+ '-' + cast(trace.Number as varchar) as ONT,
c.ACCOUNT_Id,
case when h.datecreated is not null then h.datecreated
else h2.datecreated end as Install
from equipmentjoin as ipc
left join historyjoin as h on ipc.id = h.EQUIPMENT_Id
and h.type like 'add'
left join circuitjoin as c on ipc.ADDRESS_Id = c.ADDRESS_Id
and c.GRADE_Code like '%hpna%'
join (select distinct address_id, equipment_id,
min(datecreated) as datecreated, comment
from history where comment like 'MAC: 5%' group by equipment_id, address_id, comment)
as h2 on c.address_id = h2.address_id
left join (select car.id, infport.name, carport.number, car.PCIRCUITGROUP_Id
from circuit as car (NOLOCK)
join port as carport (NOLOCK) on car.id = carport.CIRCUIT_Id
and carport.name like 'lead%'
and car.GRADE_Id = 29
join circuit as inf (NOLOCK) on car.CCIRCUITGROUP_Id = inf.PCIRCUITGROUP_Id
join port as infport (NOLOCK) on inf.id = infport.CIRCUIT_Id
and infport.name like '%olt%' )
as trace on c.ccircuitgroup_id = trace.pcircuitgroup_id
join addressjoin as ad (NOLOCK) on ipc.address_id = ad.id
The typical approach to only getting the lowest row is one of the following. You didn't bother to specify what version of SQL Server you're using, what you want to do with ties, and I have little interest to try to work this into your complex query, so I'll show you an abstract simplification for different versions.
SQL Server 2000
SELECT x.grouping_column, x.min_column, x.other_columns ...
FROM dbo.foo AS x
INNER JOIN
(
SELECT grouping_column, min_column = MIN(min_column)
FROM dbo.foo GROUP BY grouping_column
) AS y
ON x.grouping_column = y.grouping_column
AND x.min_column = y.min_column;
SQL Server 2005+
;WITH x AS
(
SELECT grouping_column, min_column, other_columns,
rn = ROW_NUMBER() OVER (ORDER BY min_column)
FROM dbo.foo
)
SELECT grouping_column, min_column, other_columns
FROM x
WHERE rn = 1;
This subqery:
select distinct address_id, equipment_id,
min(datecreated) as datecreated, comment
from history where comment like 'MAC: 5%' group by equipment_id, address_id, comment
Probably will return multiple rows because the comment is not guaranteed to be the same.
Try this instead:
CROSS APPLY (
SELECT TOP 1 H2.DateCreated, H2.Comment -- H2.Equipment_id wasn't used
FROM History H2
WHERE
H2.Comment LIKE 'MAC: 5%'
AND C.Address_ID = H2.Address_ID
ORDER BY DateCreated
) H2
Switch that to OUTER APPLY in case you want rows that don't have a matching desired history entry.

How to SELECT DISTINCT Info with TOP 1 Info and an Order By FROM the Top 1 Info

I have 2 tables, that look like:
CustomerInfo(CustomterID, CustomerName)
CustomerReviews(ReviewID, CustomerID, Review, Score)
I want to search reviews for a string and return CustomerInfo.CustomerID and CustomerInfo.CustomerName. However, I only want to show distinct CustomerID and CustomerName along with just one of their CustomerReviews.Reviews and CustomerReviews.Score. I also want to order by the CustomerReviews.Score.
I can't figure out how to do this, since a customer can leave multiple reviews, but I only want a list of customers with their highest scored review.
Any ideas?
This is the greatest-n-per-group problem that has come up dozens of times on Stack Overflow.
Here's a solution that works with a window function:
WITH CustomerCTE (
SELECT i.*, r.*, ROW_NUMBER() OVER (PARTITION BY CustomerID ORDER BY Score DESC) AS RN
FROM CustomerInfo i
INNER JOIN CustomerReviews r ON i.CustomerID = r.CustomerID
WHERE CONTAINS(r.Review, '"search"')
)
SELECT * FROM CustomerCTE WHERE RN = 1
ORDER BY Score;
And here's a solution that works more broadly with RDBMS brands that don't support window functions:
SELECT i.*, r1.*
FROM CustomerInfo i
INNER JOIN CustomerReviews r1 ON i.CustomerID = r1.CustomerID
AND CONTAINS(r1.Review, '"search"')
LEFT OUTER JOIN CustomerReviews r2 ON i.CustomerID = r2.CustomerID
AND CONTAINS(r1.Review, '"search"')
AND (r1.Score < r2.Score OR r1.Score = r2.Score AND r1.ReviewID < r2.ReviewID)
WHERE r2.CustomerID IS NULL
ORDER BY Score;
I'm showing the CONTAINS() function because you should be using the fulltext search facility in SQL Server, not using LIKE with wildcards.
I voted for Bill Karwin's answer, but I thought I'd throw out another option.
It uses a correlated subquery, which can often incur performance problems with large data sets, so use with caution. I think the only upside is that the query is easier to immediately understand.
select *
from [CustomerReviews] r
where [ReviewID] =
(
select top 1 [ReviewID]
from [CustomerReviews] rInner
where rInner.CustomerID = r.CustomerID
order by Score desc
)
order by Score desc
I didn't add the string search filter, but that can be easily added.
I think this should do it
select ci.CustomterID, ci.CustomerName, cr.Review, cr.Score
from CustomerInfo ci inner join
(select top 1*
from CustomerReviews
where Review like '%search%'
order by Score desc) cr on ci.CustomterID = cr.CustomterID
order by cr.Score

Query Executing Problem

Using SQL 2005: “Taking too much time to execute”
I want to filter the date, the date should not display in holidays, and I am using three tables with Inner Join
When I run the below query, It taking too much time to execute, because I filter the cardeventdate with three table.
Query
SELECT
PERSONID, CardEventDate tmp_cardevent3
WHERE (CardEventDate NOT IN
(SELECT T_CARDEVENT.CARDEVENTDATE
FROM T_PERSON
INNER JOIN T_CARDEVENT ON T_PERSON.PERSONID = T_CARDEVENT.PERSONID
INNER JOIN DUAL_PRO_II_TAS.dbo.T_WORKINOUTTIME ON T_CARDEVENT.CARDEVENTDAY = DUAL_PRO_II_TAS.dbo.T_WORKINOUTTIME.DAYCODE
AND T_PERSON.TACODE = DUAL_PRO_II_TAS.dbo.T_WORKINOUTTIME.TACODE
WHERE (DUAL_PRO_II_TAS.dbo.T_WORKINOUTTIME.HOLIDAY = 'true')
)
)
ORDER BY PERSONID, CardEventDate DESC
For the above mentioned Query, there is any other way to do date filter.
Expecting alternative queries for my query?
I'm pretty sure that it's not the joined tables that is the problem, but rather the "not in" that makes it slow.
Try to use a join instead:
select m.PERSONID, m.CardEventDate
from T_PERSON p
inner join T_CARDEVENT c on p.PERSONID = c.PERSONID
inner join DUAL_PRO_II_TAS.dbo.T_WORKINOUTTIME w
on c.CARDEVENTDAY = w.DAYCODE
and p.TACODE = w.TACODE
and w.HOLIDAY = 'true'
right join tmp_cardevent3 m on m.CardEventDate = c.CardEventDate
where c.CardEventDate is null
order by m.PERSONID, m.CardEventDate desc
(There is a from clause missing from your query, so I don't know what table you are trying to get the data from.)
Edit:
Put tmp_cardevent3 in the correct place.
Have you created indices on all of the columns that you are using to do the joins? In particular, I'd consider indices on PERSONID in T_CARDEVENT, TACODE in both T_PERSON and T_WORKINOUTTIME, and HOLIDAY in T_WORKINOUTTIME.

Resources