Is there a way to combine these queries? - sql-server

I have begun working some of the programming problems on HackerRank as a "productive distraction".
I was working on the first few in the SQL section and came across this problem (link):
Query the two cities in STATION with the shortest and
longest CITY names, as well as their respective lengths
(i.e.: number of characters in the name). If there is
more than one smallest or largest city, choose the one
that comes first when ordered alphabetically.
Input Format
The STATION table is described as follows:
where LAT_N is the northern latitude and LONG_W is
the western longitude.
Sample Input
Let's say that CITY only has four entries:
1. DEF
2. ABC
3. PQRS
4. WXY
Sample Output
ABC 3
PQRS 4
Explanation
When ordered alphabetically, the CITY names are listed
as ABC, DEF, PQRS, and WXY, with the respective lengths
3, 3, 4 and 3. The longest-named city is obviously PQRS,
but there are options for shortest-named city; we choose
ABC, because it comes first alphabetically.
I agree that this requirement could be written much more clearly, but the basic gist is pretty easy to get, especially with the clarifying example. The question I have, though, occurred to me because the instructions given in the comments for the question read as follows:
/*
Enter your query here.
Please append a semicolon ";" at the end of the query and
enter your query in a single line to avoid error.
*/
Now, writing a query on a single line doesn't necessarily imply a single query, though that seems to be the intended thrust of the statement. However, I was able to pass the test case using the following submission (submitted on 2 lines, with a carriage return in between):
SELECT TOP 1 CITY, LEN(CITY) FROM STATION ORDER BY LEN(CITY), CITY;
SELECT TOP 1 CITY, LEN(CITY) FROM STATION ORDER BY LEN(CITY) DESC, CITY;
Again, none of this is advanced SQL. But it got me thinking. Is there a non-trivial way to combine this output into a single results set? I have some ideas in mind where the WHERE clause basically adds some sub-queries in an OR statement to combine the two queries into one. Here is another submission I had that passed the test case:
SELECT
CITY,
LEN(CITY)
FROM
STATION
WHERE
ID IN (SELECT TOP 1 ID FROM STATION ORDER BY LEN(CITY), CITY)
OR
ID IN (SELECT TOP 1 ID FROM STATION ORDER BY LEN(CITY) DESC, CITY)
ORDER BY
LEN(CITY), CITY;
And, yes, I realize that the final , CITY in the final ORDER BY clause is superfluous, but it kind of makes the point that this query hasn't really saved that much effort, especially against returning the query results separately.
Note: This isn't a true MAX and MIN situation. Given the following input, you aren't actually taking the first and last rows:
Sample Input
1. ABC
2. ABCD
3. ZYXW
Based on the requirements as written, you'd take #1 and #2, not #1 and #3.
This makes me think that my solutions actually might be the most efficient way to accomplish this, but my set-based thinking could always use some strengthening, and I'm not sure if that might play in here or not.

Here's another alternative. I think it's pretty straight forward, easy to understand what's going on. Performance is good.
Still has a couple of sub-queries though.
select
min(City), len(City)
from Station
group by
len(City)
having
len(City) = (select min(len(City)) from Station)
or
len(City) = (select max(len(City)) from Station)

Untested as well, but I don't see a reason for it not to work:
SELECT *
FROM (
SELECT TOP (1) CITY, LEN(CITY) AS CITY_LEN
FROM STATION
ORDER BY CITY_LEN, CITY
) AS T
UNION ALL
SELECT *
FROM (
SELECT TOP (1) CITY, LEN(CITY) AS CITY_LEN
FROM STATION
ORDER BY CITY_LEN DESC, CITY
) AS T2;
You cant have UNION ALL with ORDER BY for each SELECT statement, but you can workaround it by using subqueries togeter with TOP (1) clause and ORDER BY.

UNTESTED:
WITH CTE AS (
Select ID, len(City), row_number() over (order by City) as AlphaRN,
row_number() over (order by Len(City) desc) as LenRN) B
Select * from cte
Where AlphaRN = 1 and (lenRN = (select max(lenRN) from cte) or
lenRN = (Select min(LenRN) from cte))

Here's the best I could come up with:
with Ordering as
(
select
City,
Forward = row_number() over (order by len(City), City),
Backward = row_number() over (order by len(City) desc, City)
from
Station
)
select City, len(City) from Ordering where 1 in (Forward, Backward);
There are definitely a lot of ways to approach this as evidenced by the variety of answers, but I don't think anything beats your original two-query solution in terms of cleanly and concisely expressing the intended behavior. Interesting question, though!

This is what I came with. I tried to use only one query, without CTE's or sub-queries.
;WITH STATION AS ( --Dummy table
SELECT *
FROM (VALUES
(1,'DEF','EU',1,9),
(2,'ABC','EU',1,6), -- This is shortest
(3,'PQRS','EU',1,5),
(4,'WXY','EU',1,4),
(5,'FGHA','EU',1,2),
(6,'ASDFHG','EU',1,3) --This is longest
) as t(ID, CITY, [STATE], LAT_N,LONG_W)
)
SELECT TOP 1 WITH TIES CITY,
LEN(CITY) as CITY_LEN
FROM STATION
ORDER BY ROW_NUMBER() OVER(PARTITION BY LEN(CITY) ORDER BY LEN(CITY) ASC),
CASE WHEN MAX(LEN(CITY)) OVER (ORDER BY (SELECT NULL)) = LEN(CITY)
OR MIN(LEN(CITY)) OVER (ORDER BY (SELECT NULL))= LEN(CITY)
THEN 0 ELSE 1 END
Output:
CITY CITY_LEN
ABC 3
ASDFHG 6

select min(CITY), length(CITY)
from STATION
group by length(CITY)
having length(CITY) = (select min(length(CITY)) from STATION)
or length(CITY) = (select max(length(CITY)) from STATION);

Related

Top vs Rank/Row Number functions - Which performs higher?

I attempted to Google the Cost of using Top in a query vs using a Ranking or Row_Number type function.
Does the cost of each depend on the situation or can the cost of these two features be determined across the board for all situations?
Some mock SQL is below using a simple CTE to demonstrate my question would look like the below:
WITH fData AS
(
SELECT 1 AS ID, 'John' AS fName, 'Black' AS lName, CAST('05/19/1975' AS DATE) AS birthDate UNION ALL
SELECT 2 AS ID, 'John' AS fName, 'Black' AS lName, CAST('04/1/1989' AS DATE) AS birthDate UNION ALL
SELECT 3 AS ID, 'John' AS fName, 'Black' AS lName, CAST('11/16/1995' AS DATE) AS birthDate UNION ALL
SELECT 4 AS ID, 'John' AS fName, 'Black' AS lName, CAST('01/16/1968' AS DATE) AS birthDate UNION ALL
SELECT 5 AS ID, 'John' AS fName, 'Black' AS lName, CAST('01/16/1968' AS DATE) AS birthDate
)
/* Using TOP 1 vs Row_Number() - Uncomment this and comment the below to VIEW TOP version */
--SELECT TOP 1 d.ID, d.fName, d.lName, d.birthDate
--FROM fData d
--ORDER BY d.birthDate
/* Using the below vs TOP 1 */
SELECT * FROM
( SELECT d.ID, d.fName, d.lName, d.birthDate, Row_Number() OVER (ORDER BY d.birthDate) AS ranker
FROM fData d
) r
WHERE r.ranker = 1
When using TOP there's not a need to apply a secondary Wrapping query around it and it looks cleaner. After applying a Row_Number or a Ranking function you then must wrap it to tell the query which row you are now wanting... either by applying the WHERE ranker = 1 or ranker >= 5 to achieve the same as TOP 1 or TOP 5.
Which is better faster if this is even something that can be determined?
In the case of your example the TOP is somewhat more efficient.
The execution plan for TOP is below
The TOP N sort with N=1 just needs to keep track of the row with the lowest birthDate that it sees.
For the row_number query it recognises that the row number is always ascending and does itself add a TOP 1 to the plan but it doesn't combine the separated TOP and SORT into a TOP N Sort - so it does a full sort of all 5 rows.
In the case that an index supplies rows in the desired order without the need for a sort there won't be much in it. The row_number query will have an extra couple of operators that are fairly inexpensive anyway.
WHY use ranking functions in SQL Server when it has TOP
Ranking functions in general are more powerful than TOP.
For the cases where both would work consider that TOP is a fairly ancient proprietary syntax and not standard SQL. It was in the product a long time before window functions were added. If portable SQL is a concern you should not use TOP.
Though you might not use ranking functions either. As another (standard SQL) alternative is
SELECT d.ID, d.fName, d.lName, d.birthDate
FROM fData d
ORDER BY d.birthDate
OFFSET 0 ROWS
FETCH NEXT 1 ROW ONLY
which gives the same plan as TOP 1

Find/replace string values based upon table values

I have a somewhat unusual need to find/replace values in a string from values in a separate table.
Basically, I need to standardize a bunch of addresses, and one of the steps is to replace things like St, Rd or Blvd with Street, Road or Boulevard. I was going to write a function with bunch of nested REPLACE() statements, but this is 1) inefficient; and 2) not practical. There are over 500 possible abbreviations for street types according the USPS website.
What I'd like to do is something akin to:
REPLACE(Address1, Col1, Col2) where col1 and col2 are abbreviation and full street type in a separate table.
Anyone have any insight into something like this?
You can do such replacements using a recursive CTE. Something like this:
with r as (
select t.*, row_number() over (order by id) as seqnum
from replacements
),
cte as (
select replace(t.address, r.col1, r.col2) as address, seqnum as i
from t cross join
r
where r.seqnum = 1
union all
select replace(cte.address, r.col1, r.col2), i + 1
from cte join
r
on r.i = cte.i + 1
)
select cte.*
from (select cte.*, max(i) over () as maxi
from cte
) cte
where maxi = i;
That said, this is basically iteration. It will be quite expensive to do this on a table where there are 500 replacements per row.
SQL is probably not the best tool for this.

Hackerrank SQL challenge

This T-SQL query
SELECT city, Len(city)
FROM station
ORDER BY Len(city)
returns table sorted by city, not by Len(city) - is this proper behavior?
Acme 4
Addison 7
Agency 6
Aguanga 7
Alanson 7
Alba 4
...
The challenge is :
https://www.hackerrank.com/challenges/weather-observation-station-5
Since you want first and last, I'd probably just use a union and top 1. makes it clear as to what you're after and easy to maintain.
And since you can use alias in order by... I'd alias len(city)
SELECT TOP 1
city, len(city) LenCity
FROM
station
ORDER BY
LenCity ASC
UNION ALL
SELECT TOP 1
city, Len(City) lenCity
FROM
station
ORDER BY
LenCity DESC
Here's a link to my GitHub if you have any problems with the other questions. These are the answers to all the Basic questions. Feel free to join in!
https://github.com/jaymoore3/SolvingHackerRank/tree/main/SQL/Basic
Buuuuutttt... If you just need the code:
select city,len(city) as LengthOf
from station
group by city,len(city)
having len(city)=(select max(len(city)) from station)
union
select top 1 city,len(city) as LengthOf
from station
group by city,len(city)
having len(city)=(select min(len(city)) from station)

Getting Distinct Data from two Columns SQL Server

I am trying to get the distinct data from two columns in the same table.
Table 1:
***ID Address City***
01 Test Street Springdale
01 Main Street Springdale
01 Pass Dr. New Town
01 Main Street New Town
I want the results to look like this;
***Address City***
Test Street Springdale
Main Street New Town
Pass Dr.
Currently I have this:
SELECT DISTINCT Address
FROM Table1
WHERE ID = 01
UNION
SELECT DISTINCT City
FROM Table1
WHERE ID = 01
But what I get in return is:
***Address***
Test Street
Main Street
Pass Dr.
Springdale
New Town
Using nested CTEs as follows will produce the result set required in the OP:
;WITH CTE_Address AS
(
SELECT DISTINCT Address
FROM #T
), CTE_Address_rn AS
(
SELECT Address, ROW_NUMBER() OVER (ORDER BY Address) AS rn
FROM CTE_Address
), CTE_City AS
(
SELECT DISTINCT City
FROM #T
), CTE_City_rn AS
(
SELECT City, ROW_NUMBER() OVER (ORDER BY City) AS rn
FROM CTE_City
)
SELECT a.Address, c.City
FROM CTE_Address_rn AS a
LEFT JOIN CTE_City_rn AS c ON a.rn = c.rn
The basic idea is to produce two separate result sets containing distinct Addresses and Cities and join these by ROW_NUMBER.
SQL Fiddle Demo here
P.S. The above answer is based on the assumption that the OP just wants distinct Address and City values put into a single table, disassociated from each other.
It is because you are only ever selecting one column. The union just puts two data sets together and deupes. So the first one reads distinct address and the second distinct city and then retuns as one list.
You should really return these as two different data sets or use two different procs. You can do the former just be getting rid of your UNION.
;WITH CTE_Address AS
(
SELECT ID, Street_Address, DENSE_RANK () over (Order by Street_Address) as Denserank_Street
FROM The_Table
),
CTE_City AS
(
SELECT ID, City_Name, DENSE_RANK () over (Order by City_Name) as Denserank_City
FROM The_Table
)
SELECT a.Address, c.City
FROM CTE_Address AS A
INNER JOIN CTE_City AS C ON A.ID = C.ID
P.S. Without the ID column, the JOIN statement will give wrong match between City and Address.

SSRS 2008 R2 - evaluating running total only on change of group

I have a report where I capture patient information, some of which is stored in the patient table and some of which is stored in the observations table. Taking date of birth as my example, if I count all the records for which the DOB has been supplied, I get significantly more than the total number of patients, because of the join to the observations table. How do I evaluate the running total only once for each group?
Edit: some sample data over at http://sqlfiddle.com/#!3/27b91/1/0. If I count birthdates from that query, I want 2 as the answer; same for race and ethnicity.
The following may or may not be the right approach for your specific situation, but it can be a useful technique to have at your disposal.
You can add some code to your select statement to help yourself answer questions like these 'downstream' (either via added criteria or via SSRS). See this modification of your SQL Fiddle:
select pid, firstName, lastName, dateOfBirth, obsName, obsValue, obsDate,
rowRank, CASE rowRank WHEN 1 THEN 1 ELSE 0 END AS countableRow
from
(
select Person.pid, Person.firstName, Person.lastName, Person.dateOfBirth
, Obs.obsName, Obs.obsValue, Obs.obsDate,
ROW_NUMBER() OVER (PARTITION BY Person.pid, Person.firstName, Person.lastName, Person.dateOfBirth ORDER BY Obs.obsDate) AS rowRank
from Person
join Obs on Person.pId = Obs.pId
) rankedData
The rowRank field will create a group-relative ranking number, which may or may not be useful to you downstream. The countableRow field will be either 1 or 0 such that each group will have one and only one row with a 1 in it. Doing SUM(countableRow) will give you the proper number of groups in your data.
Now, you can extend this functionality (if you wish) by dumping out actual field values instead of a constant scalar like 1 in the first row of each group. So, if you had CASE rowRank WHEN 1 THEN dateOfBirth ELSE NULL END AS countableDOB, you could then, for example, get the total number of people with each distinct birthday using just this dataset.
Of course, you can do all those things using methods like #Russell's with SQL anyway, so this would be most relevant with specific downstream requirements that may not match your situation.
EDIT
Obviously the countableRow field there isn't a one-size-fits-all solution to the types of queries you want. I have added a few more examples of the PARTITION BY strategy to another SQL Fiddle:
select pid, firstName, lastName, dateOfBirth, obsName, obsValue, obsDate,
rowRank, CASE rowRank WHEN 1 THEN 1 ELSE 0 END AS countableRow,
valueRank, CASE valueRank WHEN 1 THEN 1 ELSE 0 END AS valueCount,
dobRank, CASE WHEN dobRank = 1 AND dateOfBirth IS NOT NULL THEN 1 ELSE 0 END AS dobCount
from
(
select Person.pid, Person.firstName, Person.lastName, Person.dateOfBirth
, Obs.obsName, Obs.obsValue, Obs.obsDate,
ROW_NUMBER() OVER (PARTITION BY Person.pid, Person.firstName, Person.lastName, Person.dateOfBirth ORDER BY Obs.obsDate) AS rowRank,
ROW_NUMBER() OVER (PARTITION BY Obs.obsName, Obs.obsValue ORDER BY Obs.obsDate) AS valueRank,
ROW_Number() OVER (PARTITION BY Person.dateOfBirth ORDER BY Person.pid) AS dobRank
from Person
join Obs on Person.pId = Obs.pId
) rankedData
Lest anyone misunderstand me as suggesting this is always appropriate, it obviously isn't. This isn't a better solution to getting specific answers using additional SQL queries. What it allows you to do is encode enough information to simply answer such questions in the consuming code all in a single result set. That's where it can come in handy.
SECOND EDIT
Since you were wondering whether you can do this if race data is stored in more than one place, the answer is, absolutely. I have revised the code from my previous SQL Fiddle, which is now available in a new one:
select pid, firstName, lastName, dateOfBirth, obsName, obsValue, obsDate,
rowRank, CASE rowRank WHEN 1 THEN 1 ELSE 0 END AS countableRow,
valueRank, CASE valueRank WHEN 1 THEN 1 ELSE 0 END AS valueCount,
dobRank, CASE WHEN dobRank = 1 AND dateOfBirth IS NOT NULL THEN 1 ELSE 0 END AS dobCount,
raceRank, CASE WHEN raceRank = 1 AND (race IS NOT NULL OR obsName = 'RACE') THEN 1 ELSE 0 END AS raceCount
from
(
select Person.pid, Person.firstName, Person.lastName, Person.dateOfBirth, Person.[race]
, Obs.obsName, Obs.obsValue, Obs.obsDate,
ROW_NUMBER() OVER (PARTITION BY Person.pid, Person.firstName, Person.lastName, Person.dateOfBirth ORDER BY Obs.obsDate) AS rowRank,
ROW_NUMBER() OVER (PARTITION BY Obs.obsName, Obs.obsValue ORDER BY Obs.obsDate) AS valueRank,
ROW_NUMBER() OVER (PARTITION BY Person.dateOfBirth ORDER BY Person.pid) AS dobRank,
ROW_NUMBER() OVER (PARTITION BY ISNULL(Person.race, CASE Obs.obsName WHEN 'RACE' THEN Obs.obsValue ELSE NULL END) ORDER BY Person.pid) AS raceRank
from Person
left join Obs on Person.pId = Obs.pId
) rankedData
As you can see, in the new Fiddle, this properly counts the number of Races as 3, with 2 being in the Obs table and the third being in the Person table. The trick is that PARTITION BY can contain expressions, not just raw column output. Note that I changed the join to a left join here, and that we need to use a CASE to only include obsValue WHERE obsName is 'RACE'. It is a little complicated, but not overwhelmingly so, and it handles even fairly complex cases gracefully.
It turned out that Jeroen's pointer to RunningValue was more on-target than I thought. I was able to get the results I wanted with the following code:
=RunningValue(Iif(Not IsNothing(Fields!DATEOFBIRTH.Value)
, Fields!PATIENTID.Value
, Nothing)
, CountDistinct
, Nothing
)
Thanks particularly to Dominic P, whose technique I'll keep in mind for next time.
This will only pull one record per patient, unless they reported different DOBs:
SELECT P.FOO,
P.BAR,
(etc.),
O.DOB
FROM Patients P
INNER JOIN Observations O
ON P.PatientID = O.PatientID
GROUP BY P.FOO, P.BAR, (P.etc), O.DOB

Resources