Extracting ranked values from different rows - SQL

Extracting ranked values from different rows - SQL - sql-server

I have a database with various categories. For each category I have three quantities, and I want to extract a row containing the 25th largest value from each of the quantities per category (ties can be safely ignored).
For example, I might have a database whose rows were towns or cities from one of several countries. The categories are countries, and the quantities might be population, land area, and latitude. The data would then look something like:
TownName Country Population LandArea Latitude
Paris France 500,715 47.9 45.76
Manchester USA 110,229 90.6 42.99
Calais France 72,589 33.5 50.95
Leicester England 337,653 73.3 52.63
Dunkirk France 90,995 43.9 51.04
... ... ... ... ...
In this example, the end result I'd want would be each of the countries in the list, along with their 25th largest population, 25th largest land area and 25th largest latitude. This no longer resembles some specific town or city, but gives some information about each country. This might look like:
Country Population LandArea Latitude
France 144,548 83.95 50.21
Poland 141,080 88.3 54.17
Australia 68,572 146 -21.35
... ... ... ...
I've figured out one way to do this, which was to do the following:
Use the ROW_NUMBER function to rank one of Population, LandArea and Latitude in descending order, partitioned over countries.
Repeat this three times (one for each quantity), and JOIN the three databases together. In the ON statement, ensure the values of the Country columns are equal, as are the values of the rank columns.
Use a WHERE statement to pull out the row for each country with rank 25.
I don't like this method because it involved creating three almost exact copies of decent sized chunks of code to get three separate databases I joined together (each of the blocks of code in the join statments were a decent size because this is a simplified example, and I had to do other stuff to get to a stage like this).
I was wondering whether there was a way which wouldn't involve me repeating large chunks of code with a JOIN statement as this makes my code big and ugly. Also, this seems like something which may crop up time and time again, so a more efficient method would be wonderful.
Thanks for your time

Perhaps if you can't find a way to eliminate the 3-join approach, you can simplify the join condition by assigning each distinct tuple a GroupID:
;WITH
MasterCTE AS
(
SELECT *,
DENSE_RANK() OVER (ORDER BY Country) AS GroupID -- Don't use ROW_NUMBER here. RANK or DEMSE_RANK only
FROM MyTable
),
cte1 AS
(
SELECT GroupID, [Population],
ROW_NUMBER() OVER (PARTITION BY GroupID ORDER BY [Population] DESC) AS PopulationRank
FROM MasterCTE
),
cte2 AS
(
SELECT GroupID, LandArea,
ROW_NUMBER() OVER (PARTITION BY GroupID ORDER BY LandArea DESC) AS LandAreaRank
FROM MasterCTE
),
cte3 AS
(
SELECT GroupID, Latitude,
ROW_NUMBER() OVER (PARTITION BY GroupID ORDER BY Latitude DESC) AS LatitudeRank
FROM MasterCTE
)
SELECT DISTINCT -- Remember to include DISTINCT
MasterCTE.Country,
cte1.Population, cte2.LandArea, cte3.Latitude
FROM MasterCTE
INNER JOIN cte1 ON MasterCTE.GroupID = cte1.GroupID AND cte1.PopulationRank = 25
INNER JOIN cte2 ON MasterCTE.GroupID = cte2.GroupID AND cte2.LandAreaRank = 25
INNER JOIN cte3 ON MasterCTE.GroupID = cte3.GroupID AND cte3.LatitudeRank = 25

Related

min(count(*)) over... behavior?

I'm trying to understand the behavior of
select ..... ,MIN(count(*)) over (partition by hotelid)
VS
select ..... ,count(*) over (partition by hotelid)
Ok.
I have a list of hotels (1,2,3)
Each hotel has departments.
On each departments there are workers.
My Data looks like this :
select * from data
Ok. Looking at this query :
select hotelid,departmentid , cnt= count(*) over (partition by hotelid)
from data
group by hotelid, departmentid
ORDER BY hotelid
I can perfectly understand what's going on here. On that result set, partitioning by hotelId , we are counting visible rows.
But look what happens with this query :
select hotelid,departmentid , min_cnt = min(count(*)) over (partition by hotelid)
from data
group by hotelid, departmentid
ORDER BY hotelid
Question:
Where are those numbers came from? I don't understand how adding min caused that result? min of what?
Can someone please explain how's the calculation being made?
fiddle

The 2 statements are very different. The first query is counting the rows after the grouping and then application the PARTITION. So, for example, with hotel 1 there is 1 row returned (as all rows for Hotel 1 have the same department A as well) and so the COUNT(*) OVER (PARTITION BY hotelid) returns 1. Hotel 2, however, has 2 departments 'B' and 'C', and so hence returns 2.
For your second query, you firstly have the COUNT(*), which is not within the OVER clause. That means it counts all the rows within the GROUP BY specified in your query: GROUP BY hotelid, departmentid. For Hotel 1, there are 4 rows for department A, hence 4. Then you take the minimum of 4; which is unsurprisingly 4. For all the other hotels, they have at least 1 entry with only 1 row for a hotel and department and so returns 1.

Calculation of Closest Distance of one table's IDs to another reference table without joining in reference table

Really confused on this: so I'm trying to get the rank of the city in terms of population from https://gist.github.com/Miserlou/c5cd8364bf9b2420bb29 (I converted this into a csv and uploaded to our SQL server, let's call it City_table) that each ID in ID_table is closest to.
City_table has the latitude and longitude of each city as well as the rank of each city in terms of population, and the ID_table has the latitude & longitude of each ID. I can't join in City_table, because I'll need to calculate the distance from each ID to each city, and take the minimum of that.
The calculation below gets the distance from one location to another and converts to miles:
(ACOS(COS(RADIANS(90-CAST(ID_table.GEO_LATITUDE AS REAL)))*COS(RADIANS(90-City_table.latitude))+SIN(RADIANS(90-CAST(ID_table.GEO_LATITUDE AS REAL)))*SIN(RADIANS(90-City_table.latitude)) *COS(RADIANS(CAST(ID_table.GEO_LONGITUDE AS REAL)-
(-City_table.longitude))))*6371)*0.621371
To recap, ID_table has an ID, latitude, and longitude. City_table has a latitude, longitude, city, and rank according to highest population. I need to get the rank of the city from City_table that is closest to the ID's location.
I really just don't know how to do this and would really appreciate any help.
I'm trying to accomplish something like the following (acknowledging the syntax isn't right, just the idea of what I think I'm trying to accomplish) #hastrb
SELECT A.ID,City_table.rank,
FOR EACH CITY IN City_table{(ACOS(COS(RADIANS(90-CAST(ID_table.GEO_LATITUDE AS REAL)))*COS(RADIANS(90-City_table.latitude))+
SIN(RADIANS(90-CAST(ID_table.GEO_LATITUDE AS REAL)))*SIN(RADIANS(90-City_table.latitude)) *
COS(RADIANS(CAST(ID_table.GEO_LONGITUDE AS REAL)-(-City_table.longitude))))*6371)*0.621371} AS DISTANCE
FROM ID_table A
WHERE ROW_NUMBER()OVER(PARTITION BY A.ID ORDER BY DISTANCE ASC) = '1'
So I ended up figuring this out, but for future reference if anyone runs into this same problem, I figured out a solution (albeit maybe not the best one) but it works:
WITH X
AS
(
SELECT A.ID, A.CITY, A.[STATE], B.[Rank], B.City AS CITY_TABLE_CITY, B.State AS CITY_TABLE_STATE,
((ACOS(COS(RADIANS(90-CAST(A.GEO_LATITUDE AS REAL)))*COS(RADIANS(90-B.latitude))+
SIN(RADIANS(90-CAST(A.GEO_LATITUDE AS REAL)))*SIN(RADIANS(90-B.latitude))*
COS(RADIANS(CAST(A.GEO_LONGITUDE AS REAL)-B.longitude)))*6371)*0.621371) AS DISTANCE,
ROW_NUMBER()OVER(PARTITION BY A.ID ORDER BY((ACOS(COS(RADIANS(90-CAST(A.GEO_LATITUDE AS REAL)))*COS(RADIANS(90-B.latitude))+
SIN(RADIANS(90-CAST(A.GEO_LATITUDE AS REAL)))*SIN(RADIANS(90-B.latitude))*
COS(RADIANS(CAST(A.GEO_LONGITUDE AS REAL)-B.longitude)))*6371)*0.621371) ASC) AS DISTANCE_NUMBER
FROM ID_TABLE A
FULL OUTER JOIN CITY_TABLE B ON B.latitude<>A.GEO_LATITUDE
)
SELECT *
FROM X
WHERE DISTANCE_NUMBER='1' AND DISTANCE IS NOT NULL
ORDER BY ID

I strongly suggest (as already mentioned by John Cappelletti) to use the geography type. I had to do something similar as you, where my reference table contained the geography of the locations. I joined it to the main query "ON 1=1". That way, for each row of your main query, you will have a location from your locations table (just keep in mind that if the tables you're dealing with are large this will be slow!). In any case, the query looks something like this:
--The next line declares a reference location (some lat/long example)!
DECLARE #reference_point geography=geography::STGeomFromText('POINT(-84.206230 33.897247)', 4326);
SELECT t.LocationName,
t.PointGeom.STDistance(#reference_point) / 1609.34 AS [DistanceInMiles]
FROM
(
SELECT LocationName,
geography::Point(ISNULL(Geom.STY, 0), ISNULL(Geom.STX, 0), 4326) AS [PointGeom],
Geom,
Geom.STX AS [Longitude],
Geom.STY AS [Latitude]
FROM MyLocationsTable
) AS t;

How can I retrieve "exception" data from a table without knowing the data in advance?

I have a table that updates all the time.
The table maintains a list that links stores to clubs, and manages, among other things, "discount percentages" per store + club.
Table name: Policy_supplier
Column: POLXSUP_DISCOUNT
Suppose all the "vendors" in the table are marked with a 10% discount.
And someone accidentally signs one vendor with 8% or 15% (or even NULL)
How do I generate a query to retrieve the "abnormal" vendor?

You can find the mode of your discounts and then just pick out the records that aren't equal to that mode:
WITH mode_discount AS (SELECT TOP 1 POLXSUP_DISCOUNT FROM table GROUP BY POLXSUP_DISCOUNT ORDER BY count(*) DESC)
SELECT * FROM table WHERE POLXSUP_DISCOUNT <> (SELECT POLSXUP_DISCOUNT FROM mode_discount);

You can use the OVER clause with aggregates to calculate an aggregate over a data range and include it in the results. For example,
SELECT avg(POLXSUP_DISCOUNT)
from Policy_supplier
Would return a single average value while
SELECT POLXSUP_DISCOUNT, avg(POLXSUP_DISCOUNT) OVER()
from Policy_supplier
Would return the overall average in each row. Typically OVER is used with a PARTITION BY clause. If you wanted the average per supplier you could have written AVG() OVER(PARTITION BY supplierID).
To find anomalies, you should use one of the PERCENTILE functions, eg PERCENTILE_CONT. For example
select PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY POLXSUP_DISCOUNT) over()
from Policy_Supplier
Will return a discount value below which you'll find 95% of the records. The other 5% of discounts that are above this are probably anomalies.
Similarly, PERCENTILE_CONT(0.05) will return a discount below which you'll find 5% of the records
You can combine both to find potentially exceptional records, eg:
with percentiles as (
select ID,
POLXSUP_DISCOUNT,
PERCENTILE_CONT(0.95) WITHIN GROUP (ORDER BY POLXSUP_DISCOUNT) over() as pct95,
PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY POLXSUP_DISCOUNT) over() as pct05,
from Policy_Supplier)
select ID,POLXSUP_DISCOUNT
from percentiles
where POLXSUP_DISCOUNT>pct95 or POLXSUP_DISCOUNT<pct05

How to sum if within percentile in SQL Server?

I have a table that looks something like this:
It contains more than 100k rows.
I know how to get the median (or other percentile) values per week:
SELECT DISTINCT week,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY visits) OVER (PARTITION BY week) AS visit_median
FROM table
ORDER BY week
But how do I return a column with the total visits within the top N percentile of the group per week?

I don't think you want percentile_cont(). You can try using ntile(). For instance, the top decile:
SELECT week, SUM(visits)
FROM (SELECT t.*,
NTILE(100) OVER (PARTITION BY week ORDER BY visits DESC) as tile
FROM table
) t
WHERE tile <= 10
GROUP BY week
ORDER BY week;
You need to understand how NTILE() handles ties. Rows with the same number of visits can go into different tiles. That is, the sizes of the tiles differ by at most 1. This may or may not be what you really want.

t sql sort by calculated column

I'm on T-SQL 2014 and try to order products by their price. Now here is the problem: the price is a calculated field. Eg. I have created a function which evaluates a number of pricing rules (maybe about 4 tables with each about 4,000,000 records combined with JOINs to fit to the current login) and returns the users price for the product. While this is OK if I just want to return the price for a limited number of products it is way to slow if I want to sort by this.
I was thinking about having an additional table like UserProductPrice which will get calculated in the background but this will obviously not always have the correct price in it as the rules etc. could change in between the calculation.
Any suggestion on how I could sort by the price would be most appreciated.

You could use the ROW_NUMBER() function and place this into a temp table:
SELECT
Product,
dbo.ufnPrice(Price) as Price,
ROW_NUMBER() OVER (PARTITION BY Product ORDER BY dbo.ufnPrice(Price) DESC) AS Ranking
INTO #Products
FROM dbo.Products
SELECT
Product,
Price,
FROM #Products
WHERE Ranking = 1
DROP TABLE #Products
Should give you what you need.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight