rank() over in hive - sql-server

I'm converting a SQL Server stored procedure to HiveQL.
How can I convert something like:
SELECT
p.FirstName, p.LastName,
RANK() OVER (ORDER BY a.PostalCode) AS Rank

I Have seen this use case a few times, there is a way to do something similar to RANK() in Hive using a UDF.
There are basically a few steps:
Partition the data into groups with DISTRIBUTE BY
Order the data in each group with SORT BY
There is actually a nice article on the topic, and you can also find some code from Edward Capriolo here.
Here is an example query doing a rank in Hive:
ADD JAR p-rank-demo.jar;
CREATE TEMPORARY FUNCTION p_rank AS 'demo.PsuedoRank';
SELECT
category,country,product,sales,rank
FROM (
SELECT
category,country,product,sales,
p_rank(category, country) rank
FROM (
SELECT
category,country,product,
sales
FROM p_rank_demo
DISTRIBUTE BY
category,country
SORT BY
category,country,sales desc) t1) t2
WHERE rank <= 3
Which does the equivalent of the following query in MySQL:
SELECT
category,country,product,sales,rank
FROM (
SELECT
category,country,product, sales,
rank() over (PARTITION BY category, country ORDER BY sales DESC) rank
FROM p_rank_demo) t
WHERE rank <= 3

For anyone who comes across this question, Hive now supports rank() and other analytics functions.
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

No big deal
Somehow in Hive 0.13.1 that comma does not work :-(
so I got it working without the comma
(PARTITION BY category country ORDER BY sales DESC)

Related

Why is Rank() OVER PARTITION BY returning too many results

I want the results of my query to be the top 3 newest, distinct Campaign Names for each Campaign Type.
My query at the moment is:
DECLARE #currentRecord varchar(160);
SET #currentRecord = '316827D2-B522-E811-816A-0050569FE3BD';
SELECT DISTINCT
rs.CampaignName,
rs.CampaignType,
rs.receivedon,
rs.Rank
FROM
(SELECT
fs_retentioncontact,
receivedon,
regardingobjectidname AS CampaignName,
fs_campaignresponsetypename AS CampaignType,
RANK() OVER (PARTITION BY fs_campaignresponsetypename, regardingobjectidname
ORDER BY receivedon DESC) AS Rank
FROM
dbo.FilteredCampaignResponse) rs
INNER JOIN
dbo.FilteredContact ON rs.fs_retentioncontact = dbo.FilteredContact.contactid
WHERE
(dbo.FilteredContact.parentcustomerid IN (#currentRecord))
AND Rank <= 3
ORDER BY
CampaignType, receivedon DESC;
There may be multiple results for each campaign name as well as campaign response because they are linked to individual contacts but I only want to see the 3 latest unique campaigns for each campaign type.
My query is not partitioning by each individual campaign response type (there are 6 different ones) as I was expecting. If I remove the regardingobjectidname from the PARTITION BY I only get a single row in the results when I should be getting 18 rows. This particular company has over 700 campaign responses across the 6 campaign types.
My query is returning 102 rows so it seems to be removing duplicates on campaign name which is part of what I need but not the whole story.
I have read quite a few posts regarding rank() on here e.g.
how-to-use-rank-in-sql-server
[ using-sql-rank-for-overall-rank-and-rank-within-a-group]2
but I am not able to work out what I am doing wrong from their examples. Could it be the positioning of the 'receivedon' in the ORDER BY? or something else?
I have finally worked out from reading a post on another site how to get the top 3 of each group. I shall post my answer in case it helps anyone else.
I had to use ROW_NUMBER() OVER (PARTITION BY instead of RANK() OVER (PARTITION BY and I also moved the INNER JOIN and WHERE clause (to filter for the correct company) from the outer query to the inner query.
DECLARE #currentRecord varchar(160)
SET #currentRecord='316827D2-B522-E811-816A-0050569FE3BD'
SELECT distinct rs.CampaignName
,rs.CampaignType
, rs.receivedon
,RowNum
FROM(
SELECT fs_retentioncontact
, receivedon
, regardingobjectidname AS CampaignName
,fs_campaignresponsetypename as CampaignType
,ROW_NUMBER() OVER (PARTITION BY fs_campaignresponsetypename ORDER BY fs_campaignresponsetypename, receivedon DESC) AS RowNum
FROM FilteredCampaignResponse
INNER JOIN dbo.FilteredContact ON fs_retentioncontact = dbo.FilteredContact.contactid
WHERE(dbo.FilteredContact.parentcustomerid IN (#currentRecord)))rs
WHERE RowNum <=3
ORDER BY CampaignType,receivedon DESC;

Joins and subqueries in SQL Server

I am trying to find all continents and their most-used currency.
ContinentCode
CurrencyCode
CurrencyUsage
I am not familiar with grouping so I will be very grateful if you can give me a hint using only subqueries and joins if they can be used adequately here.
Join the countries to the continents. Then aggregate to get the number the currencies are used. Then use row_number() (or rank(), if you want to keep ties) to produce an ordinal per continent -- the more the currency is used the lesser the ordinal. Only get the rows where this ordinal equals 1.
SELECT x.continentcode,
x.currencycode,
x.currencyusage
FROM (SELECT ct.continentcode,
cy.currencycode,
count(cy.currencycode) currencyusage,
row_number() OVER (PARTITION BY ct.continentcode
ORDER BY count(cy.currencycode) DESC) rn
FROM continents ct
LEFT JOIN countries cy
ON cy.continentcode = ct.continentcode
GROUP BY ct.continentcode,
cy.currencycode) x
WHERE x.rn = 1;
And next time do not post images of tables. Instead paste the CREATE and INSERT statements to create them as text.

Get the id of the row with the max value with two grouping

We have a data structure with four columns:
ContractoreName, ProjectCode, InvoiceID, OrderID
We want to group the data by both ContractoreName and ProjectCode columns, and then get the InvoiceID of the row for each group with MAX(OrderID).
You could use ROW_NUMBER:
SELECT ContractorName, ProjectName, OrderId, InvoiceId
FROM (SELECT *, ROW_NUMBER() OVER(PARTITION BY ContractorName, ProjectName
ORDER BY OrderId DESC) AS rn
FROM tab
) AS sub
WHERE rn = 1;
ROW_NUMBER() is what I would call the canonical solution. In many cases, an old-fashioned solution has better performance:
select t.*
from t
where t.orderid = (select max(t2.orderid)
from t t2
where t2.contractorname = t.contractorname and
t2.projectname = t.projectname
);
This is especially true if there is an index on (contractorname, projectname, orderid).
Why is this faster? Basically, SQL Server can scan the table doing a lookup in an index. The lookup is really fast because the index is designed for it, so the scan is just a little faster than a full table scan.
When using row_number(), SQL Server has to scan the table to calculate the row number (and that can use the index, so it might be fast). But then it has to go back to the table to fetch the columns and apply the where clause. So, even if it uses an index, it is doing more work.
EDIT:
I should also point out that this can be done without a subquery:
select distinct contractorname, projectname,
max(orderid) over (partition by contractorname, projectname) as lastest_order,
first_value(invoiceid) partition by (order by contractorname, projectname order by orderid desc) as lastest_invoice
from t;
Unfortunately, SQL Server doesn't offer first_value() as an aggregation function, but you can use select distinct and get the same effect.

SQL Server loop for changing multiple values of different users

I have the next table:
Supposing these users are sort in descending order based on their inserted date.
Now, what I want to do, is to change their sorting numbers in that way that for each user, the sorting number has to start from 1 up to the number of appearances of each user. The result should look something like:
Can someone provide me some clues of how to do it in sql server ? Thanks.
You can use the ROW_NUMBER ranking function to calculate a row's rank given a partition and order.
In this case, you want to calculate row numbers for each user PARTITION BY User_ID. The desired output shows that ordering by ID is enough ORDER BY ID.
SELECT
Id,
User_ID,
ROW_NUMBER() OVER (PARTITION BY User_ID ORDER BY Id) AS Sort_Number
FROM MyTable
There are other ranking functions you can use, eg RANK, DENSE_RANK to calculate a rank according to a score, or NTILE to calculate percentiles for each row.
You can also use the OVER clause with aggragets to create running totals or moving averages, eg SUM(Id) OVER (PARTITION BY User_ID ORDER BY Id) will create a running total of the Id values for each user.
use ROW_NUMBER() PARTITION BY User_Id
SELECT
Id,
[User_Id],
Sort_Number = ROW_NUMBER() OVER(PARTITION BY [User_Id]
ORDER BY [User_Id],[CreatedDate] DESC)
FROM YourTable
select id
,user_id
,row_number() over(partition by user_id order by user_id) as sort_number
from table
Use ranking function row_number()
select *,
row_number() over (partition by User_id order by user_id, date desc)
from table t

Replace Group By clause with any other clause

In below query, I am using GROUP BY clause to get list of recently updated records depends on updated date. But I would like to have the query without a GROUP BY clause because of some internal reasons. Can please any one help me to solve this.
SELECT Proj_UpdatedDate,
Proj_UpdatedBy
FROM ProjectProgress PP
WHERE Proj_UpdatedDate IN (SELECT MAX(Proj_UpdatedDate)
FROM ProjectProgress
GROUP BY
Proj_ProjectID)
ORDER BY
Proj_ProjectID
Using TOP 1 should give you the same result assuming you meant the MAX(Proj_UpdatedDate):
SELECT Proj_UpdatedDate,
Proj_UpdatedBy
FROM ProjectProgress PP
WHERE Proj_UpdatedDate IN (SELECT TOP 1 Proj_UpdatedDate
FROM ProjectProgress
ORDER BY Proj_UpdatedDate DESC)
ORDER BY
Proj_ProjectID
However your query actually returns multiple dates since it's GROUPED BY Proj_ProjectId (the max date for each project). Is that your desired outcome - to show a list of dates that the projects were updated and by whom?
If so, try using ROW_NUMBER():
SELECT Proj_UpdatedDate, Proj_UpdatedBy
FROM (
SELECT ROW_NUMBER() OVER (PARTITION BY Proj_ProjectID ORDER BY Proj_UpdatedBy DESC) rn,
Proj_UpdatedDate,
Proj_UpdatedBy
FROM ProjectProgress
) t
WHERE rn = 1
And here is the SQL Fiddle. This assumes you are running SQL Server 2005 or greater.
Good luck.

Resources