T_SQL Rank function not working as expected

T_SQL Rank function not working as expected - sql-server

once again I really need your expertise.It looks like I am not ranking well here...
I have two records returned and they are both ranked one even though their date time is slightly different. by some minutes.Thanks guys , I really appreciate your help as always.
SELECT CD.MEMACT,
CD.DATETIME,--DATETIME
CD.AG_ID,
RANK() OVER (PARTITION BY
CD.MEMACT,
CD.DATETIME,
CD.AG_ID
ORDER BY CD.DATETIME)RANKED
FROM MEM_ACT_TBL
WHERE CD.MEMACT='1024518'

You provided this sample data as violating your expectations:
MEMACT DATETIME AG_ID RANK
------- --------------- ------- ----
1024518 12/26/2013 7:43 Ag_1541 1
1024518 12/26/2013 7:53 Ag_2488 1
But in your example code, you PARTITION BY CD.MEMACT, CD.DATETIME, CD.AG_ID. This explicitly means to treat all rows that do not share these values as completely separate ranking series. Since the two rows you presented above do not have the same MEMACT, DATETIME, and AG_ID values, then they each start their ranking from 1.
If you were expecting the two rows above to ascend in rank, then you must not partition by anything but MEMACT, since only that column is the same between the rows. You would then most likely order by DATETIME and then AG_ID (to get determinism if you have two rows with the same date) like so:
PARTITION BY CD.MEMACT ORDER BY CD.DATETIME, CD.AG_ID
It never makes sense in any of the ranking functions to both PARTITION BY and ORDER BY the same column as this will ensure that the ORDER BY has no differing data to work with as you've already segregated all the separate series by the same column.

Without seeing your data I could say it is due to the partitioning method. Your 'partitioning' statement is essentially WHAT you group by to determine number position resetting. The 'order by' determines the sequence of HOW you are ordering that data. See this simple example where in my first windowed function I use one too many partition by's and thus just show a whole bunch of redundant ones.
Simple self extracting example:
declare #Person Table ( personID int identity, person varchar(8));
insert into #Person values ('Brett'),('Sean'),('Chad'),('Michael'),('Ray'),('Erik'),('Queyn');
declare #Orders table ( OrderID int identity, PersonID int, Desciption varchar(32), Amount int);
insert into #Orders values (1, 'Shirt', 20),(1, 'Shoes', 50),(2, 'Shirt', 22),(2, 'Shoes', 52),(3, 'Shirt', 20),(3, 'Shoes', 50),(3, 'Hat', 20),(4, 'Shirt', 20),(5, 'Shirt', 20),(5, 'Pants', 30),
(6, 'Shirt', 20),(6, 'RunningShoes', 70),(7, 'Shirt', 22),(7, 'Shoes', 40),(7, 'Coat', 80)
Select
p.person
, o.Desciption
, o.Amount
, row_number() over(partition by person, Desciption order by Desciption) as [Wrong I Used Too Many Partitions]
, row_number() over(partition by person order by Desciption) as [Correct By Alpha of Description]
, row_number() over(partition by person order by Amount desc) as [Correct By Amount of Highest First]
from #Person p
join #Orders o on p.personID = o.PersonID
I would guess you could do this instead:
SELECT
CD.MEMACT,
CD.DATETIME,--DATETIME
CD.AG_ID,
RANK() OVER (PARTITION BY CD.MEMACT ORDER BY CD.DATETIME, CD.AG_ID) RANKED
FROM MEM_ACT_TBL
WHERE CD.MEMACT='1024518'

Related

Identify an associated main meal for choice items

I have a MS SQL Server database for a restaurant. There is a transactional table that shows the meals ordered and, where applicable, the optional (choice) items ordered with each meal e.g. "with chips", "with beans".
I am trying to determine, for each choice item, what the main meal was that it was ordered with.
How can I write a query in such a way that it doesn't matter how many choice items there are for a particular main meal, it will always find the appropriate main meal? We can assume that each choice item belongs to the same main meal until a new main meal appears.
The expected outcome might look something like this (first and last columns created by the query, middle columns are present in the table): -
RowWithinTxn
TransactionID
LineNumber
CourseCategory
ItemName
AssociatedMainMeal
1
123
123456
Main
Steak
NULL
2
123
123457
Choice
Chips
Steak
3
123
123458
Choice
Beans
Steak
1
124
124567
Main
Fish
NULL
2
124
124568
Choice
Mushy Peas
Fish
As it's possible to have more than one choice item with the same main meal, something like LAST_VALUE() or LAG() (where we lag by 1), won't always give the correct answers (for example: -)
CASE WHEN CourseCategory = 'Choice Items' THEN
LAST_VALUE(ItemName)
OVER(PARTITION BY TransactionID
ORDER BY LineNumber ROWS BETWEEN UNBOUNDED PRECEDING AND 1 PRECEDING)
ELSE NULL END AS AssociatedMainMeal
OR
CASE WHEN CourseCategory = 'Choice Items' THEN
LAG(ItemName,1,0)
OVER(PARTITION BY TransactionID
ORDER BY LineNumber)
ELSE NULL END AS AssociatedMainMeal
I suspect ROW_NUMBER() might be useful somewhere, i.e.: ROW_NUMBER() OVER(PARTITION BY TransactionID ORDER BY LineNumber) AS RowWithinTxn, as this would chop the table up into separate TransactionIDs
Many Thanks!

You want to use conditional aggregation here:
MAX(CASE CourseCategory WHEN 'Main' THEN ItemName END) OVER (PARTITION BY TransactionID)

Assuming that you could have many main meals per transaction, then effectively need a LAG function that can ignore certain rows. There is an IGNORE NULLS option in some RDMBS, where you could use something like:
LAG(CASE WHEN CourseCategoruy = 'Main' THEN ItemName END) IGNORE NULLS
OVER(PARTITION BY TransactionID ORDER BY Linenumber)
Unfortunately SQL Server does not support this, but Itzik Ben-Gan came up with a nice workaround using windowed functions. Essentially if you combine your sorting column and your data column into a single binary value, you can use MAX in a similar way to LAG since the max binary value will always be the previous row since the sorting column is part of it.
It does however make for quite painful reading because you have to convert two columns to binary and combine them:
CONVERT(BINARY(4), t.Linenumber) +
CONVERT(BINARY(50), CASE WHEN t.CourseCategory = 'Main' THEN t.ItemName END)
Then get the max of them
MAX(<result of above expression>) OVER(PARTITION BY t.TransactionID ORDER BY t.LineNumber)
Then remove the first 4 characters (from linenumber) and convert back to varchar:
CONVERT(VARCHAR(50), SUBSTRING(<result of above expression>, 5, 50)
Then finally, you only want to show this value for non main meals
CASE WHEN t.CourseCategory <> 'Main' THEN <result of above expression> END
Putting this all together, along with some sample data you end up with:
DROP TABLE IF EXISTS #T;
CREATE TABLE #T
(
RowWithinTxn INT,
TransactionID INT,
LineNumber INT,
CourseCategory VARCHAR(6),
ItemName VARCHAR(10)
);
INSERT INTO #T
(
RowWithinTxn,
TransactionID,
LineNumber,
CourseCategory,
ItemName
)
VALUES
(1, 123, 123456, 'Main', 'Steak'),
(2, 123, 123457, 'Choice', 'Chips'),
(3, 123, 123458, 'Choice', 'Beans'),
(1, 124, 124567, 'Main', 'Fish'),
(2, 124, 124568, 'Choice', 'Mushy Peas');
SELECT t.RowWithinTxn,
t.TransactionID,
t.LineNumber,
t.CourseCategory,
t.ItemName,
AssociatedMainMeal = CASE WHEN t.CourseCategory <> 'Main' THEN
CONVERT(VARCHAR(50), SUBSTRING(MAX(CONVERT(BINARY(4), t.Linenumber) + CONVERT(BINARY(50), CASE WHEN t.CourseCategory = 'Main' THEN t.ItemName END))
OVER(PARTITION BY t.TransactionID ORDER BY t.LineNumber), 5, 50))
END
FROM #T AS t;

How can I delete duplicates from a table based on a location id and name

columns are: Name, Location_Name, Location_ID
I want to check Names and Location_ID, and if there are two that are the same I want to delete/remove that row.
For example: If Name John Fox at location id 4 shows up two or more times I want to just keep one.
After that, I want to count how many people per location.
Location_Name1: 45
Location_Name2: 66
Etc...
The location name and Location Id are related.
Sample data
Code I tried

Deleting duplicates is a common pattern. You apply a sequence number to all of the duplicates, then delete any that aren't first. In this case I order arbitrarily but you can choose to keep the one with the lowest PK value or that was modified last or whatever - just update the ORDER BY to sort the one you want to keep first.
;WITH cte AS
(
SELECT *, rn = ROW_NUMBER() OVER
(PARTITION BY Name, Location_ID ORDER BY ##SPID)
FROM dbo.TableName
)
DELETE cte WHERE rn > 1;
Then to count, assuming there can't be two different Location_IDs for a given Location_Name (this is why schema + sample data is so helpful):
SELECT Location_Name, People = COUNT(Name)
FROM dbo.TableName
GROUP BY Location_Name;
Example db<>fiddle
If Location_Name and Location_ID are not tightly coupled (e.g. there could be Location_ID = 4, Location_Name = Place 1 and Location_ID = 4, Location_Name = Place 2 then you're going to have to define how to determine which place to display if you group by Location_ID, or admit that perhaps one of those columns is meaningless.
If Location_Name and Location_ID are tightly coupled, they shouldn't both be stored in this table. You should have a lookup/dimension table that stores both of those columns (once!) and you use the smallest data type as the key you record in the fact table (where it is repeated over and over again). This has several benefits:
Scanning the bigger table is faster, because it's not as wide
Storage is reduced because you're not repeating long strings over and over and over again
Aggregation is clearer and you can join to get names after aggregation, which will be faster
If you need to change a location's name, you only need to change it in exactly one place

Sample code
CREATE TABLE People_Location
(
Name VARCHAR(30) NOT NULL,
Location_Name VARCHAR(30) NOT NULL,
Location_ID INT NOT NULL,
)
INSERT INTO People_Location
VALUES
('John Fox', 'Moon', 4),
('John Bear', 'Moon', 4),
('Peter', 'Saturn', 5),
('John Fox', 'Moon', 4),
('Micheal', 'Sun', 1),
('Jackie', 'Sun', 1),
('Tito', 'Sun', 1),
('Peter', 'Saturn', 5)
Get location and count
select Location_Name, count(1)
from
(select Name, Location_Name,
rn = ROW_NUMBER() OVER (PARTITION BY Name, Location_ID ORDER BY Name)
from People_Location
) t
where rn = 1
group by Location_Name
Result
Moon 2
Saturn 1
Sun 3

TSQL - De-duplication report - grouping

So I'm trying to create a report that ranks a duplicate record, the idea behind this is that the customer wants to merge a whole lot of duplicate records that came about from a migration.
I need the ranking so that my report can show which record should be the "main" record, i.e. the record that will have missing data pulled into it.
The duplicate definition is pretty simple:
If the email addresses are the same then it is always a duplicate, if
the emails do not match, then the first name, surname, and mobile must
match.
The ranking will be based on a whole bunch of columns in the table, so:
email address isn't NULL = 50
phone number isn't NULL = 20
etc.. whichever gets the highest number in the duplicate group becomes the main record. This is where I am having issues, I can't seem to find a way to get an incremental number for each duplicate set. This is some of the code I have so far:
( I took out some of the rank columns in the temp table and CTE expression to shorten it )
DECLARE #tmp_Duplicates TABLE (
tmp_personID INT
, tmp_Firstname NVARCHAR(100)
, tmp_Surname NVARCHAR(100)
, tmp_HomeEmail NVARCHAR(300)
, tmp_MobileNumber NVARCHAR(100)
--- Ratings
, tmp_HomeEmail_Rating INT
--- Groupings
, tmp_GroupNumber INT
)
;WITH cteDupes AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY personHomeEmail ORDER BY personID DESC) AS RND,
ROW_NUMBER() OVER(PARTITION BY personHomeEmail ORDER BY personId) AS RNA,
p.personID, p.PersonFirstName, p.PersonSurname, p.PersonHomeEMail
, personMobileTelephone
FROM tblCandidate c INNER JOIN tblPerson p ON c.candidateID = p.personID
)
INSERT INTO #tmp_Duplicates
SELECT PersonID, PersonFirstName, PersonSurname, PersonHomeEMail, personMobileTelephone
, 10, RND
FROM cteDupes
WHERE RNA + RND > 2
ORDER BY personID, PersonFirstName, PersonSurname
SELECT * FROM #tmp_Duplicates
This gives me the results I want, but the group number isn't showing how I need it:
What I need is for each group to be an incremental value:

SQL Select set of records from one table, join each record to top 1 record of second table matching 1 column, sorted by a column in the second table

This is my first question on here, so I apologize if I break any rules.
Here's the situation. I have a table that lists all the employees and the building to which they are assigned, plus training hours, with ssn as the id column, I have another table that list all the employees in the company, also with ssn, but including name, and other personal data. The second table contains multiple records for each employee, at different points in time. What I need to do is select all the records in the first table from a certain building, then get the most recent name from the second table, plus allow the result set to be sorted by any of the columns returned.
I have this in place, and it works fine, it is just very slow.
A very simplified version of the tables are:
table1 (ssn CHAR(9), buildingNumber CHAR(7), trainingHours(DEC(5,2)) (7200 rows)
table2 (ssn CHAR(9), fName VARCHAR(20), lName VARCHAR(20), sequence INT) (708,000 rows)
The sequence column in table 2 is a number that corresponds to a predetermined date to enter these records, the higher number, the more recent the entry. It is common/expected that each employee has several records. But several may not have the most recent(i.e. '8').
My SProc is:
#BuildingNumber CHAR(7), #SortField VARCHAR(25)
BEGIN
DECLARE #returnValue TABLE(ssn CHAR(9), buildingNumber CAHR(7), fname VARCHAR(20), lName VARCHAR(20), rowNumber INT)
INSERT INTO #returnValue(...)
SELECT(ssn,buildingNum,fname,lname,rowNum)
FROM SELECT(...,CASE #SortField Row_Number() OVER (PARTITION BY buildingNumber ORDER BY {sortField column} END AS RowNumber)
FROM table1 a
OUTER APPLY(SELECT TOP 1 fName,lName FROM table2 WHERE ssn = a.ssn ORDER BY sequence DESC) AS e
where buildingNumber = #BuildingNumber
SELECT * from #returnValue ORDER BY RowNumber
END
I have indexes for the following:
table1: buildingNumber(non-unique,nonclustered)
table2: sequence_ssn(unique,nonclustered)
Like I said this gets me the correct result set, but it is rather slow. Is there a better way to go about doing this?
It's not possible to change the database structure or the way table 2 operates. Trust me if it were it would be done. Are there any indexes I could make that would help speed this up?
I've looked at the execution plans, and it has a clustered index scan on table 2(18%), then a compute scalar(0%), then an eager spool(59%), then a filter(0%), then top n sort(14%).
That's 78% of the execution so I know it's in the section to get the names, just not sure of a better(faster) way to do it.
The reason I'm asking is that table 1 needs to be updated with current data. This is done through a webpage with a radgrid control. It has a range, start index, all that, and it takes forever for the users to update their data.
I can change how the update process is done, but I thought I'd ask about the query first.
Thanks in advance.

I would approach this with window functions. The idea is to assign a sequence number to records in the table with duplicates (I think table2), such as the most recent records have a value of 1. Then just select this as the most recent record:
select t1.*, t2.*
from table1 t1 join
(select t2.*,
row_number() over (partition by ssn order by sequence desc) as seqnum
from table2 t2
) t2
on t1.ssn = t1.ssn and t2.seqnum = 1
where t1.buildingNumber = #BuildingNumber;
My second suggestion is to use a user-defined function rather than a stored procedure:
create function XXX (
#BuildingNumber int
)
returns table as
return (
select t1.ssn, t1.buildingNum, t2.fname, t2.lname, rowNum
from table1 t1 join
(select t2.*,
row_number() over (partition by ssn order by sequence desc) as seqnum
from table2 t2
) t2
on t1.ssn = t1.ssn and t2.seqnum = 1
where t1.buildingNumber = #BuildingNumber;
);
(This doesn't have the logic for the ordering because that doesn't seem to be the central focus of the question.)
You can then call it as:
select *
from dbo.XXX(<building number>);
EDIT:
The following may speed it up further, because you are only selecting a small(ish) subset of the employees:
select *
from (select t1.*, t2.*, row_number() over (partition by ssn order by sequence desc) as seqnum
from table1 t1 join
table2 t2
on t1.ssn = t1.ssn
where t1.buildingNumber = #BuildingNumber
) t
where seqnum = 1;
And, finally, I suspect that the following might be the fastest:
select t1.*, t2.*, row_number() over (partition by ssn order by sequence desc) as seqnum
from table1 t1 join
table2 t2
on t1.ssn = t1.ssn
where t1.buildingNumber = #BuildingNumber and
t2.sequence = (select max(sequence) from table2 t2a where t2a.ssn = t1.ssn)
In all these cases, an index on table2(ssn, sequence) should help performance.

Try using some temp tables instead of the table variables. Not sure what kind of system you are working on, but I have had pretty good luck. Temp tables actually write to the drive so you wont be holding and processing so much in memory. Depending on other system usage this might do the trick.
Simple define the temp table using #Tablename instead of #Tablename. Put the name sorting subquery in a temp table before everything else fires off and make a join to it.
Just make sure to drop the table at the end. It will drop the table at the end of the SP when it disconnects, but it is a good idea to make tell it to drop to be on the safe side.

How to choose the first record based on time

I have two tables and call table and a after call work table. These two tables have a one to one relationship even tho the after call work records are stored in the call table. You can link a call to a after call work call by joining on three values in the call table. The call table also holds the values of the start and end time of the call.
The data in the after call work table is a total mess there are occasions where one call has many after call work records. My client wants me to pick out the first record based on the start time of the call and only take this 1 row of data.
Its been suggested to use RANKING function but I'm unfamiliar with this any body got any ideas?
If anything needs explaining further let me know.
Thank you

Self extracting ranking and dense ranking example ASSUMING you are using SQL Server 2008 or newer.
declare #Person Table ( personID int identity, person varchar(8));
insert into #Person values ('Brett'),('Sean'),('Chad'),('Michael'),('Ray'),('Erik'),('Queyn');
declare #Orders table ( OrderID int identity, PersonID int, Desciption varchar(32), Amount int);
insert into #Orders values (1, 'Shirt', 20),(1, 'Shoes', 50),(2, 'Shirt', 22),(2, 'Shoes', 20),(3, 'Shirt', 20),(3, 'Shoes', 50),(3, 'Hat', 20),(4, 'Shirt', 20),(5, 'Shirt', 20),(5, 'Pants', 30),
(6, 'Shirt', 20),(6, 'RunningShoes', 70),(7, 'Shirt', 22),(7, 'Shoes', 40),(7, 'Coat', 80);
with a as
(
Select
person
, o.Desciption
, o.Amount
, rank() over(partition by p.personId order by Amount) as Ranking
, Dense_rank() over(partition by p.personId order by Amount) as DenseRanking
from #Person p
join #Orders o on p.personID = o.PersonID
)
select *
from a
where Ranking <= 2 -- determine top 2, 3, etc.... whatever you want.
order by person, amount

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

T_SQL Rank function not working as expected - sql-server

Related

Identify an associated main meal for choice items

How can I delete duplicates from a table based on a location id and name

TSQL - De-duplication report - grouping

SQL Select set of records from one table, join each record to top 1 record of second table matching 1 column, sorted by a column in the second table

How to choose the first record based on time

Categories

Resources