row_number() Group by? - sql-server

I have a data set that contains the columns Date, Cat, and QTY. What I want to do is add a unique column that only counts unique Cat values when it does the row count. This is what I want my result set to look like:
By using the SQL query below, I'm able to get row using the row_number() function.
However, I can't get that unique column that I have depicted above. When I add group by to the OVER clause, it does not work. Does anybody have any ideas as how I could get this unique count column to work?
SELECT
Date,
ROW_NUMBER() OVER (PARTITION BY Date ORDER By Date, Cat) as ROW,
Cat,
Qty
FROM SOURCE

Here is a solution.
You need not worry about the ordering of Cat. Using following SQL you will be able to get unique values for your Date & Cat combination.
SELECT
Date,
ROW_NUMBER() OVER (PARTITION BY Date, Cat ORDER By Date, Cat) as ROW,
Cat,
Qty
FROM SOURCE

DENSE_RANK() OVER (PARTITION BY date ORDER BY cat)

Related

The field in ORDER BY affects the result of window functions

I have simple T-SQL query, which calculates row number, rows count and total volume across all records:
DECLARE #t TABLE
(
id varchar(100),
volume float,
prev_date date
);
INSERT INTO #t VALUES
('0318610084', 100, '2019-05-16'),
('0318610084', 200, '2016-06-04');
SELECT
row_num = ROW_NUMBER() OVER (PARTITION BY id ORDER BY prev_date),
rows_count = COUNT(*) OVER (PARTITION BY id ORDER BY prev_date),
vol_total = SUM(volume) OVER (PARTITION BY id ORDER BY prev_date),
*
FROM #t;
I get the following result:
However, this is NOT what I expected: in all two rows the rows_count must be 2 and vol_total must be 300:
The workaround would be to add ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING. However, I thought that there must be another way.
In the end of the day I have found out that the ORDER BY clause must use id field rather prev_date field:
row_num = ROW_NUMBER() OVER (PARTITION BY id ORDER BY id),
rows_count = COUNT(*) OVER (PARTITION BY id ORDER BY id),
vol_total = SUM(volume) OVER (PARTITION BY id ORDER BY id)
After this change the query's output is as expected.
But! I don't understand why is this so? How come the ordering affects partitioning?
For Aggregate functions generally it is not required to have order in the window definition unless you want to do the aggregation one at a time in an ordered fashion, it is like running total. Simply removing the orders will fix the problem.
If I want to explain it from another way it would be like a window that is expanding row by row as you move on to another row. It is started with the first row, calculate the aggregation with all the rows from before (which in the first row is just the current row!) to the position of row.
if you remove the order, the aggregation will be computed for all the rows in the window definition and no order of applying window will take effect.
You can change the order in window definition to see the effect of it.
Of course, ranking functions need the order and this point is just for the aggregations.
DECLARE #t TABLE
(
id varchar(100),
volume float,
prev_date date
);
INSERT INTO #t VALUES
('0318610084', 100, '2019-05-16'),
('0318610084', 200, '2016-06-04');
SELECT
row_num = ROW_NUMBER() OVER (PARTITION BY id ORDER BY prev_date),
rows_count = COUNT(*) OVER (PARTITION BY id),
vol_total = SUM(volume) OVER (PARTITION BY id),
*
FROM #t;
Enabling order in the window for aggregations added after SqlServer 2012 and it was not part of the first release of the feature in 2005.
For a detailed explanation of the order in window functions on aggregates this is a great help:
Producing a moving average and cumulative total - SqlServer Documentation

How to identify which column(s) have different value in SQL Server

I have a table which has more than 100 columns, in normal case the contract_id should be unique in this table, but sometimes there are duplicate values. I use this SQL statement to retrieve data from this table:
select distinct contract_id, col1, col2,...colM
from the_table;
but I found contract_id values, I know there should be some values are different in the same column(s), can I have a way to find out all these columns which have different value result in I saw duplicate contract_id even though I use distinct, because there are lots of fields and only a few columns have different values. It is difficult to compare each column one by one manually.
Try something along
SELECT contract_id
FROM the_table
GROUP BY contract_id
HAVING COUNT(contract_id)>1;
or
WITH NumberedRows AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY contract_id ORDER BY(SELECT NULL)) AS RowNumber
,*
FROM the_table
)
SELECT *
FROM NumberedRows
WHERE RowNumber>1;
The first will show you all the contract_id values, which occur at least twice, the second will show you all the rows you might want to manipulate (delete/change).
attention: I used SELECT NULL in the ORDER BY of the OVER() clause. It is very important to use a fitting ORDER BY clause here. This will be responsible for Which row gets the number 1 and which rows get increasing numbers and will show up in the result due to >1?

SQL Server loop for changing multiple values of different users

I have the next table:
Supposing these users are sort in descending order based on their inserted date.
Now, what I want to do, is to change their sorting numbers in that way that for each user, the sorting number has to start from 1 up to the number of appearances of each user. The result should look something like:
Can someone provide me some clues of how to do it in sql server ? Thanks.
You can use the ROW_NUMBER ranking function to calculate a row's rank given a partition and order.
In this case, you want to calculate row numbers for each user PARTITION BY User_ID. The desired output shows that ordering by ID is enough ORDER BY ID.
SELECT
Id,
User_ID,
ROW_NUMBER() OVER (PARTITION BY User_ID ORDER BY Id) AS Sort_Number
FROM MyTable
There are other ranking functions you can use, eg RANK, DENSE_RANK to calculate a rank according to a score, or NTILE to calculate percentiles for each row.
You can also use the OVER clause with aggragets to create running totals or moving averages, eg SUM(Id) OVER (PARTITION BY User_ID ORDER BY Id) will create a running total of the Id values for each user.
use ROW_NUMBER() PARTITION BY User_Id
SELECT
Id,
[User_Id],
Sort_Number = ROW_NUMBER() OVER(PARTITION BY [User_Id]
ORDER BY [User_Id],[CreatedDate] DESC)
FROM YourTable
select id
,user_id
,row_number() over(partition by user_id order by user_id) as sort_number
from table
Use ranking function row_number()
select *,
row_number() over (partition by User_id order by user_id, date desc)
from table t

SSIS - Filter duplicate rows

I have a table (Id, ArticleCode, StoreCode, Adress, Number) that contains duplicate entries based on only these columns [ArticleCode, StoreCode].
Currently I can filter duplicate rows using Aggregate transformation, but the problem is in the output rows I have only two columns [Article, StoreCode] and I need the other columns as well.
Just in the OLEDB Source component use SQL Command as Source instead of Table name and write the following command (as a source):
SELECT [ID]
,[ArticleCode]
,[StoreCode]
,[Address]
,[Number] FROM (
SELECT [ID]
,[ArticleCode]
,[StoreCode]
,[Address]
,[Number]
,ROW_NUMBER() OVER(PARTITION BY [ArticleCode]
,[StoreCode] ORDER BY [ArticleCode]
,[StoreCode]) AS ROWNUM
FROM [dbo].[Table_1]) AS T1
WHERE T1.ROWNUM = 1
To get rid of duplicates and select unique records by [ArticleCode, StoreCode]:
select top 1 with ties
Id ,
ArticleCode ,
StoreCode ,
Adress ,
Number
from
YourTable
order by
row_number() over(partition by ArticleCode, StoreCode order by Id)
But which of two records have to be selected when [ArticleCode, StoreCode] are equal and [Adress, Number] differ?
If Id is auto-increment then order by Id gets the first entered record, order by Id desc - the last.
You have somehow to define which [Adress, Number] pair among the duplicates is correct to be selected.

How to group by column having spelling mistakes

While working with some legacy data, I want to group the data on a column ignoring spelling mistakes. I think SOUNDEX() could do the job to achieve the desired result. Here is what I tried:
SELECT soundex(AREA)
FROM MASTER
GROUP BY soundex(AREA)
ORDER BY soundex(AREA)
But (obviously) the SOUNDEX returned 4-character code in result rows like this, loosing actual strings:
A131
A200
A236
How could I include at least one occurrence from the group into the query result instead of 4-character code.
SELECT soundex(AREA) as snd_AREA, min(AREA) as AREA_EXAMPLE_1, max(AREA) as AREA_EXAMPLE_2
from MASTER
group by soundex(AREA)
order by AREA_EXAMPLE_1
;
In MySQL you could select group_concat(distinct AREA) as list_area to get all the versions, and I don't know about that in SQL-Server, but min and max give two examples of the areas, and you wanted to discard the diffs anyway.
You could also use row_number() to get one row for each soundex(area) value:
select AREA, snd
from
(
select AREA, soundex(AREA) snd,
row_number() over(partition by soundex(AREA)
order by soundex(AREA)) rn
from master
) x
where rn = 1
See SQL Fiddle with Demo

Resources