SQL Server - Delete Duplicate Rows - how does Partition By affect this query? - sql-server

I've been using the following inherited query where I'm trying to delete duplicate rows and I'm getting some unexpected results when first running it as a SELECT - I believe it has something to do with my lack of understanding of the Partition part of the statement:
WITH CTE AS(
SELECT [Id],
[Url],
[Identifier],
[Name],
[Entity],
[DOB],
RN = ROW_NUMBER()OVER(PARTITION BY Name ORDER BY Name)
FROM Data.Statistics
where Id = 2170
)
DELETE FROM CTE WHERE RN > 1
Can someone help me understand exactly what I'm doing with the Partition BY Name part of this? This doesn't limit the query in any way to only looking for duplicates in the Name field, correct? I need to ensure that it's looking for records where all 5 of the fields inside the CTE definition are the same for a record to be considered a duplicate.

ROW_NUMBER() OVER (PARTITION BY Name ORDER BY Name) doesn't make a lot of sense. You wouldn't ORDER BY the same thing you used in PARTITION BY since it will be the same value for everything in the partition, making the ORDER BY part useless.
Basically the CTE part of this query is saying to split the matching rows (those with [Id] = 2170) temporarily into groups for each distinct name, and within each group of rows with the same name, order those by name (which are obviously all the same value) and then return the row number within that sequence group as RN. Unique names will all have a row number of 1, because there is only one row with that name. Duplicate names will have row numbers 1, 2, 3, and so on. The order of those rows is undefined in this case because of the silly ORDER BY clause, but if you changed the ORDER BY to something meaningful, the row numbers would follow that sequence.

Related

How to identify which column(s) have different value in SQL Server

I have a table which has more than 100 columns, in normal case the contract_id should be unique in this table, but sometimes there are duplicate values. I use this SQL statement to retrieve data from this table:
select distinct contract_id, col1, col2,...colM
from the_table;
but I found contract_id values, I know there should be some values are different in the same column(s), can I have a way to find out all these columns which have different value result in I saw duplicate contract_id even though I use distinct, because there are lots of fields and only a few columns have different values. It is difficult to compare each column one by one manually.
Try something along
SELECT contract_id
FROM the_table
GROUP BY contract_id
HAVING COUNT(contract_id)>1;
or
WITH NumberedRows AS
(
SELECT ROW_NUMBER() OVER(PARTITION BY contract_id ORDER BY(SELECT NULL)) AS RowNumber
,*
FROM the_table
)
SELECT *
FROM NumberedRows
WHERE RowNumber>1;
The first will show you all the contract_id values, which occur at least twice, the second will show you all the rows you might want to manipulate (delete/change).
attention: I used SELECT NULL in the ORDER BY of the OVER() clause. It is very important to use a fitting ORDER BY clause here. This will be responsible for Which row gets the number 1 and which rows get increasing numbers and will show up in the result due to >1?

How does DISTINCT work in SQL Server 2008 R2? Are there other options? [duplicate]

I need to retrieve all rows from a table where 2 columns combined are all different. So I want all the sales that do not have any other sales that happened on the same day for the same price. The sales that are unique based on day and price will get updated to an active status.
So I'm thinking:
UPDATE sales
SET status = 'ACTIVE'
WHERE id IN (SELECT DISTINCT (saleprice, saledate), id, count(id)
FROM sales
HAVING count = 1)
But my brain hurts going any farther than that.
SELECT DISTINCT a,b,c FROM t
is roughly equivalent to:
SELECT a,b,c FROM t GROUP BY a,b,c
It's a good idea to get used to the GROUP BY syntax, as it's more powerful.
For your query, I'd do it like this:
UPDATE sales
SET status='ACTIVE'
WHERE id IN
(
SELECT id
FROM sales S
INNER JOIN
(
SELECT saleprice, saledate
FROM sales
GROUP BY saleprice, saledate
HAVING COUNT(*) = 1
) T
ON S.saleprice=T.saleprice AND s.saledate=T.saledate
)
If you put together the answers so far, clean up and improve, you would arrive at this superior query:
UPDATE sales
SET status = 'ACTIVE'
WHERE (saleprice, saledate) IN (
SELECT saleprice, saledate
FROM sales
GROUP BY saleprice, saledate
HAVING count(*) = 1
);
Which is much faster than either of them. Nukes the performance of the currently accepted answer by factor 10 - 15 (in my tests on PostgreSQL 8.4 and 9.1).
But this is still far from optimal. Use a NOT EXISTS (anti-)semi-join for even better performance. EXISTS is standard SQL, has been around forever (at least since PostgreSQL 7.2, long before this question was asked) and fits the presented requirements perfectly:
UPDATE sales s
SET status = 'ACTIVE'
WHERE NOT EXISTS (
SELECT FROM sales s1 -- SELECT list can be empty for EXISTS
WHERE s.saleprice = s1.saleprice
AND s.saledate = s1.saledate
AND s.id <> s1.id -- except for row itself
)
AND s.status IS DISTINCT FROM 'ACTIVE'; -- avoid empty updates. see below
db<>fiddle here
Old sqlfiddle
Unique key to identify row
If you don't have a primary or unique key for the table (id in the example), you can substitute with the system column ctid for the purpose of this query (but not for some other purposes):
AND s1.ctid <> s.ctid
Every table should have a primary key. Add one if you didn't have one, yet. I suggest a serial or an IDENTITY column in Postgres 10+.
Related:
In-order sequence generation
Auto increment table column
How is this faster?
The subquery in the EXISTS anti-semi-join can stop evaluating as soon as the first dupe is found (no point in looking further). For a base table with few duplicates this is only mildly more efficient. With lots of duplicates this becomes way more efficient.
Exclude empty updates
For rows that already have status = 'ACTIVE' this update would not change anything, but still insert a new row version at full cost (minor exceptions apply). Normally, you do not want this. Add another WHERE condition like demonstrated above to avoid this and make it even faster:
If status is defined NOT NULL, you can simplify to:
AND status <> 'ACTIVE';
The data type of the column must support the <> operator. Some types like json don't. See:
How to query a json column for empty objects?
Subtle difference in NULL handling
This query (unlike the currently accepted answer by Joel) does not treat NULL values as equal. The following two rows for (saleprice, saledate) would qualify as "distinct" (though looking identical to the human eye):
(123, NULL)
(123, NULL)
Also passes in a unique index and almost anywhere else, since NULL values do not compare equal according to the SQL standard. See:
Create unique constraint with null columns
OTOH, GROUP BY, DISTINCT or DISTINCT ON () treat NULL values as equal. Use an appropriate query style depending on what you want to achieve. You can still use this faster query with IS NOT DISTINCT FROM instead of = for any or all comparisons to make NULL compare equal. More:
How to delete duplicate rows without unique identifier
If all columns being compared are defined NOT NULL, there is no room for disagreement.
The problem with your query is that when using a GROUP BY clause (which you essentially do by using distinct) you can only use columns that you group by or aggregate functions. You cannot use the column id because there are potentially different values. In your case there is always only one value because of the HAVING clause, but most RDBMS are not smart enough to recognize that.
This should work however (and doesn't need a join):
UPDATE sales
SET status='ACTIVE'
WHERE id IN (
SELECT MIN(id) FROM sales
GROUP BY saleprice, saledate
HAVING COUNT(id) = 1
)
You could also use MAX or AVG instead of MIN, it is only important to use a function that returns the value of the column if there is only one matching row.
If your DBMS doesn't support distinct with multiple columns like this:
select distinct(col1, col2) from table
Multi select in general can be executed safely as follows:
select distinct * from (select col1, col2 from table ) as x
As this can work on most of the DBMS and this is expected to be faster than group by solution as you are avoiding the grouping functionality.
I want to select the distinct values from one column 'GrondOfLucht' but they should be sorted in the order as given in the column 'sortering'. I cannot get the distinct values of just one column using
Select distinct GrondOfLucht,sortering
from CorWijzeVanAanleg
order by sortering
It will also give the column 'sortering' and because 'GrondOfLucht' AND 'sortering' is not unique, the result will be ALL rows.
use the GROUP to select the records of 'GrondOfLucht' in the order given by 'sortering
SELECT GrondOfLucht
FROM dbo.CorWijzeVanAanleg
GROUP BY GrondOfLucht, sortering
ORDER BY MIN(sortering)

TSQL sub query invalid column and only one expression

It has been a while since writing T-SQL for me and I know this can be done but my memory is good enough to get me close (I think I'm close) but poor enough to not get it right.
To start I have this query:
SELECT DISTINCT(COMM_TYPE),
COUNT(COMM_TYPE) AS 'Total'
FROM
[MYDB].[dbo].[COMM]
GROUP BY
COMM_TYPE
Which returns:
COMM_TYPE Total
--------------------------
TypeA 1
TypeB 44474
TypeC 3
TypeD 3854
TypeE 12327
TypeF 362912
TypeG 484344
TypeH 386
TypeI 106
This is an accurate result.
So now I want the above PLUS a sample of each one. Something with columns like:
ID COMM_TYPE TOTAL DATA COMMENTS PRIMARY COMM_NUMBER
I believe this can be done with a sub query but I am not writing it correctly as I get two errors.
Msg 207, Level 16, State 1, Line 10
Invalid column name 'CT'.
Msg 116, Level 16, State 1, Line 7
Only one expression can be specified in the select list when the subquery is not introduced with EXISTS.
The second error I understand. My sub query has two columns being returned but positioned in the select as I have it wants only one.
The first error I'm more lost on. I thought I could reference an sub query column in the outer query?
Here is the query:
SELECT TOP(1)
*,
(SELECT
DISTINCT(COMM_TYPE),
COUNT(COMM_TYPE)
FROM
[MYDB].[dbo].[COMM]
GROUP BY
COMM_TYPE) AS CT
FROM
[MYDB].[dbo].[COMM]
WHERE
CT = COMM_TYPE
This is mostly for myself but if it helps anyone here ya go:
We start with a (cte to wrap the entire operation as it bring many benefits but the two applicable here are:
1.Enable grouping by a column that is derived from a scalar subselect.
2.Reference the resulting table multiple times in the same statement
WITH T
AS (
CTE SELECT Statement
)
FINAL SELECT Statement
Next our CTE select basically return three columns for us.
1.Total which in my query was COUNT on a column
2.RN which is the row number
3.Wildcard * which gets all the columns from the table
Now from this point we get into the Partitioning....
So it seems that we need to choose how we are going to break this table up. Since I had defined DISTINCT(COMM_TYPE) without realizing it there was my partition....in that first column definition we also do a count(*). So what must be happening is that first SQL engine breaks table into pieces (partitions) then does a count of records in those pieces....????
SELECT Count(*)
OVER (PARTITION BY COMM_TYPE) AS Total,
Next we do a row_number() operation OVER (aka operating against) again my partition of COMM_TYPE...we then order it and project the column name of rn....kinda not sure why this is needed till I got to the end then it made sense.
Row_number()
OVER (PARTITION BY COMM_TYPE
ORDER BY COMM_TYPE) AS RN,
finally we just pull a wildcard which is every column in the table.
So in the depths of the SQL engine namespace memory registers this must be quite a big hunk of data with these repeated grouping operations "OVER" everything.
However all we see is a single row and that is because of the last select which gives me everything all mushed together as I wanted and we only get the TOP(1) because of that RN column I didn't understand earlier.
Do I understand it properly?
This should do what you need.
WITH T
AS (SELECT Count(*)
OVER (PARTITION BY COMM_TYPE) AS Total,
Row_number()
OVER (PARTITION BY COMM_TYPE
ORDER BY COMM_TYPE) AS RN,
*
FROM MyDb.dbo.Comm)
SELECT *
FROM T
WHERE RN = 1

How ROW_NUMBER used with insertions?

I've multipe uniond statements in MSSQL Server that is very hard to find a unique column among the result.
I need to have a unique value per each row, so I've used ROW_NUMBER() function.
This result set is being copied to other place (actually a SOLR index).
In the next time I will run the same query, I need to pick only the newly added rows.
So, I need to confirm that, the newly added rows will be numbered afterward the last row_number value of the last time.
In other words, Is the ROW_NUMBER functions orders the results with the insertion order - suppose I don't adding any ORDER BY clause?
If no, (as I think), Is there any alternatives?
Thanks.
Without seeing the sql I can only give the general answer that MS Sql does not guarantee the order of select statements without an order clause so that would mean that the row_number may not be the insertion order.
I guess you can do something like this..
;WITH
cte
AS
(
SELECT * , rn = ROW_NUMBER() OVER (ORDER BY SomeColumn)
FROM
(
/* Your Union Queries here*/
)q
)
INSERT INTO Destination_Table
SELECT * FROM
CTE LEFT JOIN Destination_Table
ON CTE.Refrencing_Column = Destination_Table.Refrencing_Column
WHERE Destination_Table.Refrencing_Column IS NULL
I would suggest you consider 'timestamping' the row with the time it was inserted. Or adding an identity column to the table.
But what it sounds like you want to do is get current max id and then add the row_number to it.
Select col1, col2, mid + row_number() over(order by smt) id
From (
Select col1, col2, (select max(id) from tbl) mid
From query
) t

Row_Number Over Where RowNumber between

I'm try to select a certain rows from my table using the row_number over. However, the sql will prompt the error msg "Invalid column name 'ROWNUMBERS' ". Anyone can correct me?
SELECT ROW_NUMBER() OVER (ORDER BY Price ASC) AS ROWNUMBERS, *
FROM Product
WHERE ROWNUMBERS BETWEEN #fromCount AND #toCount
Attempting to reference the aliased column in the WHERE clause does not work because of the logical query processing taking place. The WHERE is evaluated before the SELECT clause. Therefore, the column ROWNUMBERS does not exist when WHERE is evaluated.
The correct way to reference the column in this example would be:
SELECT a.*
FROM
(SELECT ROW_NUMBER() OVER (ORDER BY Price ASC) AS ROWNUMBERS, *
FROM Product) a
WHERE a.ROWNUMBERS BETWEEN #fromCount AND #toCount
For your reference, the order for operations is:
FROM
WHERE
GROUP BY
HAVING
SELECT
ORDER BY
There is another answer here that solves the specific error reported. However, I also want to address the wider problem. It looks a lot like what you are doing here is paging your results for display. If that is the case, and if you can use Sql Server 2012, there is a better way now. Take a look at OFFSET/FETCH:
SELECT First Name + ' ' + Last Name
FROM Employees
ORDER BY First Name
OFFSET 10 ROWS FETCH NEXT 5 ROWS ONLY;
That would show the third page of a query where the page size is 5.

Resources