SSIS - Filter duplicate rows - sql-server

I have a table (Id, ArticleCode, StoreCode, Adress, Number) that contains duplicate entries based on only these columns [ArticleCode, StoreCode].
Currently I can filter duplicate rows using Aggregate transformation, but the problem is in the output rows I have only two columns [Article, StoreCode] and I need the other columns as well.

Just in the OLEDB Source component use SQL Command as Source instead of Table name and write the following command (as a source):
SELECT [ID]
,[ArticleCode]
,[StoreCode]
,[Address]
,[Number] FROM (
SELECT [ID]
,[ArticleCode]
,[StoreCode]
,[Address]
,[Number]
,ROW_NUMBER() OVER(PARTITION BY [ArticleCode]
,[StoreCode] ORDER BY [ArticleCode]
,[StoreCode]) AS ROWNUM
FROM [dbo].[Table_1]) AS T1
WHERE T1.ROWNUM = 1

To get rid of duplicates and select unique records by [ArticleCode, StoreCode]:
select top 1 with ties
Id ,
ArticleCode ,
StoreCode ,
Adress ,
Number
from
YourTable
order by
row_number() over(partition by ArticleCode, StoreCode order by Id)
But which of two records have to be selected when [ArticleCode, StoreCode] are equal and [Adress, Number] differ?
If Id is auto-increment then order by Id gets the first entered record, order by Id desc - the last.
You have somehow to define which [Adress, Number] pair among the duplicates is correct to be selected.

Related

Column 'ACCOUNT.ACCOUNT_ID' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause

I am trying to get available balance on last(max) date. I am trying to write below query but it is showing error.
select ACCOUNT_ID,AVAIL_BALANCE,OPEN_DATE,MAX(LAST_ACTIVITY_DATE)
from ACCOUNT
group by CUST_ID;
Column 'ACCOUNT.ACCOUNT_ID' is invalid in the select list because it
is not contained in either an aggregate function or the GROUP BY
clause.
I am new to sql. Can anyone let me know where I am wrong in this query?
Any column not having a calculation/function on it must be in the GROUP BY clause.
select ACCOUNT_ID,AVAIL_BALANCE,OPEN_DATE,MAX(LAST_ACTIVITY_DATE)
from ACCOUNT
group by ACCOUNT_ID,AVAIL_BALANCE,OPEN_DATE;
If you're wanting the most recent row for each customer, think ROW_NUMBER(), not GROUP BY:
;With Numbered as (
select *,ROW_NUMBER() OVER (
PARTITION BY CUST_ID
ORDER BY LAST_ACTIVITY_DATE desc) rn
from Account
)
select ACCOUNT_ID,AVAIL_BALANCE,OPEN_DATE,LAST_ACTIVITY_DATE
from Numbered
where rn=1
I think you want to select one records having max(LAST_ACTIVITY_DATE) for each CUST_ID.
For this you can use TOP 1 WITH TIES like following.
SELECT TOP 1 WITH TIES account_id,
avail_balance,
open_date,
last_activity_date
FROM account
ORDER BY Row_number()
OVER (
partition BY cust_id
ORDER BY last_activity_date DESC)
Issue with your query is, you can't select non aggregated column in select if you don't specify those columns in group by
If you want to get the max activity date for a customer then your query should be as below
select CUST_ID, MAX(LAST_ACTIVITY_DATE)
from ACCOUNT
group by CUST_ID;
You can't select any other column which is not in the group by clause. The error message also giving the same message.
with query(CUST_ID, LAST_ACTIVITY_DATE) as
(
select
CUST_ID,
MAX(LAST_ACTIVITY_DATE) as LAST_ACTIVITY_DATE
from ACCOUNT
group by CUST_ID
)
select
a.ACCOUNT_ID,
a.AVAIL_BALANCE,
a.OPEN_DATE,
a.LAST_ACTIVITY_DATE
from ACCOUNT as a
inner join query as q
on a.CUST_ID = q.CUST_ID
and a.LAST_ACTIVITY_DATE = q.LAST_ACTIVITY_DATE

sql auto increment based on ID

Im using SQLServer 2017, I have a table that I want to auto increment for each id.
Example Table A has columns
PolicyID, ClaimID, TranId
with the following values
ABC123, 111, 1
when another row gets inserted/added TranId will show 2 and so on but if the PolicyID is different lets say ABC456 then the expected TranId should be 1 but my table just keeps incrementing instead of per PolicyID.
You are trying to create a sequence and this shouldn't be stored in the table.
Try creating a view:
create vw_xxx as
(
select PolicyID, ClaimID
, TranId = row_number() over (partition by PolicyID order by ClaimID)
from tableXXX
)
This is an example of how to do this. You need to partition and order by properly to get the right sequence.
If this table is large then you want to have an index on the partition,ordered by columns.
Everytime you enter a new row in your MyTable you have to run the following UPDATE:
UPDATE Table_A
SET Table_A.TranId = Table_B.[TranId]
FROM MyTable AS Table_A
INNER JOIN (
SELECT PolicyID, ClaimID, ROW_NUMBER() OVER (PARTITION BY PolicyID, ClaimID ORDER BY ClaimID) AS [TranId]
FROM MyTable
) Table_B
ON Table_A.PolicyID = Table_B.PolicyID AND Table_A.ClaimID = Table_B.ClaimID

SQL Server how can I use COUNT DISTINCT(*) in HAVING clause?

I have a procedure that counts all the unique [customerid] values and displays them in a SELECT list. I'm trying to sort the [customerid] where it is only "> 1" by using a HAVING clause, but SQL won't let me use the DISTINCT COUNT inside the HAVING. In my mind it makes sense that the HAVING should work with the COUNT but it does not:
USE MyCompany;
GO
SELECT DISTINCT COUNT(customerid) AS NumberOfOrdersMade, customerid AS
CustomerID
FROM tblItems_Ordered
GROUP BY customerid
HAVING DISTINCT COUNT(customerid) > 1
GO
You probably want SELECT COUNT(DISTINCT orderid) instead of DISTINCT COUNT(customerid):
USE MyCompany;
GO
SELECT COUNT(DISTINCT orderid) AS NumberOfOrdersMade, customerid AS
CustomerID
FROM tblItems_Ordered
GROUP BY customerid
HAVING COUNT(DISTINCT orderid) > 1
GO
When outside of the COUNT, the DISTINCT will eliminate duplicate rows from a result set, which will have no effect in your query because you are doing a GROUP BY. When inside the COUNT, DISTINCT will limit the count to unique values of the column that you pass to the count function. Thus, it makes more sense to use an orderid column instead of customerid when you're aliasing it as NumberOfOrdersMade.

Select a random database row based on another query

For internal control we would like to select a single random invoice for each of multiple invoice types and regions.
Here's the SQL to get a set of distinct Invoice Types and Regions
select InvoiceType,RegionID
from Invoices
group by InvoiceType, RegionID
For each row this returns I need to fetch a random row with that InvoiceType and RegionID. This is how I'm fetching random rows:
SELECT top 1
CustomerID
,InvoiceNum
,Name
FROM Invoices
JOIN Customers on Customers.CustomerID=Invoices.CustomerID
where InvoiceType=X and RegionID=Y
ORDER BY NEWID
But I don't know how to run this select statement foreach() row the first statement returns. I could do it programmatically but I would prefer an option using only a stored procedure as this query isn't supposed to need a program.
WITH cteInvoices AS (
SELECT CustomerID, InvoiceNum, Name,
ROW_NUMBER() OVER(PARTITION BY InvoiceType, RegionID ORDER BY NEWID()) AS RowNum
FROM Invoices
)
SELECT c.CustomerID, c.InvoiceNum, c.Name
FROM cteInvoices c
WHERE c.RowNum = 1;

How to delete duplicates from a table and keeping only the one with highest id in sql server?

I have a table with unique ID and then some fields. I would like to delete all the dupliacte rows and keep only one, the one with highest id.
For example assuming to have a table with 3 fields: RECORD_ID, FIELD_ONE, FIELD_TWO
which is the query that allows me to delete all records that have same value for FIELD_ONE and FIELD_TWO except the one that has highest RECORD_ID?
Found:
with cte
as
(
select *, row_number() over(partition by FIELD_ONE, FIELD_TWO order by RECORD_ID desc) RowNumber
from TestTable
)
delete cte
where RowNumber > 1

Resources