TSQL: group by Substring (Name) and retrieve ID in SELECT - sql-server

We have companies' data stored in a table. In an effort to de-duplicate the rows, we need to identify duplicate data sets of companies by using following criterion: If First five letters of the CompanyName, City and postal code match with other records' same fields then it is a duplicate. We will later remove the duplicates. The problem I am running in to is that I can't retrieve IDs of these records since I am not grouping the records on ID.
I am using following SQL:
Select count(ID) as DupCount
, SUBSTRING(Name,1,5) as Name
, City
, PostalCode
from tblCompany
group by SUBSTRING(Name,1,5)
, City
, PostalCode
Having count(ID) > 1
order by count(ID) desc
How do I retrieve the ID of these records?

You can use window functions:
Select c.*
from (select c.*,
count(*) over (partition by left(Name, 5), City, PostalCode) as cnt
from tblCompany c
) c
where cnt >= 2;
This will return the individual rows with dups. You can then summarize this or do what you want with the result set.

Use group_concat() to get the ids as a comma separated list:
select
SUBSTRING(Name,1,5) as Name,
City,
PostalCode,
count(ID) as counter,
group_concat(id order by id) as ids
from tblCompany
group by SUBSTRING(Name,1,5), City, PostalCode
having count(ID) > 1
order by count(ID) desc

Related

SQL - Return a value sum only once when grouped

I want to count the unique record of a string but grouping by dates, and if the string already appeared previously on a group it shouldn't be counted anymore.
I've tried using distinct and it does show the unique count of the record but the record is counted again on every month.
Actual and minified SQL query:
select
date,
count(distinct d.name) as count
from ...
group by date
Sample and desired output
Image
Grab unique names and tag them with the earliest date. At that point it's just a matter of regrouping the resulting rows by date. Each name will uniquely correspond to only one date as desired:
with data as (select name, min("date") as dt from T group by name)
select dt, count(name) as cnt from data group by dt;
If you still need to see the original dates even when no names are counted, then flag each row according to whether it should be counted and then count the flags per date:
with data as (
select *,
case when "date" = min("date") over (partition by name)
then 1 end as flag
from T
)
select "date", count(flag) as cnt
from data
group by "date";
So you want the name only count once:
SELECT COUNT(u.name) as name_count, u.[date]
FROM (
SELECT d.name,MIN(d.date) AS [date]
FROM yourTable d
GROUP BY d.name) u
GROUP BY u.[date];
You can add a ROW_NUMBER() that is Partitioned by name and ordered by date and add a WHERE clause that only returns the rows with Row_Number = 1.
You can check this following option-
SELECT A.Date,COUNT(B.[Name]) Count
FROM
(
SELECT DISTINCT Date FROM your_table
)A
LEFT JOIN
(
SELECT * FROM
(
SELECT *,ROW_NUMBER() OVER(PARTITION BY [Name] ORDER BY Date) RN
FROM your_table
)A WHERE RN = 1
)B ON A.Date = B.Date
GROUP BY A.Date
But the best option if I modify a bit the concept from Shawnt00 is as below-
SELECT A.Date,COUNT(B.[Name]) Count
FROM
(
SELECT DISTINCT Date FROM your_table
)A
LEFT JOIN
(
SELECT [Name],MIN(Date) Date FROM your_table GROUP BY [Name]
)B ON A.Date = B.Date
GROUP BY A.Date
Both case the output will be-
Date Count
20190101 2
20190201 0
20190301 1

How to use DISTINCT keyword in SQL Server?

How to use DISTINCT keyword in SQL Server? I mean if it can work for given field.
select id, name, age
from dbo.XXX
There are multiple row returned by the query. I would like to get how many kinds of id or name or age.
select **distinct** id, name, age from dbo.XXX or
select id, **distinct** name, age from dbo.XXX or
select id, name, **distinct** age from dbo.XXX
To sum up, I would like to use a single SQL to get the distinct count of each fields, like select π—±π—Άπ˜€π˜π—Άπ—»π—°π˜ id, π—±π—Άπ˜€π˜π—Άπ—»π—°π˜ name, π—±π—Άπ˜€π˜π—Άπ—»π—°π˜ age from dbo.XXX
Dense_Rank can be used to calculate a distinct count for any column and multiple columns:
Select col1, col2, col3,
dense_rank() over (partition by [col1] order by [Unique ID]) + dense_rank() over (partition by [col1] order by [Unique ID] desc) - 1 as DistCountCol1,
dense_rank() over (partition by [col2] order by [Unique ID]) + dense_rank() over (partition by [col2] order by [Unique ID] desc) - 1 as DistCountCol2,
dense_rank() over (partition by [col3] order by [Unique ID]) + dense_rank() over (partition by [col3] order by [Unique ID] desc) - 1 as DistCountCol3
from [table]
select distinct ID
from dbo.XXX
Select distinct name
from dbo.XXX
Select distinct age
from dbo.XXX
If you want to know how many rows you have for each distinct ID or Name or Age, you can use the following:
Select ID, count(id) as [ID_Recurrence]
from dbo.XXX
group by ID
Select Age, count(age) as [Age_Recurrence]
from dbo.XXX
group by Age
Select Name, count(name) as [Name_Recurrence]
from dbo.XXX
group by Name
The DISTINCT keyword return a unique row like the Following
SELECT DISTINCT ID FROM SomeTable
SELECT DISTINCT ID , SCORE FROM SomeTable
If you want to get unique value from row try the following code.
The Below code is copied from here
select t.id, t.player_name, t.team
from tablename t
join (select team, min(id) as minid from tablename group by team) x
on x.team = t.team and x.minid = t.id
select COUNT(distinct id) uniqueIDCount
from dbo.XXX
would count distinct values of id field, if you want to count distinct values for field combination you must concat fields, assuming your id is integer and name is nvarchar:
select COUNT(distinct CONVERT(nvarchar, id) + name) uniqueIDCount
from dbo.XXX
note that even this way looks nice it is probably not the most efficient one, here you have more efficient, but also more complicated method way:
with c as (
select distinct id, name
from dbo.XXX
)select COUNT(1)
from c
Not sure why it's complicated. U can have 3 different queries and u can union to return single set if u want .

Error in SQLServer: Subquery returned more than 1 value

I would like to insert to Clients table data from two different tables (Surname and name). Moreover I would like to have a third column (email) that is a concatination from the first two. when i try the code hereunder it gives me the following error: "Subquery returned more than 1 value".
insert into CLIENTS (LastName,Firstname, EMAIL)
select (select top 150 Surname from Surname order by NEWID()),
(select top 150 Name from Name order by Newid()),
(select concat(concat(FisrtName, LastName),'#novaims.com') from clients);
Could you please help me understand where is the problem?
The error message is obvious your sub-query can result more than one record. Try this
;WITH cte
AS (SELECT 1 AS val
UNION ALL
SELECT val + 1
FROM cte
WHERE val < 150)
SELECT FisrtName,
LastName,
Concat(FisrtName, LastName, '#novaims.com')
FROM cte
OUTER apply (SELECT TOP 1 Surname FROM Surname ORDER BY Newid()) s (FisrtName)
OUTER apply (SELECT TOP 1 NAME FROM NAME ORDER BY Newid()) n (LastName)
Option (Maxrecursion 0)
You need to move the table references to the from clause. I think this does what you want:
insert into CLIENTS (LastName, Firstname, EMAIL)
select surname, name, concat(name, surname, '#novaims.com')
from (select Surname, row_number() over (order by newid()) as seqnum
from Surname
) s join
(select Name, row_number() over (order by newid()) as seqnum
from Name
)
on n.seqnum = s.seqnum;
Another method uses apply:
insert into CLIENTS (LastName, Firstname, EMAIL)
select top 150 s.surname, n.name, concat(n.name, s.surname, '#novaims.com')
from surname s cross apply
(select top 1 n.*
from names n
order by newid()
) n
order by newid();
This is more similar to your original idea. Do note, though, that the same name can appear more than once. And the performance should be better for the first version (because the sort is only happening once on each table).

Finding a recent most duplicate records from SQL Server 2012

I want to find the recent duplicate records from SQL Server 2012. Here is the table structure I have.
I have table name called UserRegistration which contains the duplicate of UserID(GUID) and in same table, I have CreatedDate Column as well (Date). Now I want to find the recent duplicate records from this table.
Here is the same data.
id FirstName LastName CreatedDate UserID
109 FirstNameA LastNameA 28-04-2015 GUID1
110 FirstNameC LastNameD 19-05-2015 GUID2
111 FirstNameE LastNameF 22-05-2015 GUID1
If you notice on above tables, GUID 1 are having the duplicate, Now I want to find the recent one means it should return me only those rows with duplication but recent data. So in above table structure, it should return me 111 because record has been created recently compared to the 109. I believe you understand.
Do let me know if you have any question. I am happy to answer. Thanks. Awaiting for the reply.
Harshal
Try the below query this should do the work based on your i/p data -
create table #UserRegistration (id int,FirstName varchar(20),LastName varchar(20),CreatedDate date,UserID varchar(20))
insert into #UserRegistration
select 109, 'FirstNameA', 'LastNameA', '2015-04-28', 'GUID1' union
select 110, 'FirstNameC', 'LastNameD', '2015-05-19', 'GUID2' union
select 111, 'FirstNameE', 'LastNameF', '2015-05-22', 'GUID1'
select id, FirstName, LastName, CreatedDate, UserID from
(SELECT ur.*,row_number() over(partition by UserID order by CreatedDate) rn
FROM #UserRegistration ur) A
where rn > 1
You could use CTE. Group your records by UserID and give your particular row a rank ordered by CreatedDate.
insert into tab(id, FirstName, LastName, CreatedDate, UserID)
values(109, 'FirstNameA', 'LastNameA', '2015-04-28', 'guid1'),
(110, 'FirstNameC', 'LastNameD', '2015-05-19', 'guid2'),
(111, 'FirstNameE', 'LastNameF', '2015-05-22', 'guid1');
with cte as
(
select id, ROW_NUMBER() over (partition by UserID order by CreatedDate asc) as [Rank],
FirstName, LastName, CreatedDate, UserID
from tab
)
select id, FirstName, LastName, CreatedDate, UserID from cte where Rank > 1
Rank > 1 condition is responsible for retrieving duplicated items.
sqlfiddle link:
http://sqlfiddle.com/#!6/4d1f2/6
Solved this by using tmp-tables:
SELECT a.UserID,
MAX(a.CreatedDate) As CreatedDate
INTO #latest
FROM <your table> a
GROUP BY a.UserID
HAVING COUNT(a.UserID) > 1
SELECT b.id
FROM #latest a
INNER JOIN <your table> b ON a.UserID = b.UserID AND a.CreatedDate = b.CreatedDate
try this,
SELECT * FROM TableName tt WHERE
exists(select MAX(createdDate)
from TableName
where tt.UserID = UserID
group by UserID
having MAX(createdDate)= tt.createdDate)
I think your createddate field is not a date field, then try Format
WITH TempAns (id,UserID,duplicateRecordCount)
AS
(
SELECT id,
UserID,
ROW_NUMBER()OVER(partition by UserID ORDER BY id)
AS duplicateRecordCount
FROM #t
)
select * from #t where id in (
select max(id )
from TempAns
where duplicateRecordCount > 1
group by name )
You'd rank your records with ROW_NUMBER() to give all last records per userid #1. With COUNT() you make sure only to get the userids having more than one record.
select
id, firstname, lastname, createddate, userid
from
(
select
id, firstname, lastname, createddate, userid,
row_number() over (partition by userid oder by createddate desc) as rn,
count(*) over (partition by userid) as cnt
from userregistration
) ranked
where rn = 1 -- only last one
and cnt > 1; -- but only if there is more than one record for the userid
This gets the latest record for every userid that has duplicates.

SQL Select distinct from multiple fields returning only one row

I have a table with the following columns in SQL Server:
MEMBERID, MEMBEREMAIL, FATHEREMAIL, MOTHEREMAIL, MEMBERNAME
MEMBERID is PK. The three email columns are not unique, so the same email may appear several times in the same row AND in several rows.
I am trying to extract a unique list of emails, and for each email also get a memberid and membername (it does not matter from which record).
For example if I have three rows:
1 x#x.com y#y.com y#y.com Mark
2 z#z.com y#y.com x#x.com John
3 x#x.com y#y.com z#z.com Susan
I want to get the three emails (x#x.com, y#y.com, z#z.com) and for each of those a MEMBERID in which they appear. It does NOT which MEMBERID (for example for x#X.com I don't care if I get the values 1 and Mark or 2 and John or 3 and Susan, as long as x#x.com appears only once in the results.
If I use DISTINCT when trying to return the email and memberid and membername, of course I get all of the rows.
You could use a subquery to normalize all emails. Then you can use row_number to filter out one memberid, membername per email:
select *
from (
select row_number() over (partition by email order by memberid) as rn
, *
from (
select MEMBERID
, MEMBERNAME
, MEMBEREMAIL as email
from YourTable
union all
select MEMBERID
, MEMBERNAME
, FATHEREMAIL
from YourTable
union all
select MEMBERID
, MEMBERNAME
, MOTHEREMAIL
from YourTable
) as emails
) num_emails
where rn = 1
You could also normalize the emails using the UNPIVOT clause, like this:
select *
from (
select row_number() over (partition by email order by memberid) as rn
, *
from (
select MEMBERID
, MEMBERNAME
, email
from YourTable
unpivot (
email
for emailOwner
in (
MEMBEREMAIL,
FATHEREMAIL,
MOTHEREMAIL
)
) as u
) as emails
) num_emails
where rn = 1
Try both versions at SQL Fiddle:
UNION ALL version
UNPIVOT version
This code will give you the right group of distinct emails:
then you can create a cursor out of the query members and then get the comma seperated list of mails per memberid with this concept I would create an output table for this will be easyer if you need it for future use and would make a store procedure for this to create the custom table
select mem.*, mails.MEMBEREMAIL
from (
select MEMBERID,max(MEMBERNAME) as MEMBERNAME
from table
group by MEMBERID
) as mem
inner join
(
select distinct MEMBERID, MEMBEREMAIL
from (
select MEMBERID, MEMBEREMAIL
from table
union
select MEMBERID, FATHEREMAIL
from table
union
select MEMBERID, MOTHEREMAIL
from table
) as mail
) as mails on mem.MEMBERID = mails.MEMBERID

Resources