Identify duplicates based on multiple columns and parent row

Identify duplicates based on multiple columns and parent row - sql-server

This is an example of table data that I am working on (the table contained a lot of columns, I am showing here only the relevant ones):
Id
job_number
status
parent_id
1
42FWD-42
0
0
2
42FWD-42
1
1
3
42FWD-42
5
1
Id is auto generated. parent_id links the job using the id.
When a new job is created via the app, a new row is created (with status "0"). The auto-generated Id is then used for subsequent rows of same job, and set as parent id.
Another record with status "1" (which is code for started) is also created just after parent record.
Explanation of the problem: due to a bug in the app, there are duplicate set of rows for the same job.
Example of problem
Id
job_number
status
parent_id
1
42FWD-42
0
0
2
42FWD-42
0
0
3
42FWD-42
1
1
4
42FWD-42
1
2
5
42FWD-42
5
1
As you can see from this example, due to the bug, there are 2 rows with "0" status for the same job, and 2 rows with "1" status.
This creates a lot of problems in operation in app where the job is updated using the job number.
The status number should not repeat for a specific job.
What I want to do is to find all duplicates like those in example. For example, I want a query where I can find all duplicates which have same job number, but different parent_id and NO "5" status.
Example result using the example table above, I need the query to return:
Id
job_number
status
parent_id
2
42FWD-42
0
0
4
42FWD-42
1
2
Explanation of this result:
Row with Id=1 is considered the correct record because it has an associated record with status "5"
Row with Id=2 is considered duplicate and its associated records are also considered duplicate
Another possible case: there are duplicate rows, but none have status=5. These rows can be discarded, ie need not be shown in results.
A brief explanation of how the query works will be appreciated.
EDIT:
I forgo to add an important information:
job_number is case sensitive.
ie: 42FWD-42 and 42fwd-42 are different and valid job number. They should not be considered duplicates, and are 2 separate jobs.
The reason for this is the actual job number is not small text as in my example. It is a long string like a guid.

First I must mention you should block identical rows by means of a unique constraint. I suggest that once you have eliminated all duplicates you put up a such a constraint to keep this from happening again.
Now for your question, you can do this by grouping on the duplicate columns, and have only those that count more than one.
Here is an example
declare #t table (id int, job_number varchar(10), status int, parent_id int)
insert into #t
values (1, '42FWD-42', 0, 0), (2, '42FWD-42', 0, 0), (3, '42FWD-42', 1, 1), (4, '42FWD-42', 1, 2), (5, '42FWD-42', 5, 1)
select max(t.id) as id, t.job_number, t.status
from #t t
group by t.job_number, t.status
having count(*) > 1
the result is
id job_number status
2 42FWD-42 0
4 42FWD-42 1
and to get also the parent_id you can add a self join
select max(t.id) as id,
t.job_number,
t.status,
(select t2.parent_id from #t t2 where t2.id = max(t.id)) as parent_id
from #t t
group by t.job_number, t.status
having count(*) > 1
this returns
id job_number status parent_id
2 42FWD-42 0 0
4 42FWD-42 1 2
EDIT
To solve the addional problem in the edit of your question, about the case sensitive, you can fix that by using a COLLATE in your field retrieval and your comparision
this should do it
declare #t table (id int, job_number varchar(10), status int, parent_id int)
insert into #t
values (1, '42FWD-42', 0, 0),
(2, '42FWD-42', 0, 0),
(3, '42FWD-42', 1, 1),
(4, '42fwd-42', 1, 2), -- LOWERCASE !!!
(5, '42FWD-42', 5, 1)
select max(t.id) as id,
t.job_number COLLATE Latin1_General_CS_AS,
t.status,
(select t2.parent_id from #t t2 where t2.id = max(t.id)) as parent_id
from #t t
group by t.job_number COLLATE Latin1_General_CS_AS, t.status
having count(*) > 1
and now the result will be
id job_number status parent_id
2 42FWD-42 0 0
Yet another edit
Now, suppose you need to use the result of these duplicate id's in another query, you could do something like this
select t.*
from #t t
where t.id in ( select max(t.id) as id
from #t t
group by t.job_number COLLATE Latin1_General_CS_AS, t.status
having count(*) > 1
)
What I am doing here is getting only the duplicate id's in a form that can be used to feed a where clause in another query.
This way you can use the result set in any way you wish.
Also note that for this we don't need the self join to retrieve the parent_id anymore.
One possible use of this could be to delete duplicate rows, you can write
delete from yourtable
where id in ( select max(t.id) as id
from #t t
group by t.job_number COLLATE Latin1_General_CS_AS, t.status
having count(*) > 1
)

you can try to use ROW_NUMBER window function to get duplicate row data and its id by job_number, then using cte recursive to find all error records by this id
Query 1:
;WITH CTE AS (
SELECT *,ROW_NUMBER() OVER (PARTITION BY job_number ORDER BY Id) rn
FROM T
WHERE status = 0
), CTE1 AS (
SELECT id,job_number,status,parent_id
FROM CTE
WHERE rn > 1
UNION ALL
SELECT t.id,t.job_number,t.status,t.parent_id
FROM CTE1 c INNER JOIN T t
ON c.id = t.parent_id
)
SELECT *
FROM CTE1
Results:
| id | job_number | status | parent_id |
|----|------------|--------|-----------|
| 2 | 42FWD-42 | 0 | 0 |
| 4 | 42FWD-42 | 1 | 2 |

Related

How to fix SSRS does NOT count NULL as value?

I'm creating a report using SSRS, and I have a bunch of departments and I need to count the total of their running statuses. This is the result from my table
Please note, the UNKNOWN department shows NULL inside the table. i hard coded to 'UNKNOWN' --ISNULL(department,'UNKNOWN')
And i have tested the table has NULL record and I can count those NULL record COUNT(*)
However, it seems like SSRS does not count NULL values.
the SSRS expression i had its =COUNT(Fields!ID.Value)
I need UNKNOWN rows count just as other department
How do I fix this?

I think your problem comes from how the query was written. This is a guess (you didn't provide the query) but I expect you did something like this:
/* Start Demo Data */
DECLARE #Departments TABLE (DepartmentID INT IDENTITY, Name NVARCHAR(50));
INSERT INTO #Departments (Name) VALUES
('Architect'),('Business Intelligence Analyst'),('Data Analyst'),
('Database'),('Information Technology'),('Technical Analyst');
DECLARE #Tickets TABLE (TicketID INT IDENTITY, CreateDateUTC DATETIME DEFAULT GETUTCDATE(), DepartmentID INT, Status NVARCHAR(50));
INSERT INTO #Tickets (DepartmentID, Status) VALUES
(1, 'Completed'),
(2, 'Completed'),(2, 'Completed'),
(3, 'Completed'),(3, 'Completed'),(3, 'Completed'),(3, 'Completed'),(3, 'Completed'),
(3, 'Failure'),(3, 'Failure'),(3, 'Running'),(3, 'Running'),(3, 'Failure'),
(4, 'Completed'),
(5, 'Completed'),(5, 'Completed'),(5, 'Failure'),(5, 'Running'),(5, 'Completed'),
(6, 'Completed'),
(7, 'Failure'),(7, 'Completed');
/* End Demo Data */
SELECT COALESCE(d.Name,'Unknown') AS Department,
COUNT(CASE WHEN t.Status = 'Completed' THEN 1 END) AS Completed,
COUNT(CASE WHEN t.Status = 'Failure' THEN 1 END ) AS Failure,
COUNT(CASE WHEN t.Status = 'Running' THEN 1 END ) AS Running,
COUNT(t.Status) AS Total
FROM #Departments d
INNER JOIN #Tickets t
ON t.DepartmentID = d.DepartmentID
GROUP BY d.Name
ORDER BY Department
Department Completed Failure Running Total
-----------------------------------------------------------------
Architect 1 0 0 1
Business Intelligence Analyst 2 0 0 2
Data Analyst 5 3 2 10
Database 1 0 0 1
Information Technology 3 1 1 5
Technical Analyst 1 0 0 1
This will find all the tickets with a matching department ID in the tickets table, but it will not return any tickets which have a non-matching value in the departmentID column, a NULL for example.
If you change your approach to something like:
SELECT COALESCE(d.Name,'Unknown') AS Department,
COUNT(CASE WHEN t.Status = 'Completed' THEN 1 END) AS Completed,
COUNT(CASE WHEN t.Status = 'Failure' THEN 1 END ) AS Failure,
COUNT(CASE WHEN t.Status = 'Running' THEN 1 END ) AS Running,
COUNT(t.Status) AS Total
FROM #Tickets t
LEFT OUTER JOIN #Departments d
ON t.DepartmentID = d.DepartmentID
GROUP BY d.Name
ORDER BY Department
You're now asking for all the tickets, and joining that to the departments with a LEFT OUTER JOIN which allows non-matching rows from Tickets to be returned as well. When there is a non-matching (including NULL) value in the departmentID column, it's still part of the result set.
Department Completed Failure Running Total
-----------------------------------------------------------------
Architect 1 0 0 1
Business Intelligence Analyst 2 0 0 2
Data Analyst 5 3 2 10
Database 1 0 0 1
Information Technology 3 1 1 5
Technical Analyst 1 0 0 1
Unknown 1 1 0 2

How to update the parent field in SQL Server

My data looks like this
ID Text IsParent ParentID
-------------------------
1 A 1 NULL
2 B 0 NULL
3 C 0 NULL
4 D 0 NULL
5 E 1 NULL
6 F 0 NULL
7 G 1 NULL
8 H 0 NULL
I want to fill ParentID with the previous parentID.
Data is ordered so
ID : 2,3,4 have parentID : 1
ID : 6 has parentID : 5
ID : 8 has parentID : 7
How to do this with SQL?
I have tried with a cursor, but it is way too slow.
Here is my code:
DECLARE cur1 CURSOR FOR
SELECT ID Text IsParent ParentID
FROM x2
ORDER BY ID
OPEN cur1
FETCH NEXT FROM cur1 INTO #ID, #Text, #IsParent, #ParentID
WHILE ##FETCH_STATUS = 0
BEGIN
IF #IsParent = 1
BEGIN
SET #LastParentID = #ID
END
ELSE
BEGIN
UPDATE X2
SET ParentID = #LastParentID
WHERE ID = #ID
END
FETCH NEXT FROM cur1 INTO #ID, #Text, #IsParent, #ParentID
END;
CLOSE cur1;
DEALLOCATE cur1;

You can do this with APPLY. The premise is to find the parent record with the highest ID, where the ID is lower than the child record.
Example
DECLARE #x2 TABLE (ID INT NOT NULL, Text CHAR(1), IsParent BIT, ParentID INT);
INSERT #x2 (ID, Text, IsParent)
VALUES
(1, 'A', 1), (2, 'B', 0), (3, 'C', 0), (4, 'D', 0),
(5, 'E', 1), (6, 'F', 0), (7, 'G', 1), (8, 'H', 0);
UPDATE c
SET ParentID = p.ID
FROM #x2 AS c
CROSS APPLY
( SELECT TOP 1 ID
FROM #x2 AS p
WHERE p.IsParent = 1 -- Is a parent record
AND p.ID < c.ID -- ID is lower than child record
ORDER BY p.ID DESC -- Order descending to get the highest ID
) AS p
WHERE c.IsParent = 0
AND c.ParentID IS NULL;
SELECT *
FROM #x2;
OUTPUT
ID Text IsParent ParentID
---------------------------------
1 A 1 NULL
2 B 0 1
3 C 0 1
4 D 0 1
5 E 1 NULL
6 F 0 5
7 G 1 NULL
8 H 0 7

You can use CTE and window function to achieve that.
First we are creating continuous id(cid) using sum and secondly picking the minimum ID using the cid created in the first step and then finally updating the table where IsParent is 0.
try the following:
;WITH cte AS
(
SELECT *, sum(t.IsParent) OVER (ORDER BY id) cid
FROM #t t
),
cte2 AS
(
SELECT *, min(id) OVER (PARTITION BY cid ORDER BY id) pid
FROM cte c
)
UPDATE t
SET
t.ParentID = pid
FROM #t t
JOIN cte2 c ON c.id = t.ID
WHERE c.IsParent = 0
db<>fiddle demo.

The quickest way would be to insert child records with the parent id. Instead of populating the parent ids after the fact. From code you would first insert parent records and get back the newly generated parent ids. Then insert the child records with those newly generated parent ids.
Trying to maintain a query's speed like the ones suggested would just get gross over time as the data grows. Just because you can, doesn't mean you should.
Also as a side note, if you plan on having child of child records with an unknown depth. To avoid recursion I would recommend looking into having a hierarchyid data type column.

SQL - View which duplicates values for missing entries

I need to create a view, which would propagate missing values by creating duplicates. Here is an example:
With such table:
NR|Description|FK
0 |Text1 |0
0 |Text2 |1
0 |Text4 |2
1 |Text3 |0
Create such view:
NR|Description|FK
0 |Text1 |0
0 |Text2 |1
0 |Text4 |2
1 |Text3 |0
1 |Text3 |1
1 |Text3 |2
The original table will always have at least one entry with specific NR column and column FK valued 0. So in short, if there is a row with unique NR and column FK with value 0 and there is no row with FK valued 1 then create one based on the row with FK value 0
Edit:
There can be more than one unique FK value

This should do it
declare #T table (NR int, Description varchar(10), FK int);
insert into #T values
(0, 'Text1', 0)
, (0, 'Text2', 1)
, (1, 'Text3', 0);
select t1.NR, t1.Description, t1.FK
from #T t1
union
select t1.NR, t1.Description, 1
from #T t1
left join #T t2
on t2.NR = t1.NR
and t1.FK = 0
and t2.fk = 1
where t2.NR is null;

You could do something like this:
SELECT [NR], [Description], newtbl2.[FK] FROM (
SELECT [NR], [Description], newtbl.[FK] FROM [dbo].[myTable] oldtbl
LEFT OUTER JOIN (select [NR], [Description], 1 as [FK]) newtbl ON newtbl.[FK] = oldtbl.[FK]
) joined
LEFT OUTER JOIN (select [NR], [Description], 0 as [FK]) newtbl2 ON newtbl2.[FK] = joined.[FK]
What's happening here is: First, I'm left joining a duplicate table with 1 as the FK to the original table. That way if there are no rows with 1 as the FK, it will create one, if there are a row, it will just join on that row - giving you the same NR and Description.
Next, I'm joining an additional table with 0 as the FK. Basically the same as the first step, but with 0 instead of 1.
The aliasing may need some work, but in principal, this approach should work.

copy same records with Different ParentID from Existing ParentID

I want to copy records from same Table. If parentID is null then I want to copy Parent and Child of that, with another parentName(I'll use replace Key word).
If it is not null then I want to copy that into same parent if same not exists in that parent only.
create table #Table(ID int primary key , Name varchar(10),ParentID int )
insert into #Table
select 1,'Suresh', -1
union
select 2,'Naresh', 1
union
select 3,'John', 1
union
select 4,'Kumar',3
union
Select 5,'Dale John',3
select * from #Table
ID Name ParentID
-------------------
1 Suresh -1
2 Naresh 1
3 John 1
4 Kumar 3
5 Dale John 3
if I select ID = 1, then all ID child should insert into same table if Name not "Suresh" and ParentID not -1.
If I select ID = 3, then ID 3 and Child should insert into same table if Name and ParentID not John & 1

Take those into Temp table and make the join between them.
IF not exists(---)
then insert into table.

Tsql group by clause with exceptions

I have a problem with a query.
This is the data (order by Timestamp):
Data
ID Value Timestamp
1 0 2001-1-1
2 0 2002-1-1
3 1 2003-1-1
4 1 2004-1-1
5 0 2005-1-1
6 2 2006-1-1
7 2 2007-1-1
8 2 2008-1-1
I need to extract distinct values and the first occurance of the date. The exception here is that I need to group them only if not interrupted with a new value in that timeframe.
So the data I need is:
ID Value Timestamp
1 0 2001-1-1
3 1 2003-1-1
5 0 2005-1-1
6 2 2006-1-1
I've made this work by a complicated query, but am sure there is an easier way to do it, just cant think of it. Could anyone help?
This is what I started with - probably could work with that. This is a query that should locate when a value is changed.
> SELECT * FROM Data d1 join Data d2 ON d1.Timestamp < d2.Timestamp and
> d1.Value <> d2.Value
It probably could be done with a good use of row_number clause but cant manage it.

Sample data:
declare #T table (ID int, Value int, Timestamp date)
insert into #T(ID, Value, Timestamp) values
(1, 0, '20010101'),
(2, 0, '20020101'),
(3, 1, '20030101'),
(4, 1, '20040101'),
(5, 0, '20050101'),
(6, 2, '20060101'),
(7, 2, '20070101'),
(8, 2, '20080101')
Query:
;With OrderedValues as (
select *,ROW_NUMBER() OVER (ORDER By TimeStamp) as rn --TODO - specific columns better than *
from #T
), Firsts as (
select
ov1.* --TODO - specific columns better than *
from
OrderedValues ov1
left join
OrderedValues ov2
on
ov1.Value = ov2.Value and
ov1.rn = ov2.rn + 1
where
ov2.ID is null
)
select * --TODO - specific columns better than *
from Firsts
I didn't rely on the ID values being sequential and without gaps. If that's the situation, you can omit OrderedValues (using the table and ID in place of OrderedValues and rn). The second query simply finds rows where there isn't an immediate preceding row with the same Value.
Result:
ID Value Timestamp rn
----------- ----------- ---------- --------------------
1 0 2001-01-01 1
3 1 2003-01-01 3
5 0 2005-01-01 5
6 2 2006-01-01 6
You can order by rn if you need the results in this specific order.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight