T-SQL finding of exactly same values in referenced table - sql-server

Lets assume I have 3 tables in my Sql Serer 2008 database:
CREATE TABLE [dbo].[Properties](
[PropertyId] [int] NOT NULL,
[PropertyName] [nvarchar](50) NOT NULL
)
CREATE TABLE [dbo].[Entities](
[EntityId] [int] NOT NULL,
[EntityName] [nvarchar](50) NOT NULL
)
CREATE TABLE [dbo].[PropertyValues](
[EntityId] [int] NOT NULL,
[PropertyId] [int] NOT NULL,
[PropertyValue] [int] NOT NULL
)
Table Properties contains possible set of Properties which values can set up configured for business objects.
Table Entities contains business objects which are configured from app.
Table 3 contains selected Property values for business objects. Each business object can contain its own set of properties (i.e. "Property1" can be configured for first object but not configured for the second one).
My task is to find business objects which are exactly same as given object (ones which have exactly same set of properties with exactly same values). Performance is critical.
Any suggestions?
[ADDED]
For example there is an entry in Entities table with EntityId = 1. In PropertyValues table there are 3 row which are related to this entry:
EntityId PropertyId PropertyValue
1 4 Val4
1 5 Val5
1 6 Val6
The requirement is to find other entries in Entity table which have 3 related rows in PropertyValues table and these rows contain the same data as rows for EntityId = 1 (besides of EntityId column)
[ADDED]
Please, see my new question: Best approach to store data which attributes can vary
[BOUNTY1]
Thanks for all. The answers were very helpful. My task is complicated a little bit (but this complication can be useful in performance purposes). Please, see the details below:
The new table named EntityTypes is added
EntityTypeId column has been added into Entities and Properties tables
Now, there are several types of entities. Each entity has it's own set of properties.
Is it possible to increase performance using this information?
[BOUNTY2]
There is the second complication:
IsDeleted column is added to Property table
PropertyValues table can have values for Properties which already deleted from database. Entities which have such properties are considered invalid.
Some entities don't have values for each property of EntityType set. These entities also are considered as invalid.
The question is: How do I can write a script which will select all Entities and additional column IsValid for them.

;with cteSource as
(
select PropertyId,
PropertyValue
from PropertyValues
where EntityId = #EntityID
)
select PV.EntityId
from PropertyValues as PV
inner join cteSource as S
on PV.PropertyId = S.PropertyId and
PV.PropertyValue = S.PropertyValue and
PV.EntityId <> #EntityID
group by PV.EntityId
having count(*) = (select count(*)
from cteSource) and
count(*) = (select count(*)
from PropertyValues as PV1
where PV1.EntityId = PV.EntityId)
For your addition you can add this where clause:
where -- exlude entities with deleted properties
PV.EntityID not in (select PV2.EntityID
from Properties as P
inner join PropertyValues as PV2
on P.PropertyID = PV2.PropertyID
where P.IsDeleted = 1)
-- exclude entities with missing EntityType
and PV.EntityID not in (select E.EntityID
from Entities as E
where E.EntityType is null)
Edit:
If you want to test the query against some sample data you can do so here:
https://data.stackexchange.com/stackoverflow/q/110243/matching-properties

My task is to find business objects which are exactly same as given object (ones which have exactly same set of properties with exactly same values). Performance is critical.
Approach might vary depending on the average number of properties objects will typically have, a few versus dozens.
Assuming that objects have a varying number of properties:
I would start with a composite non-unique index on the dyad (PropertyValues.PropertyId, PropertyValues.PropertyValue) for select-performance.
Then, given an entity ID, I would select its propertid, propertyvalue pairs into a cursor.
[EDIT:
Not sure whether (entityid, propertyid) is unique in your system or if you are allowing multiple instances of the same property id for an entity, e.g. FavoriteColors:
entityid propertyid property value
1 17 blue
1 17 dark blue
1 17 sky blue
1 17 ultramarine
You would also need either a non-unique index on the monad (PropertyValues.entityid) or a composite index on (PropertyValues.entityid,PropertyValues.propertyid); the composite index would be unique if you wanted to prevent the same propertyid from being associated with an entity more than once.
If a property can occur multiple times, you should probably have a CanBeMultivalued flag in your Properties table. You should have a unique index on the triad (entityid, propertyid, propertyvalue) if you wanted to prevent this:
entityid propertyid property value
1 17 blue
1 17 blue
If you have this triad indexed, you would not need (entityid) index or the (entityid, propertyid) composite index in the PropertyValues table.
[/EDIT]
Then I would create a temp table to store matching entity ids.
Then I would iterate my cursor above to grab the given entity's propertyid, propertyvalue pairs, one pair at a time, and issue a select statement with each iteration:
insert into temp
select entityid from PropertyValues
where propertyid = mycursor.propertyid and propertyvalue = mycursor.propertyvalue
At the end of the loop, you have a non-distinct set of entityids in your temp table for all entities that had at least one of the properties in common with the given object. But the ones you want must have all properties in common.
Since you know how many properties the given object has, you can do the following to fetch only those entities that have all of the properties in common with the given object:
select entityid from temp
group by entityid having count(entityid) = {the number of properties in the given object}
ADDENDUM:
After the first property-value pair of the given object is used to select all potential matches, your temp table would not be missing any possible matches; rather it would contain entityids that were not perfect matches, which must be discarded in some manner, either by being ignored (by your group by having... clause) or by being explicitly removed from the temp table.
Also, after the first iteration of the loop, you could explore the possibility that an inner join between the temp table and the PropertyValues table might offer some performance gain:
select entityid from propertvalues
>> inner join temp on temp.entityid = propertyvalues.entityid <<
where propertyid = mycursor.propertyid and propertyvalue = mycursor.propertyvalue
And you might also try removing entityids from temp after the first iteration:
delete from temp
where not exists
(
select entityid from propertyvalues
inner join temp on temp.entityid = propertyvalues.entityid
where propertyid = mycursor.propertyid and propertyvalue = mycursor.propertyvalue
)
Alternatively, it would be possible to optimize this looping approach further if you stored some metadata about property-frequency. Optimally, when looking for matches for a given entity, you'd want to begin with the least frequently occuring property-value pair. You could order the given object's property-value pairs by ascending frequency, so that in your loop you'd be looking for the rarest one first. That would reduce the set of potential matches to its smallest possible size on the first iteration of the loop.
Of course, if temp were empty at any time after the given object's first property-value pair was used to look for matches, you would know that there are no matches for your given object, because you have found a property-value that no other entity possesses, and you could exit the loop and return a null set.

One way to look at this is if I have all base ball cards you have then we don't have the same baseball card as I may have more. But if you also have all the baseball cards that I have then we have exactly the same baseball cards. This is a little more complex as we are looking by team. By team could count the match count, my count, and your count and compare those 3 counts but that is 3 joins. This solution is 2 joins and I think it would be faster than the 3 join option.
To me the bonus questions did not make sense. There as a change to a table but that table name did not match any of the tables. Need a full table description for those bonus questions.
Below is the 2 join option:
select [m1].[IDa] as [EntityId1], [m1].[IDb] as [EntityId2]
from
( select [PV1].[EntityId] as [IDa], [PV2].[EntityId] as [IDb]
from [PropertyValue] as [PV1]
left outer join [PropertyValue] as [PV2]
on [PV2].[EntityId] <> [PV1].[EntityId]
and [PV2].[PropertyId] = [PV1].[PropertyId]
and [PV2].[PropertyValue] = [PV1].[PropertyValue]
group by [PV1].[EntityId], [PV2].[EntityId]
having count(*) = count([PV2].[EntityId])
) as [m1]
join
( select [PV1].[EntityId] as [IDa], [PV2].[EntityId] as [IDb]
from [PropertyValue] as [PV1]
right outer join [PropertyValue] as [PV2]
on [PV2].[EntityId] <> [PV1].[EntityId]
and [PV2].[PropertyId] = [PV1].[PropertyId]
and [PV2].[PropertyValue] = [PV1].[PropertyValue]
group by [PV1].[EntityId], [PV2].[EntityId]
having count(*) = count([PV1].[EntityId]))
) as [m2]
on [m1].[IDa] = [m2].[IDa] and [m1].[IDb] = [m2].[IDb]
Below is the 3 join count based option:
select [m1].[IDa] as [EntityId1], [m1].[IDb] as [EntityId2]
from
( select [PV1].[EntityId] as [IDa], [PV2].[EntityId] as [IDb], COUNT(*) as [count]
from [PropertyValue] as [PV1]
join [PropertyValue] as [PV2]
on [PV2].[EntityId] <> [PV1].[EntityId]
and [PV2].[PropertyId] = [PV1].[PropertyId]
and [PV2].[PropertyValue] = [PV1].[PropertyValue]
group by [PV1].[EntityId], [PV2].[EntityId]
) as [m1]
join
( select [PV1].[EntityId] as [IDa], COUNT(*) as [count]
from [PropertyValue] as [PV1]
group by [PV1].[EntityId]
having count(*) = count([PV1].[sID]))
) as [m2]
on [m1].[IDa] = [m2].[IDa] and [m1].[count] = [m2].[count]
join
( select [PV2].[EntityId] as [IDb], COUNT(*) as [count]
from [PropertyValue] as [PV2]
group by [PV2].[EntityId]
) as [m3]
on [m1].[IDb] = [m3].[IDb] and [m1].[count] = [m3].[count]

My task is to find business objects which are exactly same as given
object (ones which have exactly same set of properties with exactly
same values).
if the "given objec"t is described as e.g. #PropertyValues, so the query would be:
create table #PropertyValues(
[PropertyId] [int] NOT NULL,
[PropertyValue] [int] NOT NULL
)
insert #PropertyValues
select
3, 3 -- e.g.
declare
#cnt int
select #cnt = count(*) from #PropertyValues
select
EntityId
from
PropertyValues pv
left join #PropertyValues t on t.PropertyId = pv.PropertyId and t.PropertyValue = pv.PropertyValue
group by
EntityId
having
count(t.PropertyId) = #cnt
and count(pv.PropertyId) = #cnt
drop table #PropertyValues
But if performance is so much critical, you can create special indexed field on table Entities, e.g. EntityIndex varchar(8000), which will be filled by trigger on PropertyValues table as convert(char(10), PropertyId) + convert(char(10), PropertyValue) (for all properties of entity, sorted!). So it will be possible to do very fast seek by this field.

I think this is just a simple self-join:
select P2.EntityID,E.EntityName
from PropertyValues P1
inner join PropertyValues P2
on P1.PropertyID = P2.PropertyID
and P1.PropertyValue = P2.PropertyValue
inner join Entity E
on P2.EntityID = E.EntityID
where P1.EntityId = 1
and P2.EntityId <> 1
group by P2.EntityID, E.EntityName

Related

Need to make NULL=Value evaluate to TRUE

I have a dimension table I'm trying to create that would require records with NULLs to be overwritten by a value when all other non-null fields match.
This logic works and shows what I mean by "null=Value evaluates to TRUE":
UPDATE A
SET
A.SSN = COALESCE(A.SSN, B.SSN)
,A.DOB = COALESCE(A.DOB, B.DOB)
,A.ID_1 = COALESCE(A.ID_1, B.ID_1)
,A.ID_2 = COALESCE(A.ID_2, B.ID_2)
,A.ID_3 = COALESCE(A.ID_3, B.ID_3)
,A.ID_4 = COALESCE(A.ID_4, B.ID_4)
FROM #TESTED1 A
INNER JOIN #TESTED1 B
ON (A.SSN = B.SSN
OR A.SSN IS NULL
OR B.SSN IS NULL)
AND (A.DOB = B.DOB
OR A.DOB IS NULL
OR B.DOB IS NULL)
AND (A.ID_1 = B.ID_1
OR A.ID_1 IS NULL
OR B.ID_1 IS NULL)
AND (A.ID_2 = B.ID_2
OR A.ID_2 IS NULL
OR B.ID_2 IS NULL)
AND (A.ID_3 = B.ID_3
OR A.ID_3 IS NULL
OR B.ID_3 IS NULL)
AND (A.ID_4 = B.ID_4
OR A.ID_4 IS NULL
OR B.ID_4 IS NULL)
WHERE A.ArbitraryTableID <> B.ArbitraryTableID
but takes exponentially longer the more records that are evaluated, 10k records takes 9sec, 100k records takes 9min, etc. I'm trying to do an initial load of around 30mil records and then I will have to evaluate the entire table in a MERGE operation with another 10k records every day.
For example I would need the following three rows (that all exist on the same table) to combine into two rows with all values populated:
Just like this:
Unfortunately members can have multiple IDs so I can't count on any one of these IDs to be unique or even exist at all to cut down on my join conditions.
For performance of this query, make sure you have an index sorting all the criteria you are making your join on.
I did a quick example of what you described:
`declare #test table (
row_name NVARCHAR(50),
id1 int null,
id2 int null,
id3 int null
)
insert into #test values('row1', 1,2,3), ('row2',1,4,5), ('row3',11,null,null), ('row4',null,4,null), ('row5',3,6,5), ('row6',3,null,null)
select *
from #test t1
inner join #test t2 on (
(t1.id1 = t2.id1
or t1.id1 is null
or t2.id1 is null)
and (
t1.id2 = t2.id2
or t1.id2 is null
or t2.id2 is null)
and (
t1.id3 = t2.id3
or t1.id3 is null
or t2.id3 is null)
)
where t1.row_name <> t2.row_name
order by t1.row_name`
There are a couple of possible problems I see in my test output:
row3 and row4 in my example match because they have none of the same IDs. I'm guessing this is not desired but if you really have several independent systems with different keys, is it possible that you have a lot of rows that fall into this scenario? Every row with id1 set and no other keys and every row with id2 set and no other keys will match.
row1 and row4 do not match even though they should through transitivity (row1.id1 -> row2.id1, row2.id2-> row4.id2)
Based on your response to my comment, I suggest the following solution:
a master record identifying the member/customer
child records for each master record storing the respective IDs
Replace your UPDATE statement with
INSERTs into the master table for all records in table A that are guaranteed to be unique (e.g. SSN).
INSERTs into the child table for all records in table A with not-NULL ID attributes
mark records in table A as processed by UPDATEing a foreign key column referencing the master records IDENTITY primary key
INSERT into the child table all records from A that you can safely assign to existing master records, and again set the FK
This solution would resolve the performance issues resulting from a 5-way JOIN, and also mark processed source records as processed.

Optimizing SQL Function

I'm trying to optimize or completely rewrite this query. It takes about ~1500ms to run currently. I know the distinct's are fairly inefficient as well as the Union. But I'm struggling to figure out exactly where to go from here.
I am thinking that the first select statement might not be needed to return the output of;
[Key | User_ID,(User_ID)]
Note; Program and Program Scenario are both using Clustered Indexes. I can provide a screenshot of the Execution Plan if needed.
ALTER FUNCTION [dbo].[Fn_Get_Del_User_ID] (#_CompKey INT)
RETURNS VARCHAR(8000)
AS
BEGIN
DECLARE #UseID AS VARCHAR(8000);
SET #UseID = '';
SELECT #UseID = #UseID + ', ' + x.User_ID
FROM
(SELECT DISTINCT (UPPER(p.User_ID)) as User_ID FROM [dbo].[Program] AS p WITH (NOLOCK)
WHERE p.CompKey = #_CompKey
UNION
SELECT DISTINCT (UPPER(ps.User_ID)) as User_ID FROM [dbo].[Program] AS p WITH (NOLOCK)
LEFT OUTER JOIN [dbo].[Program_Scenario] AS ps WITH (NOLOCK) ON p.ProgKey = ps.ProgKey
WHERE p.CompKey = #_CompKey
AND ps.User_ID IS NOT NULL) x
RETURN Substring(#UserIDs, 3, 8000);
END
There are two things happening in this query
1. Locating rows in the [Program] table matching the specified CompKey (#_CompKey)
2. Locating rows in the [Program_Scenario] table that have the same ProgKey as the rows located in (1) above.
Finally, non-null UserIDs from both these sets of rows are concatenated into a scalar.
For step 1 to be efficient, you'd need an index on the CompKey column (clustered or non-clustered)
For step 2 to be efficient, you'd need an index on the join key which is ProgKey on the Program_Scenario table (this likely is a non-clustered index as I can't imagine ProgKey to be PK). Likely, SQL would resort to a loop join strategy - i.e., for each row found in [Program] matching the CompKey criteria, it would need to lookup corresponding rows in [Program_Scenario] with same ProgKey. This is a guess though, as there is not sufficient information on the cardinality and distribution of data.
Ensure the above two indexes are present.
Also, as others have noted the second left outer join is a bit confusing as an inner join is the right way to deal with it.
Per my interpretation the inner part of the query can be rewritten this way. Also, this is the query you'd ideally run and optimize before tacking the string concatenation part. The DISTINCT is dropped as it is automatic with a UNION. Try this version of the query along with the indexes above and if it provides the necessary boost, then include the string concatenation or the xml STUFF approaches to return a scalar.
SELECT UPPER(p.User_ID) as User_ID
FROM
[dbo].[Program] AS p WITH (NOLOCK)
WHERE
p.CompKey = #_CompKey
UNION
SELECT UPPER(ps.User_ID) as User_ID
FROM
[dbo].[Program] AS p WITH (NOLOCK)
INNER JOIN [dbo].[Program_Scenario] AS ps WITH (NOLOCK) ON p.ProgKey = ps.ProgKey
WHERE
p.CompKey = #_CompKey
AND ps.User_ID IS NOT NULL
I am taking a shot in the dark here. I am guessing that the last code you posted is still a scalar function. It also did not have all the logic of your original query. Again, this is a shot in the dark since there is no table definitions or sample data posted.
This might be how this would look as an inline table valued function.
ALTER FUNCTION [dbo].[Fn_Get_Del_User_ID]
(
#_CompKey INT
) RETURNS TABLE AS RETURN
select MyResult = STUFF(
(
SELECT distinct UPPER(p.User_ID) as User_ID
FROM dbo.Program AS p
WHERE p.CompKey = #_CompKey
group by p.User_ID
UNION
SELECT distinct UPPER(ps.User_ID) as User_ID
FROM dbo.Program AS p
LEFT OUTER JOIN dbo.Program_Scenario AS ps ON p.ProgKey = ps.ProgKey
WHERE p.CompKey = #_CompKey
AND ps.User_ID IS NOT NULL
for xml path ('')
), 1, 1, '')
from dbo.Program

postgresql: Insert two values in table b if both values are not in table a

I'm doing an assignment where I am to make an sql-database of a tournament result. Players can be added by their name, and when the database has at least two or more players who has not already been assigned to a match, two players should be matched against each other.
For instance, if the tables currently are empty I add Joe as a player. I then also add James and since the table then has two players, who also are not in the matches-table, a new row in the matches-table is created with their p_id set to left_player_P_id and right_player_P_id.
I thought it would be a good idea to create a function and a trigger so that every time a row is added to the player-table, the sql-code would run and create the row in the matches as needed. I am open to other ways of doing this.
I've tried multiple different approaches including SQL - Insert if the number of rows is greater than and Using IF ELSE statement based on Count to execute different Insert statements but I am now at a loss.
Problematic code:
This approach returns a syntax error.
IF ((select count(*) from players_not_in_any_matches) >= 2)
begin
insert into matches values (
(select p_id from players_not_in_any_matches limit 1),
(select p_id from players_not_in_any_matches limit 1 offset 1)
)
end;
Alternative approach (still problematic code):
This approach seems more promising (but less readable). However, it inserts even if there are no rows returned inside the where not exists.
insert into matches (left_player_p_id, right_player_p_id)
select
(select p_id from players_not_in_any_matches limit 1),
(select p_id from players_not_in_any_matches limit 1 offset 1)
where not exists (
select * from players_not_in_any_matches offset 2
);
Tables
CREATE TABLE players (
p_id serial PRIMARY KEY,
full_name text
);
CREATE TABLE matches(
left_player_P_id integer REFERENCES players,
right_player_P_id integer REFERENCES players,
winner integer REFERENCES players
);
Views
-- view for getting all players not currently assigned to a match
create view players_not_in_any_matches as
select * from players
where p_id not in (
select left_player_p_id from matches
) and
p_id not in (
select right_player_p_id from matches
);
Try:
insert into matches (left_player_p_id, right_player_p_id)
select p1.p_id, p2.p_id
from players p1
join players p2
on p1.p_id <> p2.p_id
and not exists(
select 1 from matches m
where p1.p_id in (m.left_player_p_id, m.right_player_p_id)
)
and not exists(
select 1 from matches m
where p2.p_id in (m.left_player_p_id, m.right_player_p_id)
)
limit 1
Anti joins (not-exists operators) in the above query could be further simplified a bit using LEFT JOINs:
insert into matches (left_player_p_id, right_player_p_id)
select p1.p_id, p2.p_id
from players p1
join players p2
left join matches m1
on p1.p_id in (m1.left_player_p_id, m1.right_player_p_id)
left join matches m2
on p2.p_id in (m2.left_player_p_id, m2.right_player_p_id)
where m1.left_player is null
and m2.left_player is null
limit 1
but in my opinion the former query is more readable, while the latter one looks tricky.

SQL query with two tables, count and more info

I'm just learning this stuff and I'm having trouble with this one. I have two tables, STUDENTS and ADVISORS. The students are assigned advisors within the students table using a foreign key attached to the primary key of the advisors table.
The task here is this: Provide a list of all advisors and the number of active students assigned to each. Filter out any advisors with more than 1 student.
The current script is listed below:
select
Students.AdvisorID, count(Students.AdvisorID) as 'TotalStudents'
from
Students
left outer join
Advisors on Students.AdvisorID = Advisors.AdvisorID
where
Students.IsActive = 1
Group by
Students.AdvisorID
Having
count(Students.AdvisorID) < 2
This will output a proper list showing only the advisorID and total students.
I need to also display the
Advisors.FirstName + ' ' + Advisors.LastName as 'AdvisorName'
Any help would be greatly appreciated.
EDIT
students table
advisors table
I think your original attempt is on the right track, but you need to join again to the Advisors table to pull in the first and last name for each adviser. The reason for this is that after doing the aggregation all that remains is an ID for each adviser and a student count.
SELECT t1.AdvisorID,
t2.TotalStudents,
t1.FirstName + ' ' + t1.LastName AS AdvisorName
FROM Advisors t1
INNER JOIN
(
SELECT a.AdvisorID, COUNT(*) AS TotalStudents
FROM Advisors a
LEFT JOIN Students s
ON a.AdvisorID = s.AdvisorID
GROUP BY a.AdvisorID
HAVING COUNT(*) < 2
) t2
ON t1.AdvisorID = t2.AdvisorID
Other notes:
I chose to LEFT JOIN advisers to students, not the other way around, since you want a statistic for each adviser. Doing the join as you first had it could filter out advisers who do not match to any student. This is not the behavior you want, since an adviser who does not match to any student should have a student count of zero.
Here's a little sample data to work with
USE tempdb
GO
IF OBJECT_ID('tempdb.dbo.Advisors') IS NOT NULL DROP TABLE dbo.Advisors;
IF OBJECT_ID('tempdb.dbo.Students') IS NOT NULL DROP TABLE dbo.Students;
CREATE TABLE dbo.Advisors (AdvisorID int primary key, AdvisorName varchar(100));
CREATE TABLE dbo.Students
(
studentID int identity primary key,
AdvisorID int foreign key references dbo.Advisors(AdvisorID)
);
INSERT dbo.Advisors VALUES (1, 'Mr. White'),(2,'Walter Jr.'),(3,'Mr. Pinkman');
INSERT dbo.Students (AdvisorID)
SELECT TOP (20) abs(checksum(newid())%3)+1 FROM sys.all_columns;
No Left Join needed, I think this will give you what you are looking for.
SELECT a.AdvisorID, total_students = COUNT(*)
FROM dbo.Advisors a
INNER JOIN dbo.Students s ON a.AdvisorID = s.AdvisorID
GROUP BY a.AdvisorID
HAVING COUNT(*) < 2;

How do I compare two rows from a SQL database table based on DateTime within 3 seconds?

I have a table of DetailRecords containing records that seem to be "duplicates" of other records, but they have a unique primary key [ID]. I would like to delete these "duplicates" from the DetailRecords table and keep the record with the longest/highest Duration. I can tell that they are linked records because their DateTime field is within 3 seconds of another row's DateTime field and the Duration is within 2 seconds of one another. Other data in the row will also be duplicated exactly, such as Number, Rate, or AccountID, but this could be the same for the data that is not "duplicate" or related.
CREATE TABLE #DetailRecords (
[AccountID] INT NOT NULL,
[ID] VARCHAR(100) NULL,
[DateTime] VARCHAR(100) NULL,
[Duration] INT NULL,
[Number] VARCHAR(200) NULL,
[Rate] DECIMAL(8,6) NULL
);
I know that I will most likely have to perform a self join on the table, but how can I find two rows that are similar within a DateTime range of plus or minus 3 seconds, instead of just exactly the same?
I am having the same trouble with the Duration within a range of plus or minus 2 seconds.
The key is taking the absolute value of the difference between the dates and durations. I don't know SQL server, but here's how I'd do it in SQLite. The technique should be the same, only the specific function names will be different.
SELECT a.id, b.id
FROM DetailRecords a
JOIN DetailRecords b
ON a.id > b.id
WHERE abs(strftime("%s", a.DateTime) - strftime("%s", b.DateTime)) <= 3
AND abs(a.duration - b.duration) <= 2
Taking the absolute value of the difference covers the "plus or minus" part of the range. The self join is on a.id > b.id because a.id = b.id would duplicate every pair.
Given the entries...
ID|DateTime |Duration
1 |2014-01-26T12:00:00|5
2 |2014-01-26T12:00:01|6
3 |2014-01-26T12:00:06|6
4 |2014-01-26T12:00:03|11
5 |2014-01-26T12:00:02|10
6 |2014-01-26T12:00:01|6
I get the pairs...
5|4
2|1
6|1
6|2
And you should really store those dates as DateTime types if you can.
You could use a self-referential CTE and compare the DateTime fields.
;WITH CTE AS (
SELECT AccountID,
ID,
DateTime,
rn = ROW_NUMBER() OVER (PARTITION BY AccountID, ID, <insert any other matching keys> ORDER BY AccountID)
FROM table
)
SELECT earliestAccountID = c1.AccountID,
earliestDateTime = c1.DateTime,
recentDateTime = c2.DateTime,
recentAccountID = c2.AccountID
FROM cte c1
INNER JOIN cte c2
ON c1.rn = 1 AND c2.rn = 2 AND c1.DateTime <> c2.DateTime
Edit
I made several assumptions about the data set, so this may not be as relevant as you need. If you're simply looking for difference between possible duplicates, specifically DateTime differences, this will work. However, this does not constrain to your date range, nor does it automatically assume what the DateTime column is used for or how it is set.

Resources