Deleting duplicates in a time series - sql-server

I have a large set of measurements taken every 1 millisecond stored in a SQL Server 2012 table. Whenever there are 3 or more duplicate values in some rows that I would like to delete the middle duplicates. Highlighted values in this image of sample data are the ones that I want to delete. Is there a way to do this with a SQL query?

You can do this using a CTE and ROW_NUMBER:
SQL Fiddle
WITH CteGroup AS(
SELECT *,
grp = ROW_NUMBER() OVER(ORDER BY MS) - ROW_NUMBER() OVER(PARTITION BY Value ORDER BY MS)
FROM YourTable
),
CteFinal AS(
SELECT *,
RN_FIRST = ROW_NUMBER() OVER(PARTITION BY grp, Value ORDER BY MS),
RN_LAST = ROW_NUMBER() OVER(PARTITION BY grp, Value ORDER BY MS DESC)
FROM CteGroup
)
DELETE
FROM CteFinal
WHERE
RN_FIRST > 1
AND RN_LAST > 1

I'm sure there must be a more efficient way to do this, but you could join the table to itself twice to find the previous and next value in the list, and then delete all of the entries where all three values are the same.
DELETE FROM tbl
WHERE ms IN
(
SELECT T.ms
FROM tbl T
INNER JOIN tbl T1 ON T.ms = T1.ms + 1
INNER JOIN tbl T2 ON T.ms = T2.ms - 1
WHERE T.value = T1.value AND T.value = T2.value
)
If the table is really big, I can see this blowing tempdb though.

Yes there is
select * from table group by table.field ->value

Related

SQL Newbie - Over Partition?

I have the following query. I am trying to get the Row # to increment whenever the value in Value1 field changes. The SensorData table has 2800 records and the Value1 is either 0 or 3 and changes throughout the day.
SELECT
ROW_NUMBER() OVER(PARTITION BY Value1 ORDER BY Block ASC) AS Row#,
GatewayDetailID, Block, Value1
FROM
SensorData
ORDER BY
Row#
I get the following results:
It seems like it creates only 2 partitions 0 and 3. It is not restarting the row number every time the value 1 changes.?
First instead of creating a permanent table I just changed it to a Temp table.
So, Given your example here is what I came up with:
WITH CTE as(
select ROW_NUMBER() OVER(ORDER BY BLOCK) RN, LAG(Value1,1,VALUE1) OVER (ORDER BY BLOCK) LG,
GatewayDetailID, Block, Value1,Value2,Vaule3
from #tmp
),
CTE2 as (
select *, CASE WHEN LG <> VALUE1 THEN RN ELSE 0 END RowMark
from cte
),
CTE3 AS (
select MIN(Block) BL, RowMark from CTE2
GROUP BY ROwMark
),
CTE4 AS (
SELECT GatewayDetailID,Block,Value1,Value2,Vaule3,RMM from cte2 t1
CROSS APPLY (SELECT MAX(ROWMark) RMM FROM CTE3 t9 where t1.Block >= t9.ROwMark and t1.RN >= t9.RowMark) t2
)
SELECT GateWayDetailID,Block,Value1,Value2,Vaule3, ROW_NUMBER() OVER(Partition by RMM ORDER BY BLOCK) RN
FROM CTE4
ORDER BY BLOCK
I first had to get a Row number for all the rows, then depending on when the Value1 changed I marked that as a new group. From that I created a CTE with the date and row boundry for each group. And then lastly I cross applied that back to the table to find each row in each group.
From that last CTE I merely just applied a simple ROW_NUMBER() function portioned by each RowMarker group and poof....row numbers.
There may be a better way to do this, but this was how I logically worked through the problem.

Updating multiple row with random data from another table?

Combining some examples, I came up with the following query (fields and table names have been anonymised soI hope I didn't insert typos).
UPDATE destinationTable
SET destinationField = t2.value
FROM destinationTable t1
CROSS APPLY (
SELECT TOP 1 'SomeRequiredPrefix ' + sourceField as value
FROM #sourceTable
WHERE sourceField <> ''
ORDER BY NEWID()
) t2
Problem
Currently, all records get the same value into destinationField , value needs to be random and different. I'm probably missing something here.
Here's a possible solution. Using CTE's assign row numbers to both tables based on random order. Join the tables together using that rownumber and update the rows accordingly.
;WITH
dt AS
(SELECT *, ROW_NUMBER() OVER (ORDER BY NEWID()) AS RowNum
FROM dbo.destinationtable),
st AS
(SELECT *, ROW_NUMBER() OVER (ORDER BY NEWID()) AS RowNum
FROM dbo.#sourcetable)
UPDATE dt
SET dt.destinationfield = 'SomeRequiredPrefix ' + st.sourcefield
FROM dt
JOIN st ON dt.RowNum = st.RowNum
UPDATED SOLUTION
I used CROSS JOIN to get all possibilities since you have less rows in source table. Then assign random rownumbers and only take 1 row for each destination field.
;WITH cte
AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY destinationfield ORDER BY NEWID()) AS Rownum
FROM destinationtable
CROSS JOIN #sourcetable
WHERE sourcefield <> ''
)
UPDATE cte
SET cte.destinationfield = 'SomeRequiredPrefix ' + sourcefield
WHERE cte.Rownum = 1
SELECT * FROM dbo.destinationtable

partition by a count of a field

I have a table t1 with two int fields(id,month) and I have populated it with some values.
What I would like to see as an output is, the maximum of (count of id in a month). I have tried the following code and it works fine:
select id,max(freq) as maxfreq from
(select id,month,count(*) as freq
from t1
group by id,month) a
group by id
order by maxfreq desc
The result is:
ID MAXFREQ
1 3
2 3
3 1
4 1
This is fine. How to achieve this using the over partition by clause? And which one is more efficient? In reality my table consists of several thousands of records. So doing a subquery wont be a good idea I guess! Thanks for any help. Here's the fiddle
;WITH tmp AS (select id, row_number() over (partition by id, month order by id) rn
FROM t1)
SELECT t.id, max(tmp.rn) as maxfreq
from t1 t
INNER JOIN tmp ON tmp.id = t.id
GROUP BY t.id
You can try this -
select id,max(freq) as maxfreq from
(select id,row_number() over (partition by id,month ORDER BY id) as freq
from t1
) a
group by id
order by id,maxfreq desc
but from a performance standpoint, I do not see much difference between your original query and this one.
Same solution but with using CTE.
Actually there is no point to forcibly use windowing functions to this issue.
Compare both solutions with plan explorer.
;with c1 as
( select id,month,count(*) as freq
from t1
group by id,month
)
select id, max(freq) as maxfreq
from c1
group by id
order by maxfreq desc;

How to extract the last records based on entrydate sql server

i have many duplicate job id but entry date is can not be duplicate. i need to fetch always unique job id based on last entry date. i have solved it with the below query but like to know is there any better way to form the same sql when data would be huge for best performance. please guide me thanks.
SELECT A.JID,A.EntryDate,RefundDate,Comments,Refund, ActionBy
FROM (
(
select JID, Max(EntryDate) AS EntryDate
from refundrequested
GROUP BY JID
) A
Inner JOIN
(
SELECT JID,ENTRYDATE,refundDate,Comments,refund,ActionBy
from refundrequested
) B
ON A.JID=B.JID AND A.EntryDate = B.EntryDate
)
Using the row_number() function is usually a bit faster:
select *
from (
select row_number() over (partition by jid
order by EntryDate desc) as rn
, *
from refundrequested
) as SubQueryAlias
where rn = 1
Query:
SELECT t1.JID,
t1.EntryDate,
t1.RefundDate,
t1.Comments,
t1.Refund,
t1.ActionBy
FROM refundrequested t1
LEFT JOIN refundrequested t2
ON t2.JID = t1.JID
AND t2.EntryDate > t1.EntryDate
WHERE t2.JID is null

SQL Update with row_number()

I want to update my column CODE_DEST with an incremental number. I have:
CODE_DEST RS_NOM
null qsdf
null sdfqsdfqsdf
null qsdfqsdf
I would like to update it to be:
CODE_DEST RS_NOM
1 qsdf
2 sdfqsdfqsdf
3 qsdfqsdf
I have tried this code:
UPDATE DESTINATAIRE_TEMP
SET CODE_DEST = TheId
FROM (SELECT Row_Number() OVER (ORDER BY [RS_NOM]) AS TheId FROM DESTINATAIRE_TEMP)
This does not work because of the )
I have also tried:
WITH DESTINATAIRE_TEMP AS
(
SELECT
ROW_NUMBER() OVER (ORDER BY [RS_NOM] DESC) AS RN
FROM DESTINATAIRE_TEMP
)
UPDATE DESTINATAIRE_TEMP SET CODE_DEST=RN
But this also does not work because of union.
How can I update a column using the ROW_NUMBER() function in SQL Server 2008 R2?
One more option
UPDATE x
SET x.CODE_DEST = x.New_CODE_DEST
FROM (
SELECT CODE_DEST, ROW_NUMBER() OVER (ORDER BY [RS_NOM]) AS New_CODE_DEST
FROM DESTINATAIRE_TEMP
) x
DECLARE #id INT
SET #id = 0
UPDATE DESTINATAIRE_TEMP
SET #id = CODE_DEST = #id + 1
GO
try this
http://www.mssqltips.com/sqlservertip/1467/populate-a-sql-server-column-with-a-sequential-number-not-using-an-identity/
With UpdateData As
(
SELECT RS_NOM,
ROW_NUMBER() OVER (ORDER BY [RS_NOM] DESC) AS RN
FROM DESTINATAIRE_TEMP
)
UPDATE DESTINATAIRE_TEMP SET CODE_DEST = RN
FROM DESTINATAIRE_TEMP
INNER JOIN UpdateData ON DESTINATAIRE_TEMP.RS_NOM = UpdateData.RS_NOM
Your second attempt failed primarily because you named the CTE same as the underlying table and made the CTE look as if it was a recursive CTE, because it essentially referenced itself. A recursive CTE must have a specific structure which requires the use of the UNION ALL set operator.
Instead, you could just have given the CTE a different name as well as added the target column to it:
With SomeName As
(
SELECT
CODE_DEST,
ROW_NUMBER() OVER (ORDER BY [RS_NOM] DESC) AS RN
FROM DESTINATAIRE_TEMP
)
UPDATE SomeName SET CODE_DEST=RN
This is a modified version of #Aleksandr Fedorenko's answer adding a WHERE clause:
UPDATE x
SET x.CODE_DEST = x.New_CODE_DEST
FROM (
SELECT CODE_DEST, ROW_NUMBER() OVER (ORDER BY [RS_NOM]) AS New_CODE_DEST
FROM DESTINATAIRE_TEMP
) x
WHERE x.CODE_DEST <> x.New_CODE_DEST AND x.CODE_DEST IS NOT NULL
By adding a WHERE clause I found the performance improved massively for subsequent updates. Sql Server seems to update the row even if the value already exists and it takes time to do so, so adding the where clause makes it just skip over rows where the value hasn't changed. I have to say I was astonished as to how fast it could run my query.
Disclaimer: I'm no DB expert, and I'm using PARTITION BY for my clause so it may not be exactly the same results for this query. For me the column in question is a customer's paid order, so the value generally doesn't change once it is set.
Also make sure you have indexes, especially if you have a WHERE clause on the SELECT statement. A filtered index worked great for me as I was filtering based on payment statuses.
My query using PARTITION by
UPDATE UpdateTarget
SET PaidOrderIndex = New_PaidOrderIndex
FROM
(
SELECT PaidOrderIndex, SimpleMembershipUserName, ROW_NUMBER() OVER(PARTITION BY SimpleMembershipUserName ORDER BY OrderId) AS New_PaidOrderIndex
FROM [Order]
WHERE PaymentStatusTypeId in (2,3,6) and SimpleMembershipUserName is not null
) AS UpdateTarget
WHERE UpdateTarget.PaidOrderIndex <> UpdateTarget.New_PaidOrderIndex AND UpdateTarget.PaidOrderIndex IS NOT NULL
-- test to 'break' some of the rows, and then run the UPDATE again
update [order] set PaidOrderIndex = 2 where PaidOrderIndex=3
The 'IS NOT NULL' part isn't required if the column isn't nullable.
When I say the performance increase was massive I mean it was essentially instantaneous when updating a small number of rows. With the right indexes I was able to achieve an update that took the same amount of time as the 'inner' query does by itself:
SELECT PaidOrderIndex, SimpleMembershipUserName, ROW_NUMBER() OVER(PARTITION BY SimpleMembershipUserName ORDER BY OrderId) AS New_PaidOrderIndex
FROM [Order]
WHERE PaymentStatusTypeId in (2,3,6) and SimpleMembershipUserName is not null
I did this for my situation and worked
WITH myUpdate (id, myRowNumber )
AS
(
SELECT id, ROW_NUMBER() over (order by ID) As myRowNumber
FROM AspNetUsers
WHERE UserType='Customer'
)
update AspNetUsers set EmployeeCode = FORMAT(myRowNumber,'00000#')
FROM myUpdate
left join AspNetUsers u on u.Id=myUpdate.id
Simple and easy way to update the cursor
UPDATE Cursor
SET Cursor.CODE = Cursor.New_CODE
FROM (
SELECT CODE, ROW_NUMBER() OVER (ORDER BY [CODE]) AS New_CODE
FROM Table Where CODE BETWEEN 1000 AND 1999
) Cursor
If table does not have relation, just copy all in new table with row number and remove old and rename new one with old one.
Select RowNum = ROW_NUMBER() OVER(ORDER BY(SELECT NULL)) , * INTO cdm.dbo.SALES2018 from
(
select * from SALE2018) as SalesSource
In my case I added a new column and wanted to update it with the equevilat record number for the whole table
id name new_column (ORDER_NUM)
1 Ali null
2 Ahmad null
3 Mohammad null
4 Nour null
5 Hasan null
6 Omar null
I wrote this query to have the new column populated with the row number
UPDATE My_Table
SET My_Table.ORDER_NUM = SubQuery.rowNumber
FROM (
SELECT id ,ROW_NUMBER() OVER (ORDER BY [id]) AS rowNumber
FROM My_Table
) SubQuery
INNER JOIN My_Table ON
SubQuery.id = My_Table.id
after executing this query I had 1,2,3,... numbers in my new column
I update a temp table with the first occurrence of part where multiple parts can be associated with a sequence number. RowId=1 returns the first occurence which I join the tmp table and data using part and sequence number.
update #Tmp
set
#Tmp.Amount=#Amount
from
(SELECT Part, Row_Number() OVER (ORDER BY [Part]) AS RowId FROM #Tmp
where Sequence_Num=#Sequence_Num
)data
where data.Part=#Tmp.Part
and data.RowId=1
and #Tmp.Sequence_Num=#Sequence_Num
I don't have a running ID in order to do what "Basheer AL-MOMANI" suggested.
I did something like this: (joined my table on myself, just to get the Row Number)
update T1 set inID = T2.RN
from (select *, ROW_NUMBER() over (order by ID) RN from MyTable) T1
inner join (select *, ROW_NUMBER() over (order by ID) RN from MyTable) T2 on T2.RN = T1.RN

Resources