Using t-sql to select a dataset with duplicate values removed - sql-server

I have a set of tables in SQL Server 2005 which contain timeseries data. There is hence a datetime field and a set of values.
CREATE TABLE [dbo].[raw_data](
[Time] [datetime] NULL,
[field1] [float] NULL,
[field2] [float] NULL,
[field3] [float] NULL
)
The datetime field is unfortunately not a unique key, and there appear to be a lot of datetime values with multiple (non-identical) entries - hence DISTINCT doesn't work.
I want to select data from these tables for insertion into a new, properly indexed table.
Hence I want a select query that will return a dataset with a single row entry for each Time. I am not concerned which set of values is selected for a given time, as long as one (and only one) is chosen.
There are a LOT of these tables, so I do not have time to find and manually purge duplicate values, so a standard HAVING COUNT(*)>1 query is not applicable. There are also too many duplicates to just ignore those time values altogether.
Any ideas? I was thinking of some kind of cursor based on PARTITION BY, but got stuck beyond that point.

You don't need a cursor:
SELECT tmp.*
FROM
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY [Time] ORDER BY [Time]) AS RowNum
FROM raw_data
) AS tmp
WHERE tmp.RowNum = 1

Related

How to define a view that will give zeros on days with no recording?

I have a couple tables with a view that joins them
CREATE TABLE Valve (
ValveID int IDENTITY NOT NULL,
ValveName varchar(100) NOT NULL,
ValveOwner varchar(100) NOT NULL
)
CREATE TABLE ValveRecording (
ValveID int IDENTITY NOT NULL,
Date date NOT NULL,
Measure varchar(100) NOT NULL,
Value numeric NOT NULL
)
ALTER VIEW ValveRecordingView
AS
SELECT
v.ValveName,
v.ValveOwner,
vr.Measure,
vr.Value
FROM Valve v
LEFT OUTER JOIN ValveRecording vr on v.ValveID = vr.ValveID
*Note above was just typed in - so may have errors.
The problem with the view is that if queried for a date range where no measurement was made,
there is no row present for the measure-value pair. I would like to
restate the view in a way that some value is returned for each date-measure-value tuples.
If one wasn't present then return a zero.
I think that it's possible, but the SQL is a bit beyond me. I assume
that it involves a UNION ALL with a query that gets date-measure-value
tuples which aren't already present. Probably an ugly query but that's
okay.

SQL Server trigger inserting duplicates

I'm debugging a data pipeline that consists of several tables and triggers.
We have a table called Dimension_Date defined as such:
CREATE TABLE [dbo].[Dimension_Date]
(
[id_date] [bigint] IDENTITY(1,1) NOT NULL,
[a_date] [datetime] NOT NULL,
[yoy_date] [date] NOT NULL
)
We also have three different tables in which other processes are inserting data (with transactions, although I don't have access to these processes). Each table contains a Datetime column (x_date) that needs to be inserted in the Dimension table only if that datetime doesn't exist already in the Dimension table. If it already exists, it shouldn't be inserted.
On each of these tables there is a trigger that, among other things, checks if the datetime exists in the Dimension table and if it isn't, inserts the new date. Once all the actions are performed, the content of TABLE_1, 2 and 3 are deleted. The triggers (of tables TABLE_1, 2 and 3) contain the following query:
CREATE TRIGGER [dbo].[insert_Date_Trigger]
ON [dbo].[TABLE_1]
AFTER INSERT
AS
BEGIN
INSERT INTO Dimension_Date (a_date, yoy_date)
SELECT DISTINCT x_date, DateADD(yy, -100, CONVERT(date, x_date))
FROM TABLE_1
WHERE NOT EXISTS (SELECT id_date FROM Dimension_Date
WHERE a_date = TABLE_1.a_date);
(...)
DELETE FROM TABLE_1
END
The problem is that these triggers are inserting duplicates in the Dimension table (two different id_date for the same a_date field), and I can't figure out where the problem is. Could it be that the processes might not be using transactions? Is there anything wrong with the query?
Any help would be greatly appreciated.

SQL Server - Order Identity Fields in Table

I have a table with this structure:
CREATE TABLE [dbo].[cl](
[ID] [int] IDENTITY(1,1) NOT NULL,
[NIF] [numeric](9, 0) NOT NULL,
[Name] [varchar](80) NOT NULL,
[Address] [varchar](100) NULL,
[City] [varchar](40) NULL,
[State] [varchar](30) NULL,
[Country] [varchar](25) NULL,
Primary Key([ID],[NIF])
);
Imagine that this table has 3 records. Record 1, 2, 3...
When ever I delete Record number 2 the IDENTITY Field generates a Gap. The table then has Record 1 and Record 3. Its not correct!
Even if I use:
DBCC CHECKIDENT('cl', RESEED, 0)
It does not solve my problem becuase it will set the ID of the next inserted record to 1. And that's not correct either because the table will then have a multiple ID.
Does anyone has a clue about this?
No database is going to reseed or recalculate an auto-incremented field/identity to use values in between ids as in your example. This is impractical on many levels, but some examples may be:
Integrity - since a re-used id could mean records in other systems are referring to an old value when the new value is saved
Performance - trying to find the lowest gap for each value inserted
In MySQL, this is not really happening either (at least in InnoDB or MyISAM - are you using something different?). In InnoDB, the behavior is identical to SQL Server where the counter is managed outside of the table, so deleted values or rolled back transactions leave gaps between last value and next insert. In MyISAM, the value is calculated at time of insertion instead of managed through an external counter. This calculation is what is giving the perception of being recalcated - it's just never calculated until actually needed (MAX(Id) + 1). Even this won't insert inside gaps (like the id = 2 in your example).
Many people will argue if you need to use these gaps, then there is something that could be improved in your data model. You shouldn't ever need to worry about these gaps.
If you insist on using those gaps, your fastest method would be to log deletes in a separate table, then use an INSTEAD OF INSERT trigger to perform the inserts with your intended keys by first looking for records in these deletions table to re-use (then deleting them to prevent re-use) and then using the MAX(Id) + 1 for any additional rows to insert.
I guess what you want is something like this:
create table dbo.cl
(
SurrogateKey int identity(1, 1)
primary key
not null,
ID int not null,
NIF numeric(9, 0) not null,
Name varchar(80) not null,
Address varchar(100) null,
City varchar(40) null,
State varchar(30) null,
Country varchar(25) null,
unique (ID, NIF)
)
go
I added a surrogate key so you'll have the best of both worlds. Now you just need a trigger on the table to "adjust" the ID whenever some prior ID gets deleted:
create trigger tr_on_cl_for_auto_increment on dbo.cl
after delete, update
as
begin
update dbo.cl
set ID = d.New_ID
from dbo.cl as c
inner join (
select c2.SurrogateKey,
row_number() over (order by c2.SurrogateKey asc) as New_ID
from dbo.cl as c2
) as d
on c.SurrogateKey = d.SurrogateKey
end
go
Of course this solution also implies that you'll have to ensure (whenever you insert a new record) that you check for yourself which ID to insert next.

How do I preserve timestamp values when altering a table in SQL Server (T-SQL)?

Or: how to copy timestamp data from one table to another?
Using SQL Server 2008 and having old design documents which requires a table to has the columns ordered in a certain way (with timestamp column last, something I guess comes from the time when Excel was used instead of an SQL database) I need to add a column in the middle of a table, keeping the timestamp data intact...
Do you know how to instruct SQL Server to do this?
Example T-SQL code:
-- In the beginning...
CREATE TABLE TestTableA
(
[TestTableAId] [int] IDENTITY(1,1) NOT NULL,
[TestTableAText] varchar(max) NOT NULL,
[TestTableATimeStamp] [timestamp] NOT NULL
)
INSERT INTO TestTableA (TestTableAText) VALUES ('TEST')
-- Many years pass...
-- Now we need to add a column to this table, but preserve all data, including timestamp data.
-- Additional requirement: We want SQL Server to keep the TimeStamp last of the column.
CREATE TABLE TestTableB
(
[TestTableBId] [int] IDENTITY(1,1) NOT NULL,
[TestTableBText] varchar(max) NOT NULL,
[TestTableBInt] [int] NULL,
[TestTableBTimeStamp] [timestamp] NOT NULL
)
-- How do we copy the timestamp data from TestTableATimestamp to `TestTableBTimestamp`?
SET IDENTITY_INSERT [TestTableB] ON
-- Next line will produce errormessage:
-- Cannot insert an explicit value into a timestamp column. Use INSERT with a column list to exclude the timestamp column, or insert a DEFAULT into the timestamp column.
INSERT INTO [TestTableB] (TestTableBId, TestTableBText, TestTableBTimeStamp)
SELECT TestTableAId, TestTableAText, TestTableATimestamp
FROM TestTableA
SET IDENTITY_INSERT [TestTableB] OFF
GO
Suggestions?
Drop table TestTableB first and then run a query:
SELECT
TestTableAId AS TestTableBId,
TestTableAText AS TestTableBText,
cast(null as int) as TestTableBInt,
TestTableATimestamp AS TestTableBTimeStamp
INTO TestTableB
FROM TestTableA
First check requirements: It depends if you need to preserve the timestamps. You actually may not, since they are just ROWVERSION values and don't actually encode the time in any way. So check that.
Why you might not want to preserve them: The only purpose of TIMESTAMP or ROWVERSION is to determine if the row has changed since last being read. If you are adding a column, you may want this to be seen as a change, particularly if the default is non-null.
If you DO need to preserve the timestamps see Dmitry's answer.

Update DateUsed by trigger

I'm trying to update a table with a trigger from another table. I thought this would be a very simple query but the query I first came up with does not work and I don't understand why.
CREATE TABLE [dbo].[Vehicle](
[id] [int] IDENTITY(1,1) NOT NULL,
[plate] [nvarchar](50) NOT NULL,
[name] [nvarchar](50) NOT NULL,
[dateUsed] [datetime] NULL
)
CREATE TABLE [dbo].[Transaction](
[id] [int] IDENTITY(1,1) NOT NULL,
[vehicleId] [int] NOT NULL,
[quantity] [float] NOT NULL,
[dateTransaction] [datetime] NOT NULL,
)
When a transaction is added, I wish to update the Vehicle table. If the added dateTransaction is later then dateUsed it should be updated so the dateUsed field always contains the latest date of that specific vehicle.
I would think that this trigger should do the trick.. but it does not:
UPDATE [Vehicle]
SET [dateUsed] =
CASE
WHEN [dateUsed] < [Transaction].[dateTransaction]
OR [dateUsed] IS NULL
THEN [Transaction].[dateTransaction]
ELSE [dateUsed]
END
FROM [Transaction]
WHERE [Vehicle].[id]=[Transaction].[vehicleId]
It looks good to me... It should go over all newly inserted records and update the dateUsed field. If the dateTransaction is newer, use that one.. if not.. use the current. But I seem to missing something because it's not updating to the latest date. It does match one of the transactions of that specific vehicle but not the latest one.
A query that does work:
UPDATE [Vehicle]
SET [dateUsed] = InsertedPartitioned.[dateTransaction]
FROM [Vehicle]
LEFT JOIN (
SELECT
[vehicleId],
[dateTransaction],
ROW_NUMBER() OVER(PARTITION BY [VehicleId] ORDER BY [dateTransaction] DESC) AS RC
FROM [Inserted]) AS InsertedPartitioned
ON InsertedPartitioned.RC=1
AND InsertedPartitioned.[vehicleId]=[Vehicle].[id]
WHERE InsertedPartitioned.[vehicleId] IS NOT NULL
AND ([Vehicle].[dateUsed] IS NULL
OR InsertedPartitioned.[dateTransaction] > [Vehicle].[dateUsed]);
So I have a working solution and it may even be for the better (haven't timed it with a large insert) but it bugs the hell out of my not knowing why the first it not working!
Can anyone 'enlighten me'?
why the first it not working
Because of a wonderful aspect of the Microsoft extension to UPDATE that uses a FROM clause:
Use caution when specifying the FROM clause to provide the criteria for the update operation. The results of an UPDATE statement are undefined if the statement includes a FROM clause that is not specified in such a way that only one value is available for each column occurrence that is updated, that is if the UPDATE statement is not deterministic.
(My emphasis).
That is, if more than one row from inserted matches the same row in Vehicle then it's undefined which row will be used to apply the update - and all computations within the SET clause are computed "as if" they're all evaluated in parallel - so it's not as if a second attempt to update the same row will observe the results on the first attempt - the current value of the DateUsed column that can be observed is always the original value.
In ANSI standard SQL, you'd have to write the UPDATE without using the FROM extension and would thus have to write a correlated subquery, something like:
UPDATE [Vehicle]
SET [dateUsed] = COALESCE((SELECT dateUsed FROM inserted i
WHERE i.VehicleId = Vehicle.Id and
(i.dateUsed > Vehicle.DateUsed or
Vehicle.DateUsed IS NULL),
dateUsed)
WHERE [id] IN (select [vehicleId] FROM inserted)
Which, under the same circumstances, would nicely give you an error about a subquery returning more than one value (for the one inside the COALESCE, not the one in the IN) and thus give you a clue to why it's not working.
But, undeniably, the FROM extension is useful - I just wish it triggered a warning for this kind of situation.

Resources