MSSql Batch Update of table on Primary Keys - sql-server

I have migrated an Access DB to MSSql server 2008 and found some anomalies from the old database. On both DBs IDs are auto incremental and should be in line with Date. But as shown below, some have been saved in the wrong chronological order.
**Access:**
ID FileID DateOfTransaction SectionID
64490 95900 02/12/1997 100
64491 95900 04/04/1996 80
64492 95900 25/03/1996 90
**Desired Correct Format:**
ID FileID DateOfTransaction SectionID
64492 95900 02/12/1997 100
64491 95900 04/04/1996 80
64490 95900 25/03/1996 90
The PK (ID) table is linked to several other tables with update Cascade set.
I need to group by FileID and sort by DateOfTransaction and update IDs accordingly.
I need some suggestions on how best to tackle this as data is quite sensitive. I have about 50K records to update.
Thanks for reading!

try this query
with cte as
(select * from (
select *,ROW_NUMBER() over (partition by FileID
order by DateOfTransaction) as row_num
from t_Transactions) A
join
(select ID B_ID, FileID B_FileID,ROW_NUMBER()
over (partition by FileID order by ID) as B_row_num
from t_Transactions) B
on A.row_num=B.B_row_num)
select T.ID [Old_ID], CTE.B_ID [New_ID],
T.FileID,T.DateOfTransaction,T.SectionID
--update T set T.ID=CTE.B_ID
from t_Transactions T join cte
on T.ID=CTE.ID
and CTE.B_FileID=T.FileID
Before updating , you can select and conform the result
This query updates the table as per your requirement. You have mentioned that ID column is linked to several other tables. Please be very careful about this and make sure that updating ID column doesn't break anything else
SQL Fiddle Demo

Designing a database to rely on the order of an artificially-generated key to match the date order of another column is a terrible anti-pattern, NOT best practice in the slightest.
Stop relying on it to represent insertion order. That is the answer. If you need that data, it should be another column separate from your PK. Can't you order by date, anyway? If not, create a new column.
It is always a mistake to invest internal database identifiers with meaning of any kind besides relating rows to each other.
I've seen this exact problem before at a former employer--and the database was rife with all sorts of other design problems as well. FK columns were actually named "frnkeyColumnName" to match the "keyColumnName" they pointed to. Never mind a PK that was also an FK...
Stop the madness!

I would seriously consider whether you need to do this at all. Is there any logic that depends on higher IDs having a later date? Was the data out of order in the Access database, in which case, it doesn't matter.
If you do decide to proceed, back up the data first. You're probably going to make mistakes.

Related

Is it possible in SQl Server to create a self-maintaing table with self-references

I'm using Azure's SQL Database & MS SQL Server Management Studio and I wondering if its possible to create a self-referencing table that maintains itself.
I have three tables: Race, Runner, Names. The Race table includes the following columns:
Race_ID (PK)
Race_Date
Race_Distance
Number_of_Runners
The second table is Runner. Runner contains the following columns:
Runner_Id (PK)
Race_ID (Foreign Key)
Name_ID
Finish_Position
Prior_Race_ID
The Names Table includes the following columns:
Full Name
Name_ID
The column of interest is Prior_Race_ID in the Runner Table. I'd like to automatically populate this field via a Trigger or Stored Procedure, but I'm not sure if its possible to do so and how to go about it. The goal would be to be able to get all a runners races very quickly and easily by traversing the Prior_Race_ID field.
Can anyone point me to a good resource or references that explains if and how this is achievable. Also, if there is a preferred approach to achieving my objective please do share that.
Thanks for your input.
Okay, so we want, for each Competitor (better name than Names?), to find their two most recent races. You'd write a query like this:
SELECT
* --TODO - Specific columns
FROM
(SELECT
*, --TODO - Specific columns
ROW_NUMBER() OVER (PARTITION BY n.Name_ID ORDER BY r.Race_Date DESC) rn
FROM
Names n
inner join
Runners rs
on
n.Name_ID = rs.Name_ID
inner join
Races r
on
rs.Race_ID = r.Race_ID
) t
WHERE
t.rn in (1,2)
That should produce two rows per competitor. If needed, you can then PIVOT this data if you want a single row per competitor, but I'd usually leave that up to the presentation layer, rather than do it in SQL.
And so, no, I wouldn't even have a Prior_Race_ID column. As a general rule, don't store data that can be calculated - that just introduces opportunities for that data to be incorrect compared to the base data.
run the following sql(The distinct here is to avoid that a runner has more than one race at a same day):
update runner r1
set r1.prior_race_id =
(
select distinct race.race_id from runner, race where runner.race_id = race.race_id and runner.runner_id = r1.runner_id group by runner.runner_id having race.race_date = max(race.race_date)
)

Tuning Select statement to obtain faster results

I have benefited from this website for a long time now. This is my first question on the site. It is regarding performance tuning a reporting query. Here it goes.
1.
SELECT Count(b1.primkey)
from tableA b1 --WITH (NOLOCK)
join tableA b2 --WITH (NOLOCK)
on b1.email = b2.email
and DateDiff(day, b2.BookedDate , b1.BookedDate) > 1
tableA has around 7 million rows. Email is a varchar(100) field. Bookeddate is a datetime field. primkey is a primary key column that is an int.
My purpose of writing this query is to find out the count entries that have same email ids but have come in one day late. This query take about 45 minutes to run. I really want to reduce the time it takes to execute.
Since this is for reporting, i tried in vain to use --WITH (NOLOCK) option to improve the read time. I have a column store index on tableA and I know that it is being used by the SQL optimizer - can see in the execution plan. I am using SQL Server 2012.
Can someone tell me in such a case, what would be better? Using a nonclustered index on email or a nonclustered columnstore index on tableA?
Please help me.
Your query is relatively complex. You are essentially joining two tables that have 7 million records each on a column that is not unique.
How about the following query instead:
select Email
from TableA
group by Email
having MAX(BookedDate) > MIN(BookedDate) + 1
Also make sure you have an index with Email and BookedDate.
Hope this helps.
You have 3 options here:
Create clustered index on email field at least for a larger table.
But I suppose there are other queries running on these tables, and
clustered index is needed on other fields
Move emails to another table, and store email id's in TableA and
TableB; join on int field would be much faster than on varchar
fields
Create indexes on email fields with included columns BookedDate (no
need to include primkey, you can count on another field, or count(*). Code: create index idx_email on TableA include(BoodedDate)
I think that third option is the one you should go with. There's not much work to be done, and there will be great performance gain. The only problem is that index on varchar field will take a lot of space and impact insert/update operations; but you said that this is a reporting db, so I think you can allow that.

SQL Server : insert a new row as ID of 1

I have a table called ComplaintCodes which contains about 15 rows and 2 columns: ComplaintCodeId and ComplaintCodeText.
I want to insert a new row into that table but have its ID set to 1 which will also add 1 to all of the ID's that exist already. Is this possible?
EDIT
Using SQL Server and ComplaintCodeId is an identity / PK column
It's possible as two separate DML statements, an UPDATE to update the ID and a subsequent INSERT. But this will fail if you are using the ID as a foreign key in another table of course, so you'd need to find a way to update across all related tables.
Why would you want to do this though? Suggest you take a step back and reconsider the design decision that has brought you to this question.
And yes, as podiluska says in his/her(/its!) comment, please specify which DBMS you are using in your question and/or tags.
update <table> set ComplaintCodeId =ComplaintCodeId +1
insert into <table>
select 1,'other column'
Edit:
If its a PK+Identity column, then its a very bad idea to do like this. You cannot update an identity column..
Instead of updating you could do something like this:
select row_number() over (order by ComplaintCodeId desc) as row_num,
ComplaintCodeId
from table
and use row_num instead of ComplaintCodeId
After some thought, it seems to me that the best solution to your problem is to change the PK to be non-identity. Then you can set the value to whatever you'd like.
I still think that using a Display Order column (which is the only reason I can think you'd care the order in the table) would be a fine solution, but if you really want the PK order to be the display order, then changing the PK to non-identity would be a good long-term solution as you wouldn't have these problems in the future.

MAX keyword taking a lot of time to select a value from a column

Well, I have a table which is 40,000,000+ records but when I try to execute a simple query, it takes ~3 min to finish execution. Since I am using the same query in my c# solution, which it needs to execute over 100+ times, the overall performance of the solution is deeply hit.
This is the query that I am using in a proc
DECLARE #Id bigint
SELECT #Id = MAX(ExecutionID) from ExecutionLog where TestID=50881
select #Id
Any help to improve the performance would be great. Thanks.
What indexes do you have on the table? It sounds like you don't have anything even close to useful for this particular query, so I'd suggest trying to do:
CREATE INDEX IX_ExecutionLog_TestID ON ExecutionLog (TestID, ExecutionID)
...at the very least. Your query is filtering by TestID, so this needs to be the primary column in the composite index: if you have no indexes on TestID, then SQL Server will resort to scanning the entire table in order to find rows where TestID = 50881.
It may help to think of indexes on SQL tables in the same way as those you'd find in the back of a big book that are hierarchial and multi-level. If you were looking for something, then you'd manually look under 'T' for TestID then there'd be a sub-heading under TestID for ExecutionID. Without an index entry for TestID, you'd have to read through the entire book looking for TestID, then see if there's a mention of ExecutionID with it. This is effectively what SQL Server has to do.
If you don't have any indexes, then you'll find it useful to review all the queries that hit the table, and ensure that one of those indexes is a clustered index (rather than non-clustered).
Try to re-work everything into something that works in a set based manner.
So, for instance, you could write a select statement like this:
;With OrderedLogs as (
Select ExecutionID,TestID,
ROW_NUMBER() OVER (PARTITION BY TestID ORDER By ExecutionID desc) as rn
from ExecutionLog
)
select * from OrderedLogs where rn = 1 and TestID in (50881, 50882, 50883)
This would then find the maximum ExecutionID for 3 different tests simultaneously.
You might need to store that result in a table variable/temp table, but hopefully, instead, you can continue building up a larger, single, query, that processes all of the results in parallel.
This is the sort of processing that SQL is meant to be good at - don't cripple the system by iterating through the TestIDs in your code.
If you need to pass many test IDs into a stored procedure for this sort of query, look at Table Valued Parameters.

joining latest of various usermetadata tags to user rows

I have a postgres database with a user table (userid, firstname, lastname) and a usermetadata table (userid, code, content, created datetime). I store various information about each user in the usermetadata table by code and keep a full history. so for example, a user (userid 15) has the following metadata:
15, 'QHS', '20', '2008-08-24 13:36:33.465567-04'
15, 'QHE', '8', '2008-08-24 12:07:08.660519-04'
15, 'QHS', '21', '2008-08-24 09:44:44.39354-04'
15, 'QHE', '10', '2008-08-24 08:47:57.672058-04'
I need to fetch a list of all my users and the most recent value of each of various usermetadata codes. I did this programmatically and it was, of course godawful slow. The best I could figure out to do it in SQL was to join sub-selects, which were also slow and I had to do one for each code.
This is actually not that hard to do in PostgreSQL because it has the "DISTINCT ON" clause in its SELECT syntax (DISTINCT ON isn't standard SQL).
SELECT DISTINCT ON (code) code, content, createtime
FROM metatable
WHERE userid = 15
ORDER BY code, createtime DESC;
That will limit the returned results to the first result per unique code, and if you sort the results by the create time descending, you'll get the newest of each.
I suppose you're not willing to modify your schema, so I'm afraid my answe might not be of much help, but here goes...
One possible solution would be to have the time field empty until it was replaced by a newer value, when you insert the 'deprecation date' instead. Another way is to expand the table with an 'active' column, but that would introduce some redundancy.
The classic solution would be to have both 'Valid-From' and 'Valid-To' fields where the 'Valid-To' fields are blank until some other entry becomes valid. This can be handled easily by using triggers or similar. Using constraints to make sure there is only one item of each type that is valid will ensure data integrity.
Common to these is that there is a single way of determining the set of current fields. You'd simply select all entries with the active user and a NULL 'Valid-To' or 'deprecation date' or a true 'active'.
You might be interested in taking a look at the Wikipedia entry on temporal databases and the article A consensus glossary of temporal database concepts.
A subselect is the standard way of doing this sort of thing. You just need a Unique Constraint on UserId, Code, and Date - and then you can run the following:
SELECT *
FROM Table
JOIN (
SELECT UserId, Code, MAX(Date) as LastDate
FROM Table
GROUP BY UserId, Code
) as Latest ON
Table.UserId = Latest.UserId
AND Table.Code = Latest.Code
AND Table.Date = Latest.Date
WHERE
UserId = #userId

Resources