SQL Server: table design for changing Identity records - sql-server

I have a view which displays test data from multiple sources for a GPS spot.
The view displays the "GPS Point ID" and some geological test results associated with this GPS Point.
The GPS-POINT-ID is like this : XYZ-0XX-CCCCC
XYZ : Area
00XX : ID
CCCC: Coordinates
The GPS point name changes over time, the first portion of point name(XYZ-0XX) is same and doesn't change, but the Coordinate part (CCCC) changes according to new GPS point location.
I wanted to design a table that will have the previously mentioned view as a datasource. I need to decide about the following:
Primary key: if I used the full GPS-POINT-ID, I won't be able to keep track of the changes because it changes frequently over time. I can't keep track of the point. and I can't link it to it's historical records.
If I use the fixed part of GPS-Point-ID (XYZ-00XX) as a computed column, I can't use it as a primary key, because the same point has many historical records that have the same (XYZ-00XX) part, this will violate the primary key duplicate constraint.
If I create an identity column that will increase for each new record, how can I keep track of each point name change and get the latest test data as well as historical data for each point (XYZ-00XX).
Sample rows from the view are attached in a snapshot.
Thanks

I would recommend using identity for primary key with no business value. I would store the data in two columns one with the static data and another with the changing data. Then you can have a computed column that puts them together as one field if that is necessary. You can also add a date field so that you can follow the history. The static data column being the identifier that ties the records together.
I am assuming you do not want to use auditing to track historical records for some reason. That is the approach I would normally take.
http://weblogs.asp.net/jongalloway/adding-simple-trigger-based-auditing-to-your-sql-server-database
EDIT:
Sample query works if only one update can happen on a given date. If more than one update can occur then the row_number function can be used instead of group by.
Select *
From Table T1
Join (Select Max(MatchDate) MatchDate, GpsStaticData
From Table Group By GpsStaticData) T2
On T1.GpsStaticData = T2.GpsStaticData And T1.UpdateDate = T2.MatchDate
EDIT:
Using Row_Number()
With cteGetLatest As
(
Select UpdateDate MatchDate, GpsStaticData,
Row_Number() Over (Partition By GpsStaticData, Order By UpdateDate Desc) SortOrder
)
Select *
From Table T1
Join (Select MMatchDate, GpsStaticData
From cteGetLatest Where SortOrder = 1) T2
On T1.GpsStaticData = T2.GpsStaticData And T1.UpdateDate = T2.MatchDate
You can add more fields after Order By UpdateDate in the row_number function to determine which record is selected.

--To avoid artificial columns overhead costs a compound Primary Key can be used:
-- Simulate the Source View
CREATE TABLE ybSourceView (
[GPS-POINT-ID] VARCHAR(20),
[Status] NVARCHAR(MAX),
UpdateDate [datetime2],
Reason NVARCHAR(MAX),
OpId VARCHAR(15)
);
-- Source View sample data
INSERT INTO ybSourceView ([GPS-POINT-ID], [Status], UpdateDate, Reason, OpId)
VALUES ('RF-0014-9876', 'Reachable' , '2015-01-27 13:36', 'New Updated Coordinate' , 'AFERNANDO'),
('RF-0014-9876', 'Reachable' , '2014-02-27 09:37', 'New Updated Coordinate' , 'AFERNANDO'),
('RF-0014-3465', 'Reachable' , '2015-04-27 09:42', 'New Updated Coordinate' , 'HRONAULD' ),
('RF-0014-2432', 'Reachable' , '2013-06-27 12:00', 'New Updated Coordinate' , 'AFERNANDO'),
('RF-0015-9876', 'OUT_OF_Range', '2014-04-14 12:00', 'Point Abandoned, getting new coordinate', 'AFERNANDO');
-- Historic Data Table
CREATE TABLE ybGPSPointHistory (
Area VARCHAR(5) NOT NULL DEFAULT '',
ID VARCHAR(10) NOT NULL DEFAULT '',
Coordinates VARCHAR(20) NOT NULL DEFAULT '',
[GPS-POINT-ID] VARCHAR(20),
[Status] NVARCHAR(MAX),
UpdateDate [datetime2] NOT NULL DEFAULT SYSUTCDATETIME(),
Reason NVARCHAR(MAX),
OpId VARCHAR(15),
CONSTRAINT ybGPSPointHistoryPK PRIMARY KEY (Area, ID, UpdateDate) --< Compound Primary Key
);
GO
-- Update Historic Data Table from the Source View
INSERT INTO ybGPSPointHistory (Area, ID, Coordinates, [GPS-POINT-ID], [Status], UpdateDate, Reason, OpId)
SELECT LEFT(Src.[GPS-POINT-ID], LEN(Src.[GPS-POINT-ID]) - 10), RIGHT(LEFT(Src.[GPS-POINT-ID], LEN(Src.[GPS-POINT-ID]) - 5), 4), RIGHT(Src.[GPS-POINT-ID], 4), Src.[GPS-POINT-ID], Src.[Status], Src.UpdateDate, Src.Reason, Src.OpId
FROM ybSourceView Src
LEFT JOIN ybGPSPointHistory Tgt ON Tgt.[GPS-POINT-ID] = Src.[GPS-POINT-ID] AND Tgt.UpdateDate = Src.UpdateDate
WHERE Tgt.[GPS-POINT-ID] Is NULL;
--Tests (check Actual Execution Plan to see PK use):
-- Full history
SELECT * FROM ybGPSPointHistory;
-- Up-to-date only
SELECT *
FROM (
SELECT *, RANK () OVER (PARTITION BY Area, ID ORDER BY UpdateDate DESC) As HistoricOrder
FROM ybGPSPointHistory
) a
WHERE HistoricOrder = 1;
-- Latest record for a particular ID
SELECT TOP 1 *
FROM ybGPSPointHistory a
WHERE [GPS-POINT-ID] = 'RF-0014-9876'
ORDER BY UpdateDate DESC;
-- Latest record for a particular ID in details (more efficient)
SELECT TOP 1 *
FROM ybGPSPointHistory a
WHERE Area = 'RF' AND ID = '0014' AND Coordinates = '9876'
ORDER BY UpdateDate DESC;
-- Latest record for a particular point
SELECT TOP 1 *
FROM ybGPSPointHistory a
WHERE Area = 'RF' AND ID = '0014'
ORDER BY UpdateDate DESC;
--Clean-up:
DROP TABLE ybGPSPointHistory;
DROP TABLE ybSourceView;

Related

Snowflake - how to do multiple DML operations on same primary key in a specific order?

I am trying to set up continuous data replication in Snowflake. I get the transactions happened in source system and I need to perform them in Snowflake in the same order as source system. I am trying to use MERGE for this, but when there are multiple operations on same key in source system, MERGE is not working correctly. It either misses an operation or returns duplicate row detected during DML operation error.
Please note that the transactions need to happen in exact order and it is not possible to take the latest transaction for a key and just do it (like if a record has been INSERTED and UPDATED, in Snowflake too it needs to be inserted first and then updated even though insert is only transient state) .
Here is the example:
create or replace table employee_source (
id int,
first_name varchar(255),
last_name varchar(255),
operation_name varchar(255),
binlogkey integer
)
create or replace table employee_destination ( id int, first_name varchar(255), last_name varchar(255) );
insert into employee_source values (1,'Wayne','Bells','INSERT',11);
insert into employee_source values (1,'Wayne','BellsT','UPDATE',12);
insert into employee_source values (2,'Anthony','Allen','INSERT',13);
insert into employee_source values (3,'Eric','Henderson','INSERT',14);
insert into employee_source values (4,'Jimmy','Smith','INSERT',15);
insert into employee_source values (1,'Wayne','Bellsa','UPDATE',16);
insert into employee_source values (1,'Wayner','Bellsat','UPDATE',17);
insert into employee_source values (2,'Anthony','Allen','DELETE',18);
MERGE into employee_destination as T using (select * from employee_source order by binlogkey)
AS S
ON T.id = s.id
when not matched
And S.operation_name = 'INSERT' THEN
INSERT (id,
first_name,
last_name)
VALUES (
S.id,
S.first_name,
S.last_name)
when matched AND S.operation_name = 'UPDATE'
THEN
update set T.first_name = S.first_name, T.last_name = S.last_name
When matched
And S.operation_name = 'DELETE' THEN DELETE;
I am expecting to see - Bellsat - as last name for employee id 1 in the employee_destination table after all rows get processed. Same way, I should not see emp id 2 in the employee_destination table.
Is there any other alternative to MERGE to achieve this? Basically to go over every single DML in the same order (using binlogkey column for ordering) .
thanks.
You need to manipulate your source data to ensure that you only have one record per key/operation otherwise the join will be non-deterministic and will (dpending on your settings) either error or will update using a random one of the applicable source records. This is covered in the documentation here https://docs.snowflake.com/en/sql-reference/sql/merge.html#duplicate-join-behavior.
In any case, why would you want to update a record only for it to be overwritten by another update - this would be incredibly inefficient?
Since your updates appear to include the new values for all rows, you can use a window function to get to just the latest incoming change, and then merge those results into the target table. For example, the select for that merge (with the window function to get only the latest change) would look like this:
with SOURCE_DATA as
(
select COLUMN1::int ID
,COLUMN2::string FIRST_NAME
,COLUMN3::string LAST_NAME
,COLUMN4::string OPERATION_NAME
,COLUMN5::int PROCESSING_ORDER
from values
(1,'Wayne','Bells','INSERT',11),
(1,'Wayne','BellsT','UPDATE',12),
(2,'Anthony','Allen','INSERT',13),
(3,'Eric','Henderson','INSERT',14),
(4,'Jimmy','Smith','INSERT',15),
(1,'Wayne','Bellsa','UPDATE',16),
(1,'Wayne','Bellsat','UPDATE',17),
(2,'Anthony','Allen','DELETE',18)
)
select * from SOURCE_DATA
qualify row_number() over (partition by ID order by PROCESSING_ORDER desc) = 1
That will produce a result set that has only the changes required to merge into the target table:
ID
FIRST_NAME
LAST_NAME
OPERATION_NAME
PROCESSING_ORDER
1
Wayne
Bellsat
UPDATE
17
2
Anthony
Allen
DELETE
18
3
Eric
Henderson
INSERT
14
4
Jimmy
Smith
INSERT
15
You can then change the when not matched to remove the operation_name. If it's listed as an update and it's not in the target table, it's because it was inserted in a previous operation in the new changes.
For the when matched clause, you can use the operation_name to determine if the row should be updated or deleted.

Perform SCD2 on snowflake table based upon oracle input data

currently am sourcing data from oracle
As a part of intial load ingested all history data from oracle table oracle_a to snowflake table "snow_a" using named stage and copy into commands.
I would like to perform SCD2 on snow_a table based upon oracle_a table.
I mean if any new record added to Oracle_a table then that record to be inserted and any changes to existing record of oracle_a table ,
existing record of snow_a table to be expired and insert the record. Further details refer below image.
oracle_a table has key columns key_col1,key_col2,key_col3 as mentioned in below image. attr1 and attr2 are other attributes of the table enter image description here
Implementing SCD Type 2 functionality on a table in Snowflake is no different than in any other relational database. However, there is additional functionality that can help with this process. Please have a look at this blog post series on using Snowflake Streams and Tasks to perform the SCD logic.
https://www.snowflake.com/blog/building-a-type-2-slowly-changing-dimension-in-snowflake-using-streams-and-tasks-part-1/
Cheers,
Michael Rainey
Ok so here is what I found - though you may need to adjust were the update and insert come from - since oracle_a is not in Snowflake.
CREATE TABLE snowflake_a(key_col1 varchar(10), key_col2 varchar(10), key_col3 varchar(10), attr1 varchar(8), attr2 varchar(10), eff_ts TIMESTAMP, exp_ts TIMESTAMP, valid varchar(10));
DROP table oracle_a;
INSERT INTO snowflake_a VALUES('PT_1', 'DL_1', 'RPT_1', 'Address1', 'APT_1', current_date, current_date, 'Active');
CREATE TABLE oracle_a(key_col1 varchar(10), key_col2 varchar(10), key_col3 varchar(10), attr1 varchar(8), attr2 varchar(8), eff_ts TIMESTAMP, exp_ts TIMESTAMP);
INSERT INTO oracle_a
VALUES( 'PT_1', 'DL_1', 'RPT_1', 'Address1', 'APT_1', '10/24/2019', '12/31/1999');
UPDATE snowflake_a
SET valid = 'Expired'
WHERE valid LIKE '%Active%';
SELECT * FROM snowflake_a;
INSERT INTO snowflake_a VALUES( 'PT_1', 'DL_1', 'RPT_1', 'Address1', 'APT_1', '10/24/2019', '12/31/1999', 'Active');
SELECT * FROM snowflake_a;
Or better yet, what are us using to connect from your Oracle ecosystem to the Snowflake ecosystem?
From the question, it seems that the incoming Oracle rows do not contain any SCD2 type columns and that when each row inserted into snowflake is to be inserted using SCD2 type functionality.
SCD2 columns can have a specific meaning to the business, such that 'exp_ts' could be actual date or a business date. Snowflake 'Stage' does not include SCD2 functionality. This is usually the role of an ETL framework, not that of a 'fast/bulk' load utility.
Most ETL vendors have SCD2 functions as a part of their offering.
I did following steps to perform SCD2.
Loaded Oracle_a table data into TEMPORARY scd2_temp table
Performed update on snow_a to expire "changed records" by joining
key cols and checking the rest of attributes
Inserted into snow_a table from TEMPORARY scd2_temp to snow_a table
Here's a solution based on the following assumptions:
The source oracle table is not itself responsible for SCD2
processing (so Eff/Exp TS columns wouldn't be present on that
table).
There is an external process that is only Extracting/Loading delta
(new, updated) records into Snowflake.
The source oracle is not deleting records
First create the tables and add the first set of delta data:
CREATE OR REPLACE TABLE stg.cdc2_oracle_d (
key1 varchar(10),
key2 varchar(10),
key3 varchar(10),
attr1 varchar(8),
attr2 varchar(8));
CREATE OR REPLACE TABLE edw.cdc2_snowflake_d (
key1 varchar(10),
key2 varchar(10),
key3 varchar(10),
attr1 varchar(8),
attr2 varchar(8),
eff_ts TIMESTAMP_LTZ(0),
exp_ts TIMESTAMP_LTZ(0),
active_fl char(1));
INSERT INTO stg.cdc2_oracle_d VALUES
( 'PT_1', 'DL_1', 'RPT_1', 'Addr1a', 'APT_1.0'),
( 'PT_2', 'DL_2', 'RPT_2', 'Addr2a', 'APT_2.0'),
( 'PT_3', 'DL_3', 'RPT_3', 'Addr3a', 'APT_3.0');
Then run the following Transformation script:
BEGIN;
-- 1: insert new-new records from stg table that don't current exist in the edw table
INSERT INTO edw.cdc2_snowflake_d
SELECT
key1,
key2,
key3,
attr1,
attr2,
CURRENT_TIMESTAMP(0) AS eff_ts,
CAST('9999-12-31 23:59:59' AS TIMESTAMP) AS end_ts,
'Y' AS active_fl
FROM stg.cdc2_oracle_d stg
WHERE NOT EXISTS (
SELECT 1
FROM edw.cdc2_snowflake_d edw
WHERE edw.key1 = stg.key1
AND edw.key2 = stg.key2
AND edw.key3 = stg.key3
AND edw.active_fl = 'Y');
-- 2: insert new version of record from stg table where key current does exist in edw table
-- but only add if the attr columns are different, otherwise it's the same record
INSERT INTO edw.cdc2_snowflake_d
SELECT
stg.key1,
stg.key2,
stg.key3,
stg.attr1,
stg.attr2,
CURRENT_TIMESTAMP(0) AS eff_ts,
CAST('9999-12-31 23:59:59' AS TIMESTAMP) AS end_ts,
'T' AS active_fl -- set flat to Temporary setting
FROM stg.cdc2_oracle_d stg
JOIN edw.cdc2_snowflake_d edw ON edw.key1 = stg.key1 AND edw.key2 = stg.key2
AND edw.key3 = stg.key3 AND edw.active_fl = 'Y'
WHERE (stg.attr1 <> edw.attr1
OR stg.attr2 <> edw.attr2);
-- 3: deactive the current record where there is a new record from above step
-- and set the end_ts to 1 second prior to new record so there is no overlap in data
UPDATE edw.cdc2_snowflake_d old
SET old.active_fl = 'N',
old.exp_ts = DATEADD(SECOND, -1, new.eff_ts)
FROM edw.cdc2_snowflake_d new
WHERE old.key1 = new.key1
AND old.key2 = new.key2
AND old.key3 = new.key3
AND new.active_fl = 'T'
AND old.active_fl = 'Y';
-- 4: finally set all the temporary records to active
UPDATE cdc2_snowflake_d tmp
SET tmp.active_fl = 'Y'
WHERE tmp.active_fl = 'T';
COMMIT;
Review the results, then truncate & add new data and run the script again:
SELECT * FROM stg.cdc2_oracle_d;
SELECT * FROM edw.cdc2_snowflake_d ORDER BY 1,2,3,5;
TRUNCATE TABLE stg.cdc2_oracle_d;
INSERT INTO stg.cdc2_oracle_d VALUES
( 'PT_1', 'DL_1', 'RPT_1', 'Addr1a', 'APT_1.1'), -- record has updated attr2
( 'PT_2', 'DL_2', 'RPT_2', 'Addr2a', 'APT_2.0'), -- record has no changes
( 'PT_4', 'DL_4', 'RPT_4', 'Addr4a', 'APT_4.0'); -- new record
You'll see that PT_1 has 2 records w/ non-overlapping timestamps, only 1 is active.

How to shift entire row from last to 3rd position without changing values in SQL Server

This is my table:
DocumentTypeId DocumentType UserId CreatedDtm
--------------------------------------------------------------------------
2d47e2f8-4 PDF 443f-4baa 2015-12-03 17:56:59.4170000
b4b-4803-a Images a99f-1fd 1997-02-11 22:16:51.7000000
600-0e32 XL e60e07a6b 2015-08-19 15:26:11.4730000
40f8ff9f Word 79b399715 1994-04-23 10:33:44.2300000
8230a07c email 750e-4c3d 2015-01-10 09:56:08.1700000
How can I shift the last entire row (DocumentType=email) on 3rd position,(before DocumentType=XL) without changing table values?
Without wishing to deny the truth of what others have said here, SQL Server does have CLUSTERED indices. For full details on these and the difference between a clustered table and a non-clustered one, please see here. In effect, a clustered table does have data written to disk in index order. However, due to subsequent insertions and deletions, you should never rely on any given record being in a fixed ordinal position.
To get your data showing email third and XL fourth, you simply need to order by CreatedDtm. Thus:
declare #test table
(
DocumentTypeID varchar(20),
DocumentType varchar(10),
UserID varchar(20),
CreatedDtm datetime
)
INSERT INTO #test VALUES
('2d47e2f8-4','PDF','443f-4baa','2015-12-03 17:56:59'),
('b4b-4803-a','Images','a99f-1fd','1997-02-11 22:16:51'),
('600-0e32','XL','e60e07a6b','2015-08-19 15:26:11'),
('40f8ff9f','Word','79b399715','1994-04-23 10:33:44'),
('8230a07c','email','750e-4c3d','2015-01-10 09:56:08')
SELECT * FROM #test order by CreatedDtm
This gives a result set of:
40f8ff9f Word 79b399715 1994-04-23 10:33:44.000
b4b-4803-a Images a99f-1fd 1997-02-11 22:16:51.000
8230a07c email 750e-4c3d 2015-01-10 09:56:08.000
600-0e32 XL e60e07a6b 2015-08-19 15:26:11.000
2d47e2f8-4 PDF 443f-4baa 2015-12-03 17:56:59.000
This maybe what you are looking for, but I cannot stress enough, that it only gives email 3rd and XL 4th in this particular case. If the dates were different, it would not be so. But perhaps, this was all that you needed?
I assumed that you need to sort by DocumentTypecolumn.
Joining with a temp table, which may contain virtually DocumenTypes with desired SortOrder, you can achieve the result you want.
declare #tbl table(
DocumentTypeID varchar(50),
DocumentType varchar(50)
)
insert into #tbl(DocumentTypeID, DocumentType)
values
('2d47e2f8-4','PDF'),
('b4b-4803-a','Images'),
('600-0e32','XL'),
('40f8ff9f','Word'),
('8230a07c','email')
;
--this will give you original output
select * from #tbl;
--this will output rows with new sort order
select t.* from #tbl t
inner join
(
select *
from
(values
('PDF',1, 1),
('Images',2, 2),
('XL',3, 4),
('Word',4, 5),
('email',5, 3) --here I put new sort order '3'
) as dt(TypeName, SortOrder, NewSortOrder)
) dt
on dt.TypeName = t.DocumentType
order by dt.NewSortOrder
The row positions don't really matter in SQL tables, since it's all unordered sets of data, but if you really want to switch the rows I'd suggest you send all your data to temp table e.g,
SELECT * FROM [tablename] INTO #temptable
then delete/truncate the data from that table (if it won't mess the other tables it's connected to) and use the temp table you made to insert into it as you like, since it'll have all the same fields with the same data from the original.

TSQL Update Issue

Ok SQL Server fans I have an issue with a legacy stored procedure that sits inside of a SQL Server 2008 R2 Instance that I have inherited also with the PROD data which to say the least is horrible. Also, I can NOT make any changes to the data nor the table structures.
So here is my problem, the stored procedure in question runs daily and is used to update the employee table. As you can see from my example the incoming data (#New_Employees) contains the updated data and I need to use it to update the data in the Employee data is stored in the #Existing_Employees table. Throughout the years different formatting of the EMP_ID value has been used and must be maintained as is (I fought and lost that battle). Thankfully, I have been successfully in changing the format of the EMP_ID column in the #New_Employees table (Yeah!) and any new records will use this format thankfully!
So now you may see my problem, I need to update the ID column in the #New_Employees table with the corresponding ID from the #Existing_Employees table by matching (that's right you guessed it) by the EMP_ID columns. So I came up with an extremely hacky way to handle the disparate formats of the EMP_ID columns but it is very slow considering the number of rows that I need to process (1M+).
I thought of creating a staging table where I could simply cast the EMP_ID columns to an INT and then back to a NVARCHAR in each table to remove the leading zeros and I am sort of leaning that way but I wanted to see if there was another way to handle this dysfunctional data. Any constructive comments are welcome.
IF OBJECT_ID(N'TempDB..#NEW_EMPLOYEES') IS NOT NULL
DROP TABLE #NEW_EMPLOYEES
CREATE TABLE #NEW_EMPLOYEES(
ID INT
,EMP_ID NVARCHAR(50)
,NAME NVARCHAR(50))
GO
IF OBJECT_ID(N'TempDB..#EXISTING_EMPLOYEES') IS NOT NULL
DROP TABLE #EXISTING_EMPLOYEES
CREATE TABLE #EXISTING_EMPLOYEES(
ID INT PRIMARY KEY
,EMP_ID NVARCHAR(50)
,NAME NVARCHAR(50))
GO
INSERT INTO #NEW_EMPLOYEES
VALUES(NULL, '00123', 'Adam Arkin')
,(NULL, '00345', 'Bob Baker')
,(NULL, '00526', 'Charles Nelson O''Reilly')
,(NULL, '04321', 'David Numberman')
,(NULL, '44321', 'Ida Falcone')
INSERT INTO #EXISTING_EMPLOYEES
VALUES(1, '123', 'Adam Arkin')
,(2, '000345', 'Bob Baker')
,(3, '526', 'Charles Nelson O''Reilly')
,(4, '0004321', 'Ed Sullivan')
,(5, '02143', 'Frank Sinatra')
,(6, '5567', 'George Thorogood')
,(7, '0000123-1', 'Adam Arkin')
,(8, '7', 'Harry Hamilton')
-- First Method - Not Successful
UPDATE NE
SET ID = EE.ID
FROM
#NEW_EMPLOYEES NE
LEFT OUTER JOIN #EXISTING_EMPLOYEES EE
ON EE.EMP_ID = NE.EMP_ID
SELECT * FROM #NEW_EMPLOYEES
-- Second Method - Successful but Slow
UPDATE NE
SET ID = EE.ID
FROM
dbo.#NEW_EMPLOYEES NE
LEFT OUTER JOIN dbo.#EXISTING_EMPLOYEES EE
ON CAST(CASE WHEN NE.EMP_ID LIKE N'%[^0-9]%'
THEN NE.EMP_ID
ELSE LTRIM(STR(CAST(NE.EMP_ID AS INT))) END AS NVARCHAR(50)) =
CAST(CASE WHEN EE.EMP_ID LIKE N'%[^0-9]%'
THEN EE.EMP_ID
ELSE LTRIM(STR(CAST(EE.EMP_ID AS INT))) END AS NVARCHAR(50))
SELECT * FROM #NEW_EMPLOYEES
the number of rows that I need to process (1M+).
A million employees? Per day?
I think I would add a 3rd table:
create table #ids ( id INT not NULL PRIMARY KEY
, emp_id not NULL NVARCHAR(50) unique );
Populate that table using your LTRIM(STR(CAST, ahem, algorithm, and update Employees directly from a join of those three tables.
I recommend using ANSI update, not Microsoft's nonstandard update ... from because the ANSI version prevents nondeterministic results in cases where the FROM produces more than one row.

Computed column expression

I have a specific need for a computed column called ProductCode
ProductId | SellerId | ProductCode
1 1 000001
2 1 000002
3 2 000001
4 1 000003
ProductId is identity, increments by 1.
SellerId is a foreign key.
So my computed column ProductCode must look how many products does Seller have and be in format 000000. The problem here is how to know which Sellers products to look for?
I've written have a TSQL which doesn't look how many products does a seller have
ALTER TABLE dbo.Product
ADD ProductCode AS RIGHT('000000' + CAST(ProductId AS VARCHAR(6)) , 6) PERSISTED
You cannot have a computed column based on data outside of the current row that is being updated. The best you can do to make this automatic is to create an after-trigger that queries the entire table to find the next value for the product code. But in order to make this work you'd have to use an exclusive table lock, which will utterly destroy concurrency, so it's not a good idea.
I also don't recommend using a view because it would have to calculate the ProductCode every time you read the table. This would be a huge performance-killer as well. By not saving the value in the database never to be touched again, your product codes would be subject to spurious changes (as in the case of perhaps deleting an erroneously-entered and never-used product).
Here's what I recommend instead. Create a new table:
dbo.SellerProductCode
SellerID LastProductCode
-------- ---------------
1 3
2 1
This table reliably records the last-used product code for each seller. On INSERT to your Product table, a trigger will update the LastProductCode in this table appropriately for all affected SellerIDs, and then update all the newly-inserted rows in the Product table with appropriate values. It might look something like the below.
See this trigger working in a Sql Fiddle
CREATE TRIGGER TR_Product_I ON dbo.Product FOR INSERT
AS
SET NOCOUNT ON;
SET XACT_ABORT ON;
DECLARE #LastProductCode TABLE (
SellerID int NOT NULL PRIMARY KEY CLUSTERED,
LastProductCode int NOT NULL
);
WITH ItemCounts AS (
SELECT
I.SellerID,
ItemCount = Count(*)
FROM
Inserted I
GROUP BY
I.SellerID
)
MERGE dbo.SellerProductCode C
USING ItemCounts I
ON C.SellerID = I.SellerID
WHEN NOT MATCHED BY TARGET THEN
INSERT (SellerID, LastProductCode)
VALUES (I.SellerID, I.ItemCount)
WHEN MATCHED THEN
UPDATE SET C.LastProductCode = C.LastProductCode + I.ItemCount
OUTPUT
Inserted.SellerID,
Inserted.LastProductCode
INTO #LastProductCode;
WITH P AS (
SELECT
NewProductCode =
L.LastProductCode + 1
- Row_Number() OVER (PARTITION BY I.SellerID ORDER BY P.ProductID DESC),
P.*
FROM
Inserted I
INNER JOIN dbo.Product P
ON I.ProductID = P.ProductID
INNER JOIN #LastProductCode L
ON P.SellerID = L.SellerID
)
UPDATE P
SET P.ProductCode = Right('00000' + Convert(varchar(6), P.NewProductCode), 6);
Note that this trigger works even if multiple rows are inserted. There is no need to preload the SellerProductCode table, either--new sellers will automatically be added. This will handle concurrency with few problems. If concurrency problems are encountered, proper locking hints can be added without deleterious effect as the table will remain very small and ROWLOCK can be used (except for the INSERT which will require a range lock).
Please do see the Sql Fiddle for working, tested code demonstrating the technique. Now you have real product codes that have no reason to ever change and will be reliable.
I would normally recommend using a view to do this type of calculation. The view could even be indexed if select performance is the most important factor (I see you're using persisted).
You cannot have a subquery in a computed column, which essentially means that you can only access the data in the current row. The only ways to get this count would be to use a user-defined function in your computed column, or triggers to update a non-computed column.
A view might look like the following:
create view ProductCodes as
select p.ProductId, p.SellerId,
(
select right('000000' + cast(count(*) as varchar(6)), 6)
from Product
where SellerID = p.SellerID
and ProductID <= p.ProductID
) as ProductCode
from Product p
One big caveat to your product numbering scheme, and a downfall for both the view and UDF options, is that we're relying upon a count of rows with a lower ProductId. This means that if a Product is inserted in the middle of the sequence, it would actually change the ProductCodes of existing Products with a higher ProductId. At that point, you must either:
Guarantee the sequencing of ProductId (identity alone does not do this)
Rely upon a different column that has a guaranteed sequence (still dubious, but maybe CreateDate?)
Use a trigger to get a count at insert which is then never changed.

Resources