I have a stored procedure that mergers Local temp table and existing table.
ALTER PROCEDURE [dbo].[SyncProductVariantsFromServices]
#Items ProductVariantsTable readonly
AS
BEGIN
CREATE TABLE #ProductVariantsTemp
(
ItemCode nvarchar(10) collate SQL_Latin1_General_CP1_CI_AS,
VariantCode nvarchar(10) collate SQL_Latin1_General_CP1_CI_AS,
VariantDescriptionBG nvarchar(100) collate SQL_Latin1_General_CP1_CI_AS,
VariantDescriptionEN nvarchar(100) collate SQL_Latin1_General_CP1_CI_AS
)
insert into #ProductVariantsTemp
select ItemCode, VariantCode, VariantDescriptionBG, VariantDescriptionEN
from #Items
MERGE ProductVariants AS TARGET
USING #ProductVariantsTemp AS SOURCE
ON (TARGET.ItemCode = SOURCE.ItemCode AND TARGET.VariantCode= SOURCE.VariantCode)
WHEN NOT MATCHED BY TARGET THEN
INSERT (ItemCode, VariantCode, VariantDescriptionBG, VariantDescriptionEN)
VALUES (SOURCE.ItemCode, SOURCE.VariantCode, SOURCE.VariantDescriptionBG, SOURCE.VariantDescriptionEN)
OUTPUT INSERTED.ItemCode, INSERTED.VariantCode, GETDATE() INTO SyncLog;
The problem is - i know in the output clause i have access to inserted or deleted records in case of Not merged by source. But in case 'not merged by source' I want to update
Update ProductVariants Set Active = 0
// when not matched by source
What is the most efficient way to do this?
Necessarily use `WHEN NOT MATCHED BY SOURCE 'when you want to delete a record that is not in the target table. if you want to 'inactivate' a record this must necessarily be done when the clause 'MATCHED' adding exceptions.
If you want to keep history of the records evaluate using "Slowly Changing Dimensions", I leave you some examples that Kimball uses for this treatment of historical data.
Slowly Changing Dimensions - Part 1
Slowly Changing Dimensions - Part 2
Use the WHEN NOT MATCHED BY SOURCE clause of the MERGE with an UPDATE statement.
MERGE
ProductVariants AS TARGET
USING
#ProductVariantsTemp AS SOURCE ON (TARGET.ItemCode = SOURCE.ItemCode AND TARGET.VariantCode= SOURCE.VariantCode)
WHEN
NOT MATCHED BY TARGET THEN
INSERT (ItemCode, VariantCode, VariantDescriptionBG, VariantDescriptionEN)
VALUES (SOURCE.ItemCode, SOURCE.VariantCode, SOURCE.VariantDescriptionBG, SOURCE.VariantDescriptionEN)
WHEN
NOT MATCHED BY SOURCE THEN
UPDATE SET Active = 0
OUTPUT
INSERTED.ItemCode, INSERTED.VariantCode, GETDATE() INTO SyncLog;
Since the OUTPUT clause for the INSERTED table might return either inserted or updated records now, you can add the special column $action that will tell you the original operation as INSERT or UPDATE. Will have to change the SyncLog table to recieve this value though.
OUTPUT
INSERTED.ItemCode, INSERTED.VariantCode, GETDATE(), $action
INTO SyncLog;
Related
I am trying to set up continuous data replication in Snowflake. I get the transactions happened in source system and I need to perform them in Snowflake in the same order as source system. I am trying to use MERGE for this, but when there are multiple operations on same key in source system, MERGE is not working correctly. It either misses an operation or returns duplicate row detected during DML operation error.
Please note that the transactions need to happen in exact order and it is not possible to take the latest transaction for a key and just do it (like if a record has been INSERTED and UPDATED, in Snowflake too it needs to be inserted first and then updated even though insert is only transient state) .
Here is the example:
create or replace table employee_source (
id int,
first_name varchar(255),
last_name varchar(255),
operation_name varchar(255),
binlogkey integer
)
create or replace table employee_destination ( id int, first_name varchar(255), last_name varchar(255) );
insert into employee_source values (1,'Wayne','Bells','INSERT',11);
insert into employee_source values (1,'Wayne','BellsT','UPDATE',12);
insert into employee_source values (2,'Anthony','Allen','INSERT',13);
insert into employee_source values (3,'Eric','Henderson','INSERT',14);
insert into employee_source values (4,'Jimmy','Smith','INSERT',15);
insert into employee_source values (1,'Wayne','Bellsa','UPDATE',16);
insert into employee_source values (1,'Wayner','Bellsat','UPDATE',17);
insert into employee_source values (2,'Anthony','Allen','DELETE',18);
MERGE into employee_destination as T using (select * from employee_source order by binlogkey)
AS S
ON T.id = s.id
when not matched
And S.operation_name = 'INSERT' THEN
INSERT (id,
first_name,
last_name)
VALUES (
S.id,
S.first_name,
S.last_name)
when matched AND S.operation_name = 'UPDATE'
THEN
update set T.first_name = S.first_name, T.last_name = S.last_name
When matched
And S.operation_name = 'DELETE' THEN DELETE;
I am expecting to see - Bellsat - as last name for employee id 1 in the employee_destination table after all rows get processed. Same way, I should not see emp id 2 in the employee_destination table.
Is there any other alternative to MERGE to achieve this? Basically to go over every single DML in the same order (using binlogkey column for ordering) .
thanks.
You need to manipulate your source data to ensure that you only have one record per key/operation otherwise the join will be non-deterministic and will (dpending on your settings) either error or will update using a random one of the applicable source records. This is covered in the documentation here https://docs.snowflake.com/en/sql-reference/sql/merge.html#duplicate-join-behavior.
In any case, why would you want to update a record only for it to be overwritten by another update - this would be incredibly inefficient?
Since your updates appear to include the new values for all rows, you can use a window function to get to just the latest incoming change, and then merge those results into the target table. For example, the select for that merge (with the window function to get only the latest change) would look like this:
with SOURCE_DATA as
(
select COLUMN1::int ID
,COLUMN2::string FIRST_NAME
,COLUMN3::string LAST_NAME
,COLUMN4::string OPERATION_NAME
,COLUMN5::int PROCESSING_ORDER
from values
(1,'Wayne','Bells','INSERT',11),
(1,'Wayne','BellsT','UPDATE',12),
(2,'Anthony','Allen','INSERT',13),
(3,'Eric','Henderson','INSERT',14),
(4,'Jimmy','Smith','INSERT',15),
(1,'Wayne','Bellsa','UPDATE',16),
(1,'Wayne','Bellsat','UPDATE',17),
(2,'Anthony','Allen','DELETE',18)
)
select * from SOURCE_DATA
qualify row_number() over (partition by ID order by PROCESSING_ORDER desc) = 1
That will produce a result set that has only the changes required to merge into the target table:
ID
FIRST_NAME
LAST_NAME
OPERATION_NAME
PROCESSING_ORDER
1
Wayne
Bellsat
UPDATE
17
2
Anthony
Allen
DELETE
18
3
Eric
Henderson
INSERT
14
4
Jimmy
Smith
INSERT
15
You can then change the when not matched to remove the operation_name. If it's listed as an update and it's not in the target table, it's because it was inserted in a previous operation in the new changes.
For the when matched clause, you can use the operation_name to determine if the row should be updated or deleted.
I need to import data into SQL from Excel via a .NET user application. I need to avoid duplication. However, some records might have NULLs in certain columns.
I'm using stored procedures to implement the imports, but I can't seem to provide a "universal" solution that checks for matching data if it exists or NULLS if the data doesn't exit.
Note that my Part table uses an Identity PK, but the import records won't include it.
Below is an example (I did not include all the columns for brevity):
CREATE PROCEDURE [dbo].[spInsertPart]
(#PartNo NCHAR(50),
#PartName NCHAR(50) = NULL,
#PartVariance NCHAR(30) = NULL)
AS
BEGIN
SET NOCOUNT OFF;
IF NOT EXISTS (SELECT PartNo, PartVariance
FROM Part
WHERE PartNo = #PartNo AND PartVariance = #PartVariance)
BEGIN
INSERT INTO Part (PartNo, PartName, PartVariance)
VALUES (#PartNo, #PartName, #PartVariance
END
END
The import data may or may not include a PartVariance, and the existing records may (or may not) also have NULL as the PartVariance.
If both are NULL, then I get a duplicate record - which I don't want.
How can I re-write the procedure to not duplicate, but to treat the NULL value like any other value? (That is, add a record if either contains NULL, but not both).
I think you need to provide clear information on the following before this questions can be correctly answered:
What are the columns based on which 'matching' of an incoming record is performed against the rows of the 'Part' table? What that means is having same values on which columns would require the rest of the columns of 'Part' table to be rather 'updated' with incoming values VS a new record would be 'inserted' into the 'Part' table.
Considering only 'PartNo' and 'PartVariance' columns to be used for 'matching' as seen in the query and only PartVariance column can have NULL, here would be the solution:
CREATE PROCEDURE [dbo].[spInsertPart]
(#PartNo NCHAR(50),
#PartName NCHAR(50) = NULL,
#PartVariance NCHAR(30) = NULL)
AS
BEGIN
SET NOCOUNT OFF;
IF NOT EXISTS (
SELECT 1
FROM Part
WHERE PartNo = #PartNo
AND COALESCE(PartVariance, '') = COALESCE(#PartVariance, '')
)
BEGIN
INSERT INTO Part (PartNo, PartName, PartVariance)
VALUES (#PartNo, #PartName, #PartVariance)
END
END
Note:- You have mentioned that only PartVarince can be NULL. If same is true with PartNo, then COALESCE can be used for matching PartNo column as well.
Well, NULL is a problem when it comes to SQL Server. You can't use equality checks on it (=, <>) since they both will return unknown which will be translated as false.
However, you can use a combination of is null, or and and to get the desired results.
With sql server 2012 or higher (in older versions change iif to a case), you can do this:
IF NOT EXISTS (SELECT PartNo, PartVariance
FROM Part
WHERE IIF((PartNo IS NULL AND #PartNo IS NULL) OR (PartNo = #PartNo), 0, 1) = 1
AND IIF((PartVariance IS NULL AND #PartVariance IS NULL) OR (PartVariance = #PartVariance), 0, 1) = 1)
If both PartNo and #PartNo are null, or they contain the same value (remember, null = any other value will be evaluated as false) - the IIF will return 0, otherwise, (meaning, The column and the variable contains different values, even if one of them is null), it will return 1.
Of course, the second iif does the same thing for the other column/variable combination.
Referred here by #sqlhelp on Twitter (Solved - See the solution at the end of the post).
I'm trying to speed up an SSIS package that inserts 29 million rows of new data, then updates those rows with 2 additional columns. So far the package loops through a folder containing files, inserts the flat files into the database, then performs the update and archives the file. Added (thanks to #billinkc): the SSIS order is Foreach Loop, Data Flow, Execute SQL Task, File Task.
What doesn't take long: The loop, the file move and truncating the tables (stage).
What takes long: inserting the data, running the statement below this:
UPDATE dbo.Stage
SET Number = REPLACE(Number,',','')
## Heading ##
-- Creates temp table for State and Date
CREATE TABLE #Ref (Path VARCHAR(255))
INSERT INTO #Ref VALUES(?)
-- Variables for insert
DECLARE #state AS VARCHAR(2)
DECLARE #date AS VARCHAR(12)
SET #state = (SELECT SUBSTRING(RIGHT([Path], CHARINDEX('\', REVERSE([Path]))-1),12,2) FROM #Ref)
SET #date = (SELECT SUBSTRING(RIGHT([Path], CHARINDEX('\', REVERSE([Path]))-1),1,10) FROM #Ref)
SELECT #state
SELECT #date
-- Inserts the values into main table
INSERT INTO dbo.MainTable (Phone,State,Date)
SELECT d.Number, #state, #date
FROM Stage d
-- Clears the Reference and Stage table
DROP TABLE #Ref
TRUNCATE TABLE Stage
Note that I've toyed with upping Rows per batch on the insert and Max insert commit size, but neither have affected the package speed.
Solved and Added:
For those interested in the numbers: the OP package time was 11.75 minutes; with William's technique (see below this) it's dropped to 9.5 minutes. Granted, with 29 million rows and on a slower server, this can be expected, but hopefully that shows you the actual data behind how effective this is. The key is to keep as many processes happening on the Data Flow task as possible, as the updating data (after the data flow), consumed a signficant portion of time.
Hopefully that helps anyone else out there with a similar problem.
Update two: I added an IF statement and that reduced it from 9 minutes to 4 minutes. Final code for Execute SQL Task:
-- Creates temp table for State and Date
CREATE TABLE #Ref (Path VARCHAR(255))
INSERT INTO #Ref VALUES(?)
DECLARE #state AS VARCHAR(2)
DECLARE #date AS VARCHAR(12)
DECLARE #validdate datetime
SET #state = (SELECT SUBSTRING(RIGHT([Path], CHARINDEX('\', REVERSE([Path]))-1),12,2) FROM #Ref)
SET #date = (SELECT SUBSTRING(RIGHT([Path], CHARINDEX('\', REVERSE([Path]))-1),1,10) FROM #Ref)
SET #validdate = DATEADD(DD,-30,getdate())
IF #date < #validdate
BEGIN
TRUNCATE TABLE dbo.Stage
TRUNCATE TABLE #Ref
END
ELSE
BEGIN
-- Inserts new values
INSERT INTO dbo.MainTable (Number,State,Date)
SELECT d.Number, #state, #date
FROM Stage d
-- Clears the Reference and Stage table after the insert
DROP TABLE #Ref
TRUNCATE TABLE Stage
END
As I understand it, you are Reading ~ 29,000,000 rows from flat files and writing them into a staging table, then running a sql script that updates (reads/writes) the same 29,000,000 rows in the staging table, then moves those 29,000,000 records (read from staging then write to nat) to the final table.
Couldn't you Read from your flat files, use SSIS transfomations to clean your data and add your two additional columns, then write directly into the final table. You would only then work on each distinct set of data once rather than the three (six if you count reads and writes as distinct) times that your process does?
I would change your data flow to transform in process the needed items and write directly into my final table.
edit
From the SQL in your question it appears you are transforming the data by removing comma's from the PHONE field, and then retrieving the STATE and the Date from specific portions of the file path that the currently processed file is in, then storing those three data points into the NAT table. Those things can be done with the derived column transformation in your Data Flow.
For the State and Date columns, set up two new variables called State and Date. Use expressions in the variable definition to set them to the correct values (like you did in your SQL). When the Path variable updates (in your loop, I assume). the State and Date variables will update as well.
In the Derived Column Transformation, drag the State Variable into the Expression field and create a new column called State.
Repeat for Date.
For the PHONE column, in the Derived Column transforamtion create an expression like the following:
REPLACE( [Phone], ",", "" )
Set the Derived Column field to Replace 'Phone'
For your output, create a destination to your NAT table and link Phone, State, and Date columns in your data flow to the appropriate columns in the NAT table.
If there are additional columns in your input, you can choose not to bring them in from your source, since it appears that you are only acting on the Phone column from the original data.
/edit
I am using SSIS to move excel data to a temp sql server table and from there to the target table.
So my temp table consists of only varchar columns - my target table expects money values for some columns. In my temp table the original excel columns have a formula but leave an empty cell on some rows which is represented by the temp table with an empty cell as well. But when I cast one of these columns to money these originally blank cells become 0,00 in the target column.
Of course that is not what I want, so how can I get NULL values in there? Keeping in mind that it is possible that a wanted 0,00 shows up in one of these columns.
I guess I would need to edit my temp table to turn the empty cells to NULL. Can I do this from within a SSIS package or is there a setting for the table I could use?
thank you.
For existing data you can write a simple script that updates data to NULL where empty.
UPDATE YourTable SET Column = NULL WHERE Column = ''
For inserts you can use NULLIF function to insert nulls if empty
INSERT INTO YourTable (yourColumn)
SELECT NULLIF(sourceColum, '') FROM SourceTable
Edit: for multiple column updates you need to combine the two solutions and write something like:
UPDATE YourTable SET
Column1 = NULLIF(Column1, '')
, Column2 = NULLIF(Column2, '')
WHERE Column1 = '' OR Column2 = ''
etc
That will update all
I have a database in SQL Server with its data. I need change a part of some columns value in some conditions.
Imagine the value as "0010020001".
002 belongs to another value in my database and whenever I want to change it to 005, I must update the previous 10-digits code to "001005001".
Actually, I need to update just a part of columns value using UPDATE statement. How can I do it (in this example)?
While everyone else is correct that if you have control of the schema you should definitely not store your data this way, this is how I would solve the issue you as you described it if I couldn't adjust the schema.
IF OBJECT_ID('tempdb..#test') IS NOT NULL
DROP TABLE #test
create table #test
(
id int,
multivaluecolumn varchar(20)
)
insert #Test
select 1,'001002001'
UNION
select 2,'002004002'
UNION
select 3,'003006003'
GO
declare #oldmiddlevalue char(3)
set #oldmiddlevalue= '002'
declare #newmiddlevalue char(3)
set #newmiddlevalue = '005'
select * from #Test
Update #Test set multivaluecolumn =left(multivaluecolumn,3) + #newmiddlevalue + right(multivaluecolumn,3)
where substring(multivaluecolumn,4,3) = #oldmiddlevalue
select * from #Test
Why dont you use CSV(comma separated values) or use any other symbol like ~ to store tha values. Once you need to update a part of it use php explode function and then update it. After your work is done, concat all the values again to get the desired string to be stored in your column.
In that case your column will have values VARCHAR like 001~002~0001