I have a problem that I can not solve, I have a point of sale that inserts 3 tables after selling something, the header table, detail and customers (in that order).
I have a stored procedure that inserts the last record in the bak table, in the "text" column it inserts a concatenated of the 3 tables (it is fundamental), at the same time I have a stuff method that gathers all the details in a single row with the corresponding header (only one row per header), when executing the procedure after inserting it works normally, but when doing it with a trigger, an error appears that a null value can not be inserted in the "text" column, this is because the trigger points to the table header and the other 2 tables to a do not fill, if I put it at the level of detail, the stuff method does not work(since it inserts one record per header), the idea is to point to the header table but wait until everything is completed, there is some way to do it? I can not touch anything from the point of sale, everything would be at the database level (SQL SERVER 2008), can something be done? Can the trigger be delayed so that it waits for the completion of the filling of the other 2 tables?
--INTERMEDIATE TABLE
CREATE TABLE [bak]
(
[id] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY,
[date] [date] NOT NULL,
[serie] [varchar](2) NOT NULL,
[text] [varchar](6000) NOT NULL
);
--STORE PROCEDURE
CREATE PROCEDURE sp_bak
AS
BEGIN
;
WITH CTE
AS (SELECT
h.date InsertDate,
h.series DocumentSerie,
('349891894' + h.date + h.series + h.total +
h.type + CHAR(13) + CHAR(10)) HeaderData,
(d.quantity + d.price + d.description + c.name +
c.identification) DetailData
FROM header h
FULL JOIN detail d
ON a.cod = b.cod
FULL JOIN customers c
ON b.cod = c.cod)
INSERT dbo.bak (date, serie, text)
SELECT TOP 1
InsertDate,
DocumentSerie,
HeaderData + REPLACE(STUFF((SELECT
';' + DetailData
FROM CTE C
WHERE C.HeaderData = T.HeaderData
FOR xml PATH ('')), 1, 1, ''), ';', CHAR(13) + CHAR(10))
FROM CTE T
GROUP BY HeaderData,
DocumentSerie,
InsertDate
order by InsertDate DESC
END
--TRIGGER
CREATE TRIGGER dbo.tr_bak
ON dbo.header
AFTER INSERT
AS
BEGIN
EXEC sp_bak
END
GO
--ERROR
Msg 515, Level 16, State 2, Procedure sp_bak, Line 4
Cannot insert the value NULL into column 'text', table 'VIDEOJUEGOS.dbo.bak'; column does not allow nulls. INSERT fails.
What I understand is: you have data coming into three tables and only after three tables have got the data, you have to insert data into BAK table. You dont have control over the when data will be coming to three tables.
I would suggest you below approaches:
Have a batch stored procedure: which is scheduled on regular basis and does these kinds of batch insertion. You can have a date filter for only picking the values, which got inserted after the previous run
Disclaimer: This is a bad approach of using triggers. You need to defined three triggers in three tables. Always check for the existence of records in all the three tables and generate header. If header value is NOT NULL then insert into the BAK table. So, here BAK table insert can come through any of the three triggers.
In your stored procedure for sp_bak, you can have the first statement as:
WAITFOR DELAY '00:00:05'
The time can be any number of seconds, but enough till the other tables gets inserted with values. The above statements delays the execution of the statements following the WAITFOR by 5 seconds.
Be aware, this might not be a good way to handle this kind of issue, because if the delay time is too high, the trigger might take time to execute and cause a delay of operations.
Related
I have a stored procedure that performs a bulk insert of a large number of DNS log entries. I wish to summarise this raw data in a new table for analysis. The new table takes a given log entry for FQDN and Record Type and holds one record only with a hitcount.
Source table might include 100 rows of:
FQDN, Type
www.microsoft.com,A
Destination table would have:
FQDN, Type, HitCount
www.microsoft.com, A, 100
The SP establishes a unique ID made up of [FQDN] +'|'+ [Type], which is then used as the primary key in the destination table.
My plan was to have the SP fire a trigger that did an UPDATE...IF ##ROWCOUNT=0...INSERT. However, that of course failed because the trigger receives all the [inserted] rows as a single set so always throws a key violation error.
I'm having trouble getting my head around a solution and need some fresh eyes and better skills to take a look. The bulk insert SP works just fine and the raw data is exactly as desired. However trying to come up with a suitable method to create the summary data is beyond my present skills/mindset.
I have several 10s of Tb of data to process, so I don't see the summary as a something we could do dynamically with a SELECT COUNT - which is why I started down the trigger route.
The relevant code in the SP is driven by a cursor consisting of a list of compressed log files needing to be decompressed and bulk-inserted, and is as follows:
-- Bulk insert to a view because bulk insert cannot populate the UID field
SET #strDynamicSQL = 'BULK INSERT [DNS_Raw_Logs].[dbo].[vwtblRawQueryLogData] FROM ''' + #strTarFolder + '\' + #strLogFileName + ''' WITH (FIRSTROW = 1, FIELDTERMINATOR = '' '', ROWTERMINATOR = ''0x0a'', ERRORFILE = ''' + #strTarFolder + '\' + #strErrorFile + ''', TABLOCK)'
--PRINT #strDynamicSQL
EXEC (#strDynamicSQL)
-- Update [UID] field after the bulk insert
UPDATE [DNS_Raw_Logs].[dbo].[tblRawQueryLogData]
SET [UID] = [FQDN] + '|' + [Type]
FROM [tblRawQueryLogData]
WHERE [UID] IS NULL
I know that the UPDATE...IF ##ROWCOUNT=0...INSERT solution is wrong because it assumes that the input data is a single row. I'd appreciate help on a way to do this.
Thank you
First, at that scale make sure you understand columnstore tables. They are very highly compressed and fast to scan.
Then write a query that reads from the raw table and returns the summarized
create or alter view DnsSummary
as
select FQDN, Type, count(*) HitCount
from tblRawQueryLogData
group by FQDN, Type
Then if querying that view directly is too expensive, write a stored procedure that loads a table after each bulk insert. Or make the view into an indexed view.
Thanks for the answer David, obvious when someone else looks at it!
I ran the view-based solution with 14M records (about 4 hours worth) and it took 40secs to return, so I think i'll modify the SP to drop and re-create summary table each time it runs the bulk insert.
The source table also includes a timestamp for each entry. I would like to grab the earliest and latest times associated with each UID and add that to the summary.
My current summary query (courtesy of David) looks like this:
SELECT [UID], [FQDN], [Type], COUNT([UID]) AS [HitCount]
FROM [DNS_Raw_Logs].[dbo].tblRawQueryLogData
GROUP BY [UID], [FQDN], [Type]
ORDER BY COUNT([UID]) DESC
And returns:
UID, FQDN, Type, HitCount
www.microsoft.com|A, www.microsoft.com, A, 100
If I wanted to grab first earliest and latest times then I think I'm looking at nesting 3 queries to grab the earliest time (SELECT TOP N...ORDER BY... ASC), the latest time (SELECT TOP N...ORDER BY... DESC) and the hitcount. Is there a more efficient way of doing this, before I try and wrap my head around this route?
I have a table called dsReplicated.matDB and a column fee_earner. When that column is updated, I want to record two pieces of information:
dsReplicated.matDB.mt_code
dsReplicated.matDB.fee_earner
from the row where fee_earner has been updated.
I've got the basic syntax for doing something when the column is updated but need a hand with the above to get this over the line.
ALTER TRIGGER [dsReplicated].[tr_mfeModified]
ON [dsReplicated].[matdb]
AFTER UPDATE
AS
BEGIN
IF (UPDATE(fee_earner))
BEGIN
print 'Matter fee earner changed to '
END
END
The problem with triggers in SQL server is that they are called one per SQL statement - not once per row. So if your UPDATE statement updates 10 rows, your trigger is called once, and the Inserted and Deleted pseudo tables inside the trigger each contain 10 rows of data.
In order to see if fee_earner has changed, I'd recommend using this approach instead of the UPDATE() function:
ALTER TRIGGER [dsReplicated].[tr_mfeModified]
ON [dsReplicated].[matdb]
AFTER UPDATE
AS
BEGIN
-- I'm just *speculating* here what you want to do with that information - adapt as needed!
INSERT INTO dbo.AuditTable (Id, TriggerTimeStamp, Mt_Code, Old_Fee_Earner, New_Fee_Earner)
SELECT
i.PrimaryKey, SYSDATETIME(), i.Mt_Code, d.fee_earner, i.fee_earner
FROM Inserted i
-- use the two pseudo tables to detect if the column "fee_earner" has
-- changed with the UPDATE operation
INNER JOIN Deleted d ON i.PrimaryKey = d.PrimaryKey
AND d.fee_earner <> i.fee_earner
END
The Deleted pseudo table contains the values before the UPDATE - so that's why I take the d.fee_earner as the value for the Old_Fee_Earner column in the audit table.
The Inserted pseudo table contains the values after the UPDATE - so that's why I take the other values from that Inserted pseudo-table to insert into the audit table.
Note that you really must have an unchangeable primary key in that table in order for this trigger to work. This is a recommended best practice for any data table in SQL Server anyway.
Ok SQL Server fans I have an issue with a legacy stored procedure that sits inside of a SQL Server 2008 R2 Instance that I have inherited also with the PROD data which to say the least is horrible. Also, I can NOT make any changes to the data nor the table structures.
So here is my problem, the stored procedure in question runs daily and is used to update the employee table. As you can see from my example the incoming data (#New_Employees) contains the updated data and I need to use it to update the data in the Employee data is stored in the #Existing_Employees table. Throughout the years different formatting of the EMP_ID value has been used and must be maintained as is (I fought and lost that battle). Thankfully, I have been successfully in changing the format of the EMP_ID column in the #New_Employees table (Yeah!) and any new records will use this format thankfully!
So now you may see my problem, I need to update the ID column in the #New_Employees table with the corresponding ID from the #Existing_Employees table by matching (that's right you guessed it) by the EMP_ID columns. So I came up with an extremely hacky way to handle the disparate formats of the EMP_ID columns but it is very slow considering the number of rows that I need to process (1M+).
I thought of creating a staging table where I could simply cast the EMP_ID columns to an INT and then back to a NVARCHAR in each table to remove the leading zeros and I am sort of leaning that way but I wanted to see if there was another way to handle this dysfunctional data. Any constructive comments are welcome.
IF OBJECT_ID(N'TempDB..#NEW_EMPLOYEES') IS NOT NULL
DROP TABLE #NEW_EMPLOYEES
CREATE TABLE #NEW_EMPLOYEES(
ID INT
,EMP_ID NVARCHAR(50)
,NAME NVARCHAR(50))
GO
IF OBJECT_ID(N'TempDB..#EXISTING_EMPLOYEES') IS NOT NULL
DROP TABLE #EXISTING_EMPLOYEES
CREATE TABLE #EXISTING_EMPLOYEES(
ID INT PRIMARY KEY
,EMP_ID NVARCHAR(50)
,NAME NVARCHAR(50))
GO
INSERT INTO #NEW_EMPLOYEES
VALUES(NULL, '00123', 'Adam Arkin')
,(NULL, '00345', 'Bob Baker')
,(NULL, '00526', 'Charles Nelson O''Reilly')
,(NULL, '04321', 'David Numberman')
,(NULL, '44321', 'Ida Falcone')
INSERT INTO #EXISTING_EMPLOYEES
VALUES(1, '123', 'Adam Arkin')
,(2, '000345', 'Bob Baker')
,(3, '526', 'Charles Nelson O''Reilly')
,(4, '0004321', 'Ed Sullivan')
,(5, '02143', 'Frank Sinatra')
,(6, '5567', 'George Thorogood')
,(7, '0000123-1', 'Adam Arkin')
,(8, '7', 'Harry Hamilton')
-- First Method - Not Successful
UPDATE NE
SET ID = EE.ID
FROM
#NEW_EMPLOYEES NE
LEFT OUTER JOIN #EXISTING_EMPLOYEES EE
ON EE.EMP_ID = NE.EMP_ID
SELECT * FROM #NEW_EMPLOYEES
-- Second Method - Successful but Slow
UPDATE NE
SET ID = EE.ID
FROM
dbo.#NEW_EMPLOYEES NE
LEFT OUTER JOIN dbo.#EXISTING_EMPLOYEES EE
ON CAST(CASE WHEN NE.EMP_ID LIKE N'%[^0-9]%'
THEN NE.EMP_ID
ELSE LTRIM(STR(CAST(NE.EMP_ID AS INT))) END AS NVARCHAR(50)) =
CAST(CASE WHEN EE.EMP_ID LIKE N'%[^0-9]%'
THEN EE.EMP_ID
ELSE LTRIM(STR(CAST(EE.EMP_ID AS INT))) END AS NVARCHAR(50))
SELECT * FROM #NEW_EMPLOYEES
the number of rows that I need to process (1M+).
A million employees? Per day?
I think I would add a 3rd table:
create table #ids ( id INT not NULL PRIMARY KEY
, emp_id not NULL NVARCHAR(50) unique );
Populate that table using your LTRIM(STR(CAST, ahem, algorithm, and update Employees directly from a join of those three tables.
I recommend using ANSI update, not Microsoft's nonstandard update ... from because the ANSI version prevents nondeterministic results in cases where the FROM produces more than one row.
Referred here by #sqlhelp on Twitter (Solved - See the solution at the end of the post).
I'm trying to speed up an SSIS package that inserts 29 million rows of new data, then updates those rows with 2 additional columns. So far the package loops through a folder containing files, inserts the flat files into the database, then performs the update and archives the file. Added (thanks to #billinkc): the SSIS order is Foreach Loop, Data Flow, Execute SQL Task, File Task.
What doesn't take long: The loop, the file move and truncating the tables (stage).
What takes long: inserting the data, running the statement below this:
UPDATE dbo.Stage
SET Number = REPLACE(Number,',','')
## Heading ##
-- Creates temp table for State and Date
CREATE TABLE #Ref (Path VARCHAR(255))
INSERT INTO #Ref VALUES(?)
-- Variables for insert
DECLARE #state AS VARCHAR(2)
DECLARE #date AS VARCHAR(12)
SET #state = (SELECT SUBSTRING(RIGHT([Path], CHARINDEX('\', REVERSE([Path]))-1),12,2) FROM #Ref)
SET #date = (SELECT SUBSTRING(RIGHT([Path], CHARINDEX('\', REVERSE([Path]))-1),1,10) FROM #Ref)
SELECT #state
SELECT #date
-- Inserts the values into main table
INSERT INTO dbo.MainTable (Phone,State,Date)
SELECT d.Number, #state, #date
FROM Stage d
-- Clears the Reference and Stage table
DROP TABLE #Ref
TRUNCATE TABLE Stage
Note that I've toyed with upping Rows per batch on the insert and Max insert commit size, but neither have affected the package speed.
Solved and Added:
For those interested in the numbers: the OP package time was 11.75 minutes; with William's technique (see below this) it's dropped to 9.5 minutes. Granted, with 29 million rows and on a slower server, this can be expected, but hopefully that shows you the actual data behind how effective this is. The key is to keep as many processes happening on the Data Flow task as possible, as the updating data (after the data flow), consumed a signficant portion of time.
Hopefully that helps anyone else out there with a similar problem.
Update two: I added an IF statement and that reduced it from 9 minutes to 4 minutes. Final code for Execute SQL Task:
-- Creates temp table for State and Date
CREATE TABLE #Ref (Path VARCHAR(255))
INSERT INTO #Ref VALUES(?)
DECLARE #state AS VARCHAR(2)
DECLARE #date AS VARCHAR(12)
DECLARE #validdate datetime
SET #state = (SELECT SUBSTRING(RIGHT([Path], CHARINDEX('\', REVERSE([Path]))-1),12,2) FROM #Ref)
SET #date = (SELECT SUBSTRING(RIGHT([Path], CHARINDEX('\', REVERSE([Path]))-1),1,10) FROM #Ref)
SET #validdate = DATEADD(DD,-30,getdate())
IF #date < #validdate
BEGIN
TRUNCATE TABLE dbo.Stage
TRUNCATE TABLE #Ref
END
ELSE
BEGIN
-- Inserts new values
INSERT INTO dbo.MainTable (Number,State,Date)
SELECT d.Number, #state, #date
FROM Stage d
-- Clears the Reference and Stage table after the insert
DROP TABLE #Ref
TRUNCATE TABLE Stage
END
As I understand it, you are Reading ~ 29,000,000 rows from flat files and writing them into a staging table, then running a sql script that updates (reads/writes) the same 29,000,000 rows in the staging table, then moves those 29,000,000 records (read from staging then write to nat) to the final table.
Couldn't you Read from your flat files, use SSIS transfomations to clean your data and add your two additional columns, then write directly into the final table. You would only then work on each distinct set of data once rather than the three (six if you count reads and writes as distinct) times that your process does?
I would change your data flow to transform in process the needed items and write directly into my final table.
edit
From the SQL in your question it appears you are transforming the data by removing comma's from the PHONE field, and then retrieving the STATE and the Date from specific portions of the file path that the currently processed file is in, then storing those three data points into the NAT table. Those things can be done with the derived column transformation in your Data Flow.
For the State and Date columns, set up two new variables called State and Date. Use expressions in the variable definition to set them to the correct values (like you did in your SQL). When the Path variable updates (in your loop, I assume). the State and Date variables will update as well.
In the Derived Column Transformation, drag the State Variable into the Expression field and create a new column called State.
Repeat for Date.
For the PHONE column, in the Derived Column transforamtion create an expression like the following:
REPLACE( [Phone], ",", "" )
Set the Derived Column field to Replace 'Phone'
For your output, create a destination to your NAT table and link Phone, State, and Date columns in your data flow to the appropriate columns in the NAT table.
If there are additional columns in your input, you can choose not to bring them in from your source, since it appears that you are only acting on the Phone column from the original data.
/edit
I have two processes that work with data in the same table.
One process inserts daily, one by one (pure ADO.NET), about 20000 records in the target table.
The second process calls ( periodically, every 15 minutes ) a stored procedure that
Detects the duplicates in those 20000 records by looking at all the records 7 days back and marks them as such.
Marks all records that are not duplicates with a 'ToBeCopied' flag.
Select a number of columns from the records marked as 'ToBeCopied' and returns the set.
Sometimes these two processes overlap ( due to delays in data processing ) and I am suspecting that if the first process inserts new records when second process is somewhere between 1 and 2 then records will be marked 'ToBeCopied' without having gone through the duplicate sifting.
This means that now the store procedure is returning some duplicates.
This is my theory but In practice I have not been able to replicate it...
I am using LINQ to SQL to insert duplicates (40-50 or so a second) and while this is running I am manually calling the stored procedure and store its results.
It appears that when the stored procedure is running the inserting pauses ... such that at the end no duplicates have made it to the final result set.
I am wondering if LINQ to SQL or SQL Server has a default mechanism that prevents concurrency and is pausing the inserting while the selecting or updating takes place.
What do you think?
EDIT 1:
The 'duplicates' are not identical rows. They are 'equivalent' given the business/logical entities these records represent. Each row has a unique primary key.
P.S. Selecting the result set takes place with NOLOCK. Trying to reproduce on SQL Server 2008. Problem is alleged to occur on SQL Server 2005.
What do I think?
Why do you have duplicates in the database? Data purity begins in the client at the app drawing board, which should have a data model that simply does not allow for duplicates.
Why do you have duplicates in the database? Check constraints should prevent this from happening if the client app misbehaves
If you have duplicates, the reader must be prepared to handle them.
You cannot detect duplicates in two stages (look then mark), it has to be one single atomic mark. In fact, you cannot do almost anything in a database in two stages 'look and mark'. All 'look for record then mark the records found' processes fail under concurrency.
NOLOCK will give you inconsistent reads. Records will be missing or read twice. Use SNAPSHOT isolation.
Linq-To-SQL has no pixie dust to replace bad design.
Update
Consider this for instance:
A staging table with a structure like:
CREATE TABLE T1 (
id INT NOT NULL IDENTITY(1,1) PRIMARY KEY,
date DATETIME NOT NULL DEFAULT GETDATE(),
data1 INT NULL,
data2 INT NULL,
data3 INT NULL);
Process A is doing inserts at leisure into this table. It doe snot do any validation, it just dumps raw records in:
INSERT INTO T1 (data1, data2, data3) VALUES (1,2,3);
INSERT INTO T1 (data1, data2, data3) VALUES (2,1,4);
INSERT INTO T1 (data1, data2, data3) VALUES (2,2,3);
...
INSERT INTO T1 (data1, data2, data3) VALUES (1,2,3);
INSERT INTO T1 (data1, data2, data3) VALUES (2,2,3);
...
INSERT INTO T1 (data1, data2, data3) VALUES (2,1,4);
...
Process B is tasked with extracting this staging table and moving cleaned up data into a table T2. It has to remove duplicates that, by business rules, mean records with same values in data1, data2 and data3. Within a set of duplicates, only the first record by date should be kept:
set transaction isolation snapshot;
declare #maxid int;
begin transaction
-- Snap the current max (ID)
--
select #maxid = MAX(id) from T1;
-- Extract the cleaned rows into T2 using ROW_NUMBER() to
-- filter out duplicates
--
with cte as (
SELECT date, data1, data2, datta3,
ROW_NUMBER() OVER
(PARTITION BY data1, data2, data3 ORDER BY date) as rn
FROM T1
WHERE id <= #maxid)
MERGE INTO T2
USING (
SELECT date, data1, data2, data3
FROM cte
WHERE rn = 1
) s ON s.data1 = T2.data1
AND s.data2 = T2.data2
AND s.data3 = T2.data3
WHEN NOT MATCHED BY TARGET
THEN INSERT (date, data1, data2, data3)
VALUES (s.date, s.data1, s.data2, s.data3);
-- Delete the processed row up to #maxid
--
DELETE FROM T1
WHERE id <= #maxid;
COMMIT;
Assuming Process A only inserts, this procedure would safely process the staging table and extract the cleaned duplicates. Of course, this is just a skeleton, a true ETL process would have error handling via BEGIN TRY/BEGIN CATCH and transaction log size control via batching.
When are you calling submit on your data context? I believe that this happens within a transaction.
As for your problem, what you are saying sounds plausible - would it maybe make more sense to do you load into a staging table (if it's slow) and then do a
SELECT * FROM StagingTable INTO ProductionTable
once your load is complete?