Update on key violation in Stored Procedure using BULK INSERT & Trigger - sql-server

I have a stored procedure that performs a bulk insert of a large number of DNS log entries. I wish to summarise this raw data in a new table for analysis. The new table takes a given log entry for FQDN and Record Type and holds one record only with a hitcount.
Source table might include 100 rows of:
FQDN, Type
www.microsoft.com,A
Destination table would have:
FQDN, Type, HitCount
www.microsoft.com, A, 100
The SP establishes a unique ID made up of [FQDN] +'|'+ [Type], which is then used as the primary key in the destination table.
My plan was to have the SP fire a trigger that did an UPDATE...IF ##ROWCOUNT=0...INSERT. However, that of course failed because the trigger receives all the [inserted] rows as a single set so always throws a key violation error.
I'm having trouble getting my head around a solution and need some fresh eyes and better skills to take a look. The bulk insert SP works just fine and the raw data is exactly as desired. However trying to come up with a suitable method to create the summary data is beyond my present skills/mindset.
I have several 10s of Tb of data to process, so I don't see the summary as a something we could do dynamically with a SELECT COUNT - which is why I started down the trigger route.
The relevant code in the SP is driven by a cursor consisting of a list of compressed log files needing to be decompressed and bulk-inserted, and is as follows:
-- Bulk insert to a view because bulk insert cannot populate the UID field
SET #strDynamicSQL = 'BULK INSERT [DNS_Raw_Logs].[dbo].[vwtblRawQueryLogData] FROM ''' + #strTarFolder + '\' + #strLogFileName + ''' WITH (FIRSTROW = 1, FIELDTERMINATOR = '' '', ROWTERMINATOR = ''0x0a'', ERRORFILE = ''' + #strTarFolder + '\' + #strErrorFile + ''', TABLOCK)'
--PRINT #strDynamicSQL
EXEC (#strDynamicSQL)
-- Update [UID] field after the bulk insert
UPDATE [DNS_Raw_Logs].[dbo].[tblRawQueryLogData]
SET [UID] = [FQDN] + '|' + [Type]
FROM [tblRawQueryLogData]
WHERE [UID] IS NULL
I know that the UPDATE...IF ##ROWCOUNT=0...INSERT solution is wrong because it assumes that the input data is a single row. I'd appreciate help on a way to do this.
Thank you

First, at that scale make sure you understand columnstore tables. They are very highly compressed and fast to scan.
Then write a query that reads from the raw table and returns the summarized
create or alter view DnsSummary
as
select FQDN, Type, count(*) HitCount
from tblRawQueryLogData
group by FQDN, Type
Then if querying that view directly is too expensive, write a stored procedure that loads a table after each bulk insert. Or make the view into an indexed view.

Thanks for the answer David, obvious when someone else looks at it!
I ran the view-based solution with 14M records (about 4 hours worth) and it took 40secs to return, so I think i'll modify the SP to drop and re-create summary table each time it runs the bulk insert.
The source table also includes a timestamp for each entry. I would like to grab the earliest and latest times associated with each UID and add that to the summary.
My current summary query (courtesy of David) looks like this:
SELECT [UID], [FQDN], [Type], COUNT([UID]) AS [HitCount]
FROM [DNS_Raw_Logs].[dbo].tblRawQueryLogData
GROUP BY [UID], [FQDN], [Type]
ORDER BY COUNT([UID]) DESC
And returns:
UID, FQDN, Type, HitCount
www.microsoft.com|A, www.microsoft.com, A, 100
If I wanted to grab first earliest and latest times then I think I'm looking at nesting 3 queries to grab the earliest time (SELECT TOP N...ORDER BY... ASC), the latest time (SELECT TOP N...ORDER BY... DESC) and the hitcount. Is there a more efficient way of doing this, before I try and wrap my head around this route?

Related

Read amount on a postgres table

Is there any way to calculate the amount of read per second on a Postgres table?
but what I need is that whether a table has any read at the moment. (If no, then I can safely drop it)
Thank you
To figure out if the table is used currently, tun
SELECT pid
FROM pg_locks
WHERE relation = 'mytable'::regclass;
That will return the process ID of all backends using it.
To measure whether s table is used at all or not, run this query:
SELECT seq_scan + idx_scan + n_tup_ins + n_tup_upd + n_tup_del
FROM pg_stat_user_tables
WHERE relname = 'mytable';
Then repeat the query in a day. If the numbers haven't changed, nobody has used the table.
Audit SELECT activity
My suggestion is to wrap mytable in a view (called the_view_to_use_instead in the example) which invokes a logging function upon every select and then use the view for selecting from, i.e.
select <whatever you need> from the_view_to_use_instead ...
instead of
select <whatever you need> from mytable ...
So here it is
create table audit_log (table_name text, event_time timestamptz);
create function log_audit_event(tname text) returns void language sql as
$$
insert into audit_log values (tname, now());
$$;
create view the_view_to_use_instead as
select mytable.*
from mytable, log_audit_event('mytable') as ignored;
Every time someone queries the_view_to_use_instead an audit record with a timestamp appears in table audit_log. You can then examine it in order to find out whether and when mytable was selected from and make your decision. Function log_audit_event can be reused in other similar scenarios. The average number of selects per second over the last 24 hours would be
select count(*)::numeric/86400
from audit_log
where event_time > now() - interval '86400 seconds';

How to access the a table ABC_XXX constantly in Teradata where XXX changes periodically?

I have a table in Teradata ABC_XXX where XXX will change monthly basis.
For Ex: ABC_1902, ABC_1812, ABC_1904 etc...
I need to access this table in my application without modifying the code every month.
Is that any way to do in Teradata or any alternate solution.??
Please help
Can you try using DBC.TABLES in subquery like below:
with tbl as (select 'select * from ' || databasename||'.'||tablename as tb from
dbc.tables where tablename like 'ABC_%')
select * from tbl;
If you can get the final query executed in your application, you will be able to query the required table without editing the query.
The above solution is with expectation that the previous month's table gets dropped whenever a new month's table is created.
However, if previous table is not being dropped, then you can try the below approach:
select 'select * from db.ABC_' ||to_char(current_date,'YYMM')
Output will be
select * from db.ABC_1902
execute the output in your application, you will be able to query dynamic table.

How can I make the trigger wait?

I have a problem that I can not solve, I have a point of sale that inserts 3 tables after selling something, the header table, detail and customers (in that order).
I have a stored procedure that inserts the last record in the bak table, in the "text" column it inserts a concatenated of the 3 tables (it is fundamental), at the same time I have a stuff method that gathers all the details in a single row with the corresponding header (only one row per header), when executing the procedure after inserting it works normally, but when doing it with a trigger, an error appears that a null value can not be inserted in the "text" column, this is because the trigger points to the table header and the other 2 tables to a do not fill, if I put it at the level of detail, the stuff method does not work(since it inserts one record per header), the idea is to point to the header table but wait until everything is completed, there is some way to do it? I can not touch anything from the point of sale, everything would be at the database level (SQL SERVER 2008), can something be done? Can the trigger be delayed so that it waits for the completion of the filling of the other 2 tables?
--INTERMEDIATE TABLE
CREATE TABLE [bak]
(
[id] [int] IDENTITY(1,1) NOT NULL PRIMARY KEY,
[date] [date] NOT NULL,
[serie] [varchar](2) NOT NULL,
[text] [varchar](6000) NOT NULL
);
--STORE PROCEDURE
CREATE PROCEDURE sp_bak
AS
BEGIN
;
WITH CTE
AS (SELECT
h.date InsertDate,
h.series DocumentSerie,
('349891894' + h.date + h.series + h.total +
h.type + CHAR(13) + CHAR(10)) HeaderData,
(d.quantity + d.price + d.description + c.name +
c.identification) DetailData
FROM header h
FULL JOIN detail d
ON a.cod = b.cod
FULL JOIN customers c
ON b.cod = c.cod)
INSERT dbo.bak (date, serie, text)
SELECT TOP 1
InsertDate,
DocumentSerie,
HeaderData + REPLACE(STUFF((SELECT
';' + DetailData
FROM CTE C
WHERE C.HeaderData = T.HeaderData
FOR xml PATH ('')), 1, 1, ''), ';', CHAR(13) + CHAR(10))
FROM CTE T
GROUP BY HeaderData,
DocumentSerie,
InsertDate
order by InsertDate DESC
END
--TRIGGER
CREATE TRIGGER dbo.tr_bak
ON dbo.header
AFTER INSERT
AS
BEGIN
EXEC sp_bak
END
GO
--ERROR
Msg 515, Level 16, State 2, Procedure sp_bak, Line 4
Cannot insert the value NULL into column 'text', table 'VIDEOJUEGOS.dbo.bak'; column does not allow nulls. INSERT fails.
What I understand is: you have data coming into three tables and only after three tables have got the data, you have to insert data into BAK table. You dont have control over the when data will be coming to three tables.
I would suggest you below approaches:
Have a batch stored procedure: which is scheduled on regular basis and does these kinds of batch insertion. You can have a date filter for only picking the values, which got inserted after the previous run
Disclaimer: This is a bad approach of using triggers. You need to defined three triggers in three tables. Always check for the existence of records in all the three tables and generate header. If header value is NOT NULL then insert into the BAK table. So, here BAK table insert can come through any of the three triggers.
In your stored procedure for sp_bak, you can have the first statement as:
WAITFOR DELAY '00:00:05'
The time can be any number of seconds, but enough till the other tables gets inserted with values. The above statements delays the execution of the statements following the WAITFOR by 5 seconds.
Be aware, this might not be a good way to handle this kind of issue, because if the delay time is too high, the trigger might take time to execute and cause a delay of operations.

SSIS data flow - copy new data or update existing

I queried some data from table A(Source) based on certain condition and insert into temp table(Destination) before upsert into Crm.
If data already exist in Crm I dont want to query the data from table A and insert into temp table(I want this table to be empty) unless there is an update in that data or new data was created. So basically I want to query only new data or if there any modified data from table A which already existed in Crm. At the moment my data flow is like this.
clear temp table - delete sql statement
Query from source table A and insert into temp table.
From temp table insert into CRM using script component.
In source table A I have audit columns: createdOn and modifiedOn.
I found one way to do this. SSIS DataFlow - copy only changed and new records but no really clear on how to do so.
What is the best and simple way to achieve this.
The link you posted is basically saying to stage everything and use a MERGE to update your table (essentially an UPDATE/INSERT).
The only way I can really think of to make your process quicker (to a significant degree) by partially selecting from table A would be to add a "last updated" timestamp to table A and enforcing that it will always be up to date.
One way to do this is with a trigger; see here for an example.
You could then select based on that timestamp, perhaps keeping a record of the last timestamp used each time you run the SSIS package, and then adding a margin of safety to that.
Edit: I just saw that you already have a modifiedOn column, so you could use that as described above.
Examples:
There are a few different ways you could do it:
ONE
Include the modifiedOn column on in your final destination table.
You can then build a dynamic query for your data flow source in a SSIS string variable, something like:
"SELECT * FROM [table A] WHERE modifiedOn >= DATEADD(DAY, -1, '" + #[User::MaxModifiedOnDate] + "')"
#[User::MaxModifiedOnDate] (string variable) would come from an Execute SQL Task, where you would write the result of the following query to it:
SELECT FORMAT(CAST(MAX(modifiedOn) AS date), 'yyyy-MM-dd') MaxModifiedOnDate FROM DestinationTable
The DATEADD part, as well as the CAST to a certain degree, represent your margin of safety.
TWO
If this isn't an option, you could keep a data load history table that would tell you when you need to load from, e.g.:
CREATE TABLE DataLoadHistory
(
DataLoadID int PRIMARY KEY IDENTITY
, DataLoadStart datetime NOT NULL
, DataLoadEnd datetime
, Success bit NOT NULL
)
You would begin each data load with this (Execute SQL Task):
CREATE PROCEDURE BeginDataLoad
#DataLoadID int OUTPUT
AS
INSERT INTO DataLoadHistory
(
DataLoadStart
, Success
)
VALUES
(
GETDATE()
, 0
)
SELECT #DataLoadID = SCOPE_IDENTITY()
You would store the returned DataLoadID in a SSIS integer variable, and use it when the data load is complete as follows:
CREATE PROCEDURE DataLoadComplete
#DataLoadID int
AS
UPDATE DataLoadHistory
SET
DataLoadEnd = GETDATE()
, Success = 1
WHERE DataLoadID = #DataLoadID
When it comes to building your query for table A, you would do it the same way as before (with the dynamically generated SQL query), except MaxModifiedOnDate would come from the following query:
SELECT FORMAT(CAST(MAX(DataLoadStart) AS date), 'yyyy-MM-dd') MaxModifiedOnDate FROM DataLoadHistory WHERE Success = 1
So the DataLoadHistory table, rather than your destination table.
Note that this would fail on the first run, as there'd be no successful entries on the history table, so you'd need you insert a dummy record, or find some other way around it.
THREE
I've seen it done a lot where, say your data load is running every day, you would just stage the last 7 days, or something like that, some margin of safety that you're pretty sure will never be passed (because the process is being monitored for failures).
It's not my preferred option, but it is simple, and can work if you're confident in how well the process is being monitored.

Speeding Up SSIS Package (Insert and Update)

Referred here by #sqlhelp on Twitter (Solved - See the solution at the end of the post).
I'm trying to speed up an SSIS package that inserts 29 million rows of new data, then updates those rows with 2 additional columns. So far the package loops through a folder containing files, inserts the flat files into the database, then performs the update and archives the file. Added (thanks to #billinkc): the SSIS order is Foreach Loop, Data Flow, Execute SQL Task, File Task.
What doesn't take long: The loop, the file move and truncating the tables (stage).
What takes long: inserting the data, running the statement below this:
UPDATE dbo.Stage
SET Number = REPLACE(Number,',','')
## Heading ##
-- Creates temp table for State and Date
CREATE TABLE #Ref (Path VARCHAR(255))
INSERT INTO #Ref VALUES(?)
-- Variables for insert
DECLARE #state AS VARCHAR(2)
DECLARE #date AS VARCHAR(12)
SET #state = (SELECT SUBSTRING(RIGHT([Path], CHARINDEX('\', REVERSE([Path]))-1),12,2) FROM #Ref)
SET #date = (SELECT SUBSTRING(RIGHT([Path], CHARINDEX('\', REVERSE([Path]))-1),1,10) FROM #Ref)
SELECT #state
SELECT #date
-- Inserts the values into main table
INSERT INTO dbo.MainTable (Phone,State,Date)
SELECT d.Number, #state, #date
FROM Stage d
-- Clears the Reference and Stage table
DROP TABLE #Ref
TRUNCATE TABLE Stage
Note that I've toyed with upping Rows per batch on the insert and Max insert commit size, but neither have affected the package speed.
Solved and Added:
For those interested in the numbers: the OP package time was 11.75 minutes; with William's technique (see below this) it's dropped to 9.5 minutes. Granted, with 29 million rows and on a slower server, this can be expected, but hopefully that shows you the actual data behind how effective this is. The key is to keep as many processes happening on the Data Flow task as possible, as the updating data (after the data flow), consumed a signficant portion of time.
Hopefully that helps anyone else out there with a similar problem.
Update two: I added an IF statement and that reduced it from 9 minutes to 4 minutes. Final code for Execute SQL Task:
-- Creates temp table for State and Date
CREATE TABLE #Ref (Path VARCHAR(255))
INSERT INTO #Ref VALUES(?)
DECLARE #state AS VARCHAR(2)
DECLARE #date AS VARCHAR(12)
DECLARE #validdate datetime
SET #state = (SELECT SUBSTRING(RIGHT([Path], CHARINDEX('\', REVERSE([Path]))-1),12,2) FROM #Ref)
SET #date = (SELECT SUBSTRING(RIGHT([Path], CHARINDEX('\', REVERSE([Path]))-1),1,10) FROM #Ref)
SET #validdate = DATEADD(DD,-30,getdate())
IF #date < #validdate
BEGIN
TRUNCATE TABLE dbo.Stage
TRUNCATE TABLE #Ref
END
ELSE
BEGIN
-- Inserts new values
INSERT INTO dbo.MainTable (Number,State,Date)
SELECT d.Number, #state, #date
FROM Stage d
-- Clears the Reference and Stage table after the insert
DROP TABLE #Ref
TRUNCATE TABLE Stage
END
As I understand it, you are Reading ~ 29,000,000 rows from flat files and writing them into a staging table, then running a sql script that updates (reads/writes) the same 29,000,000 rows in the staging table, then moves those 29,000,000 records (read from staging then write to nat) to the final table.
Couldn't you Read from your flat files, use SSIS transfomations to clean your data and add your two additional columns, then write directly into the final table. You would only then work on each distinct set of data once rather than the three (six if you count reads and writes as distinct) times that your process does?
I would change your data flow to transform in process the needed items and write directly into my final table.
edit
From the SQL in your question it appears you are transforming the data by removing comma's from the PHONE field, and then retrieving the STATE and the Date from specific portions of the file path that the currently processed file is in, then storing those three data points into the NAT table. Those things can be done with the derived column transformation in your Data Flow.
For the State and Date columns, set up two new variables called State and Date. Use expressions in the variable definition to set them to the correct values (like you did in your SQL). When the Path variable updates (in your loop, I assume). the State and Date variables will update as well.
In the Derived Column Transformation, drag the State Variable into the Expression field and create a new column called State.
Repeat for Date.
For the PHONE column, in the Derived Column transforamtion create an expression like the following:
REPLACE( [Phone], ",", "" )
Set the Derived Column field to Replace 'Phone'
For your output, create a destination to your NAT table and link Phone, State, and Date columns in your data flow to the appropriate columns in the NAT table.
If there are additional columns in your input, you can choose not to bring them in from your source, since it appears that you are only acting on the Phone column from the original data.
/edit

Resources