Check what procedure modified data in IBM Netezza - netezza

I found that Netezza stores history of data in HISDB schema. Is it possible to join them so I would get history of which table has been modified by what procedure?
Reason for this is I have DataStage job that loads Netezza table and after SQL command triggers procedures that add another set of data to that same table. I am in need to have all events documented for data lineage purpose.
Current query I made returns procedure's call time. Issue is with joining to USER_HISTDB."$hist_table_access_3". The only field that matched is NPSINSTANCEID. LOGENTRYID, OPID and SESSIONID have different value.
That stops me from making procedure to table link.
SELECT
b.SUBMITTIME,
b.QUERYTEXT,
b.USERNAME,
b.DBNAME,
b.SCHEMANAME,
a.*
FROM USER_HISTDB."$hist_log_entry_3" a
JOIN USER_HISTDB."$hist_query_prolog_3" b
ON a.LOGENTRYID = b.LOGENTRYID
AND a.SESSIONID = b.SESSIONID
AND a.NPSID = b.NPSID
AND a.NPSINSTANCEID = b.NPSINSTANCEID
WHERE b.QUERYTEXT like '%PROCEDURE_NAME%'

-- By default, information about stored procedures is not logged
-- in the query history database. To enable logging of such ...
set ENABLE_SPROC_HIST_LOGGING = on;
-------------------------------------------------------------------------
-- TABLE -- All Info About All Accesses
-- ====================================
SELECT
QP.submittime,
substr(QP.querytext, 1, 100) as SQL_STATEMENT,
xid, -- the transaction id (which might be either a CREATEXID or DELETEXID)
username,
CASE
when usage = 1 then 'SELECTED'
when usage = 2 then 'INSERTED'
when usage = 3 then 'SELECTED/INSERTED'
when usage = 4 then 'DELETED'
when usage = 5 then 'SELECTED/DELETED'
when usage = 8 then 'UPDATED'
when usage = 9 then 'SELECTED/UPDATED'
when usage = 16 then 'TRUNCATED'
when usage = 32 then 'DROPPED'
when usage = 64 then 'CREATED'
when usage = 128 then 'GENSTATS'
when usage = 256 then 'LOCKED'
when usage = 512 then 'ALTERED'
else 'other'
END AS OPERATION,
TA.dbname,
TA.schemaname,
TA.tablename,
TA.tableid,
PP.planid -- The MAIN query plan (not all table operations involve a query plan)
-- If you want to see EVERYTHING, uncomment the next line.
-- Or pick and choose the columns you want to see.
-- ,*
FROM
---- SESSION information
"$hist_session_prolog_3" SP
left outer join "$hist_session_epilog_3" SE using ( SESSIONID, npsid, npsinstanceid )
---- QUERY information (to include the SQL statement that was issued)
left outer join "$hist_query_prolog_3" QP using ( SESSIONID, npsid, npsinstanceid )
left outer join "$hist_query_epilog_3" QE using ( OPID, npsid, npsinstanceid )
left outer join "$hist_table_access_3" TA using ( OPID, npsid, npsinstanceid )
---- PLAN information
---- Not all queries result in a query plan (for example, TRUNCATE and DROP do not)
---- And some queries might result in multiple query plans (such as a GROOM statement)
---- By including these joins we might get multiple rows (for any given row in the $hist_table_access_3 table)
left outer join "$hist_plan_prolog_3" PP using ( OPID, npsid, npsinstanceid )
left outer join "$hist_plan_epilog_3" PE using ( PLANID, npsid, npsinstanceid )
WHERE
(ISMAINPLAN isnull or ISMAINPLAN = true)
---- So ...
---- If there is NO plan file (as with a truncate) ... then ISMAINPLAN will be null. Include this row.
---- If there is a plan file, include ONLY the record corresponding to the MAIN plan file.
---- (Otherwise, there could end up being a lot of duplicated information).
and TA.tableid > 200000
---- Ignore access information for SYSTEM tables (where the OBJID # < 200000)
----
----Add any other restrictions here (otherwise, this query as written will return a lot of data)
----
ORDER BY 1;

The transaction ID is unique for each execution of a given statement and its visible on the record in the table (hidden columns called CreateXid and DeleteXid). That same ID can be found on the HISTDB tables.
Do you need help with a query against those tables ?

Related

Which is more efficient way to extract latest records from table in Snowflake: RANK function or Filter rows with Max updated timestamp using Self Join

Use Case: Real-Time Change Data Capture from Source Table into Snowflake Stream. Then consuming the Stream using a Task to merge (INSERT/UPDATE) change records into the Target Table at regular intervals.
End Result: Target table as exact replica of the source table.
Problem Statement: In a scenario where a record with Primary key field (say "ID" ) has more than one (multiple) updates in the source table, it makes sense to extract only the most recent modified record i.e. record with max(updated timestamp) for each "ID" from the change Stream and execute UPDATE into the target table.
There could be two approaches to extract only the latest records for all distinct "ID"s from the source:
Using RANK WINDOW function
select * from (
select *, RANK() OVER(PARTITION BY ID ORDER BY updated_timestamp desc) as rnk
from STREAM
) X
where rnk = 1
Using SubQuery and Self Join
select A.*
from STREAM A
join (select ID, max(updated_timestamp) AS max_updated_timestamp
from STREAM B
GROUP ID
) B
ON A.ID = B.ID
AND A.updated_timestamp = B.max_updated_timestamp
Which approach will be more efficient 1 OR 2 for frequently updating large data streams?
Tried both logics for sample dataset and observed that logic with RANK function scans less partitions as compared to the logic with self join. Wanted to understand which logic would take less time when used on a huge dataset.
You said:
In a scenario where a record with Primary key field (say "ID" ) has more than one (multiple) updates in the source table, it makes sense to extract only the most recent modified record
But it doesn't make sense because a standard stream returns the DELTA:
create or replace table table_stream_test (id number, v varchar, z varchar ) as
select * from values (1,'Gokhan','Test'),(2,'Joe','Test');
create or replace stream stream_test on table table_stream_test;
-- two separate updates on the same row!
update table_stream_test set v = 'Jack' where id = 1;
update table_stream_test set z = 'Prod' where id = 1;
select * from stream_test; -- returns 2 rows (1 INSERT + 1 DELETE, the changes are combined)
-- reverting the changes:
update table_stream_test set v = 'Gokhan' where id = 1;
update table_stream_test set z = 'Test' where id = 1;
select * from stream_test; -- returns 0 rows
In short, you don't need to extract the most recent modified record. The stream will return the delta to apply.
In terms of streams and tasks, we use a QUALIFY rank statement, just to have a little less code
SELECT * FROM stream
QUALIFY RANK() OVER(PARTITION BY ID ORDER BY updated_timestamp desc)=1
I haven't tried using MAX style on a stream so can't say if it would be more efficient but previously I did have a view where initially we used a window function but as the data grew, this became slower and slower. A MAX style system turned out to be faster in that case so we switched to that

SQL delete query takes too long

I use either query 1:
delete dp
from [linkedserver\sqlserver].[test].[dbo].[documentpos] dp
where not exists (
select 1 from document d where d.GUID = dp.documentguid
)
or query 2:
DELETE cqdp
FROM [linkedserver\sqlserver].[test].[dbo].[documentpos] cqdp
left join Document cqd on cqd.GUID = cqdp.DocumentGUID
where cqd.guid is null
Both queries do the same, but take too long. I've canceled the execution after 2 days.
This is the estimated execution plan for both queries:
I've also other queries which use the same linked server and those don't take this long. But apparently there is a problem with the linked server (remote scan 98% of time). What can I do to reduce the cost of remote scan?
Try this:
SELECT DISTINCT GUID
INTO [linkedserver\sqlserver].[test].[dbo].[temp_guids]
FROM document
DELETE cqdp
FROM [linkedserver\sqlserver].[test].[dbo].[documentpos] cqdp
left join [linkedserver\sqlserver].[test].[dbo].[temp_guids] cqd on cqd.GUID = cqdp.DocumentGUID
where cqd.guid is null

How to force Query engine to return data from right partition without .$Partition function ?

We partitioned a table based on IsActive(bit) flag. Almost %70 out of 500k rows are current and we wanted to write old and new rows in 2 different partitions. However when I want to return data from the table I see difference in execution time and I/O cost when I run these two queries.
SELECT [columns] FROM TableX
WHERE $PARTITION.partition_function_name(IsActive) = 2 -- partition has active records.
SELECT [columns] FROM TableX
WHERE IsActive = 1 -- Returns active rows
Is there a way to avoid using $PARTITION.partition_function_name(IsActive) = 2
statement. I was assuming since table partitioned by IsActive column it should automatically read from right partition and return me with less cost.
Execution Plan Details
With $PARTITION.partition_function_name(IsActive) = 2
with IsActive = 1

SQL join conditional either or not both?

I have 3 tables that I'm joining and 2 variables that I'm using in one of the joins.
What I'm trying to do is figure out how to join based on either of the statements but not both.
Here's the current query:
SELECT DISTINCT
WR.Id,
CAL.Id as 'CalendarId',
T.[First Of Month],
T.[Last of Month],
WR.Supervisor,
WR.cd_Manager as [Manager], --Added to search by the Manager--
WR.[Shift] as 'ShiftId'
INTO #Workers
FROM #T T
--Calendar
RIGHT JOIN [dbo].[Calendar] CAL
ON CAL.StartDate <= T.[Last of Month]
AND CAL.EndDate >= T.[First of Month]
--Workers
--This is the problem join
RIGHT JOIN [dbo].[Worker_Filtered]WR
ON WR.Supervisor IN (SELECT Id FROM [dbo].[User] WHERE FullName IN(#Supervisors))
or (WR.Supervisor IN (SELECT Id FROM [dbo].[User] WHERE FullName IN(#Supervisors))
AND WR.cd_Manager IN(SELECT Id FROM [dbo].[User] WHERE FullNameIN(#Manager))) --Added to search by the Manager--
AND WR.[Type] = '333E7907-EB80-4021-8CDB-5380F0EC89FF' --internal
WHERE CAL.Id = WR.Calendar
AND WR.[Shift] IS NOT NULL
What I want to do is either have the result based on the Worker_Filtered table matching the #Supervisor or (but not both) have it matching both the #Supervisor and #Manager.
The way it is now if it matches either condition it will be returned. This should be limiting the returned results to Workers that have both the Supervisor and Manager which would be a smaller data set than if they only match the Supervisor.
UPDATE
The query that I have above is part of a greater whole that pulls data for a supervisor's workers.
I want to also limit it to managers that are under a particular supervisor.
For example, if #Supervisor = John Doe and #Manager = Jane Doe and John has 9 workers 8 of which are under Jane's management then I would expect the end result to show that there are only 8 workers for each month. With the current query, it is still showing all 9 for each month.
If I change part of the RIGHT JOIN to:
WR.Supervisor IN (SELECT Id FROM [dbo].[User] WHERE FullName IN (#Supervisors))
AND WR.cd_Manager IN(SELECT Id FROM [dbo].[User] WHERE FullName IN(#Manager))
Then it just returns 12 rows of NULL.
UPDATE 2
Sorry, this has taken so long to get a sample up. I could not get SQL Fiddle to work for SQL Server 2008/2014 so I am using rextester instead:
Sample
This shows the results as 108 lines. But what I want to show is just the first 96 lines.
UPDATE 3
I have made a slight update to the Sample. this does get the results that I want. I can set #Manager to NULL and it will pull all 108 records, or I can have the correct Manager name in there and it'll only pull those that match both Supervisor and Manager.
However, I'm doing this with an IF ELSE and I was hoping to avoid doing that as it duplicates code for the insert into the Worker table.
The description of expected results in update 3 makes it all clear now, thanks. Your 'problem' join needs to be:
RIGHT JOIN Worker_Filtered wr on (wr.Supervisor in(#Supervisors)
and case when #Manager is null then 1
else case when wr.Manager in(#Manager) then 1 else 0 end
end = 1)
By the way, I don't know what you are expecting the in(#Supervisors) to achieve, but if you're hoping to supply a comma separated list of supervisors as a single string and have wr.Supervisor match any one of them then you're going to be disappointed. This query works exactly the same if you have = #Supervisors instead.

Computed column expression

I have a specific need for a computed column called ProductCode
ProductId | SellerId | ProductCode
1 1 000001
2 1 000002
3 2 000001
4 1 000003
ProductId is identity, increments by 1.
SellerId is a foreign key.
So my computed column ProductCode must look how many products does Seller have and be in format 000000. The problem here is how to know which Sellers products to look for?
I've written have a TSQL which doesn't look how many products does a seller have
ALTER TABLE dbo.Product
ADD ProductCode AS RIGHT('000000' + CAST(ProductId AS VARCHAR(6)) , 6) PERSISTED
You cannot have a computed column based on data outside of the current row that is being updated. The best you can do to make this automatic is to create an after-trigger that queries the entire table to find the next value for the product code. But in order to make this work you'd have to use an exclusive table lock, which will utterly destroy concurrency, so it's not a good idea.
I also don't recommend using a view because it would have to calculate the ProductCode every time you read the table. This would be a huge performance-killer as well. By not saving the value in the database never to be touched again, your product codes would be subject to spurious changes (as in the case of perhaps deleting an erroneously-entered and never-used product).
Here's what I recommend instead. Create a new table:
dbo.SellerProductCode
SellerID LastProductCode
-------- ---------------
1 3
2 1
This table reliably records the last-used product code for each seller. On INSERT to your Product table, a trigger will update the LastProductCode in this table appropriately for all affected SellerIDs, and then update all the newly-inserted rows in the Product table with appropriate values. It might look something like the below.
See this trigger working in a Sql Fiddle
CREATE TRIGGER TR_Product_I ON dbo.Product FOR INSERT
AS
SET NOCOUNT ON;
SET XACT_ABORT ON;
DECLARE #LastProductCode TABLE (
SellerID int NOT NULL PRIMARY KEY CLUSTERED,
LastProductCode int NOT NULL
);
WITH ItemCounts AS (
SELECT
I.SellerID,
ItemCount = Count(*)
FROM
Inserted I
GROUP BY
I.SellerID
)
MERGE dbo.SellerProductCode C
USING ItemCounts I
ON C.SellerID = I.SellerID
WHEN NOT MATCHED BY TARGET THEN
INSERT (SellerID, LastProductCode)
VALUES (I.SellerID, I.ItemCount)
WHEN MATCHED THEN
UPDATE SET C.LastProductCode = C.LastProductCode + I.ItemCount
OUTPUT
Inserted.SellerID,
Inserted.LastProductCode
INTO #LastProductCode;
WITH P AS (
SELECT
NewProductCode =
L.LastProductCode + 1
- Row_Number() OVER (PARTITION BY I.SellerID ORDER BY P.ProductID DESC),
P.*
FROM
Inserted I
INNER JOIN dbo.Product P
ON I.ProductID = P.ProductID
INNER JOIN #LastProductCode L
ON P.SellerID = L.SellerID
)
UPDATE P
SET P.ProductCode = Right('00000' + Convert(varchar(6), P.NewProductCode), 6);
Note that this trigger works even if multiple rows are inserted. There is no need to preload the SellerProductCode table, either--new sellers will automatically be added. This will handle concurrency with few problems. If concurrency problems are encountered, proper locking hints can be added without deleterious effect as the table will remain very small and ROWLOCK can be used (except for the INSERT which will require a range lock).
Please do see the Sql Fiddle for working, tested code demonstrating the technique. Now you have real product codes that have no reason to ever change and will be reliable.
I would normally recommend using a view to do this type of calculation. The view could even be indexed if select performance is the most important factor (I see you're using persisted).
You cannot have a subquery in a computed column, which essentially means that you can only access the data in the current row. The only ways to get this count would be to use a user-defined function in your computed column, or triggers to update a non-computed column.
A view might look like the following:
create view ProductCodes as
select p.ProductId, p.SellerId,
(
select right('000000' + cast(count(*) as varchar(6)), 6)
from Product
where SellerID = p.SellerID
and ProductID <= p.ProductID
) as ProductCode
from Product p
One big caveat to your product numbering scheme, and a downfall for both the view and UDF options, is that we're relying upon a count of rows with a lower ProductId. This means that if a Product is inserted in the middle of the sequence, it would actually change the ProductCodes of existing Products with a higher ProductId. At that point, you must either:
Guarantee the sequencing of ProductId (identity alone does not do this)
Rely upon a different column that has a guaranteed sequence (still dubious, but maybe CreateDate?)
Use a trigger to get a count at insert which is then never changed.

Resources