I use either query 1:
delete dp
from [linkedserver\sqlserver].[test].[dbo].[documentpos] dp
where not exists (
select 1 from document d where d.GUID = dp.documentguid
)
or query 2:
DELETE cqdp
FROM [linkedserver\sqlserver].[test].[dbo].[documentpos] cqdp
left join Document cqd on cqd.GUID = cqdp.DocumentGUID
where cqd.guid is null
Both queries do the same, but take too long. I've canceled the execution after 2 days.
This is the estimated execution plan for both queries:
I've also other queries which use the same linked server and those don't take this long. But apparently there is a problem with the linked server (remote scan 98% of time). What can I do to reduce the cost of remote scan?
Try this:
SELECT DISTINCT GUID
INTO [linkedserver\sqlserver].[test].[dbo].[temp_guids]
FROM document
DELETE cqdp
FROM [linkedserver\sqlserver].[test].[dbo].[documentpos] cqdp
left join [linkedserver\sqlserver].[test].[dbo].[temp_guids] cqd on cqd.GUID = cqdp.DocumentGUID
where cqd.guid is null
Related
I have 13 rows in a json.gz file. I am running this MERGE statement.
MERGE INTO order_lines
USING (
SELECT
$1:tenant_id as tenant_id,
$1:data:id as id
$1:data AS data,
$1:data_hash as data_hash,
FROM #s3_some_stage/dump/order_lines/2022-02-13_21-24-20_518.json.gz
) AS new_batch
ON
order_lines.tenant_id = new_batch.tenant_id
AND order_lines.id = new_batch.id
WHEN MATCHED AND order_lines.data_hash != new_batch.data_hash THEN
UPDATE SET
id = new_batch.id
data = new_batch.data,
data_hash = new_batch.data_hash,
WHEN NOT MATCHED THEN
INSERT (tenant_id, id, data, data_hash)
VALUES (new_batch.tenant_id, new_batch.id, new_batch.data, new_batch.data_hash);
It takes 15 seconds to run. When I initially ran, 3 rows updated and it took 15 seconds. When I ran it again, no rows changed but it still took 15 seconds on an S (small) warehouse. order_lines has 9.3M rows.
[{"number of rows inserted":0,"number of rows updated":0}]
SELECT
$1:tenant_id as tenant_id,
$1:data:id as id
$1:data AS data,
$1:data_hash as data_hash,
FROM #s3_some_stage/dump/order_lines/2022-02-13_21-24-20_518.json.gz
Takes 600ms to run and has 13 rows. Pretty small file.
Going to query profiler, it does show execution time at 15 seconds, but seeing the nodes, the most expensive node is 129ms. Snowflake spent 14s in processing, what does that mean?
The merge statement doesn't update any rows since the data_hash's are the same. So the MERGE statement is a no-op and I'd expect it to be very fast.
If I do a join between the staged file and the actual table, the filter returns in 400ms (13 rows). So why is the MERGE so slow?
WITH tmp as (
SELECT $1:tenant_id as tenant_id, $1:data:id::varchar AS id
from #s3_some_stage/dump/order_lines/2022-02-13_21-24-20_518.json.gz
)
select order_lines.id
from order_lines
right join tmp on
order_lines.tenant_id = tmp.tenant_id and order_lines.id = tmp.id;
MERGE is one of, if not the most, expensive operations in any database. Compounded in this case by reading directly from a file rather than the database.
Logically, every row must be examined and compared, although partition pruning eliminates some of this.
Also, suggest you try loading the file into Snowflake and then trying the merge; also larger warehouse.
I found that Netezza stores history of data in HISDB schema. Is it possible to join them so I would get history of which table has been modified by what procedure?
Reason for this is I have DataStage job that loads Netezza table and after SQL command triggers procedures that add another set of data to that same table. I am in need to have all events documented for data lineage purpose.
Current query I made returns procedure's call time. Issue is with joining to USER_HISTDB."$hist_table_access_3". The only field that matched is NPSINSTANCEID. LOGENTRYID, OPID and SESSIONID have different value.
That stops me from making procedure to table link.
SELECT
b.SUBMITTIME,
b.QUERYTEXT,
b.USERNAME,
b.DBNAME,
b.SCHEMANAME,
a.*
FROM USER_HISTDB."$hist_log_entry_3" a
JOIN USER_HISTDB."$hist_query_prolog_3" b
ON a.LOGENTRYID = b.LOGENTRYID
AND a.SESSIONID = b.SESSIONID
AND a.NPSID = b.NPSID
AND a.NPSINSTANCEID = b.NPSINSTANCEID
WHERE b.QUERYTEXT like '%PROCEDURE_NAME%'
-- By default, information about stored procedures is not logged
-- in the query history database. To enable logging of such ...
set ENABLE_SPROC_HIST_LOGGING = on;
-------------------------------------------------------------------------
-- TABLE -- All Info About All Accesses
-- ====================================
SELECT
QP.submittime,
substr(QP.querytext, 1, 100) as SQL_STATEMENT,
xid, -- the transaction id (which might be either a CREATEXID or DELETEXID)
username,
CASE
when usage = 1 then 'SELECTED'
when usage = 2 then 'INSERTED'
when usage = 3 then 'SELECTED/INSERTED'
when usage = 4 then 'DELETED'
when usage = 5 then 'SELECTED/DELETED'
when usage = 8 then 'UPDATED'
when usage = 9 then 'SELECTED/UPDATED'
when usage = 16 then 'TRUNCATED'
when usage = 32 then 'DROPPED'
when usage = 64 then 'CREATED'
when usage = 128 then 'GENSTATS'
when usage = 256 then 'LOCKED'
when usage = 512 then 'ALTERED'
else 'other'
END AS OPERATION,
TA.dbname,
TA.schemaname,
TA.tablename,
TA.tableid,
PP.planid -- The MAIN query plan (not all table operations involve a query plan)
-- If you want to see EVERYTHING, uncomment the next line.
-- Or pick and choose the columns you want to see.
-- ,*
FROM
---- SESSION information
"$hist_session_prolog_3" SP
left outer join "$hist_session_epilog_3" SE using ( SESSIONID, npsid, npsinstanceid )
---- QUERY information (to include the SQL statement that was issued)
left outer join "$hist_query_prolog_3" QP using ( SESSIONID, npsid, npsinstanceid )
left outer join "$hist_query_epilog_3" QE using ( OPID, npsid, npsinstanceid )
left outer join "$hist_table_access_3" TA using ( OPID, npsid, npsinstanceid )
---- PLAN information
---- Not all queries result in a query plan (for example, TRUNCATE and DROP do not)
---- And some queries might result in multiple query plans (such as a GROOM statement)
---- By including these joins we might get multiple rows (for any given row in the $hist_table_access_3 table)
left outer join "$hist_plan_prolog_3" PP using ( OPID, npsid, npsinstanceid )
left outer join "$hist_plan_epilog_3" PE using ( PLANID, npsid, npsinstanceid )
WHERE
(ISMAINPLAN isnull or ISMAINPLAN = true)
---- So ...
---- If there is NO plan file (as with a truncate) ... then ISMAINPLAN will be null. Include this row.
---- If there is a plan file, include ONLY the record corresponding to the MAIN plan file.
---- (Otherwise, there could end up being a lot of duplicated information).
and TA.tableid > 200000
---- Ignore access information for SYSTEM tables (where the OBJID # < 200000)
----
----Add any other restrictions here (otherwise, this query as written will return a lot of data)
----
ORDER BY 1;
The transaction ID is unique for each execution of a given statement and its visible on the record in the table (hidden columns called CreateXid and DeleteXid). That same ID can be found on the HISTDB tables.
Do you need help with a query against those tables ?
I am having problem in fetching a number of records from while joining tables. Please see the below query:
SELECT
H.EIN,
H.OUC,
(
SELECT
COUNT(1)
FROM
tbl_Checks C
INNER JOIN INFM_People_OR.dbo.tblHierarchy P
ON P.EIN = C.EIN_Checked
WHERE
(H.EIN IN (P.L1, P.L2)
OR H.EIN = C.EIN_Checked)
AND C.[Read] = 1
) AS [Read]
FROM
INFM_People_OR.dbo.tblHierarchy H
LEFT JOIN tbl_Checks C
ON H.EIN = C.EIN_Checked
WHERE
H.L1 = #EIN
GROUP BY
H.EIN,
H.OUC,
C.Check_Date
Even if there are just 100 records this query takes a much more time(around 1 min).
Please suggest a solution to tune up this query as it is throwing error in front end
Given just the query there are a few things that stick out as being non-optimal:
Any use of OR will be slower:
WHERE
(H.EIN IN (P.L1, P.L2)
OR H.EIN = C.EIN_Checked)
AND C.[Read] = 1
If there's any way to rework this based off of your data set so that both the IN and the OR are replaced with ANDs that would help.
Also, use of a local variable in the WHERE clause will not work well with the optimizer:
WHERE
H.L1 = #EIN
Finally, make sure you have indexes (and hopefully these are integer fields) where you are doing your joins and group bys (H.EIN, H.OUC, C.Check_Date
The size of the result set (100 records) doesn't matter as much as the size of the joined tables and whether or not they have appropriate indexes.
The Estimated number of rows affected is 1196880 is very high resulting in high execution time of query. I have also tried to join the tables only once but that it giving different output.
Please suggest any other solution than creating indices as I have already created non-clustered index for the table tbl_checks but it doesn't make any difference.
Below is the SQl execution plan.
I use stored procedures:
In my WHERE clause, I use short circuits (OR's) to speed up execution as the Query Optimiser knows that most of my inputs are defaulted to Null. This allows my query to be flexible and fast.
I have added a Table Valued Parameter to the WHERE clause. The execution time for a report has risen from 150ms to 450ms, reads from 70,000 to 200,000.
...
WHERE
--Integer value parameters
AND ((#hID is Null) OR (h.ID = #hID))
AND ((#dID is Null) OR (d.ID = #dID))
AND ((#mID is NULL) OR (m.ID = #mID))
--New table value parameter
--Execute, Processing time and read's increased.
--No additional JOIN added.
AND (NOT EXISTS (SELECT Null FROM #rIDs) OR r.ID IN (SELECT r FROM #rIDs))
How can I short circuit the NOT EXISTS or speed up this query please? I have tried adding a BIT value and checking if rows are in the Table Valued Parameter before executing the query. The only way I have found is having two queries and executing one over the other. Not great if I have to modify a whole bunch of queries or add multiple Table Valued Parameters to the mix.
Thanks in advance.
EDIT:
A comparison of table value parameter:
AND (NOT EXISTS (SELECT Null FROM #rIDs) OR r.ID IN (SELECT r FROM #rIDs))
and integer parameter:
AND ((#rID) OR (r.ID = #rID))
showed similar execution speed after compilation with TVP at 0 rows and Integer parameter null. I assume the Query Optimiser is short circuiting in the correct manor and my previous comparison was incorrect. Execution plan splits the above cost at 55% vs 45%, which is acceptable. Although the split doesn't change when there are more rows in the TVP, the time to generate the report increases because more pages have to be read from disk. Interesting.
if exists (select * from #rIDs)
begin
.... -- query with TVP
end
else
begin
.... -- query without TVP
end
This allows a separate execution plan for each query.
It looks like you are using a table variable. If you use a temporary table and index the column you are using for your criteria (r in your example), you will avoid a table scan. This however makes it a multiple step process, but they payoff can be huge.
To be more specific to your question, you can change the last line of your example to be
AND EXISTS (SELECT r FROM #rIDs WHERE r = r.ID AND NOT r IS NULL)
If you could post the execution plan, I could give you a much better answer. Click the Display Estimated Execution Plan, right click the execution plan and select Save Execution Plan As...
You could try a LEFT JOIN between your table to be queried (on the left) and your TVP.
During query optimization I encounted a strange behaviour of sql server (Sql Server 2008 R2 Enterprise). I created several indexes on tables, as well as some indexed views. I have two queries, for example:
select top 10 N0."Oid",N1."ObjectType",N1."OptimisticLockField" from ((("dbo"."Issue" N0
inner join "dbo"."Article" N1 on (N0."Oid" = N1."Oid"))
inner join "dbo"."ProductLink" N2 on (N1."ProductLink" = N2."Oid"))
inner join "dbo"."Technology" N3 on (N2."Technology" = N3."Oid"))
where (N1."GCRecord" is null and (N0."IsPrivate" = 0) and ((N0."HasMarkedAnswers" = 0) or N0."HasMarkedAnswers" is null) and (N3."Name" = N'Discussions'))
order by N1."ModifiedOn" desc
and
select top 30 N0."Oid",N1."ObjectType",N1."OptimisticLockField" from ((("dbo"."Issue" N0
inner join "dbo"."Article" N1 on (N0."Oid" = N1."Oid"))
inner join "dbo"."ProductLink" N2 on (N1."ProductLink" = N2."Oid"))
inner join "dbo"."Technology" N3 on (N2."Technology" = N3."Oid"))
where (N1."GCRecord" is null and (N0."IsPrivate" = 0) and ((N0."HasMarkedAnswers" = 0) or N0."HasMarkedAnswers" is null) and (N3."Name" = N'Discussions'))
order by N1."ModifiedOn" desc
both queries are the same, except first starts with select top 10 and second with select top 30. Both queries returns the same result set - 6 rows. But the second query is 5 times faster then the first one! I looked at the actual execution plans for both queries, and of course, they differs. Second query uses indexed view, and performs great, and the first query denies to use it, using indexes on tables instead. I repeat myself - both queries are the same, to the same table, at the same server, they differs only by number in "top" part.
I tried to force optimizer to use indexed view in the first query by updating statistics, destroing indexes it used and so on. No matter how I try actual execution do not use indexed view for the first query and always use it for the second one.
I am really intrested in the reasons causing such behavior. Any suggestions?
Update I am not sure that it can help without decribing corresponding indexes and view, but this is actual execution plan diagramms:
for select top 19:
for select top 18:
another confusing fact is that for the select top 19 query sometimes indexed view is used, sometimes not
The only thing I can think of is perhaps the optimizer in the first query concluded that the specifying criteria is not selective enough for the "better" execution plan to be used.
If you are still investigating this see if TOP 60, 90, 100, ... produces the second execution plan and performs well. You could also tinker with it to see what the threshold is for the optimizer to select the second plan in this case.
Also try the queries without the order by statement to see if that is affecting the selection of the query plan (check the index on that field, etc)
Beyond that, you said you can't use index hints so perhaps a re-write where you select top X from your Article table (N1) with a bunch of exists statements in your where clause would provide better performance for you.