I have a cursor in oracle database which would be fetching thousands of rows in a sorted manner but I would actually need only the first row(i.e., oldest one first). The loop is designed in such a way that it processes first one row and comes out. And then the cursor is opened again to fetch the remaining rows. My question is if I use 'fetch first 1 rows only' in the cursor, could it really help improve performance?
Basically I want to know which is more efficient in terms of performance among the below:
Query 1:
select a.col1,a.col2,a.col3,a.rowid rid,a.col4
from table1 a, table2 b
where a.status = 'N'
and b.col1 = 1
and b.col2 = a.col5
order by insert_time;
Query 2:
select a.col1,a.col2,a.col3,a.rowid rid,a.col4
from table1 a, table2 b
where a.status = 'N'
and b.col1 = 1
and b.col2 = a.col5
order by insert_time
fetch first 1 rows only;
Letting the database know your "intentions" (eg "I only want the first x rows") can be critical to performance. For example, normal sorting operations store the entire result set in memory or on disk in temporary tablespace. But with the FETCH clause Oracle knows it only has to track the Top N rows and can use significantly less memory for sorting.
Here's a complete video walkthrough of why including demos and the impact on response time, memory and performance.
https://www.youtube.com/watch?v=rhOVF82KY7E
cursor processing is a slow process, if you can do in SQL instead of using a cursor then try to use SQL to process the data.
What happens if you have more than one row to process, will you still go and process the row by going through the cursor more than one time ?
thanks
Related
Update: I will get query plan as soon as I can.
We had a poor performing query that took 4 minutes for a particular organization. After the usual recompiling the stored proc and updating statistics didn't help, we re-wrote the if Exists(...) to a select count(*)... and the stored procedure when from 4 minutes to 70 milliseconds. What is the problem with the conditional that makes a 70 ms query take 4 minutes? See the examples
These all take 4+ minutes:
if (
SELECT COUNT(*)
FROM ObservationOrganism omo
JOIN Observation om ON om.ObservationID = omo.ObservationMicID
JOIN Organism o ON o.OrganismID = omo.OrganismID
JOIN ObservationMicDrug omd ON omd.ObservationOrganismID = omo.ObservationOrganismID
JOIN SIRN srn ON srn.SIRNID = omd.SIRNID
JOIN OrganismDrug od ON od.OrganismDrugID = omd.OrganismDrugID
WHERE
om.StatusCode IN ('F', 'C')
AND o.OrganismGroupID <> -1
AND od.OrganismDrugGroupID <> -1
AND (om.LabType <> 'screen' OR om.LabType IS NULL)) > 0
print 'records';
-
IF (EXISTS(
SELECT *
FROM ObservationOrganism omo
JOIN Observation om ON om.ObservationID = omo.ObservationMicID
JOIN Organism o ON o.OrganismID = omo.OrganismID
JOIN ObservationMicDrug omd ON omd.ObservationOrganismID = omo.ObservationOrganismID
JOIN SIRN srn ON srn.SIRNID = omd.SIRNID
JOIN OrganismDrug od ON od.OrganismDrugID = omd.OrganismDrugID
WHERE
om.StatusCode IN ('F', 'C')
AND o.OrganismGroupID <> -1
AND od.OrganismDrugGroupID <> -1
AND (om.LabType <> 'screen' OR om.LabType IS NULL))
print 'records'
This all take 70 milliseconds:
Declare #recordCount INT;
SELECT #recordCount = COUNT(*)
FROM ObservationOrganism omo
JOIN Observation om ON om.ObservationID = omo.ObservationMicID
JOIN Organism o ON o.OrganismID = omo.OrganismID
JOIN ObservationMicDrug omd ON omd.ObservationOrganismID = omo.ObservationOrganismID
JOIN SIRN srn ON srn.SIRNID = omd.SIRNID
JOIN OrganismDrug od ON od.OrganismDrugID = omd.OrganismDrugID
WHERE
om.StatusCode IN ('F', 'C')
AND o.OrganismGroupID <> -1
AND od.OrganismDrugGroupID <> -1
AND (om.LabType <> 'screen' OR om.LabType IS NULL);
IF(#recordCount > 0)
print 'records';
It doesn't make sense to me why moving the exact same Count(*) query into an if statement causes such degradation or why 'Exists' is slower than Count. I even tried the exists() in a select CASE WHEN Exists() and it is still 4+ minutes.
Given that my previous answer was mentioned, I'll try to explain again because these things are pretty tricky. So yes, I think you're seeing the same problem as the other question. Namely a row goal issue.
So to try and explain what's causing this I'll start with the three types of joins that are at the disposal of the engine (and pretty much categorically): Loop Joins, Merge Joins, Hash Joins. Loop joins are what they sound like, a nested loop over both sets of data. Merge Joins take two sorted lists and move through them in lock-step. And Hash joins throw everything in the smaller set into a filing cabinet and then look for items in the larger set once the filing cabinet has been filled.
So performance wise, loop joins require pretty much no set up and if you're only looking for a small amount of data they're really optimal. Merge are the best of the best as far as join performance for any data size, but require data to be already sorted (which is rare). Hash Joins require a fair amount of setup but allow large data sets to be joined quickly.
Now we get to your query and the difference between COUNT(*) and EXISTS/TOP 1. So the behavior you're seeing is that the optimizer thinks that rows of this query are really likely (you can confirm this by planning the query without grouping and seeing how many records it thinks it will get in the last step). In particular it probably thinks that for some table in that query, every record in that table will appear in the output.
"Eureka!" it says, "if every row in this table ends up in the output, to find if one exists I can do the really cheap start-up loop join throughout because even though it's slow for large data sets, I only need one row." But then it doesn't find that row. And doesn't find it again. And now it's iterating through a vast set of data using the least efficient means at its disposal for weeding through large sets of data.
By comparison, if you ask for the full count of data, it has to find every record by definition. It sees a vast set of data and picks the choices that are best for iterating through that entire set of data instead of just a tiny sliver of it.
If, on the other hand, it really was correct and the records were very well correlated it would have found your record with the smallest possible amount of server resources and maximized its overall throughput.
I am having problem in fetching a number of records from while joining tables. Please see the below query:
SELECT
H.EIN,
H.OUC,
(
SELECT
COUNT(1)
FROM
tbl_Checks C
INNER JOIN INFM_People_OR.dbo.tblHierarchy P
ON P.EIN = C.EIN_Checked
WHERE
(H.EIN IN (P.L1, P.L2)
OR H.EIN = C.EIN_Checked)
AND C.[Read] = 1
) AS [Read]
FROM
INFM_People_OR.dbo.tblHierarchy H
LEFT JOIN tbl_Checks C
ON H.EIN = C.EIN_Checked
WHERE
H.L1 = #EIN
GROUP BY
H.EIN,
H.OUC,
C.Check_Date
Even if there are just 100 records this query takes a much more time(around 1 min).
Please suggest a solution to tune up this query as it is throwing error in front end
Given just the query there are a few things that stick out as being non-optimal:
Any use of OR will be slower:
WHERE
(H.EIN IN (P.L1, P.L2)
OR H.EIN = C.EIN_Checked)
AND C.[Read] = 1
If there's any way to rework this based off of your data set so that both the IN and the OR are replaced with ANDs that would help.
Also, use of a local variable in the WHERE clause will not work well with the optimizer:
WHERE
H.L1 = #EIN
Finally, make sure you have indexes (and hopefully these are integer fields) where you are doing your joins and group bys (H.EIN, H.OUC, C.Check_Date
The size of the result set (100 records) doesn't matter as much as the size of the joined tables and whether or not they have appropriate indexes.
The Estimated number of rows affected is 1196880 is very high resulting in high execution time of query. I have also tried to join the tables only once but that it giving different output.
Please suggest any other solution than creating indices as I have already created non-clustered index for the table tbl_checks but it doesn't make any difference.
Below is the SQl execution plan.
Condensed Example & Explanation
I want to write a WHERE IN clause that selects from a pre-populated set of numbers
Here's some code. I want to store this set of numbers, and select from them so i don't have to repeat the query that generates this set of numbers.
ARRAY_OF_NUMBERS = Values from some select statement
-- SHIPMENTS CURSOR
OPEN O_SHIPMENTS_CURSOR FOR
SELECT *
FROM Q194977.AN_SHIPMENT_INFO SI
WHERE INTERNAL_ASN IN (ARRAY_OF_NUMBERS) -- need to populate something
ORDER BY INTERNAL_ASN;
-- ORDER CURSOR
OPEN O_ORDERS_CURSOR FOR
SELECT *
FROM Q194977.AN_ORDER_INFO OI
WHERE INTERNAL_ASN IN (ARRAY_OF_NUMBERS) -- need to populate something
ORDER BY INTERNAL_ASN;
I read something about using an array, but it said it had to be a global array instead of session level. I'm not sure how true this is, and I'm not sure what a global array even is, but i imagine this needs to be session level as it would change with each procedural call. Perhaps i could use a temporary table.
Any ideas on the best way i can accomplish this?
------------- EDIT ------------
(Adding detailed example)
Detailed Example and Explanation
I have 4 tables at 4 different hierarchical levels, and 4 stored procedures. Each procedure contains input criteria to build a selection of data at all 4 levels via criteria for a certain level.
In this example, my caller will input selection criteria that exists at the carton level. Then i will use the INTERNAL_ASN numbers narrowed from this selection, to move up hierarchical levels and retrieve: ORDERS this carton is on, SHIPMENTS that ORDER is on, and then down to retrieve: ITEMS on this CARTON.
I noticed when going up levels, i was repeating the same selection, and though i should somehow store this set of numbers, so i didn't rerun the selection each time to get them, but wasn't sure how.
-- SHIPMENTS CURSOR
OPEN O_SHIPMENTS_CURSOR FOR
SELECT *
FROM Q194977.AN_SHIPMENT_INFO SI
WHERE INTERNAL_ASN IN
(SELECT INTERNAL_ASN
FROM Q194977.AN_CARTON_INFO CI
WHERE (I_BOL IS NULL OR BILL_OF_LADING = I_BOL)
AND ( I_CARTON_NO IS NULL
OR CARTON_NO = I_CARTON_NO)
AND (I_PO_NO = 0 OR PO_NO = I_PO_NO)
AND (I_STORE_NO = 0 OR STORE_NO = I_STORE_NO))
ORDER BY INTERNAL_ASN;
-- ORDER CURSOR
OPEN O_ORDERS_CURSOR FOR
SELECT *
FROM Q194977.AN_ORDER_INFO OI
WHERE INTERNAL_ASN IN
(SELECT INTERNAL_ASN
FROM Q194977.AN_CARTON_INFO CI
WHERE (I_BOL IS NULL OR BILL_OF_LADING = I_BOL)
AND ( I_CARTON_NO IS NULL
OR CARTON_NO = I_CARTON_NO)
AND (I_PO_NO = 0 OR PO_NO = I_PO_NO)
AND (I_STORE_NO = 0 OR STORE_NO = I_STORE_NO))
AND (I_PO_NO = 0 OR PO_NO = I_PO_NO)
ORDER BY INTERNAL_ASN;
-- CARTONS CURSOR
OPEN O_CARTONS_CURSOR FOR
SELECT *
FROM Q194977.AN_CARTON_INFO CI
WHERE (I_BOL IS NULL OR BILL_OF_LADING = I_BOL)
AND (I_CARTON_NO IS NULL OR CARTON_NO = I_CARTON_NO)
AND (I_PO_NO = 0 OR PO_NO = I_PO_NO)
AND (I_STORE_NO = 0 OR STORE_NO = I_STORE_NO)
ORDER BY INTERNAL_ASN;
-- ITEMS CURSOR
OPEN O_ITEMS_CURSOR FOR
SELECT *
FROM Q194977.AN_ITEM_INFO II
WHERE CARTON_NO IN
(SELECT CARTON_NO
FROM Q194977.AN_CARTON_INFO CI
WHERE (I_BOL IS NULL OR BILL_OF_LADING = I_BOL)
AND ( I_CARTON_NO IS NULL
OR CARTON_NO = I_CARTON_NO)
AND (I_PO_NO = 0 OR PO_NO = I_PO_NO)
AND (I_STORE_NO = 0 OR STORE_NO = I_STORE_NO))
ORDER BY INTERNAL_ASN;
Assuming that you mean a collection of numbers (there are three collection types in PL/SQL, one of which is an associative array, but that doesn't sound like what you want here), you could do something like
CREATE OR REPLACE TYPE num_tbl
AS TABLE OF NUMBER;
Then, in your procedure
l_nums num_tbl;
BEGIN
SELECT some_number
BULK COLLECT INTO l_nums
FROM <<your query to get the numbers>>;
<<more code>>
OPEN O_SHIPMENTS_CURSOR FOR
SELECT *
FROM Q194977.AN_SHIPMENT_INFO SI
WHERE INTERNAL_ASN IN (SELECT column_value
FROM TABLE( l_nums ))
ORDER BY INTERNAL_ASN;
That is syntactically valid. Whether it is actually going to be useful to you, however, is a separate question.
Collections are stored in the relatively expensive PGA memory on the database server. If you're storing a couple hundred numbers in a collection, that's probably not a huge concern. If, on the other hand, you're storing 10's or 100's of MB of data and running this in multiple sessions, this one bit of code could easily consume many GB of the RAM on the database server leading to lots of performance issues.
Moving large quantities of data from SQL to PL/SQL and then back to SQL can also be somewhat problematic from a performance standpoint-- it's generally more efficient to leave everything in SQL and let the SQL engine handle it.
If you use a collection in this way, you're preventing the optimizer from considering join orders and query plans that merge the two queries in a more efficient manner. If you are certain that the most efficient plan is one where a small number of internal_asn values are used to probe the an_shipment_info table using an index, that may not be a major concern. If you're not sure about what the best query plan is, and particularly if your actual queries are more complicated than what you posted, however, you might be preventing the optimizer from using the most efficient plan for each query.
What is the problem that you're trying to solve? You talk about not wanting to duplicate code. That would lead me to suspect that you really just want a view that you can reference in your queries rather than repeating the code for a complicated SQL statement. But that presumes that the issue you're trying to solve is one of code elegance which may or may not be accurate.
I am currently running into some performance issues when running a query which joins multiple tables. The main table has 170 million records, so it is pretty big.
What I encounter is that when I run the query with a top 1000 clause, the results are instantaneous. However, when I increase that to top 8000 the query easily runs for 15 minutes (and then I kill it). Through trial and error I found that the tipping point is with Top 7934 (works like a charm) and Top 7935 (Runs for ever)
Does someone recognise this behaviour and sees what I am doing wrong? Maybe my Query is faulty in some respects.
Thanks in advance
SELECT top 7934 h.DocIDBeg
,h.[Updated By]
,h.Action
,h.Type
,h.Details
,h.[Update Date]
,h.[Updated Field Name]
,i.Name AS 'Value Set To'
,COALESCE(i.Name,'') + COALESCE(h.NewValue, '') As 'Value Set To'
,h.OldValue
FROM
(SELECT g.DocIDBeg
,g.[Updated By]
,g.Action
,g.Type
,g.Details
,g.[Update Date]
,CAST(g.details as XML).value('auditElement[1]/field[1]/#name','nvarchar(max)') as 'Updated Field Name'
,CAST(g.details as XML).value('(/auditElement//field/setChoice/node())[1]','nvarchar(max)') as 'value'
,CAST(g.details as XML).value('(/auditElement//field/newValue/node())[1]','nvarchar(max)') as 'NewValue'
,CAST(g.details as XML).value('(/auditElement//field/oldValue/node())[1]','nvarchar(max)') as 'OldValue'
FROM(
SELECT a.ArtifactID
,f.DocIDBeg
,b.FullName AS 'Updated By'
,c.Action
,e.ArtifactType AS 'Type'
,a.Details
,a.TimeStamp AS 'Update Date'
FROM [EDDS1015272].[EDDSDBO].[AuditRecord] a
LEFT JOIN [EDDS1015272].[EDDSDBO].AuditUser b
ON a.UserID = b.UserID
LEFT JOIN [EDDS1015272].[EDDSDBO].AuditAction c
ON a.Action = c.AuditActionID
LEFT JOIN [EDDS1015272].[EDDSDBO].[Artifact] d
ON a.ArtifactID = d.ArtifactID
LEFT JOIN [EDDS1015272].[EDDSDBO].[ArtifactType] e
ON d.ArtifactTypeID = e.ArtifactTypeID
INNER JOIN [EDDS1015272].[EDDSDBO].[Document] f
ON a.ArtifactID = f.ArtifactID
) g
) h
LEFT JOIN [EDDS1015272].[EDDSDBO].[Code] i
ON h.value = i.ArtifactID
I used to work with data warehouses a lot and encountered similar problems quite often. The root cause is obviously in memory usage like it was already mentioned here. I don't think that rewriting your query will help a lot if you really need to query all 170 million records and I don't think that it is OK for you to wait for more memory resources.
So here is just a simple workaround from me:
Try to split your query. For example, first query all data you need from AuditRecord record table joined to AuditUser table and store the result in another(temporary table for example) table. Then join this new table with Artifact table and so on. In this case this steps will require less memory one by one then running the whole query and have it hung out. So in the long run you will have not a query but a scrip which will be easy to track as you can print out some statuses in the console and which will do his job unlike the query which never ends
Also make sure that you really need to query all this data at once, because I can think of no use cases why you need it, but still if it is an application then you should implement paging, if it is some export functionality then maybe there is a timeline you can use to batch data. For example to export data on a daily basis and query only the data from yersterday. In this case you will come up with an incremental export.
"Through trial and error I found that the tipping point is with Top 7934 (works like a charm) and Top 7935 (Runs for ever)"
This sounds very much like a spill. Adam Mechanic does a nice demo of the internals of this in the video below. Basically the top forces a sort which requires memory. If the memory grant is not big enough to complete the operation, some of it gets done on disk.
https://www.youtube.com/watch?v=j5YGdIk3DXw
Go to 1:03:50 to see Adam demo a spill. In his query, 668,935 rows do not spill but 668,936 rows do and the query time more than doubles.
Watch the whole session if you have time. Very good for performance tuning!
Could also be the tipping point, as #Remus suggested, but it's all guessing without knowing the actual plan.
i think the subselects are forcing the server to fetch all before the filter can be applied
this will couse more memory usage (xlm fields) and make it hard to use a decent qry plan
as to the strange top behavior: top has a big influence on qry plan generation.
it is possible that the 7935 is a cutoff point for 1 optimal plan and that sql server will choose a different path when it needs to fetch more.
or it could go back to the memory and run out of mem on 7935
update:
i reworked your qry to eliminate the nested selects, i'm not saying its now going to be that mutch faster but it eliminates some fields that werent used and it should be easyer to understand and optimize based on the qry plan.
since we don't now the exact size of each table and we can hardly run the qry to test its impossible to give you the best answer. but i could try some tips:
1 step would be to check if you need all the left joins and turn them into inner if it is not needed ex: AuditUser, an AuditRecord could always have a user?
an other thing you could try is to put the data of preferably the smaller tables in a tmp table and join the bigger tables to that tmp table, possible eliminating a lot of records to join
if possible you could denormalize a bit and for example put the username in the auditrecord 2 so you would eliminate the join on AuditUser alltogether
but it is up to wat you need wat you can/are allowed to and the data/server
SELECT top 7934 f.DocIDBeg
,b.FullName AS 'Updated By'
,c.Action
,e.ArtifactType AS 'Type'
,a.Details
,a.TimeStamp AS 'Update Date'
,CAST(a.Details as XML).value('auditElement[1]/field[1]/#name','nvarchar(max)') as 'Updated Field Name'
,i.Name AS 'Value Set To'
,COALESCE(i.Name,'') + COALESCE(CAST(a.Details as XML).value('(/auditElement//field/newValue/node())[1]','nvarchar(max)') as 'NewValue', '') As 'Value Set To'
,CAST(a.Details as XML).value('(/auditElement//field/oldValue/node())[1]','nvarchar(max)') as 'OldValue'
FROM [EDDS1015272].[EDDSDBO].[AuditRecord] a
LEFT JOIN [EDDS1015272].[EDDSDBO].AuditUser b
ON a.UserID = b.UserID
LEFT JOIN [EDDS1015272].[EDDSDBO].AuditAction c
ON a.Action = c.AuditActionID
LEFT JOIN [EDDS1015272].[EDDSDBO].[Artifact] d
ON a.ArtifactID = d.ArtifactID
LEFT JOIN [EDDS1015272].[EDDSDBO].[ArtifactType] e
ON d.ArtifactTypeID = e.ArtifactTypeID
INNER JOIN [EDDS1015272].[EDDSDBO].[Document] f
ON a.ArtifactID = f.ArtifactID
LEFT JOIN [EDDS1015272].[EDDSDBO].[Code] i
ON CAST(a.details as XML).value('(/auditElement//field/setChoice/node())[1]','nvarchar(max)') = i.ArtifactID
I need to have a MsSql database table and another 8 (identical) processes accessing the same table in parallel - making a select top n, processing those n rows, and updating a column of those rows. The problem is that I need to select and process each row just once. This means that if one process got to the database and selected the top n rows, when the second process comes it should find those rows locked and select the rows from n to 2*n rows, and so on...
Is it possible to put a lock on some rows when you select them, and when someone requests top n rows which are locked to return the next rows, and not to wait for the locked ones? Seems like a long shot, but...
Another thing I was thinking - maybe not so elegant but sounds simple and safe, is to have in the database a counter for the instances which made selects on that table. The first instance that comes will increment the counter and select top n, the next one will increment the counter and select rows from n*(i-1) to n*i, and so on...
Does this sound like a good ideea? Do you have any better suggestions? Any thought is highly appreciated!
Thanks for your time.
Here's a sample I blogged about a while ago:
The READPAST hint is what ensures multiple processes don't block each other when polling for records to process. Plus, in this example I have a bit field to physically "lock" a record - could be a datetime if needed.
DECLARE #NextId INTEGER
BEGIN TRANSACTION
-- Find next available item available
SELECT TOP 1 #NextId = ID
FROM QueueTable WITH (UPDLOCK, READPAST)
WHERE IsBeingProcessed = 0
ORDER BY ID ASC
-- If found, flag it to prevent being picked up again
IF (#NextId IS NOT NULL)
BEGIN
UPDATE QueueTable
SET IsBeingProcessed = 1
WHERE ID = #NextId
END
COMMIT TRANSACTION
-- Now return the queue item, if we have one
IF (#NextId IS NOT NULL)
SELECT * FROM QueueTable WHERE ID = #NextId
The most simplest method is to use row locking:
BEGIN TRAN
SELECT *
FROM authors
WITH (HOLDLOCK, ROWLOCK)
WHERE au_id = '274-80-9391'
/* Do all your stuff here while the record is locked */
COMMIT TRAN
But if you are accessing your data and then closing the connection, you won't be able to use this method.
How long will you be needing to lock the rows for? The best way might actually be as you say to place a counter on the rows you select (best done using OUTPUT clause within an UPDATE).
The best idea if you want to select records in this manner would be to use a counter in a separate table.
You really don't want to be locking rows on a production database exclusively for any great period of time, therefore I would recommend using a counter. This way only one of your processes would be able to grab that counter number at a time (as it will lock as it is being updated) which will give you the concurrency that you need.
If you need a hand writing the tables and procedures that will do this (simply and safely as you put it!) just ask.
EDIT: ahh, nevermind, you're working in a disconnected style. How about this:
UPDATE TOP (#n) QueueTable SET Locked = 1
OUTPUT INSERTED.Col1, INSERTED.Col2 INTO #this
WHERE Locked = 0
<do your stuff>
Perhaps you are looking for the READPAST hint?
<begin or save transaction>
INSERT INTO #this (Col1, Col2)
SELECT TOP (#n) Col1, Col2
FROM Table1 WITH (ROWLOCK, HOLDLOCK, READPAST)
<do your stuff>
<commit or rollback>