Load time variance with .CacheSize/.PageSize in ADODB.Recordset - arrays

I am working on a project for a client using a classic ASP application I am very familiar with, but in his environment is performing more slowly than I have ever seen in a wide variety of other environments. I'm on it with many solutions; however, the sluggishness has got me to look at something I've never had to look at before -- it's more of an "acadmic" question.
I am curious to understand a category page with say 1800 product records takes ~3 times as long to load as a category page with say 54 when both are set to display 50 products per page. That is, when the number of items to loop through is the same, why does the variance in the total number of records make a difference in loading the number of products displayed when that is a constant?
Here are the methods used, abstracted to the essential aspects:
SELECT {tableA.fields} FROM tableA, tableB WHERE tableA.key = tableB.key AND {other refining criteria};
set rs=Server.CreateObject("ADODB.Recordset")
rs.CacheSize=iPageSize
rs.PageSize=iPageSize
pcv_strPageSize=iPageSize
rs.Open query, connObj, adOpenStatic, adLockReadOnly, adCmdText
dim iPageCount, pcv_intProductCount
iPageCount=rs.PageCount
If Cint(iPageCurrent) > Cint(iPageCount) Then iPageCurrent=Cint(iPageCount)
If Cint(iPageCurrent) < 1 Then iPageCurrent=1
if NOT rs.eof then
rs.AbsolutePage=Cint(iPageCurrent)
pcArray_Products = rs.getRows()
pcv_intProductCount = UBound(pcArray_Products,2)+1
end if
set rs = nothing
tCnt=Cint(0)
do while (tCnt < pcv_intProductCount) and (count < pcv_strPageSize)
{display stuff}
count=count + 1
loop
The record set is converted to an array via getRows() and the destroyed; records displayed will always be iPageSize or less.
Here's the big question:
Why, on the initial page load for the larger record set (~1800 records) does it take significantly longer to loop through the page size (say 50 records) than on the smaller records set (~54 records)? It's running through 0 to 49 either way, but takes a lot longer to do that the larger the initial record set/getRows() array is. That is, why would it take longer to loop through the first 50 records when the initial record set/getRows() array is larger when it's still looping through the same number of records/rows before exiting the loop?
Running MS SQL Server 2008 R2 Web edition

You are not actually limiting the number of records returned. It will take longer to load 36 times more records. You should change your query to limit the records directly rather than retrieving all of them and terminating your loop after the first 50.
Try this:
SELECT *
FROM
(SELECT *, ROW_NUMBER() OVER(ORDER BY tableA.Key) AS RowNum
FROM tableA
INNER JOIN tableB
ON tableA.key = tableB.key
WHERE {other refining criteria}) AS ResultSet
WHERE RowNum BETWEEN 1 AND 50
Also make sure the columns you are using to join are indexed.

Related

OFFSET and FETCH causing massive performance hit on a query - including when OFFSET = 0

I have a fairly complex SQL query that involves returning about 20 columns from a large number of joins, used to populate a grid of results in a UI. It also uses a couple of CTEs to pre-filter the results. I've included an approximation of the query below (I've commented out the lines that fix the performance)
As the amount of data in the DB increased, the query performance tanked pretty hard, with only about 2500 rows in the main table 'Contract'.
Through experimentation, I found that by just removing the order, offset fetch at the end the performance went from around 30sec to just 1 sec!
order by 1 OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY
This makes no sense to me. The final line should be pretty cheap, free even when the OFFSET is zero, so why is it adding 29secs on to my query time?
In order to maintain the same function for the SQL, I adapted it so that I first select into #TEMP, then perform the above order-offset-fetch on the temp table, then drop the temp table. This completes in about 2-3 seconds.
My 'optimisation' feels pretty wrong, surely there's a more sane way to achieve the same speed?
I haven't extensively tested this for larger datasets, it's essentially a quick fix to get performance back for now. I doubt it will be efficient as the data size grows.
Other than the Clustered Indexes on the primary keys, there are no indexes on the tables. The Query Execution plan didn't appear to show any major bottlenecks, but I'm not an expert on interpreting it.
WITH tableOfAllContractIdsThatMatchRequiredStatus(contractId)
AS (
SELECT DISTINCT c.id
FROM contract c
INNER JOIN site s ON s.ContractId = c.id
INNER JOIN SiteSupply ss ON ss.SiteId = s.id AND ss.status != 'Draft'
WHERE
ISNULL(s.Deleted, '0') = 0
AND ss.status in ('saved')
)
,tableOfAllStatusesForAContract(contractId, status)
AS (
SELECT DISTINCT c.id, ss.status
FROM contract c
INNER JOIN site s ON s.ContractId = c.id
INNER JOIN SiteSupply ss ON ss.SiteId = s.id AND ss.status != 'Draft'
WHERE ss.SupplyType IN ('Electricity') AND ISNULL(s.Deleted, '0') = 0
)
SELECT
[Contract].[Id]
,[Contract].[IsMultiSite]
,statuses.StatusesAsCsv
... lots more columns
,[WaterSupply].[Status] AS ws
--INTO #temp
FROM
(
SELECT
tableOfAllStatusesForAContract.contractId,
string_agg(status, ', ') AS StatusesAsCsv
FROM
tableOfAllStatusesForAContract
GROUP BY
tableOfAllStatusesForAContract.contractId
) statuses
JOIN contract ON Contract.id = statuses.contractId
JOIN tableOfAllContractIdsThatMatchRequiredStatus ON tableOfAllContractIdsThatMatchRequiredStatus.contractId = Contract.id
JOIN Site ON contract.Id = site.contractId and site.isprimarySite = 1 AND ISNULL(Site.Deleted,0) = 0
... several more joins
JOIN [User] ON [Contract].ownerUserId = [User].Id
WHERE isnull(Deleted, 0) = 0
AND
(
[Contract].[Id] = '12659'
OR [Site].[Id] = '12659'
... often more search term type predicates here
)
--select * from #temp
order by 1
OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY
--drop table #temp
I've not had an answer, so I'm going to try and explain it myself, with my admittedly poor understanding of how SQL works and some pointers from Jeroen in comments above. It's probably not right, but from what I've discovered it could be correct, and I do know how to fix my immediate problem so it could help others.
I'll explain it with an analogy, as this is what I believe is probably happening:
Imagine you're a chef in a restaurant, and you have to prepare a large number of meals (rows in results). You know there's going to be a lot as you're front of house has told you this (TOP 10 or FETCH 10).
You spend time setting out the multitude of ingredients required (table joins) and equipment you'll need and as the first order comes in, you make sure you're going to be really efficient. Chopping up more that you need for the first order, putting it in little bowls ready to use on the subsequent orders. The first order takes you quite a while (30 secs) as you're planning ahead and want the subsequent dishes to go out as fast as possible.
However, as you're sat in the kitchen waiting for the next orders.. then don't arrive. That's it, just one order. Well that was a waste of time! If you'd just tried to get one dish out, you could have done it much faster (1sec) but you were planning ahead for something that was never needed.
The next night, you ditch your previous strategy and just do each plate at a time. However this time, there are 100s of customers. You can't deliver them fast enough doing them one at a time. The amount of time to deliver all the orders would have been much faster if you'd planned ahead like the previous night. (I've not tested this hypothesis, but I expect it is what would probably happen).
For my query, I don't know if there's going to be 1 result or 100s although I may be able to do some analysis up front based on the search criteria entered by the user, I may have to adapt my UI to give me more information so I can predict this better, which means I can pick the appropriate strategy for SQL to use upfront. As it is, I'm optimised for a small number of results which works fine for now - but I need to do some more extensive testing to see how performance is affected as the dataset grows.
"If you want a answer to something, post something that's wrong on the internet and someone will be sure to correct you"

SQL join running slow

I have 2 sql queries doing the same thing, first query takes 13 sec to execute while second takes 1 sec to execute. Any reason why ?
Not necessary all the ids in ProcessMessages will have data in ProcessMessageDetails
-- takes 13 sec to execute
Select * from dbo.ProcessMessages t1
join dbo.ProcessMessageDetails t2 on t1.ProcessMessageId = t2.ProcessMessageId
Where Id = 4 and Isdone = 0
--takes under a sec to execute
Select * from dbo.ProcessMessageDetails
where ProcessMessageId in ( Select distinct ProcessMessageId from dbo.ProcessMessages t1
Where Where Id = 4 and Isdone = 0 )
I have clusterd index on t1.processMessageId(Pk) and non clusterd index on t2.processMessageId (FK)
I would need the actual execution plans to tell you exactly what SqlServer is doing behind the scenes. I can tell you these queries aren't doing the exact same thing.
The first query is going through and finding all of the items that meet the conditions for t1 and finding all of the items for t2 and then finding which ones match and joining them together.
The second one is saying first find all of the items that are meet my criteria from t1, and then find the items in t2 that have one of these IDs.
Depending on your statistics, available indexes, hardware, table sizes: Sql Server may decide to do different types of scans or seeks to pick data for each part of the query, and it also may decide to join together data in a certain way.
The answer to your question is really simple the first query which have used will generate more number of rows as compared to the second query so it will take more time to search those many rows that's the reason your first query took 13 seconds and the second one to only one second
So it is generally suggested that you should apply your conditions before making your join or else your number of rows will increase and then you will require more time to search those many rows when joined.

Table Scan very high "actual rows" when filter placed on different table

I have a query, that I did not write, that takes 2.5 minutes to run. I am trying to optimize it without being able to modify the underlying tables, i.e. no new indexes can be added.
During my optimization troubleshooting I commented out a filter and all of a sudden my query ran in .5 seconds. I have screwed with the formatting and placing of that filter and if it is there the query takes 2.5 minutes, without it .5 seconds. The biggest problem is that the filter is not on the table that is being table-scanned (With over 300k records), it is on a table with 300 records.
The "Actual Execution Plan" of both the 0:0:0.5 vs 0:2:30 are identical down to the exact percentage costs of all steps:
Execution Plan
The only difference is that on the table-scanned table the "Actual Number of Rows" on the 2.5 min query shows 3.7 million rows. The table only has 300k rows. Where the .5 sec query shows Actual Number of Rows as 2,063. The filter is actually being placed on the FS_EDIPartner table that only has 300 rows.
With the filter I get the correct 51 records, but it takes 2.5 minutes to return. Without the filter I get duplication, so I get 2,796 rows, and only take half a second to return.
I cannot figure out why adding the filter to a table with 300 rows and a correct index is causing the Table scan of a different table to have such a significant difference in actual number of rows. I am even doing the "Table scan" table as a sub-query to filter its records down from 300k to 17k prior to doing the join. Here is the actual query in its current state, sorry the tables don't make a lot of sense, I could not reproduce this behavior in test data.
SELECT dbo.FS_ARInvoiceHeader.CustomerID
, dbo.FS_EDIPartner.PartnerID
, dbo.FS_ARInvoiceHeader.InvoiceNumber
, dbo.FS_ARInvoiceHeader.InvoiceDate
, dbo.FS_ARInvoiceHeader.InvoiceType
, dbo.FS_ARInvoiceHeader.CONumber
, dbo.FS_EDIPartner.InternalTransactionSetCode
, docs.DocumentName
, dbo.FS_ARInvoiceHeader.InvoiceStatus
FROM dbo.FS_ARInvoiceHeader
INNER JOIN dbo.FS_EDIPartner ON dbo.FS_ARInvoiceHeader.CustomerID = dbo.FS_EDIPartner.CustomerID
LEFT JOIN (Select DocumentName
FROM GentranDatabase.dbo.ZNW_Documents
WHERE DATEADD(SECOND,TimeCreated,'1970-1-1') > '2016-06-01'
AND TransactionSetID = '810') docs on dbo.FS_ARInvoiceHeader.InvoiceNumber = docs.DocumentName COLLATE Latin1_General_BIN
WHERE docs.DocumentName IS NULL
AND dbo.FS_ARInvoiceHeader.InvoiceType = 'I'
AND dbo.FS_ARInvoiceHeader.InvoiceStatus <> 'Y'
--AND (dbo.FS_EDIPartner.InternalTransactionSetCode = '810')
AND (NOT (dbo.FS_ARInvoiceHeader.CONumber LIKE 'CB%'))
AND (NOT (dbo.FS_ARInvoiceHeader.CONumber LIKE 'DM%'))
AND InvoiceDate > '2016-06-01'
The Commented out line in the Where statement is the culprit, uncommenting it causes the 2.5 minute run.
It could be that the table statistics may have gotten out of whack. These include the number of records tables have which is used to choose the best query plan. Try running this and running the query again:
EXEC sp_updatestats
Using #jeremy's comment as a guideline to point out the Actual Number of Rows was not my problem, but instead the number of executions, I figured out that the Hash Match was .5 seconds, the Nested loop was 2.5 minutes. Trying to force the Hash Match using Left HASH Join was inconsistent depending on what the other filters were set to, changing dates took it from .5 seconds, to 30 secs sometimes. So forcing the Hash (Which is highly discouraged anyway) wasn't a good solution. Finally I resorted to moving the poor performing view to a Stored Procedure and splitting out both of the tables that were related to the poor performance into Table Variables, then joining those table variables. This resulted in the most consistently good performance of getting the results. On average the SP returns in less than 1 second, which is far better than the 2.5 minutes it started at.
#Jeremy gets the credit, but since his wasn't an answer, I thought I would document what was actually done in case someone else stumbles across this later.

SQL Invoice Query Performance converting Credits to negative numbers

I have a 3rd party database that contains Invoice data I need to report on. The Quantity and Amount Fields are stored as Positive numbers regardless of whether the "invoice" is a Credit Memo or actual Invoice. There is a single character field that contains the Type "I" = Invoice, "R" = Credit.
In a report that is equating 1.4 million records, I need to sum this data, so that Credits subtract from the total and Invoices add to the total, and I need to do this for 8 different columns in the report (CurrentYear, PreviousYear, etc)
My problem is performance of the many different ways to achieve this.
The Best performing seems to be using a CASE statement within the equation like so:
Case WHEN ARH.AccountingYear - 2 = #iCurrentYear THEN ARL.ShipQuantity * (CASE WHEN InvoiceType = 'R' THEN -1 ELSE 1 END) ELSE 0 END as PPY_INVOICED_QTY
But code readable wise, this is super ugly since I have to do it to 8 different columns, performance is good, runs against all 1.4M records in 16 seconds.
Using a Scalar UDF kills performance
Case WHEN ARH.AccountingYear - 2 = #iCurrentYear THEN ARL.ShipQuantity * dbo.fn_GetMultiplier(ARH.InvoiceType) ELSE 0 END as PPY_INVOICED_QTY
Takes almost 5 minutes. So can't do that.
Other options I can think of would be:
Multiple levels of Views, use a new view to add a Multiplier column, then SELECT from that and do the multiplication using the new column
Build a table that has 2 columns and 2 records, R, -1 and I, 1, and join it based on InvoiceType, but this seems excessive.
Any other ideas I am missing, or suggestions on best practice for this sort of thing? I cannot change the stored data, that is established by the 3rd party application.
I decided to go with the multiple views as Igor suggested, actually using the nested version, even though readability is lower, maintenance is easier due to only 1 named view instead of 2. Performance is similar to the 8 different case statements, so overall running in just under 20 seconds.
Thanks for the insights.

Perl DBI fetch portion of database?

I'm working on a database, where the number of rows is somewhere above 1,000,000. I have my select statement, but if I fetchall to begin with, I run out of memory quickly. Here are my 2 questions:
Since I dont know the exact size of the database to start, is there any way to find out the size of the database without doing a fetchall? The computer literally cannot support it.
Is there any way to fetch say a certain chunk of the database, maybe like 5,000 at a time to process, instead of doing an individual fetchrow for each and every line? I just finished running a test, and to do it row by row, its looking to be almost 4 minutes per 1000 rows worked on, and the boss isnt looking favorably on a program that is going to take almost 3 days to complete.
This is my code:
while ($i < $rows)
{
if ($i + $chunkRows < $rows)
{
for ($j = 0; $j < $chunkRows; $j++)
{
#array = $sth->fetchrow();
($nameOne, $numberOne, $numberTwo) = someFunction($lineCount,#array,$nameOne,$numberOne, $numberTwo);
}
}
else #run for loop for j < rows % chunkrows
$i = $i + $j
}
Show your fetchrow looping code; there may be ways to improve it, depending on how you are calling it and just what you are doing with the data.
I believe the database drivers for most databases will fetch multiple rows at once from the server; you are going to have to say what underlying type of database you are using to get good advice there. If indeed it is communicating with the server for each row, you are going to have to modify the SQL to get sets of rows at a time, but how to do that varies depending on what database you are using.
Ah, DB2. I'm not sure, but I think you have to do something like this:
SELECT *
FROM (SELECT col1, col2, col3, ROW_NUMBER() OVER () AS RN FROM table) AS cols
WHERE RN BETWEEN 1 AND 10000;
and adjust the numbers for each query until you get an empty result. Obviously this
is more work on the database side to have it repeat the query multiple times; I don't
know if there are DB2 ways to optimize this (i.e. temporary tables).
To get the number of rows in a table, you can use
Select count(*) from Table
To limit the number of rows returned, this may be specific to your database. MySQL, for example, has a Limit keyword which will let you pull back only a certain number of rows.
That being said, if you are pulling back all rows, you may want to add some other questions here describing specifically what you are doing, because thats not a common thing in most applications.
If you dont have a limit available in your database, you can do things like flag a column with a boolean to indicate that a row was processed, and then re-run your query for a limited number of rows, skipping those that have been completed. Or record the last row id processed, and then limit your next query to rows with a greater id. There's a lot of ways around that.

Resources