Perl DBI fetch portion of database? - database

I'm working on a database, where the number of rows is somewhere above 1,000,000. I have my select statement, but if I fetchall to begin with, I run out of memory quickly. Here are my 2 questions:
Since I dont know the exact size of the database to start, is there any way to find out the size of the database without doing a fetchall? The computer literally cannot support it.
Is there any way to fetch say a certain chunk of the database, maybe like 5,000 at a time to process, instead of doing an individual fetchrow for each and every line? I just finished running a test, and to do it row by row, its looking to be almost 4 minutes per 1000 rows worked on, and the boss isnt looking favorably on a program that is going to take almost 3 days to complete.
This is my code:
while ($i < $rows)
{
if ($i + $chunkRows < $rows)
{
for ($j = 0; $j < $chunkRows; $j++)
{
#array = $sth->fetchrow();
($nameOne, $numberOne, $numberTwo) = someFunction($lineCount,#array,$nameOne,$numberOne, $numberTwo);
}
}
else #run for loop for j < rows % chunkrows
$i = $i + $j
}

Show your fetchrow looping code; there may be ways to improve it, depending on how you are calling it and just what you are doing with the data.
I believe the database drivers for most databases will fetch multiple rows at once from the server; you are going to have to say what underlying type of database you are using to get good advice there. If indeed it is communicating with the server for each row, you are going to have to modify the SQL to get sets of rows at a time, but how to do that varies depending on what database you are using.
Ah, DB2. I'm not sure, but I think you have to do something like this:
SELECT *
FROM (SELECT col1, col2, col3, ROW_NUMBER() OVER () AS RN FROM table) AS cols
WHERE RN BETWEEN 1 AND 10000;
and adjust the numbers for each query until you get an empty result. Obviously this
is more work on the database side to have it repeat the query multiple times; I don't
know if there are DB2 ways to optimize this (i.e. temporary tables).

To get the number of rows in a table, you can use
Select count(*) from Table
To limit the number of rows returned, this may be specific to your database. MySQL, for example, has a Limit keyword which will let you pull back only a certain number of rows.
That being said, if you are pulling back all rows, you may want to add some other questions here describing specifically what you are doing, because thats not a common thing in most applications.
If you dont have a limit available in your database, you can do things like flag a column with a boolean to indicate that a row was processed, and then re-run your query for a limited number of rows, skipping those that have been completed. Or record the last row id processed, and then limit your next query to rows with a greater id. There's a lot of ways around that.

Related

OFFSET and FETCH causing massive performance hit on a query - including when OFFSET = 0

I have a fairly complex SQL query that involves returning about 20 columns from a large number of joins, used to populate a grid of results in a UI. It also uses a couple of CTEs to pre-filter the results. I've included an approximation of the query below (I've commented out the lines that fix the performance)
As the amount of data in the DB increased, the query performance tanked pretty hard, with only about 2500 rows in the main table 'Contract'.
Through experimentation, I found that by just removing the order, offset fetch at the end the performance went from around 30sec to just 1 sec!
order by 1 OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY
This makes no sense to me. The final line should be pretty cheap, free even when the OFFSET is zero, so why is it adding 29secs on to my query time?
In order to maintain the same function for the SQL, I adapted it so that I first select into #TEMP, then perform the above order-offset-fetch on the temp table, then drop the temp table. This completes in about 2-3 seconds.
My 'optimisation' feels pretty wrong, surely there's a more sane way to achieve the same speed?
I haven't extensively tested this for larger datasets, it's essentially a quick fix to get performance back for now. I doubt it will be efficient as the data size grows.
Other than the Clustered Indexes on the primary keys, there are no indexes on the tables. The Query Execution plan didn't appear to show any major bottlenecks, but I'm not an expert on interpreting it.
WITH tableOfAllContractIdsThatMatchRequiredStatus(contractId)
AS (
SELECT DISTINCT c.id
FROM contract c
INNER JOIN site s ON s.ContractId = c.id
INNER JOIN SiteSupply ss ON ss.SiteId = s.id AND ss.status != 'Draft'
WHERE
ISNULL(s.Deleted, '0') = 0
AND ss.status in ('saved')
)
,tableOfAllStatusesForAContract(contractId, status)
AS (
SELECT DISTINCT c.id, ss.status
FROM contract c
INNER JOIN site s ON s.ContractId = c.id
INNER JOIN SiteSupply ss ON ss.SiteId = s.id AND ss.status != 'Draft'
WHERE ss.SupplyType IN ('Electricity') AND ISNULL(s.Deleted, '0') = 0
)
SELECT
[Contract].[Id]
,[Contract].[IsMultiSite]
,statuses.StatusesAsCsv
... lots more columns
,[WaterSupply].[Status] AS ws
--INTO #temp
FROM
(
SELECT
tableOfAllStatusesForAContract.contractId,
string_agg(status, ', ') AS StatusesAsCsv
FROM
tableOfAllStatusesForAContract
GROUP BY
tableOfAllStatusesForAContract.contractId
) statuses
JOIN contract ON Contract.id = statuses.contractId
JOIN tableOfAllContractIdsThatMatchRequiredStatus ON tableOfAllContractIdsThatMatchRequiredStatus.contractId = Contract.id
JOIN Site ON contract.Id = site.contractId and site.isprimarySite = 1 AND ISNULL(Site.Deleted,0) = 0
... several more joins
JOIN [User] ON [Contract].ownerUserId = [User].Id
WHERE isnull(Deleted, 0) = 0
AND
(
[Contract].[Id] = '12659'
OR [Site].[Id] = '12659'
... often more search term type predicates here
)
--select * from #temp
order by 1
OFFSET 0 ROWS FETCH NEXT 10 ROWS ONLY
--drop table #temp
I've not had an answer, so I'm going to try and explain it myself, with my admittedly poor understanding of how SQL works and some pointers from Jeroen in comments above. It's probably not right, but from what I've discovered it could be correct, and I do know how to fix my immediate problem so it could help others.
I'll explain it with an analogy, as this is what I believe is probably happening:
Imagine you're a chef in a restaurant, and you have to prepare a large number of meals (rows in results). You know there's going to be a lot as you're front of house has told you this (TOP 10 or FETCH 10).
You spend time setting out the multitude of ingredients required (table joins) and equipment you'll need and as the first order comes in, you make sure you're going to be really efficient. Chopping up more that you need for the first order, putting it in little bowls ready to use on the subsequent orders. The first order takes you quite a while (30 secs) as you're planning ahead and want the subsequent dishes to go out as fast as possible.
However, as you're sat in the kitchen waiting for the next orders.. then don't arrive. That's it, just one order. Well that was a waste of time! If you'd just tried to get one dish out, you could have done it much faster (1sec) but you were planning ahead for something that was never needed.
The next night, you ditch your previous strategy and just do each plate at a time. However this time, there are 100s of customers. You can't deliver them fast enough doing them one at a time. The amount of time to deliver all the orders would have been much faster if you'd planned ahead like the previous night. (I've not tested this hypothesis, but I expect it is what would probably happen).
For my query, I don't know if there's going to be 1 result or 100s although I may be able to do some analysis up front based on the search criteria entered by the user, I may have to adapt my UI to give me more information so I can predict this better, which means I can pick the appropriate strategy for SQL to use upfront. As it is, I'm optimised for a small number of results which works fine for now - but I need to do some more extensive testing to see how performance is affected as the dataset grows.
"If you want a answer to something, post something that's wrong on the internet and someone will be sure to correct you"

SQL Invoice Query Performance converting Credits to negative numbers

I have a 3rd party database that contains Invoice data I need to report on. The Quantity and Amount Fields are stored as Positive numbers regardless of whether the "invoice" is a Credit Memo or actual Invoice. There is a single character field that contains the Type "I" = Invoice, "R" = Credit.
In a report that is equating 1.4 million records, I need to sum this data, so that Credits subtract from the total and Invoices add to the total, and I need to do this for 8 different columns in the report (CurrentYear, PreviousYear, etc)
My problem is performance of the many different ways to achieve this.
The Best performing seems to be using a CASE statement within the equation like so:
Case WHEN ARH.AccountingYear - 2 = #iCurrentYear THEN ARL.ShipQuantity * (CASE WHEN InvoiceType = 'R' THEN -1 ELSE 1 END) ELSE 0 END as PPY_INVOICED_QTY
But code readable wise, this is super ugly since I have to do it to 8 different columns, performance is good, runs against all 1.4M records in 16 seconds.
Using a Scalar UDF kills performance
Case WHEN ARH.AccountingYear - 2 = #iCurrentYear THEN ARL.ShipQuantity * dbo.fn_GetMultiplier(ARH.InvoiceType) ELSE 0 END as PPY_INVOICED_QTY
Takes almost 5 minutes. So can't do that.
Other options I can think of would be:
Multiple levels of Views, use a new view to add a Multiplier column, then SELECT from that and do the multiplication using the new column
Build a table that has 2 columns and 2 records, R, -1 and I, 1, and join it based on InvoiceType, but this seems excessive.
Any other ideas I am missing, or suggestions on best practice for this sort of thing? I cannot change the stored data, that is established by the 3rd party application.
I decided to go with the multiple views as Igor suggested, actually using the nested version, even though readability is lower, maintenance is easier due to only 1 named view instead of 2. Performance is similar to the 8 different case statements, so overall running in just under 20 seconds.
Thanks for the insights.

Load time variance with .CacheSize/.PageSize in ADODB.Recordset

I am working on a project for a client using a classic ASP application I am very familiar with, but in his environment is performing more slowly than I have ever seen in a wide variety of other environments. I'm on it with many solutions; however, the sluggishness has got me to look at something I've never had to look at before -- it's more of an "acadmic" question.
I am curious to understand a category page with say 1800 product records takes ~3 times as long to load as a category page with say 54 when both are set to display 50 products per page. That is, when the number of items to loop through is the same, why does the variance in the total number of records make a difference in loading the number of products displayed when that is a constant?
Here are the methods used, abstracted to the essential aspects:
SELECT {tableA.fields} FROM tableA, tableB WHERE tableA.key = tableB.key AND {other refining criteria};
set rs=Server.CreateObject("ADODB.Recordset")
rs.CacheSize=iPageSize
rs.PageSize=iPageSize
pcv_strPageSize=iPageSize
rs.Open query, connObj, adOpenStatic, adLockReadOnly, adCmdText
dim iPageCount, pcv_intProductCount
iPageCount=rs.PageCount
If Cint(iPageCurrent) > Cint(iPageCount) Then iPageCurrent=Cint(iPageCount)
If Cint(iPageCurrent) < 1 Then iPageCurrent=1
if NOT rs.eof then
rs.AbsolutePage=Cint(iPageCurrent)
pcArray_Products = rs.getRows()
pcv_intProductCount = UBound(pcArray_Products,2)+1
end if
set rs = nothing
tCnt=Cint(0)
do while (tCnt < pcv_intProductCount) and (count < pcv_strPageSize)
{display stuff}
count=count + 1
loop
The record set is converted to an array via getRows() and the destroyed; records displayed will always be iPageSize or less.
Here's the big question:
Why, on the initial page load for the larger record set (~1800 records) does it take significantly longer to loop through the page size (say 50 records) than on the smaller records set (~54 records)? It's running through 0 to 49 either way, but takes a lot longer to do that the larger the initial record set/getRows() array is. That is, why would it take longer to loop through the first 50 records when the initial record set/getRows() array is larger when it's still looping through the same number of records/rows before exiting the loop?
Running MS SQL Server 2008 R2 Web edition
You are not actually limiting the number of records returned. It will take longer to load 36 times more records. You should change your query to limit the records directly rather than retrieving all of them and terminating your loop after the first 50.
Try this:
SELECT *
FROM
(SELECT *, ROW_NUMBER() OVER(ORDER BY tableA.Key) AS RowNum
FROM tableA
INNER JOIN tableB
ON tableA.key = tableB.key
WHERE {other refining criteria}) AS ResultSet
WHERE RowNum BETWEEN 1 AND 50
Also make sure the columns you are using to join are indexed.

Increment a row in SQL Server database after the LAST row within a range

I have tried looking for an answer for my question but due to lack of proper knowledge in databases, I thought some genius here could help me out.
I'm trying to add rows into a database table starting from the number 800 and onwards in a sequential order. The database currently has records from 1 up until 300.
This record in the table is not automatically incremented, but assigned manually.
I basically want to start another batch of numbers, but rather than incrementing them from where the program left off (i.e. from 300) I would like to start them from 800.
I tried different SQL statements but got nothing concrete to go with. I'm using PHP to enter these rows.
I feel like this shouldn't be so hard but I'm a novice and it kills!
I thought maybe adding one record that starts from 800 and then executing a statement like
SELECT LAST FROM A RANGE BETWEEN 800 AND 899
would be the way to go.
Any help, advice?
You can obtain the highest numbered record with the MAX aggregate function. The following will find the next highest number that is between 800 and 899.
SELECT MAX(my_assigned_number) + 1 as next_number
FROM my_table
WHERE my_assigned_number > 800 AND my_assigned_number < 899;

100k Rows Returned in a random order, without a SQL time out please

Ok,
I've been doing a lot of reading on returning a random row set last year, and the solution we came up with was
ORDER BY newid()
This is fine for <5k rows. But when we are getting >10-20k rows we are getting SQL time outs, the Execution planned tells me that 76% of my query cost comes from this line. and removing this line increase the speed by an order of magnitude when we have a large amount of rows.
Our users have a requirement of doing up to 100k rows at a time like this.
To give you all a bit more details.
We have a table with 2.6 million 4 digit alpha-numeric codes. We use a random set of these to gain entry into a venue. For example, if we have an event with a 5000 capacity, a random set of 5000 of these will be drawn from the table then issued to the each customer as a bar-code, then the bar-code scanning app at the door with have the same list of 5000. The reason for using a 4 digit alpha numeric code (and not a stupidly long number like a GUID) is that it easy for people to write the number down (or SMS it to a friend) and just bring the number and have it entered manually, so we don't want large amount of characters. Customers love the last bit btw.
Is there a better way than ORDER BY newid(), or is there a faster way to get 100k random rows from a table with 2.6 mil?
Oh, and we are using MS SQL 2005.
Thanks,
Jo
There is an MSDN article entitled "Selecting Rows Randomly from a Large Table" that talks about this exact problem and shows a solution (using no sorting but instead using a WHERE clause on a generated column to filter the rows).
The reason your query is slow is that the ORDER BY clause causes the whole table to be copied into tempdb for sorting.
If you want to generate random 4-digit codes, why not just generate them instead of trying to pull them out of a database?
Generate 100k unique numbers from 0 to 1,679,616 (which is the number of unique four-digit alphanumeric codes, ignoring case - 2.6 million rows must have some duplicates) and convert them to your four-digit codes.
You don't have to sort.
DECLARE #RandomNumber int
DECLARE #Threshold float
SELECT #RandomNumber = COUNT(*) FROM customers
SELECT #Threshold = 50000 / #RandomNumber
SELECT TOP 50000 * FROM customers WHERE rand() > #Threshold ORDER BY newid()
Just as a matter of interest, what is the performance like if you replace
ORDER BY newid()
by
ORDER BY CHECKSUM(newid())
One thought is to break down the process into steps. Add a column in the table for a GUID then do an update statement into the table adding the GUIDs. This can be done ahead of time if necessary. You should then be able to run the query with an orderby on the GUID column to recieve the results the same way.
Have you tried using % (modulo) on a given int column? Not sure what your table structure is, but you could do something like this:
select top 50000 *
from your_table
where CAST((CAST(ASCII(SUBSTRING(venuecode,1,1)) as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,2,1))as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,3,1))as varchar(3))+
CAST(ASCII(SUBSTRING(venuecode,4,1))as varchar(3))) as bigint) % 500000 between 0 and 50000
The above code will take all of your alpha numeric venues and convert them to an integer and then split the entire table into 500,000 buckets of which you are taking the top 50000 that fall between 0 and 50000. You can play with the number after the % since (500,000) and you can play with the between. This should randomize it for you. Not sure if the where clause will bite you on performance, but it's worth a shot. Also, without an order by, there is no guarantee of the order (if you have multiple cpus and threading).

Resources