Improving OFFSET performance in PostgreSQL - database

I have a table I'm doing an ORDER BY on before a LIMIT and OFFSET in order to paginate.
Adding an index on the ORDER BY column makes a massive difference to performance (when used in combination with a small LIMIT). On a 500,000 row table, I saw a 10,000x improvement adding the index, as long as there was a small LIMIT.
However, the index has no impact for high OFFSETs (i.e. later pages in my pagination). This is understandable: a b-tree index makes it easy to iterate in order from the beginning but not to find the nth item.
It seems that what would help is a counted b-tree index, but I'm not aware of support for these in PostgreSQL. Is there another solution? It seems that optimizing for large OFFSETs (especially in pagination use-cases) isn't that unusual.
Unfortunately, the PostgreSQL manual simply says "The rows skipped by an OFFSET clause still have to be computed inside the server; therefore a large OFFSET might be inefficient."

You might want a computed index.
Let's create a table:
create table sales(day date, amount real);
And fill it with some random stuff:
insert into sales
select current_date + s.a as day, random()*100 as amount
from generate_series(1,20);
Index it by day, nothing special here:
create index sales_by_day on sales(day);
Create a row position function. There are other approaches, this one is the simplest:
create or replace function sales_pos (date) returns bigint
as 'select count(day) from sales where day <= $1;'
language sql immutable;
Check if it works (don't call it like this on large datasets though):
select sales_pos(day), day, amount from sales;
sales_pos | day | amount
-----------+------------+----------
1 | 2011-07-08 | 41.6135
2 | 2011-07-09 | 19.0663
3 | 2011-07-10 | 12.3715
..................
Now the tricky part: add another index computed on the sales_pos function values:
create index sales_by_pos on sales using btree(sales_pos(day));
Here is how you use it. 5 is your "offset", 10 is the "limit":
select * from sales where sales_pos(day) >= 5 and sales_pos(day) < 5+10;
day | amount
------------+---------
2011-07-12 | 94.3042
2011-07-13 | 12.9532
2011-07-14 | 74.7261
...............
It is fast, because when you call it like this, Postgres uses precalculated values from the index:
explain select * from sales
where sales_pos(day) >= 5 and sales_pos(day) < 5+10;
QUERY PLAN
--------------------------------------------------------------------------
Index Scan using sales_by_pos on sales (cost=0.50..8.77 rows=1 width=8)
Index Cond: ((sales_pos(day) >= 5) AND (sales_pos(day) < 15))
Hope it helps.

I don't know anything about "counted b-tree indexes", but one thing we've done in our application to help with this is break our queries into two, possibly using a sub-query. My apologies for wasting your time if you're already doing this.
SELECT *
FROM massive_table
WHERE id IN (
SELECT id
FROM massive_table
WHERE ...
LIMIT 50
OFFSET 500000
);
The advantage here is that, while it still has to calculate the proper ordering of everything, it doesn't order the entire row--only the id column.

Instead of using an OFFSET, a very efficient trick is to use a temporary table:
CREATE TEMPORARY TABLE just_index AS
SELECT ROW_NUMBER() OVER (ORDER BY myID), myID
FROM mytable;
For 10 000 000 rows it needs about 10s to be created.
Then you want to use SELECT or UPDATE your table, you simply:
SELECT * FROM mytable INNER JOIN (SELECT just_index.myId FROM just_index WHERE row_number >= *your offset* LIMIT 1000000) indexes ON mytable.myID = indexes.myID
Filtering mytable with only just_index is more efficient (in my case) with a INNER JOIN than with a WHERE myID IN (SELECT ...)
This way you don't have to store the last myId value, you simply replace the offset with a WHERE clause, that uses indexes

It seems that optimizing for large
OFFSETs (especially in pagination
use-cases) isn't that unusual.
It seems a little unusual to me. Most people, most of the time, don't seem to skim through very many pages. It's something I'd support, but wouldn't work hard to optimize.
But anyway . . .
Since your application code knows which ordered values it's already seen, it should be able to reduce the result set and reduce the offset by excluding those values in the WHERE clause. Assuming you order a single column, and it's sorted ascending, your app code can store the last value on the page, then add AND your-ordered-column-name > last-value-seen to the WHERE clause in some appropriate way.

recently i worked over a problem like this, and i wrote a blog about how face that problem. is very like, i hope be helpfull for any one.
i use lazy list approach with partial adquisition. i Replaced the limit and offset or the pagination of query to a manual pagination.
In my example, the select returns 10 millions of records, i get them and insert them in a "temporal table":
create or replace function load_records ()
returns VOID as $$
BEGIN
drop sequence if exists temp_seq;
create temp sequence temp_seq;
insert into tmp_table
SELECT linea.*
FROM
(
select nextval('temp_seq') as ROWNUM,* from table1 t1
join table2 t2 on (t2.fieldpk = t1.fieldpk)
join table3 t3 on (t3.fieldpk = t2.fieldpk)
) linea;
END;
$$ language plpgsql;
after that, i can paginate without count each row but using the sequence assigned:
select * from tmp_table where counterrow >= 9000000 and counterrow <= 9025000
From java perspective, i implemented this pagination through partial adquisition with a lazy list. this is, a list that extends from Abstract list and implements get() method. The get method can use a data access interface to continue get next set of data and release the memory heap:
#Override
public E get(int index) {
if (bufferParcial.size() <= (index - lastIndexRoulette))
{
lastIndexRoulette = index;
bufferParcial.removeAll(bufferParcial);
bufferParcial = new ArrayList<E>();
bufferParcial.addAll(daoInterface.getBufferParcial());
if (bufferParcial.isEmpty())
{
return null;
}
}
return bufferParcial.get(index - lastIndexRoulette);<br>
}
by other hand, the data access interface use query to paginate and implements one method to iterate progressively, each 25000 records to complete it all.
results for this approach can be seen here
http://www.arquitecturaysoftware.co/2013/10/laboratorio-1-iterar-millones-de.html

Related

Handling very big table in SQL Server Performance

I'm having some troubles to deal with a very big table in my database. Before to talk about the problem, let's talk about what i want to achieve.
I have two source tables :
Source 1: SALES_MAN (ID_SMAN, SM_LATITUDE, SM_LONGITUDE)
Source 2: CLIENT (ID_CLIENT, CLATITUDE, CLONGITUDE)
Target: DISTANCE (ID_SMAN, ID_CLIENT, SM_LATITUDE, SM_LONGITUDE, CLATITUDE, CLONGITUDE, DISTANCE)
The idea is to find the top N nearest SALES_MAN for every client using a ROW_NUMBER in the target table.
What I'm doing currently is calculating the distance between every client and every sales man :
INSERT INTO DISTANCE ([ID_SMAN], [ID_CLIENT], [DISTANCE],
[SM_LATITUDE], [SM_LONGITUDE], [CLATITUDE], [CLONGITUDE])
SELECT
[ID_SMAN], [ID_CLIENT],
geography::STGeomFromText('POINT('+IND_LATITUDE+' '+IND_LONGITUDE+')',4326).STDistance(geography::STGeomFromText('POINT('+DLR.[DLR_N_GPS_LATTITUDE]+' '+DLR.[DLR_N_GPS_LONGITUDE]+')',4326))/1000 as distance,
[SM_LATITUDE], [SM_LONGITUDE], [CLATITUDE], [CLONGITUDE]
FROM
[dbo].[SALES_MAN], [dbo].[CLIENT]
The DISTANCE table contains approximately 1 milliards rows.
The second step to get my 5 nearest sales man per client is to run this query :
SELECT *
FROM
(SELECT
*,
ROW_NUMBER() OVER(PARTITION BY ID_CLIENT ORDER BY DISTANCE) rang
FROM DISTANCE) TAB
WHERE rang < 6
The last query is really a consuming one. So to avoid the SORT operator I tried to create an sorted non clustered index in DISTANCE and ID_CLIENT but it did not work. I also tried to include all the needed columns in the both indexes.
But when I created a clustered index on DISTANCE and keep the nonclustered sorted index in the ID_CLIENT the things went better.
So what a nonclustered sorting index is not working in this case?
But when I use the clustered index, I have other problem in loading data and I'm kind of forced to delete it before starting the loading process.
So what do you think? And how we can deal with this kind of tables to be able to select, insert or update data without having performance issues ?
Many thanks
Too long for a comment, but consider the following points.
Item 1) Consider adding a Geography field to each of your source tables. This will eliminate the redundant GEOGRAPHY::Point() function calls
Update YourTable Set GeoPoint = GEOGRAPHY::Point([Lat], [Lng], 4326)
So then the calculation for distance would simply be
,InMeters = C.GeoPoint.STDistance(S.GeoPoint)
,InMiles = C.GeoPoint.STDistance(S.GeoPoint) / 1609.344
Item 2) Rather than generating EVERY possible combination, consider a adding a condtion to the JOIN. Keep in mind that every "1" of Lat or Lng is approx 69 miles, so you can reduce the search area. For example
From CLIENT C
Join SALES_MAN S
on S.Lat between C.Lat-1 and C.Lat+1
and S.Lng between C.Lng-1 and C.Lng+1
This +/- 1 could be any reasonable value ... (i.e. 0.5 or even 2.0)
ROW_NUMBER is a window function that requires the whole rows related with the ORDER BY 's column so its better to filter your result before ROW_NUMBER,
and you've to change the following code :
SELECT * FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY ID_CLIENT ORDER BY DISTANCE)
rang FROM DISTANCE
) TAB
WHERE rang < 6
into this:
WITH DISTANCE_CLIENT_IDS (CLIENT_ID) AS
(
SELECT DISTINCT CLIENT_ID
FROM DISTANCE
)
SELECT Dx.*
FROM DISTANCE_CLIENT_IDS D1,
(
SElECT * , ROW_NUMBER(ORDER BY DISTANCE) RANGE
FROM (
SELECT TOP(5) *
FROM DISTANCE D2
WHERE D1.CLIENT_ID = D2.CLIENT_ID
) Dt
) Dx
and make sure you'd added indexes on both CLIENT_ID and DISTANCE columns

SQL Get Second Record

I am looking to retrieve only the second (duplicate) record from a data set. For example in the following picture:
Inside the UnitID column there is two separate records for 105. I only want the returned data set to return the second 105 record. Additionally, I want this query to return the second record for all duplicates, not just 105.
I have tried everything I can think of, albeit I am not that experience, and I cannot figure it out. Any help would be greatly appreciated.
You need to use GROUP BY for this.
Here's an example: (I can't read your first column name, so I'm calling it JobUnitK
SELECT MAX(JobUnitK), Unit
FROM JobUnits
WHERE DispatchDate = 'oct 4, 2015'
GROUP BY Unit
HAVING COUNT(*) > 1
I'm assuming JobUnitK is your ordering/id field. If it's not, just replace MAX(JobUnitK) with MAX(FieldIOrderWith).
Use RANK function. Rank the rows OVER PARTITION BY UnitId and pick the rows with rank 2 .
For reference -
https://msdn.microsoft.com/en-IN/library/ms176102.aspx
Assuming SQL Server 2005 and up, you can use the Row_Number windowing function:
WITH DupeCalc AS (
SELECT
DupID = Row_Number() OVER (PARTITION BY UnitID, ORDER BY JobUnitKeyID),
*
FROM JobUnits
WHERE DispatchDate = '20151004'
ORDER BY UnitID Desc
)
SELECT *
FROM DupeCalc
WHERE DupID >= 2
;
This is better than a solution that uses Max(JobUnitKeyID) for multiple reasons:
There could be more than one duplicate, in which case using Min(JobUnitKeyID) in conjunction with UnitID to join back on the UnitID where the JobUnitKeyID <> MinJobUnitKeyID` is required.
Except, using Min or Max requires you to join back to the same data (which will be inherently slower).
If the ordering key you use turns out to be non-unique, you won't be able to pull the right number of rows with either one.
If the ordering key consists of multiple columns, the query using Min or Max explodes in complexity.

Indexing and optimization of where clause based on datetime field

I have a database with more than a million of rowset data. When I execute this query it takes hours, mostly due to pageIOLatch_sh. There are currently no indexing. Can you suggest the possible indexing in where clause. I believe it should be on datetime as it is used in where as well as order by , if so which index to use.
if(<some condition>)
BEGIN
select <some columns>
From <some tables with joins(no lock)>
WHERE
((#var2 IS NULL AND a.addr IS NOT NULL)OR
(a.addr LIKE #var2 + '%')) AND
((#var3 IS NULL AND a.ca_id IS NOT NULL) OR
(a.ca_id = #var3)) AND
b.time >= #from_datetime AND b.time <= #to_datetime AND
(
(
b.shopping_product IN ('CX12343', 'BG8945', 'GF4543') AND
b.shopping_category IN ('online', 'COD')
)
OR
(
b.shopping_product = 'LX3454' and b.sub_shopping_list in ('FF544','GT544','KK543','LK5343')
)
OR
(
b.shopping_product = 'LK434434' and b.sub_shopping_list in ('LL5435','PO89554','IO948854','OR4334','TH5444')
)
OR
(
b.shopping_product = 'AZ434434' and b.sub_shopping_list in ('LL54352','PO489554','IO9458854','OR34334','TH54344')
)
)AND
ORDER BY
b.time desc
ELSE
BEGIN
select <some columns>
From <some tables with joins(no lock)>
where <similar where as above with slight difference>
Okay then,
I said "first take indexes on these : shopping_product and shopping_category sub_shopping_list , and secondly u can try on the date , after that see the execution plan. (or would be better to create partition on the time column)"
I'm working on oracle, but the basics are the same.
You can create 3 distinct indexes on that cols : shopping_product, shopping_category, sub_shopping_list . Or you can create 1 composite index for that 3 cols. The point is you need to examine the execution plan which one is the most effective for you.
Oh, and here is a.ca_id column (almost forget), you need an index for this too.
For the date column i think you would better create a partition instead of an index.
Summary, two ways: - create 4 distinct index (shopping_product,shopping_category,sub_shopping_list, ca_id) , create a range typed partition on the date column
- create 1 composite index (shopping_product,shopping_category,sub_shopping_list) and 1 normal index(ca_id) , create a range typed partition on the date column
You probably should learn about indexing if you're dealing with tables of this size. It's not a trivial process. JOIN operations are a big deal when sorting out which indexes you need. Read this. http://use-the-index-luke.com/
In the meantime, if your date-range is highly selective (that is, if
b.time >= #from_datetime AND b.time <= #to_datetime
chooses a reasonably small fraction of the rows in your database) you should try the following compound index.
b.shopping_product, b.time
If that doesn't help, try
b.time
by itself. The idea is to structure your index so the server can do a range scan. Without a knowledge of your whole query, there's not much else to offer.

"order by newid()" - how does it work?

I know that If I run this query
select top 100 * from mytable order by newid()
it will get 100 random records from my table.
However, I'm a bit confused as to how it works, since I don't see newid() in the select list. Can someone explain? Is there something special about newid() here?
I know what NewID() does, I'm just
trying to understand how it would help
in the random selection. Is it that
(1) the select statement will select
EVERYTHING from mytable, (2) for each
row selected, tack on a
uniqueidentifier generated by NewID(),
(3) sort the rows by this
uniqueidentifier and (4) pick off the
top 100 from the sorted list?
Yes. this is pretty much exactly correct (except it doesn't necessarily need to sort all the rows). You can verify this by looking at the actual execution plan.
SELECT TOP 100 *
FROM master..spt_values
ORDER BY NEWID()
The compute scalar operator adds the NEWID() column on for each row (2506 in the table in my example query) then the rows in the table are sorted by this column with the top 100 selected.
SQL Server doesn't actually need to sort the entire set from positions 100 down so it uses a TOP N sort operator which attempts to perform the entire sort operation in memory (for small values of N)
In general it works like this:
All rows from mytable is "looped"
NEWID() is executed for each row
The rows are sorted according to random number from NEWID()
100 first row are selected
as MSDN says:
NewID() Creates a unique value of type
uniqueidentifier.
and your table will be sorted by this random values.
use select top 100 randid = newid(), * from mytable order by randid
you will be clarified then..
I have an unimportant query which uses newId() and joins many tables. It returns about 10k rows in about 3 seconds. So, newId() might be ok in such cases where performance is not too bad & does not have a huge impact. But, newId() is bad for large tables.
Here is the explanation from Brent Ozar's blog - https://www.brentozar.com/archive/2018/03/get-random-row-large-table/.
From the above link, I have summarized the methods which you can use to generate a random id. You can read the blog for more details.
4 ways to get a random row from a large table:
Method 1, Bad: ORDER BY NEWID() > Bad performance!
Method 2, Better but Strange: TABLESAMPLE > Many gotchas & is not really
random!
Method 3, Best but Requires Code: Random Primary Key >
Fastest, but won't work for negative numbers.
Method 4, OFFSET-FETCH (2012+) > Only performs properly with a clustered
index.
More on method 3:
Get the top ID field in the table, generate a random number, and look for that ID. For top N rows, call the code below N times or generate N random numbers and use in an IN clause.
/* Get a random number smaller than the table's top ID */
DECLARE #rand BIGINT;
DECLARE #maxid INT = (SELECT MAX(Id) FROM dbo.Users);
SELECT #rand = ABS((CHECKSUM(NEWID()))) % #maxid;
/* Get the first row around that ID */
SELECT TOP 1 *
FROM dbo.Users AS u
WHERE u.Id >= #rand;

MS Access row number, specify an index

Is there a way in MS access to return a dataset between a specific index?
So lets say my dataset is:
rank | first_name | age
1 Max 23
2 Bob 40
3 Sid 25
4 Billy 18
5 Sally 19
But I only want to return those records between 'rank' 2 and 4, so my results set is Bob, Sid and Billy? However, Rank is not part of the table, and this should be generated when the query is run. Why don't I use an autogenerated number, because if a record is deleted, this will be inconsistent, and what if I wanted the results in reverse!
This obviously very simple, and the reason I ask is because I am working on a product catalogue and I am looking for a more efficient way of paging through the returned dataset, so if I only return 1 page worth of data from the database this is obviously going to be quicker then return a complete set of 3000 records and then having to subselect from that set!
Thanks R.
Original suggestion:
SELECT * from table where rank BETWEEN 2 and 4;
Modified after comment, that rank is not existing in structure:
Select top 100 * from table;
And if you want to choose subsequent results, you can choose the ID of the last record from the first query, say it was ID 101, and use a WHERE clause to get the next 100;
Select top 100 * from table where ID > 100;
But these won't give you what you're looking for either, I bet.
How are you calculating rank? I assume you are basing it on some data in another dataset somewhere. If so, create a function, do a table join, or do something that can calculate rank based on values in other table(s), then you can do queries based on the rank() function.
For example:
select *
from table
where rank() between 2 and 4
If you are not calculating rank based on some data somewhere, there really isn't a way to write this query, and you might as well be returning three random rows from the table.
I think you need to use a correlated subquery to calculate the rank on the fly e.g. I'm guessing the rank is based on name:
SELECT T1.first_name, T1.age,
(
SELECT COUNT(*) + 1
FROM MyTable AS T2
WHERE T1.first_name > T2.first_name
) AS rank
FROM MyTable AS T1;
The bad news is the Access data engine is poorly optimized for this kind of query; in my experience, performace will start to noticeably degrade beyond a few hundred rows.
If it is not possible to maintain the rank on the db side of the house (e.g. high insertion environment) consider doing the paging on the client side. For example, an ADO classic recordset object has properties to support paging (PageCount, PageSize, AbsolutePage, etc), something for which DAO recordsets (being of an older vintage) have no support.
As always, you'll have to perform your own timings but I suspect that when there are, say, 10K rows you will find it faster to take on the overhead of fetching all the rows to an ADO recordset then finding the page (then perhaps fabricate smaller ADO recordset consisting of just that page's worth of rows) than it is to perform a correlated subquery to only fetch the number of rows for the page.
Unfortunately the LIMIT keyword isn't available in MS Access -- that's what is used in MySQL for a multi-page presentation. If you can write an order key into the results table, then you can use it something like this:
SELECT TOP 25 MyOrder, Etc FROM Table1 WHERE MyOrder in
(SELECT TOP 55 MyOrder FROM Table1 ORDER BY MyOrder DESC)
ORDER BY MyOrder ASCENDING
If I understand you correctly, there is ionly first_name and age columns in your table. If this is the case, then there is no way to return Bob, Sid, and Billy with a single query. Unless you do something like
SELECT * FROM Table
WHERE FirstName = 'Bob'
OR FirstName = 'Sid'
OR FirstName = 'Billy'
But I think that this is not what you are looking for.
This is because SQL databases make no guarantee as to the order that the data will come out of the database unless you specify an ORDER BY clause. It will usually come out in the same order it was added, but there are no guarantees, and once you get a lot of rows in your table, there's a reasonably high probability that they won't come out in the order you put them in.
As a side note, you should probably add a "rank" column (this column is usually called id) to your table, and make it an auto incrementing integer (see Access documentation), so that you can do the query mentioned by Sev. It's also important to have a primary key so that you can be certain which rows are being updated when you are running an update query, or which rows are being deleted when you run a delete query. For example, if you had 2 people named Max, and they were both 23, how you delete 1 row without deleting the other. If you had another auto incrementing unique column in there, you could specify the unique ID in your query to delete only one.
[ADDITION]
Upon reading your comment, If you add an autoincrement field, and want to read 3 rows, and you know the ID of the first row you want to read, then you can use "TOP" to read 3 rows.
Assuming your data looks like this
ID | first_name | age
1 Max 23
2 Bob 40
6 Sid 25
8 Billy 18
15 Sally 19
You can wuery Bob, Sid and Billy with the following QUERY.
SELECT TOP 3 FirstName, Age
From Table
WHERE ID >= 2
ORDER BY ID

Resources