Cassandra best practice to ORDER BY using PRIMARY KEY - database

Originally I had a cassandra table like this:
CREATE TABLE table (
open_time timestamp,
open double,
close double,
high double,
low double,
volume bigint,
PRIMARY KEY(open_time));
open_time | close | high | low | open | volume
---------------------------------+--------+--------+-------+--------+--------
2020-08-05 06:00:00.000000+0000 | 181.53 | 184.32 | 181.1 | 184.32 | 100
2020-08-04 06:00:00.000000+0000 | 181.53 | 184.32 | 181.1 | 184.32 | 100
I need to perform a query to get the latest open_time. After noticing that querys like
SELECT open_time FROM table ORDER BY open_time DESC LIMIT 1;
are not allowed, I wonder what's the best practice here.
My idea is to add an id column, that I can use open_time as clustering order. Something like:
CREATE TABLE table (
id int,
open_time timestamp,
open double,
close double,
high double,
low double,
volume bigint,
PRIMARY KEY(id, open_time)
)
WITH CLUSTERING ORDER BY (open_time DESC);
Is this a valid solution to get the job done or are there better ways, e.g. something without an extra id column, because I would never query over the id itslef.
The most queries would be something like:
SELECT * FROM table WHERE open_time >= '2013-01-01 00:00:00+0200' AND open_time <= '2013-08-13 23:59:00+0200';
Thanks!

CLUSTERING ORDER enforces the on-disk sort order within each partition. So ordering by the same key that you're partitioning on isn't possible. Partitioning by id will face a similar challenge, in that the CLUSTERING ORDER BY open_time will only be enforced within each id.
I wonder what's the best practice here.
Models like these are usually solved by time bucketing, as I mentioned in an answer to a similar question earlier today. To select the best "bucket," you'll need to understand your business case like number of entries per day, as well as the query requirements.
For the sake of example, let's say that month would work the best. If each row contained a value of 'YEAR-MONTH', the PK definition would look like this:
PRIMARY KEY (month_bucket,open_time))
WITH CLUSTERING ORDER BY (open_time DESC);
Then, you could support a query like this:
SELECT * FROM table
WHERE month_bucket = '2013-08'
AND open_time >= '2013-08-01 00:00:00+0200' AND open_time <= '2013-08-13 23:59:00+0200';
Likewise, querying the most recent entry would only require the most recent (current?) month as a parameter:
SELECT * FROM table
WHERE month_bucket = '2020-08'
LIMIT 1;
As the results are stored within each month_bucket sorted by open_time in descending order, that query would return the most-recent entry.
I wrote an article on this for DataStax (several years ago) which is relevant to this problem. It's been moved to a new part of their site, which hosed the formatting, but the content is defintely there. Give it a read; hope it helps: We Shall Have Order!

If id is mentioned as primary key, it must be included in where clause otherwise it would need allow filtering.
You can try querying with "Select max(open_time)....",otherwise you can use id as above which will be incremented with every record and a result, id with highest value will always have the latest record.

Related

Metadata database design

I am trying to store meta data about a document into a SQL Server. The document are stored into a document archive, and returns back an identifier so I can get back that document by asking the archive to get the document by identifier.
Our user would like to be able to search for this document based on different meta data. The meta data could be 1 attribute or 5 depending on the document type, and the users should be able to create new document types from a admin site.
I can see two solution here. One is that each documenttype gets it's own metadata table, where all metadata attributes are predefined, and if one should be added a new column needs to be created. And if a new documenttype is created a new metadata table needs to be created. Our DBA will freak out with a solution like this, and I also see a problem with indexes. Because if the documenttype has 5 different meta data attributes it needs to be searchable with 1 or 4 of them specified in the search. Then I would need to write index for all the different combinations of possible searchs.
here is an example (fictiv)
|documentId | Name | InsertDate | CustomerId | City
| 1 | John | 2014-01-01 | 2 | London
| 2 | John | 2014-01-20 | 5 | New York
| 3 | Able | 2014-01-01 | 10 | Paris
I could here say:
Give me all documents where Name = 'John'
Give me all documets where Name = 'John' And CustomerId = 5
Give me all document where InserDate = '2014-01-01' and City = 'London'
This will be 3 differnet indexes and then I haven't coverd all possible combinations. This isn't practical.
So I am look in to the evil 'EAV' (anti)pattern.
So instead of having the metadata as columns I can have the as rows.
|documentId | MetaAttribute | MetaValue
| 1 | Name | John
| 1 | InsertDate | 2014-01-01
| 1 | CustomerId | 2
| 1 | City | London
| 2 | Name | John
| 2 | InsertDate | 2014-01-20
| 2 | CustomerId | 5
| 2 | City | New York
| 3 | Name | Able
| 3 | InserDate | 2014-01-01
| 3 | CustomerId | 10
| 3 | City | Paris
Here it's simple to create one index om MetaAttribute och metaValue, and it's covered. If a new documenttype is created, new metadata can be created with that documenttype into a MetaAttributeTable (that contains all MetaAttribute for the different documenttype). So no need to create new tables or coulms if a new documenttype is added or if a new attribute is added to a documenttype. Instead all MetaValues most be strings :( and the SQL Query to find the document id is a bit more complicated.
This is what I figured out. (In this example the MetaAttribute is a string, but would be an ID to the MetaAttribute Table)
SELECT * FROM [Document]
WHERE ID IN (SELECT documentId FROM [MetaData]
WHERE ((MetaAttribute = 'Name' AND MetaValue = 'John')
OR (MetaAttribute = 'CustomerId' and MetaValue = '5'))
GROUP BY [documentId]
HAVING Count(1) = 2)
Here I need to ask if the Name = 'John' and CustomerId = 5. I do that by finding all records that have Name = 'John' and CustomerId = '5' and the Group it on the documentId and count number of items in the group. If I got 2 then both Name = 'John' and CustomerId = '5' is true for this search. Return the documentId and use that to retrive information about the document, like the document archive storage id.
There should be a better SQL statement for this isn't there?
So my question is. Is there a better approche than these 2. Is the EAV-pattern so bad that I should stick with the first approche and have a Freaked out DBA and "ten millions of indexes"
We are talking about a system that will have around 10-20 millions of new records each month, and contain data for at least 3 years.... So the tables will be preatty big and good indexes are neccasary for performance.
Best Regards
Magnus
The EAV model is appealing if you have unbounded attributes--that is, anyone can set up anything as an attribute. However, it sounds from your description that this is not the case--the possible document attributes come from a known and fairly limited set. If this is the case, routine normalization suggests the following:
-- One per document
CREATE TABLE Document
(
DocumentId -- primary key
,DocumentType
,<etc>
)
-- One per "type" of document
CREATE TABLE DocumentType
(
DocumentTypeId -- pirmary key
,Name
)
-- One per possible document attribute.
-- Note that multiple document types can reference the same attribute
CREATE TABLE DocumentAttributes
(
AttributeId -- primary key
,Name
)
-- This lists which attributes are used by a given type
CREATE TABLE DocumentTypeAttributes
(
DocumentTypeId
,AttributeId
-- compound primary key on both columns
-- foeign keys on both columns
)
-- This contains the final association of document and attributes
CREATE TABLE DocumentAttributeValues
(
DocumentId
,AttributeId
,Value
-- compound primary key on DocumentId, AttributeId
-- foeign keys on both columns ot their respective parent tables
)
A tighter model with more robust keys could be implemented to ensure at the database level that an attribute cannot be assigned to a document with an “inappropriate” type.
Queries have to use joins, but (presumably) only the Documents and DocumentAttributes tables will ever be large. An index on on (AttributeId + Value) facilitiate lookups by attribute type, and depending on cardinality an index on (Value + AttributeId) could make searches for specific attributes quite efficient.
(Edit)
Ooh, clever, I created two tables with the same name. I've renamed the last one to DocumentAttributeValues. (Free advice is clearly worth what you paid for it!)
This shows how ugly these systems can get in SQL, as you have to “look up” both attributes separately. On the plus side you don’t have to worry about “does this type go with this document”, as those rules have (better had) been applied when the data was loaded. Two examples:
This one spells everything out in joins, and as such I think it might perform worse than the next:
-- Top-down
SELECT do.DocumentId
from Documents do
inner join DocumentAttributes da1
on da.Name = 'Name'
inner join DocumentAttributeValues dav1
on dav1.AttributeId = da1.AttributeId
and dav1.Value = 'John'
inner join DocumentAttributes da2
on da2.Name = 'CustomerId'
inner join DocumentAttributeValues dav2
on dav2.AttributeId = da2.AttributeId
and dav2.Value = '5'
This one picks out the attributes, then finds which documents have all of them. It might perform better, as there’s one less table to process:
-- Bottom-up
SELECT xx.DocumentId
from (-- All documents with name "John"
select dav.DocumentId
from DocumentAttributes da
inner join DocumentAttributeValues dav
on dav.AttributeId = da.AttributeId
where da.Name = 'Name'
and dav.Value = 'John'
-- This combines the two sets, with "all" keeping any duplicate entries
union all
-- All documents with CustomerId = "5"
select dav.DocumentId
from DocumentAttributes da
inner join DocumentAttributeValues dav
on dav.AttributeId = da.AttributeId
where da.Name = 'CustomerId'
and dav.Value = '5') xx -- Have to give the subquery an alias
group by xx.DocumentId
having count(*) = 2
While further refinements might be possible, the more more attributes you’re filtering on, the uglier the queries will be. Five attributes max might work ok in SQL, but if you’ve got tons of attributes, a NoSQL solution might be what you’re looking for.
(Please note that, as with my original post, I have not tested this code, so there may be typos or subtle--or not so subtle--errors in here.)
SQL Server 2008+ offers three related features for dealing with such cases:
Sparse Columns which allow you to define hundreds of columns even if only a subset are used at a time
Column Sets allow you to group these columns and treat them as a group
Filtered indexes can index only the rows that actually have values in them.
These features allow you to work with more-or-less normal SQL statements to handle all metadata columns.
These features were specifically added to address the EAV/metadata scenario.
EDIT
If you have a limited set of attributes that are always filled, there is no need for Sparse Columns or the EAV anti-pattern either.
You can create your tables as you normally would and add indexes to optimize the real workload you encounter. Certain types of queries will occur far more often than others and SQL Server's Index tuning advisor can propose the indexes and statistics to use based on a trace captured using SQL Server's Profiler.
It's quite possible that only a subset of the columns will accelerate searches and the rest can be added as include columns in the index.
Full Text Search
A more powerful option is to use SQL Server's Full Text Search. This will allow you to execute queries using arbitrary attributes. This is another technique using by document/content management systems, ERPs and CRMs to handle arbitrary attributes.
With FTS you simply specify the columns to include in one FTS index and don't have to create separate indexes for each attribute.
You can use FTS predicates in SELECT queries like this:
SELECT Name, ListPrice
FROM Production.Product
WHERE ListPrice = 80.99
AND CONTAINS(Name, 'Mountain')
This can result in much simpler queries (you just write a modified select) and administration (no worries about column order in indexes, only one FTS index to manage)

Why sort on sorted non clustered index field?

Say I have a table with ID, Name, and Date.
And I have a non-clustered index like,
CREATE NONCLUSTERED INDEX IX_Test_NameDate ON [dbo].[Test] (Name, Date)
When I run the query,
select
[Name], [Date]
from
[dbo].[Test] WITH (INDEX(IX_Test_NameDate))
where
[Name] like 'A%'
order by
[Date] asc
I get in SQL Server's execution plan,
Select <-- Sort <-- Index Seek (NonClustered)
Why the sort? Isn't the date already sorted in the non-clustered index? What would a better non-clustered index look like that doesn't require a sort (only an index seek).
(Can't use a clustered index as this example is a condensed version of a bigger example with multiple rows/indexes).
For example, I get the execution plan (with sort) for a table that looks like this,
ID Name Date
1 A 2014-01-01
2 A 2014-02-01
3 A 2014-03-01
4 A 2014-04-01
5 B 2014-01-01
6 B 2014-02-01
7 B 2014-03-01
8 B 2014-04-01
9 B 2014-05-01
10 B 2014-06-01
Shouldn't the dates be sorted in this case?
No, the Date column is not "already sorted in the non-clustered index", at least, not by itself. It is sorted after Name.
Consider the following trivial table data:
Name Date
----- --------
Allen 1/1/2014
Barb 1/1/2013
Charlie 1/1/2015
Darlene 1/1/2012
Ernie 1/1/2016
Faith 1/1/2011
Once you've sorted by Name, the Date columns are potentially out of order. Dates are guaranteed in order only for rows that have the same Name.
Your goals are at cross-purposes to each other. You want multiple names--so the data is best ordered by name so that the seek is possible, but then you want to sort by Date. How would you propose storing the above six-row table so that it is sorted by Date for every possible range of names?
If there is some kind of regularity or pattern about the ranges of names (perhaps, for example, you always pull names by first letter only) then there is a possible workaround.
ALTER TABLE dbo.Test ADD NamePrefix AS (Left(Name, 1)) PERSISTED;
CREATE NONCLUSTERED INDEX IX_Test_NamePrefix_Date ON dbo.Test (NamePrefix, Date);
Now this query theoretically should not need to perform the sort:
SELECT Name, Date
FROM dbo.Test
WHERE NamePrefix = 'A'
ORDER BY Date;
Be aware that there are some likely gotchas with adding a persisted computed column like this: increased data size, the fact that such a design is almost certainly wrong in almost every case, that the proliferation of computed columns would be very bad, among others.
P.S. It is generally not best practice to force indexes manually--let the optimizer choose.

Efficiently counting strength of relationship between rows in Postgres

I have a table that looks similar to this:
session_id | sku
------------|-----
a | 1
a | 2
a | 3
a | 4
b | 2
b | 3
c | 3
I want to pivot this into a table similar to this:
sku1 | sku2 | score
------|------|------
1 | 2 | 1
1 | 3 | 1
1 | 4 | 1
2 | 3 | 2
2 | 4 | 1
3 | 4 | 1
The idea is to store a denormalised table that allows one to look up for a given sku, what other skus are related to sessions it has been related to, and how many times both skus are related to the same session.
What algorithms, patterns or strategies could you suggest for implementing this in PostgreSQL or other technologies?
I realise that this kind of lookup can be done on the original table using counts, or using a facetting search engine. However, I want to make the reads more performant, and just want to keep the overall statistics. The idea is that I will be performing this pivot regularly on the newest few thousand rows in the first table, then storing the result in the second. I'm only concerned with approximate statistics for the second table.
I've got some SQL that works, but VERY slowly. Also looking into the potential for using a graph database of some sort, but wanted to avoid adding another technology for a small part of the app.
Update: The SQL below seems performant enough. I can convert 1.2 million rows in the first table (tags) into 250k rows in the second table (product_relations) with around 2-3k variations of sku in about 5 minutes on my iMac. I will realistically be denormalising only up to 10k rows per day. Question is whether this is actually the best approach. Seems a little dirty to me.
BEGIN;
CREATE
TEMPORARY TABLE working_tags(tag_id int, session_id varchar, sku varchar) ON COMMIT DROP;
INSERT INTO working_tags
SELECT id,
session_id,
sku
FROM tags
WHERE time < now() - interval '12 hours'
AND processed_product_relation IS NULL
AND sku IS NOT NULL LIMIT 200000;
CREATE
TEMPORARY TABLE working_relations (sku1 varchar, sku2 varchar, score int) ON COMMIT DROP;
INSERT INTO working_relations
SELECT a.sku AS sku1,
b.sku AS sku2,
count(DISTINCT a.session_id) AS score
FROM working_tags AS a
INNER JOIN working_tags AS b ON a.session_id = b.session_id
AND a.sku < b.sku
WHERE a.sku IS NOT NULL
AND b.sku IS NOT NULL
GROUP BY a.sku,
b.sku;
UPDATE product_relations
SET score = working_relations.score+product_relations.score
FROM working_relations
WHERE working_relations.sku1 = product_relations.sku1
AND working_relations.sku2 = product_relations.sku2;
INSERT INTO product_relations (sku1, sku2, score)
SELECT working_relations.sku1,
working_relations.sku2,
working_relations.score
FROM working_relations
LEFT OUTER JOIN product_relations ON (working_relations.sku1 = product_relations.sku1
AND working_relations.sku2 = product_relations.sku2)
WHERE product_relations.sku1 IS NULL;
UPDATE tags
SET processed_product_relation = TRUE
WHERE id IN
(SELECT tag_id
FROM working_tags);
COMMIT;
If I've interpreted your intention correctly (per comments) this should do it:
SELECT
s1.sku AS sku1,
s2.sku AS sku2,
count(session_id)
FROM session s1
INNER JOIN session s2 USING (session_id)
WHERE s1.sku < s2.sku
GROUP BY s1.sku, s2.sku
ORDER BY 1,2;
See: http://sqlfiddle.com/#!15/2e0b2/1
In other words: Self-join session, then find all pairings of SKUs for each session ID, excluding ones where the left is greater than or equal to the right in order to avoid repeating pairings - if we have (1,2,count) we don't want (2,1,count) as well. Then group by the SKU pairings and count how many rows are found for each pairing.
You may want to count(distinct session_id) instead, if your SKU pairings can repeat and you want to exclude duplicates. There will probably be more efficient ways to do that, but that's the simplest.
An index on at least session_id will be very useful. You may also want to mess with planner cost parameters to make sure it chooses a good plan - in particular, make sure effective_cache_size is accurate and random_page_cost vs seq_page_cost reflects your caching and I/O costs. Finally, throw as much work_mem at it as you can afford.
If you're creating a materialized view, just CREATE UNLOGGED TABLE whatever AS SELECT .... . That way you minimise the numer of writes/rewrites/overwrites.

T-SQL rolling twelve month per day performance

I have checked similar problems, but none have worked well for me. The most useful was http://forums.asp.net/t/1170815.aspx/1, but the performance makes my query run for hours and hours.
I have 1.5 million records based on product sales (about 10k product) over 4 years. I want to have a table that contains date, product and rolling twelve months sales.
This query (from the link above) works, and shows what I want, but the perfomance makes it useless:
select day_key, product_key, price, (select sum(price) as R12 from #ORDER_TURNOVER as tb1 where tb1.day_key <= a.day_key and tb1.day_key > dateadd(mm, -12, a.day_key) and tb1.product_key = a.product_key) as RSum into #hejsan
from #ORDER_TURNOVER as a
I tried a rolling sum cursor function for all records which was fast as lightning, but I couldn't get the query only to sum the sales over the last 365 days.
Any ideas on how to solve this problem is much appreciated.
Thank you.
I'd change your setup slightly.
First, have a table that lists all the product keys that are of interest...
CREATE TABLE product (
product_key INT NOT NULL,
price INT,
some_fact_data VARCHAR(MAX),
what_ever_else SOMEDATATYPE,
PRIMARY KEY CLUSTERED (product_key)
)
Then, I'd have a calendar table, with each individual date that you could ever need to report on...
CREATE TABLE calendar (
date SMALLDATETIME,
is_bank_holdiday INT,
what_ever_else SOMEDATATYPE,
PRIMARY KEY CLUSTERED (date)
)
Finally, I'd ensure that your data table has a covering index on all the relevant fields...
CREATE INDEX IX_product_day ON #ORDER_TURNOVER (product_key, day_key)
This would then allow the following query...
SELECT
product.product_key,
product.price,
calendar.date,
SUM(price) AS RSum
FROM
product
CROSS JOIN
calendar
INNER JOIN
#ORDER_TURNOVER AS data
ON data.product_key = product.product_key
AND data.day_key > dateadd(mm, -12, calendar.date)
AND data.day_key <= calendare.date
GROUP BY
product.product_key,
product.price,
calendar.date
By doing everything in this way, each product/calendar_date combination will then relate to a set of record in your data table that are all consecutive to each other. This will make the act of looking up the data to be aggregated much, much simpler for the optimiser.
[Requires a single index, specifically in the order (product, date).]
If you have the index the other way around, it is actually much harder...
Example data:
product | date date | product
---------+------------- ------------+---------
A | 01/01/2012 01/01/2012 | A
A | 02/01/2012 01/01/2012 | B
A | 03/01/2012 02/01/2012 | A
B | 01/01/2012 02/01/2012 | B
B | 02/01/2012 03/01/2012 | A
B | 03/01/2012 03/01/2012 | B
On the left oyu just get all the records that are next to each other in a 365 day block.
On the right you search for each record before you can aggregate. The search is relatively simple, but you do 365 of them. Much more than the version on the left.
This is how one does "running totals" / "sum subsets" in SQL Server 2005-2008. In SQL 2012 there is native support for running totals but we are all still working with 2005-2008 db's
SELECT day_key ,
product_key ,
price ,
( SELECT SUM(price) AS R12
FROM #ORDER_TURNOVER AS tb1
WHERE tb1.day_key <= a.day_key
AND tb1.day_key > DATEADD(mm, -12, a.day_key)
AND tb1.product_key = a.product_key
) AS RSum
INTO #hejsan
FROM #ORDER_TURNOVER AS a
A few suggestions.
You could pre calculate the running totals so that they are not calculated again and again. What you are doing it the above select is a disguised loop and not a set query (unless the optimizer can convert the subquery to a join).
The above solution requires a few changes to the code.
Another solution that you can certainly try is to create a clustered index on your #ORDER_TURNOVER temp table. This is safer cause it's local change.
CREATE CLUSTERED INDEX IndexName
ON #ORDER_TURNOVER (day_key,day_key,product_key)
All your 3 expressions in the WHERE clause are SARGS so chanes are good that the optimizer will now do a seek instead of a scan.
If the index solution does not give enough performance gains that its well worth investing in solution 1

Improving OFFSET performance in PostgreSQL

I have a table I'm doing an ORDER BY on before a LIMIT and OFFSET in order to paginate.
Adding an index on the ORDER BY column makes a massive difference to performance (when used in combination with a small LIMIT). On a 500,000 row table, I saw a 10,000x improvement adding the index, as long as there was a small LIMIT.
However, the index has no impact for high OFFSETs (i.e. later pages in my pagination). This is understandable: a b-tree index makes it easy to iterate in order from the beginning but not to find the nth item.
It seems that what would help is a counted b-tree index, but I'm not aware of support for these in PostgreSQL. Is there another solution? It seems that optimizing for large OFFSETs (especially in pagination use-cases) isn't that unusual.
Unfortunately, the PostgreSQL manual simply says "The rows skipped by an OFFSET clause still have to be computed inside the server; therefore a large OFFSET might be inefficient."
You might want a computed index.
Let's create a table:
create table sales(day date, amount real);
And fill it with some random stuff:
insert into sales
select current_date + s.a as day, random()*100 as amount
from generate_series(1,20);
Index it by day, nothing special here:
create index sales_by_day on sales(day);
Create a row position function. There are other approaches, this one is the simplest:
create or replace function sales_pos (date) returns bigint
as 'select count(day) from sales where day <= $1;'
language sql immutable;
Check if it works (don't call it like this on large datasets though):
select sales_pos(day), day, amount from sales;
sales_pos | day | amount
-----------+------------+----------
1 | 2011-07-08 | 41.6135
2 | 2011-07-09 | 19.0663
3 | 2011-07-10 | 12.3715
..................
Now the tricky part: add another index computed on the sales_pos function values:
create index sales_by_pos on sales using btree(sales_pos(day));
Here is how you use it. 5 is your "offset", 10 is the "limit":
select * from sales where sales_pos(day) >= 5 and sales_pos(day) < 5+10;
day | amount
------------+---------
2011-07-12 | 94.3042
2011-07-13 | 12.9532
2011-07-14 | 74.7261
...............
It is fast, because when you call it like this, Postgres uses precalculated values from the index:
explain select * from sales
where sales_pos(day) >= 5 and sales_pos(day) < 5+10;
QUERY PLAN
--------------------------------------------------------------------------
Index Scan using sales_by_pos on sales (cost=0.50..8.77 rows=1 width=8)
Index Cond: ((sales_pos(day) >= 5) AND (sales_pos(day) < 15))
Hope it helps.
I don't know anything about "counted b-tree indexes", but one thing we've done in our application to help with this is break our queries into two, possibly using a sub-query. My apologies for wasting your time if you're already doing this.
SELECT *
FROM massive_table
WHERE id IN (
SELECT id
FROM massive_table
WHERE ...
LIMIT 50
OFFSET 500000
);
The advantage here is that, while it still has to calculate the proper ordering of everything, it doesn't order the entire row--only the id column.
Instead of using an OFFSET, a very efficient trick is to use a temporary table:
CREATE TEMPORARY TABLE just_index AS
SELECT ROW_NUMBER() OVER (ORDER BY myID), myID
FROM mytable;
For 10 000 000 rows it needs about 10s to be created.
Then you want to use SELECT or UPDATE your table, you simply:
SELECT * FROM mytable INNER JOIN (SELECT just_index.myId FROM just_index WHERE row_number >= *your offset* LIMIT 1000000) indexes ON mytable.myID = indexes.myID
Filtering mytable with only just_index is more efficient (in my case) with a INNER JOIN than with a WHERE myID IN (SELECT ...)
This way you don't have to store the last myId value, you simply replace the offset with a WHERE clause, that uses indexes
It seems that optimizing for large
OFFSETs (especially in pagination
use-cases) isn't that unusual.
It seems a little unusual to me. Most people, most of the time, don't seem to skim through very many pages. It's something I'd support, but wouldn't work hard to optimize.
But anyway . . .
Since your application code knows which ordered values it's already seen, it should be able to reduce the result set and reduce the offset by excluding those values in the WHERE clause. Assuming you order a single column, and it's sorted ascending, your app code can store the last value on the page, then add AND your-ordered-column-name > last-value-seen to the WHERE clause in some appropriate way.
recently i worked over a problem like this, and i wrote a blog about how face that problem. is very like, i hope be helpfull for any one.
i use lazy list approach with partial adquisition. i Replaced the limit and offset or the pagination of query to a manual pagination.
In my example, the select returns 10 millions of records, i get them and insert them in a "temporal table":
create or replace function load_records ()
returns VOID as $$
BEGIN
drop sequence if exists temp_seq;
create temp sequence temp_seq;
insert into tmp_table
SELECT linea.*
FROM
(
select nextval('temp_seq') as ROWNUM,* from table1 t1
join table2 t2 on (t2.fieldpk = t1.fieldpk)
join table3 t3 on (t3.fieldpk = t2.fieldpk)
) linea;
END;
$$ language plpgsql;
after that, i can paginate without count each row but using the sequence assigned:
select * from tmp_table where counterrow >= 9000000 and counterrow <= 9025000
From java perspective, i implemented this pagination through partial adquisition with a lazy list. this is, a list that extends from Abstract list and implements get() method. The get method can use a data access interface to continue get next set of data and release the memory heap:
#Override
public E get(int index) {
if (bufferParcial.size() <= (index - lastIndexRoulette))
{
lastIndexRoulette = index;
bufferParcial.removeAll(bufferParcial);
bufferParcial = new ArrayList<E>();
bufferParcial.addAll(daoInterface.getBufferParcial());
if (bufferParcial.isEmpty())
{
return null;
}
}
return bufferParcial.get(index - lastIndexRoulette);<br>
}
by other hand, the data access interface use query to paginate and implements one method to iterate progressively, each 25000 records to complete it all.
results for this approach can be seen here
http://www.arquitecturaysoftware.co/2013/10/laboratorio-1-iterar-millones-de.html

Resources