Postgres index-only scan taking too long - database

I have a table with the below structure and indexes:
Table "public.client_data"
Column | Type | Modifiers
-------------------------+---------+-----------
account_id | text | not_null
client_id | text | not null
client_type | text | not null
creation_time | bigint | not null
last_modified_time | bigint | not null
Indexes:
"client_data_pkey" PRIMARY KEY, btree (account_id, client_id)
"client_data_last_modified_time_index" btree (last_modified_time)
From this table I need to find the oldest record - for this I used the following query:
SELECT last_modified_time FROM client_data ORDER BY last_modified_time ASC LIMIT 1;
However this query on this table with around 61 million rows is running very slow (90-100 mins) in a db.r4.2xlarge RDS instance in AWS Aurora Postgres 9.6 with no other concurrent queries running.
However changing the query to use DESC finishes instantly. What could be the problem? I was expecting that since I have an index of the last_modified_time querying only for that column ordered by that column with the limit applied would involve an index-only query that should stop after the first entry in the index.
Here is the output of the explain analyze:
EXPLAIN ANALYZE SELECT last_modified_time FROM client_data ORDER BY last_modified_time ASC LIMIT 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..2.31 rows=1 width=8) (actual time=6297823.287..6297823.287 rows=1 loops=1)
-> Index Only Scan using client_data_last_modified_time_index on client_data (cost=0.57..1049731749.38 rows=606590292 width=8) (actual time=6297823.287..6297823.287 rows=1 loops=1)
Heap Fetches: 26575013
Planning time: 0.078 ms
Execution time: 6297823.306 ms
The same for the DESC version of the query results in the following
EXPLAIN ANALYZE SELECT last_modified_time FROM client_data ORDER BY last_modified_time DESC LIMIT 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..2.32 rows=1 width=8) (actual time=2.265..2.266 rows=1 loops=1)
-> Index Only Scan Backward using client_data_last_modified_time_index on client_data (cost=0.57..1066049674.69 rows=611336085 width=8) (actual time=2.264..2.264 rows=1 loops=1)
Heap Fetches: 9
Planning time: 0.095 ms
Execution time: 2.278 ms
Any pointers?

The difference is this:
The slow plan has
Heap Fetches: 26575013
and the fast plan
Heap Fetches: 9
Heap fetches is what turns a fast index only scan to a slow normal index scan.
Did the table experience mass updates or deletions recently?
The reason for the slow scan is that it has to wade through a lot of invisible (deleted) tuples before it hits the first match.
Run VACUUM on the table, and both scans will be fast.

Related

Why in my case ST_DWithin is not using index

I'm using Postgis extension for Postgres and trying to optimize my query for searching points in circle.
Consider I have this table with index:
create table position
(
id bigserial not null primary key,
date timestamp with time zone,
point GEOMETRY(Point, 4326),
alias varchar(50)
);
create index position_point_idx on position using gist (point);
Now when I use query with polygon everything work as expected. In explain plan I can see that query uses index.
SELECT distinct alias
FROM position
WHERE date > '2021-11-28T19:26:18.574Z'
AND date < '2021-11-28T20:26:18.574Z'
AND ST_contains(ST_GeomFromText(
'POLYGON ((13.970947489142418 49.59174558308953, 13.970947489142418 50.12515341892287, 15.208740681409838 50.12515341892287, 15.208740681409838 49.59174558308953, 13.970947489142418 49.59174558308953))',
4326), point);
-> Bitmap Index Scan on position_point_idx (cost=0.00..183.82 rows=5254 width=0) (actual time=5.981..5.981 rows=94462 loops=1)
Okey now I want to search aliases in circle but for some reason it takes seconds and not using index at all.
SELECT distinct alias
FROM position
WHERE
date > '2021-11-28T19:26:18.574Z' AND date < '2021-11-28T20:26:18.574Z'
AND
ST_DWithin (point,ST_GeomFromText('POINT (14.32983409613371
49.91815471231952)',4326),62815.14152820495);
ST_DWithin is in list here so it should use index but it's ignoring it.
What I'm doing wrong here? Thanks for any hint.
Here is my query plan
HashAggregate (cost=687537.59..687538.59 rows=100 width=9) (actual time=2874.991..2875.003 rows=100 loops=1)
Output: alias
" Group Key: ""position"".alias"
-> Gather (cost=1000.00..686702.70 rows=333955 width=9) (actual time=0.254..2041.354 rows=5008801 loops=1)
Output: alias
Workers Planned: 2
Workers Launched: 2
" -> Parallel Seq Scan on public.""position"" (cost=0.00..652307.20 rows=139148 width=9) (actual time=0.021..2117.644 rows=1669600 loops=3)"
Output: alias
" Filter: ((""position"".date > '2021-11-28 19:26:18.574+00'::timestamp with time zone) AND (""position"".date < '2021-11-28 20:26:18.574+00'::timestamp with time zone) AND (""position"".point && '0103000020E6100000010000000500000077EC65F919AAEEC0B42AE025A7A5EEC077EC65F919AAEEC03A26ECE821B2EE4077646615AFADEE403A26ECE821B2EE4077646615AFADEE40B42AE025A7A5EEC077EC65F919AAEEC0B42AE025A7A5EEC0'::geometry) AND ('0101000020E61000000100C003E0A82C40520AF71786F54840'::geometry && st_expand(""position"".point, '62815.1415282049493'::double precision)) AND _st_dwithin(""position"".point, '0101000020E61000000100C003E0A82C40520AF71786F54840'::geometry, '62815.1415282049493'::double precision))"
Rows Removed by Filter: 86028
Worker 0: actual time=0.023..2492.854 rows=1922778 loops=1
Worker 1: actual time=0.025..2493.448 rows=2024544 loops=1
Planning Time: 0.211 ms
Execution Time: 2876.783 ms
PostgreSQL chooses a sequential scan because it thinks that that is the most efficient access strategy, and I would say it is right. After all, the WHERE condition removed only 250000 out of approximately 5 million rows.
I think you wanted to use geography, not geometry. In geometry 4326, the entire earth (and the rest of the universe, I suppose) is well within 62815.14152820495 degrees of every other point, so the index would be profoundly ineffective.
If you were using geography, that would be 39 miles, for which the index would be useful, and in my hands it would be used.
The stats on your table also seem to be way off.

Can I provide extra context somehow to allow postgres to efficiently sort/limit views without calculating their rows?

Given this somewhat contrived query
select id, pg_sleep(0.001)::text from administrative_areas;
When I add an order and limit to it directly, the sleep executes only once and results are returned quickly.
> explain analyze select id, pg_sleep(0.001)::text from administrative_areas order by id desc limit 1;
Limit (cost=0.28..0.39 rows=1 width=36) (actual time=4.227..4.228 rows=1 loops=1)
-> Index Only Scan Backward using administrative_areas_pkey on administrative_areas (cost=0.28..69.50 rows=604 width=36) (actual time=4.227..4.227 rows=1 loops=1)
Heap Fetches: 1
Planning time: 0.066 ms
Execution time: 4.243 ms
If I throw the same query in a view
CREATE OR REPLACE VIEW sleepy AS
select id, pg_sleep(0.001)::text from administrative_areas;
When querying with an order and limit the sleep executes once for every item in the underlaying administrative_areas table.
> explain analyze select * from sleepy order by id desc limit 1;
Limit (cost=30.63..30.63 rows=1 width=36) (actual time=3794.827..3794.829 rows=1 loops=1)
-> Sort (cost=30.63..32.14 rows=604 width=36) (actual time=3794.825..3794.825 rows=1 loops=1)
Sort Key: administrative_areas.id DESC
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on administrative_areas (cost=0.00..21.57 rows=604 width=36) (actual time=6.432..3792.566 rows=604 loops=1)
Planning time: 0.072 ms
Execution time: 3794.851 ms
Is there any additional context I can add to the view or provide at query time to allow the planner to optimize this?
I believe this is because pg_sleep is a volatile function. When you are querying the view, you are in effect, doing this:
select id from (select id, pg_sleep(0.001)::text from administrative_areas) order by id desc limit 1;
Postgres sees that volatile function in the subquery and runs it for each row. Let's test this.
create table test as select id from generate_series(1, 1000) g(id);
create index on test(id);
analyze test;
create view sleepy as select id, pg_sleep(0.001)::text from test;
explain analyze select * from sleepy order by id desc limit 1;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
Limit (cost=37.50..37.50 rows=1 width=36) (actual time=1640.368..1640.439 rows=1 loops=1)
-> Sort (cost=37.50..40.00 rows=1000 width=36) (actual time=1640.336..1640.358 rows=1 loops=1)
Sort Key: test.id DESC
Sort Method: top-N heapsort Memory: 25kB
-> Seq Scan on test (cost=0.00..22.50 rows=1000 width=36) (actual time=1.511..1623.058 rows=1000 loops=1)
Planning Time: 0.175 ms
Execution Time: 1640.617 ms
(7 rows)
This ran pg_sleep for each row in test, as expected.
Now try a stable function:
create function not_so_sleepy()
returns void AS
$$
select pg_sleep(0.001)
$$ language sql
stable; -- NOTE: this is just to trick postgres
create view not_as_sleepy as
select id,
not_so_sleepy()::text
FROM test;
explain analyze select *
from not_as_sleepy
order by id desc limit 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.28..0.32 rows=1 width=36) (actual time=0.049..0.198 rows=1 loops=1)
-> Index Only Scan Backward using test_id_idx on test (cost=0.28..43.27 rows=1000 width=36) (actual time=0.024..0.048 rows=1 loops=1)
Heap Fetches: 1
Planning Time: 1.786 ms
Execution Time: 0.308 ms
(5 rows)
In the second case, we told postgres that the function would not have any side-effects so it could safely ignore it. So, the function must be marked as stable or immutable (and of course, it must be actually stable/immutable) for postgres to not bother running the function.

Postgres using primary_key index in almost every query

We're upgrading our postgres database from version 9.3.14 to 9.4.9. We're currently under testing phase. We've encountered a problem while testing which leads to high CPU usages when the database is updated to 9.4.9. There are queries where Postgres 9.4 is using primary_key_index while cheaper options are there. As for example, running explain analyze for below query:
SELECT a.id as a_id, b.col_id as col_id
FROM a
INNER JOIN b ON b.id = a.b_id
WHERE (a.col_text = 'pqrs' AND a.col_int = 1)
ORDER BY a.id ASC LIMIT 1
gives this:
Limit (cost=0.87..4181.94 rows=1 width=8) (actual time=93014.991..93014.992 rows=1 loops=1)
-> Nested Loop (cost=0.87..1551177.78 rows=371 width=8) (actual time=93014.990..93014.990 rows=1 loops=1)
-> Index Scan using a_pkey on a (cost=0.43..1548042.20 rows=371 width=8) (actual time=93014.968..93014.968 rows=1 loops=1)
Filter: ((col_int = 1) AND ((col_text)::text = 'pqrs'::text))
Rows Removed by Filter: 16114217
-> Index Scan using b_pkey on b (cost=0.43..8.44 rows=1 width=8) (actual time=0.014..0.014 rows=1 loops=1)
Index Cond: (id = a.b_id)
Planning time: 0.291 ms
Execution time: 93015.041 ms
While the query plan for the same query in 9.3.14 gives this:
Limit (cost=17.06..17.06 rows=1 width=8) (actual time=5.066..5.067 rows=1 loops=1)
-> Sort (cost=17.06..17.06 rows=1 width=8) (actual time=5.065..5.065 rows=1 loops=1)
Sort Key: a.id
Sort Method: quicksort Memory: 25kB
-> Nested Loop (cost=1.00..17.05 rows=1 width=8) (actual time=5.047..5.049 rows=1 loops=1)
-> Index Scan using index_a_on_col_text on a (cost=0.56..8.58 rows=1 width=8) (actual time=3.154..3.155 rows=1 loops=1)
Index Cond: ((col_text)::text = 'pqrs'::text)
Filter: (col_int = 1)
-> Index Scan using b_pkey on b (cost=0.43..8.46 rows=1 width=8) (actual time=1.888..1.889 rows=1 loops=1)
Index Cond: (id = a.b_id)
Total runtime: 5.112 ms
If I remove the ORDER BY clause from the query, then the query works fine using proper index. I can understand that in this case (using ORDER BY) the planner is trying to use the primary key index to scan all the rows and fetch the valid rows. But as it's evident using sort explicitly is much cheaper.
I've explored Postgres parameters like enable_indexscan and enable_seqscan which are by default on. We want to leave it on the database to decide to go for index scan or sequential scan. We've also tried tweaking effective_cache_size, random_page_cost and seq_page_cost. enable_sort is also on.
It's not happening only for this particular query but with a few more other queries where primary_key_index is being used rather than other effective methods possible.
P.S.:
After opening a case to AWS Support, this is what I got:
I understand that you want to know why you have degraded performance
on your recently upgraded instance. This is the expected and general
behavior of upgrade on a Postgres instance. Once upgrade is completed,
you need to run ANALYZE on each user database to update statistics of
the tables. This also makes SQLs performing better. A better way to do
that is using vacuumdb[1], like this:
vacuumdb -U [your user] -d [your database] -Ze -h [your rds endpoint]
It will optmize your database execution plan only, not freeing space,
but will take less time than a complete vacuum.
And this has resolved the issue. Hope this helps others who stumble upon such issue.

Postgres. the query sometimes to long.

I need help or any hint. I have Postgres DB 9.4 and have one query processed very slow SOMETIMES.
SELECT COUNT(*) FROM "table_a" INNER JOIN "table_b" ON "table_b"."id" = "table_a"."table_b_id" AND "table_b"."deleted_at" IS NULL WHERE "table_a"."deleted_at" IS NULL AND "table_b"."company_id" = ? AND "table_a"."company_id" = ?
Query plan for this -
Aggregate (cost=308160.70..308160.71 rows=1 width=0)
-> Hash Join (cost=284954.16..308160.65 rows=20 width=0)
Hash Cond: ?
-> Bitmap Heap Scan on table_a (cost=276092.39..299260.96 rows=6035 width=4)
Recheck Cond: ?
Filter: ?
-> Bitmap Index Scan on index_table_a_on_created_at_and_company_id (cost=0.00..276090.89 rows=6751 width=0)
Index Cond: ?
-> Hash (cost=8821.52..8821.52 rows=3220 width=4)
-> Bitmap Heap Scan on table_b (cost=106.04..8821.52 rows=3220 width=4)
Recheck Cond: ?
Filter: ?
-> Bitmap Index Scan on index_ table_b_on_company_id (cost=0.00..105.23 rows=3308 width=0)
Index Cond: ?
But usually, this is query executed enough fast (about 69.7ms). I don't understand why this happened sometimes. I saw in performance logs by this period, that my RDS instance consumes a lot of memory and count this queries reaches about 100 per seconds. so guys, any helps please, where do I move for solve this problem.
I am not sure if this will solve your problem or not :)
When this query is returning very fast result it is returning result from cache and not executing query again and not preparing result at that time.
First of all you have to check if there are too much queries are being executed on these tables, especially inserts/updated/deletes. This type of queries are causing locking and select have to wait until lock is being released.
Query can be slow because there is too much comparison cost of join and where clause between table_a and table_b.
You can reduce your cost by applying indexes to columns "table_b"."id", "table_a"."table_b_id", "table_a"."deleted_at", "table_b"."company_id", AND "table_a"."company_id".
You can create a view to reduce the cost as well. Views are returning cached information.
One last thing is you can reduce cost by using temporary table as well. I have given an example below.
QUERIES:
CREATE TEMPORARY TABLE table_a_temp as
SELECT "table_a"."table_b_id" FROM "table_a"
WHERE "table_a"."deleted_at" IS NULL AND "table_a"."company_id" = ? ;
CREATE TEMPORARY TABLE table_b_temp as
SELECT "table_b"."id" FROM "table_a"
WHERE"table_b"."deleted_at" IS NULL AND "table_b"."company_id" = ?;
SELECT COUNT(*) FROM "table_a_temp" INNER JOIN "table_b_temp"
ON "table_b_temp"."id" = "table_a_temp"."table_b_id" ;

Is there a way to search partial match in multivalue field in PostgreSQL?

I have a table quiet like this:
CREATE TABLE myTable (
family text,
names text[]
)
I can search like this:
SELECT family
FROM myTable where names #> array['B0WP04'];
But I would like to do:
SELECT family
FROM myTable where names #> array['%P0%'];
Is this possible ?
In postgreSQL 9.3 you can:
select family
from myTable
join lateral unnest(mytable.names) as un(name) on true
where un.name like '%P0%';
But keep in mind that it can produce duplicates so perhaphs you'd like to add distinct.
For earlier versions:
select family
from myTable where
exists (select 1 from unnest(names) as un(name) where un.name like '%P0%');
Adding a bit on Radek's answer, I tried
select family
from myTable where
exists (select 1 from unnest(names) as name where name like '%P0%');
and it also works. I searched in the PostgreSQL docs for the un() function, but can't find anything.
I'm not saying it doesn't do anything, but I'm just curious about what the un() function should do (and happy to have my problem solved)
You can use the parray_gin extension https://github.com/theirix/parray_gin
This extension is said to work only up to 9.2 but I just installed and tested it on 9.3 and it works well.
Here is how to install it on ubuntu-like systems :)
# install postgresql extension network client and postgresql extension build tools
sudo apt-get install python-setuptools
easy_install pgxnclient
sudo apt-get install postgresql-server-dev-9.3
# get the extension
pgxn install parray_gin
And here is my test
-- as a superuser: add the extension to the current database
CREATE EXTENSION parray_gin;
-- as a normal user
CREATE TABLE test (
id SERIAL PRIMARY KEY,
names TEXT []
);
INSERT INTO test (names) VALUES
(ARRAY ['nam1', 'nam2']),
(ARRAY ['2nam1', '2nam2']),
(ARRAY ['Hello', 'Woooorld']),
(ARRAY ['Woooorld', 'Hello']),
(ARRAY [] :: TEXT []),
(NULL),
(ARRAY ['Hello', 'is', 'it', 'me', 'you''re', 'looking', 'for', '?']);
-- double up the rows in test table, with many rows, the index is used
INSERT INTO test (names) (SELECT names FROM test);
SELECT count(*) from test; /*
count
--------
997376
(1 row)
*/
Now that we have some test data, it's magic time:
-- http://pgxn.org/dist/parray_gin/doc/parray_gin.html
CREATE INDEX names_idx ON test USING GIN (names parray_gin_ops);
--- now it's time for some tests
EXPLAIN ANALYZE SELECT * FROM test WHERE names #> ARRAY ['is']; /*
-- WITHOUT INDEX ON NAMES
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..25667.00 rows=1138 width=49) (actual time=0.021..508.599 rows=51200 loops=1)
Filter: (names #> '{is}'::text[])
Rows Removed by Filter: 946176
Total runtime: 653.879 ms
(4 rows)
-- WITH INDEX ON NAMES
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=455.73..3463.37 rows=997 width=49) (actual time=14.327..240.365 rows=51200 loops=1)
Recheck Cond: (names #> '{is}'::text[])
-> Bitmap Index Scan on names_idx (cost=0.00..455.48 rows=997 width=0) (actual time=12.241..12.241 rows=51200 loops=1)
Index Cond: (names #> '{is}'::text[])
Total runtime: 341.750 ms
(5 rows)
*/
EXPLAIN ANALYZE SELECT * FROM test WHERE names ##> ARRAY ['%nam%']; /*
-- WITHOUT INDEX ON NAMES
QUERY PLAN
------------------------------------------------------------------------------------------------------------
Seq Scan on test (cost=0.00..23914.20 rows=997 width=49) (actual time=0.023..590.093 rows=102400 loops=1)
Filter: (names ##> '{%nam%}'::text[])
Rows Removed by Filter: 894976
Total runtime: 796.636 ms
(4 rows)
-- WITH INDEX ON NAMES
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on test (cost=159.73..3167.37 rows=997 width=49) (actual time=20.164..293.942 rows=102400 loops=1)
Recheck Cond: (names ##> '{%nam%}'::text[])
-> Bitmap Index Scan on names_idx (cost=0.00..159.48 rows=997 width=0) (actual time=18.539..18.539 rows=102400 loops=1)
Index Cond: (names ##> '{%nam%}'::text[])
Total runtime: 490.060 ms
(5 rows)
*/
The final performance totally depend on your data and queries but on my dummy example, this extension is very efficient cut query time in half.

Resources