Why select result takes long time in partitioned table in postgreSql? - database

I have a daily partitioned table in postgresql. It uses cdr_date for partitioning. When I select a simple query, it takes a long time I dont know why!
this is a simple sql
EXPLAIN (ANALYZE , BUFFERS )
select * FROM cdr
WHERE cdr_date >= '2018-05-24 11:59:00.937000 +00:00'
AND cdr_date <= '2018-05-25 23:59:59.937000 +00:00'
and it result
Append (cost=0.56..1036393.46 rows=14908437 width=295) (actual time=5019.283..335535.305 rows=15191628 loops=1)
Buffers: shared hit=252735 read=1443977 written=125'
-> Index Scan using ind_cdr_cdr_date on cdr (cost=0.56..8.58 rows=1 width=286) (actual time=5019.190..5019.190 rows=0 loops=1)'
Index Cond: ((cdr_date >= ''2018-05-24 11:59:00.937+00''::timestamp with time zone) AND (cdr_date <= ''2018-05-25 23:59:59.937+00''::timestamp with time zone))
Buffers: shared hit=178464 read=708130 written=125
-> Index Scan using ind_cdr_2018_05_24 on cdr_2018_05_24 (cost=0.43..567998.02 rows=7158579 width=295) (actual time=0.091..311773.252 rows=7846816 loops=1)
Index Cond: ((cdr_date >= ''2018-05-24 11:59:00.937+00''::timestamp with time zone) AND (cdr_date <= ''2018-05-25 23:59:59.937+00''::timestamp with time zone))
Buffers: shared hit=74264 read=383715
-> Seq Scan on cdr_2018_05_25 (cost=0.00..468386.85 rows=7749857 width=295) (actual time=5.192..16189.737 rows=7344812 loops=1)
Filter: ((cdr_date >= ''2018-05-24 11:59:00.937+00''::timestamp with time zone) AND (cdr_date <= ''2018-05-25 23:59:59.937+00''::timestamp with time zone))
Buffers: shared hit=7 read=352132
Planning time: 3.394 ms
Execution time: 336984.703 ms
here is my root table
CREATE TABLE cdr
(
id BIGSERIAL NOT NULL
CONSTRAINT cdr_pkey
PRIMARY KEY,
username VARCHAR(256) NOT NULL,
user_id BIGINT,
cdr_date TIMESTAMP WITH TIME ZONE NOT NULL,
created_at TIMESTAMP WITH TIME ZONE NOT NULL,
last_reset_time TIMESTAMP WITH TIME ZONE,
prev_cdr_date TIMESTAMP WITH TIME ZONE NOT NULL
);
CREATE INDEX ind_cdr_user_id
ON cdr (user_id);
CREATE INDEX ind_cdr_cdr_date
ON cdr (cdr_date);
and here is my one of the child table
-- auto-generated definition
CREATE TABLE cdr_2018_05_25
(
CONSTRAINT cdr_2018_05_25_cdr_date_check
CHECK ((cdr_date >= '2018-05-25 00:00:00+00' :: TIMESTAMP WITH TIME ZONE) AND
(cdr_date <= '2018-05-26 00:23:29.064408+00' :: TIMESTAMP WITH TIME ZONE))
)
INHERITS (cdr);
CREATE INDEX ind_cdr_2018_05_25_user_id
ON cdr_2018_05_25 (user_id);
CREATE INDEX ind_cdr_2018_05_25
ON cdr_2018_05_25 (cdr_date);

Because your partition is big, and you're basically selecting most of the data in the partition.
The filter is not equal to the check, so after it determines which partition to use, it still scans the index.
There are 3 solutions that I can propose that can work together:
Don't partition on ranges with such a high resolution. Consider adding another field, which is just the DATE component, and have the check with an equality operator instead. This will also ensure that your partitions don't overlap like in this case. This won't help much in this exact case, unless you really want to select all the data from a single partition.
Cluster the table on the cdr_date index, which will drastically speed up such queries.
CLUSTER cdr_2018_05_24 USING ind_cdr_2018_05_24
Consider partitioning the partitions, by hour, in case you often select smaller time ranges. 7 million rows are quite a lot for such a query.

There is no way it should take 5 seconds to find 0 rows on an index scan of the root table. I would say your root table (or indexes, anyway) is massively bloated. And if that is the case, maybe your other ones are as well. Are you vacuuming these tables sufficiently, or even at all? Look in pg_stat_user_tables for the last time they were vacuumed, either manually or auto.

Related

Why in my case ST_DWithin is not using index

I'm using Postgis extension for Postgres and trying to optimize my query for searching points in circle.
Consider I have this table with index:
create table position
(
id bigserial not null primary key,
date timestamp with time zone,
point GEOMETRY(Point, 4326),
alias varchar(50)
);
create index position_point_idx on position using gist (point);
Now when I use query with polygon everything work as expected. In explain plan I can see that query uses index.
SELECT distinct alias
FROM position
WHERE date > '2021-11-28T19:26:18.574Z'
AND date < '2021-11-28T20:26:18.574Z'
AND ST_contains(ST_GeomFromText(
'POLYGON ((13.970947489142418 49.59174558308953, 13.970947489142418 50.12515341892287, 15.208740681409838 50.12515341892287, 15.208740681409838 49.59174558308953, 13.970947489142418 49.59174558308953))',
4326), point);
-> Bitmap Index Scan on position_point_idx (cost=0.00..183.82 rows=5254 width=0) (actual time=5.981..5.981 rows=94462 loops=1)
Okey now I want to search aliases in circle but for some reason it takes seconds and not using index at all.
SELECT distinct alias
FROM position
WHERE
date > '2021-11-28T19:26:18.574Z' AND date < '2021-11-28T20:26:18.574Z'
AND
ST_DWithin (point,ST_GeomFromText('POINT (14.32983409613371
49.91815471231952)',4326),62815.14152820495);
ST_DWithin is in list here so it should use index but it's ignoring it.
What I'm doing wrong here? Thanks for any hint.
Here is my query plan
HashAggregate (cost=687537.59..687538.59 rows=100 width=9) (actual time=2874.991..2875.003 rows=100 loops=1)
Output: alias
" Group Key: ""position"".alias"
-> Gather (cost=1000.00..686702.70 rows=333955 width=9) (actual time=0.254..2041.354 rows=5008801 loops=1)
Output: alias
Workers Planned: 2
Workers Launched: 2
" -> Parallel Seq Scan on public.""position"" (cost=0.00..652307.20 rows=139148 width=9) (actual time=0.021..2117.644 rows=1669600 loops=3)"
Output: alias
" Filter: ((""position"".date > '2021-11-28 19:26:18.574+00'::timestamp with time zone) AND (""position"".date < '2021-11-28 20:26:18.574+00'::timestamp with time zone) AND (""position"".point && '0103000020E6100000010000000500000077EC65F919AAEEC0B42AE025A7A5EEC077EC65F919AAEEC03A26ECE821B2EE4077646615AFADEE403A26ECE821B2EE4077646615AFADEE40B42AE025A7A5EEC077EC65F919AAEEC0B42AE025A7A5EEC0'::geometry) AND ('0101000020E61000000100C003E0A82C40520AF71786F54840'::geometry && st_expand(""position"".point, '62815.1415282049493'::double precision)) AND _st_dwithin(""position"".point, '0101000020E61000000100C003E0A82C40520AF71786F54840'::geometry, '62815.1415282049493'::double precision))"
Rows Removed by Filter: 86028
Worker 0: actual time=0.023..2492.854 rows=1922778 loops=1
Worker 1: actual time=0.025..2493.448 rows=2024544 loops=1
Planning Time: 0.211 ms
Execution Time: 2876.783 ms
PostgreSQL chooses a sequential scan because it thinks that that is the most efficient access strategy, and I would say it is right. After all, the WHERE condition removed only 250000 out of approximately 5 million rows.
I think you wanted to use geography, not geometry. In geometry 4326, the entire earth (and the rest of the universe, I suppose) is well within 62815.14152820495 degrees of every other point, so the index would be profoundly ineffective.
If you were using geography, that would be 39 miles, for which the index would be useful, and in my hands it would be used.
The stats on your table also seem to be way off.

SQL Actual Execution Plan with Sort took high cost

I have a table named DocumentItem with Id column was clustered index (primary key).
Please see these two query strings:
Query 1 (not use order by):
select *
from DocumentItem
where (HistoryCreateDate >= '2019-09-04 05:00:00' AND HistoryCreateDate <= '2019-12-04 05:00:00') and ActNodeState>140100
The result took: 00:00:09 with 168.357 rows.
Query 2 (used order by):
select *
from DocumentItem
where (HistoryCreateDate >= '2019-09-04 05:00:00' AND HistoryCreateDate <= '2019-12-04 05:00:00') and ActNodeState>140100 order by Id
The result took: 00:02:41 with 168.357 rows.
Here is the actual execution plan:
Why it took so long in the 2nd query?
SQL Server has decided that your index IX_HistoryCreateDate (not sure of the full name) is sufficiently selective that it will use it to find the rows that it needs. However, that index isn't sorted on the ID column. It does include the ID column already (whether you specified it or not) because it's the clustering key.
I'd suggest recreating your IX_HistoryCreateDate index like this:
CREATE INDEX IX_HistoryCreateDate ON DocumentItem
( HistoryCreateDate, ID)
INCLUDE (ActNodeState);
And I think you'll be fine. It's still not going to be great and it will have to do a large number of lookups, because your query uses SELECT *. Do you really need all columns returned? If so, and you do this all the time, you might consider reclustering the table in the order that you need.

Postgres index-only scan taking too long

I have a table with the below structure and indexes:
Table "public.client_data"
Column | Type | Modifiers
-------------------------+---------+-----------
account_id | text | not_null
client_id | text | not null
client_type | text | not null
creation_time | bigint | not null
last_modified_time | bigint | not null
Indexes:
"client_data_pkey" PRIMARY KEY, btree (account_id, client_id)
"client_data_last_modified_time_index" btree (last_modified_time)
From this table I need to find the oldest record - for this I used the following query:
SELECT last_modified_time FROM client_data ORDER BY last_modified_time ASC LIMIT 1;
However this query on this table with around 61 million rows is running very slow (90-100 mins) in a db.r4.2xlarge RDS instance in AWS Aurora Postgres 9.6 with no other concurrent queries running.
However changing the query to use DESC finishes instantly. What could be the problem? I was expecting that since I have an index of the last_modified_time querying only for that column ordered by that column with the limit applied would involve an index-only query that should stop after the first entry in the index.
Here is the output of the explain analyze:
EXPLAIN ANALYZE SELECT last_modified_time FROM client_data ORDER BY last_modified_time ASC LIMIT 1;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..2.31 rows=1 width=8) (actual time=6297823.287..6297823.287 rows=1 loops=1)
-> Index Only Scan using client_data_last_modified_time_index on client_data (cost=0.57..1049731749.38 rows=606590292 width=8) (actual time=6297823.287..6297823.287 rows=1 loops=1)
Heap Fetches: 26575013
Planning time: 0.078 ms
Execution time: 6297823.306 ms
The same for the DESC version of the query results in the following
EXPLAIN ANALYZE SELECT last_modified_time FROM client_data ORDER BY last_modified_time DESC LIMIT 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.57..2.32 rows=1 width=8) (actual time=2.265..2.266 rows=1 loops=1)
-> Index Only Scan Backward using client_data_last_modified_time_index on client_data (cost=0.57..1066049674.69 rows=611336085 width=8) (actual time=2.264..2.264 rows=1 loops=1)
Heap Fetches: 9
Planning time: 0.095 ms
Execution time: 2.278 ms
Any pointers?
The difference is this:
The slow plan has
Heap Fetches: 26575013
and the fast plan
Heap Fetches: 9
Heap fetches is what turns a fast index only scan to a slow normal index scan.
Did the table experience mass updates or deletions recently?
The reason for the slow scan is that it has to wade through a lot of invisible (deleted) tuples before it hits the first match.
Run VACUUM on the table, and both scans will be fast.

Postgres. the query sometimes to long.

I need help or any hint. I have Postgres DB 9.4 and have one query processed very slow SOMETIMES.
SELECT COUNT(*) FROM "table_a" INNER JOIN "table_b" ON "table_b"."id" = "table_a"."table_b_id" AND "table_b"."deleted_at" IS NULL WHERE "table_a"."deleted_at" IS NULL AND "table_b"."company_id" = ? AND "table_a"."company_id" = ?
Query plan for this -
Aggregate (cost=308160.70..308160.71 rows=1 width=0)
-> Hash Join (cost=284954.16..308160.65 rows=20 width=0)
Hash Cond: ?
-> Bitmap Heap Scan on table_a (cost=276092.39..299260.96 rows=6035 width=4)
Recheck Cond: ?
Filter: ?
-> Bitmap Index Scan on index_table_a_on_created_at_and_company_id (cost=0.00..276090.89 rows=6751 width=0)
Index Cond: ?
-> Hash (cost=8821.52..8821.52 rows=3220 width=4)
-> Bitmap Heap Scan on table_b (cost=106.04..8821.52 rows=3220 width=4)
Recheck Cond: ?
Filter: ?
-> Bitmap Index Scan on index_ table_b_on_company_id (cost=0.00..105.23 rows=3308 width=0)
Index Cond: ?
But usually, this is query executed enough fast (about 69.7ms). I don't understand why this happened sometimes. I saw in performance logs by this period, that my RDS instance consumes a lot of memory and count this queries reaches about 100 per seconds. so guys, any helps please, where do I move for solve this problem.
I am not sure if this will solve your problem or not :)
When this query is returning very fast result it is returning result from cache and not executing query again and not preparing result at that time.
First of all you have to check if there are too much queries are being executed on these tables, especially inserts/updated/deletes. This type of queries are causing locking and select have to wait until lock is being released.
Query can be slow because there is too much comparison cost of join and where clause between table_a and table_b.
You can reduce your cost by applying indexes to columns "table_b"."id", "table_a"."table_b_id", "table_a"."deleted_at", "table_b"."company_id", AND "table_a"."company_id".
You can create a view to reduce the cost as well. Views are returning cached information.
One last thing is you can reduce cost by using temporary table as well. I have given an example below.
QUERIES:
CREATE TEMPORARY TABLE table_a_temp as
SELECT "table_a"."table_b_id" FROM "table_a"
WHERE "table_a"."deleted_at" IS NULL AND "table_a"."company_id" = ? ;
CREATE TEMPORARY TABLE table_b_temp as
SELECT "table_b"."id" FROM "table_a"
WHERE"table_b"."deleted_at" IS NULL AND "table_b"."company_id" = ?;
SELECT COUNT(*) FROM "table_a_temp" INNER JOIN "table_b_temp"
ON "table_b_temp"."id" = "table_a_temp"."table_b_id" ;

PostgreSQL - Index not used

I have created a table.
And I've written a procedure to update the table when I update another table.
That is, when I update table2 few records from the table2 will be updated to table1 using the trigger I've created on table2.
I could've created view instead of doing that. But the main purpose of it is I won't be able to create index on views.
Hence I did like this. Table2 consists about 500k rows. And I'm updating about 220k rows to and a extra column is created on some calculation that'll give each rows either 0 or 1 based on some criteria.
And I've created a index on the table1.
If I execute a count(*) query in table2 in which i already have only one index for date col. The query executes in 200ms which has about 500k rows.
But if I execute the same query on table1 it takes double the time when compared to that of table2.
And if I remove the index on table1 it add another 500-600ms to the execution time.
Creating index on table1 has just reduced 500-600ms.
Explain Analyze of the query with 2 columns.
"HashAggregate (cost=80421.85..80422.94 rows=73 width=4) (actual time=6248.826..6248.829 rows=3 loops=1)"
" -> Seq Scan on table1 (cost=0.00..70306.88 rows=2022994 width=4) (actual time=0.048..4203.224 rows=2022994 loops=1)"
" Filter: ((date >= '2014-02-01'::date) AND (date <= '2014-04-30'::date))"
"Total runtime: 6248.895 ms"
Table Definition:
CREATE TABLE table1
(
label1 text NOT NULL,
label2 text NOT NULL,
label3 text NOT NULL,
date date NOT NULL,
"mobile no" bigint NOT NULL,
"start time" time without time zone NOT NULL,
"end time" time without time zone NOT NULL,
label4 text NOT NULL,
label5 text NOT NULL,
value1 integer NOT NULL,
count numeric NOT NULL
)
Index Definition :
CREATE INDEX ix_date
ON table1
USING btree
(date);
And the COUNT(*) I've given is just for an example.
Actually I sum up the count column by grouping label1, 2, 3 and extracting the month from date.
Firstly,
I could've created view instead of doing that. But the main purpose of it is I won't be able to create index on views.
Views are "expanded" when processing queries so, e.g. a SELECT x FROM my_view JOIN y... will practically directly substitute the view definition inside your query, and the resulting expanded query will be able to use any indexes directly, if applicable.
Secondly,
If I execute a count(*) query in table2 in which i already have only one index for date col. The query executes in 200ms which has about 500k rows.
Unfortunately, COUNT(*) queries in PostgreSQL don't usually use indexes, even in recent (9.2+) versions with index-only scans. See here: https://wiki.postgresql.org/wiki/Index-only_scans#Is_.22count.28.2A.29.22_much_faster_now.3F for a description why is that. A non-unique (or primary key) index will NEVER be used for COUNT(*).
Thirdly, updating records in a MVCC database such as PostgreSQL creates updated copies of those records, instead of updating them in-place. This almost always results in significant internal data fragmentation, which is sorely visible if you use drives with slow seek times, like mechanical drives. If you want linearly reduced COUNT(*) times between tables of different sizes, either make sure the data is not fragmented (VACUUM FULL ANALYZE + REINDEX will mostly do the trick), or just use an SSD.

Resources