I have a Postgres DB hosted on Amazon RDS with 150GB of stogare, 8GB RAM and 2vCPUs. The DB has a table with 320 columns and 20 million rows as of now. The problem I am facing is that the response time of the DB queries has reduced quite a lot as we began inserting more data. At 18 million rows, the DB response was quite fast. But after inserting another 2 million rows, the performance reduced quite a lot. I did a simple query as follows
explain analyze SELECT * from "data_table" WHERE foreign_key_id = 7 ORDER BY "TimeStamp" DESC LIMIT 1;
The response for the above as follows
Limit (cost=0.43..90.21 rows=1 width=2552) (actual time=650065.806..650065.807 rows=1 loops=1)
-> Index Scan Backward using "data_table_TimeStamp_219314ec" on data_table (cost=0.43..57250559.80 rows=637678 width=2552) (actual time=650065.803..650065.803 rows=1 loops=1)
Filter: (asset_id = 7)
Rows Removed by Filter: 4910074
Planning time: 44.072 ms
Execution time: 650066.004 ms
I ran another query with a different id for the foreign key and the result was as shown below
explain analyze SELECT * from "data_table" WHERE foreign_key_id = 1 ORDER BY "TimeStamp" DESC LIMIT 1;
Limit (cost=0.43..13.05 rows=1 width=2552) (actual time=2.749..2.750 rows=1 loops=1)
-> Index Scan Backward using "data_table_TimeStamp_219314ec" on data_table (cost=0.43..57250559.80 rows=4539651 width=2552) (actual time=2.747..2.747 rows=1 loops=1)
Filter: (asset_id = 1)
Planning time: 0.496 ms
Execution time: 2.927 ms
As you can see two different queries of the same type give highly different results. The number of records with foreign_key_id=1 is 11 million while that with foreign_key_id=7 is about 1 million.
I am not able to figure out why this is happening. There is a huge delay in response for all foreign_key_id's except for foreign_key_id=1. The first query has a line where Filter removed rows. Which is not there in the second query.
Could anyone help me with understanding this issue?
Additional Information
The TimeStamp is indexed using btree
A small amount of data insertion is being done every 10 minutes. Occasionally we also insert bulk data(5-6 million rows) using scripts.
You could add index to generate different execution plan:
CREATE INDEX idx ON data_table(foreign_key_id, "TimeStamp" DESC);
Related
I'm brand new to the concept of database administration, so I have no basis for what to expect. I am working with approximately 100GB of data in the form of five different tables. Descriptions of the data, as well as the first few rows of each file, can be found here.
I'm currently just working with the flows tables in an effort to gauge performance. Here is the results from \d flows:
Table "public.flows"
Column | Type | Modifiers
------------+-------------------+-----------
time | real |
duration | real |
src_comp | character varying |
src_port | character varying |
dest_comp | character varying |
dest_port | character varying |
protocol | character varying |
pkt_count | real |
byte_count | real |
Indexes:
"flows_dest_comp_idx" btree (dest_comp)
"flows_dest_port_idx" btree (dest_port)
"flows_protocol_idx" btree (protocol)
"flows_src_comp_idx" btree (src_comp)
"flows_src_port_idx" btree (src_port)
Here is the results from EXPLAIN ANALYZE SELECT src_comp, COUNT(DISTINCT dest_comp) FROM flows GROUP BY src_comp;, which I thought would be a relatively simple query:
GroupAggregate (cost=34749736.06..35724568.62 rows=200 width=64) (actual time=1292299.166..1621191.771 rows=11154 loops=1)
Group Key: src_comp
-> Sort (cost=34749736.06..35074679.58 rows=129977408 width=64) (actual time=1290923.435..1425515.812 rows=129977412 loops=1)
Sort Key: src_comp
Sort Method: external merge Disk: 2819360kB
-> Seq Scan on flows (cost=0.00..2572344.08 rows=129977408 width=64) (actual time=26.842..488541.987 rows=129977412 loops=1)
Planning time: 6.575 ms
Execution time: 1636290.138 ms
(8 rows)
If I'm interpreting this correctly (which I might not be since I'm new to PSQL), it's saying that my query would take almost 30 minutes to execute, which is much, much longer than I would expect. Even with ~130 million rows.
My computer is running with an 8th-gen i7 quad-core CPU, 16GBs of RAM, and a 2TB HDD (full specs can be found here).
My questions then are: 1) is this performance to be expected, and 2) is there anything I can do to speed it up, other than buying an external SSD?
1 - src_comp and dest_comp, which are used by the query, are both indexed. However, they are indexed independently. If you had an index of 'src_comp, dest_comp' then there is a possibility that the database could process this all via indexes, eliminating a full table scan.
2 - src_comp and dest_comp are character varying. That is NOT a good thing for indexed fields, unless necessary. What are these values really? Numbers? IP addresses? Computer network names? If there is a relatively finite number of these items and they can be identified as they are added to the database, change them to integers that are used as foreign keys into other tables. That will make a HUGE difference in this query. If they can't be stored that way, but they at least have a definite finite length - e.g., 15 characters for IPv4 addresses in dotted quad format - then set a maximum length for the fields, which should help some.
I have a table where I keep record of who is following whom on a Twitter-like application:
\d follow
Table "public.follow" .
Column | Type | Modifiers
---------+--------------------------+-----------------------------------------------------
xid | text |
followee | integer |
follower | integer |
id | integer | not null default nextval('follow_id_seq'::regclass)
createdAt | timestamp with time zone |
updatedAt | timestamp with time zone |
source | text |
Indexes:
"follow_pkey" PRIMARY KEY, btree (id)
"follow_uniq_users" UNIQUE CONSTRAINT, btree (follower, followee)
"follow_createdat_idx" btree ("createdAt")
"follow_followee_idx" btree (followee)
"follow_follower_idx" btree (follower)
Number of entries in table is more than a million and when I run explain analyze on the query I get this:
explain analyze SELECT "follow"."follower"
FROM "public"."follow" AS "follow"
WHERE "follow"."followee" = 6
ORDER BY "follow"."createdAt" DESC
LIMIT 15 OFFSET 0;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..353.69 rows=15 width=12) (actual time=5.456..21.497
rows=15 loops=1)
-> Index Scan Backward using follow_createdat_idx on follow (cost=0.43..61585.45 rows=2615 width=12) (actual time=5.455..21.488 rows=15 loops=1)
Filter: (followee = 6)
Rows Removed by Filter: 62368
Planning time: 0.068 ms
Execution time: 21.516 ms
Why it is doing backward index scan on follow_createdat_idx where it could have been more faster execution if it had used follow_followee_idx.
This query is taking around 33 ms when running first time and then subsequent calls are taking around 22 ms which I feel are on higher side.
I am using Postgres 9.5 provided by Amazon RDS. Any idea what wrong could be happening here?
The multicolumn index on (follower, "createdAt") that user1937198 suggested is perfect for the query - as you found in your test already.
Since "createdAt" can be NULL (not defined NOT NULL), you may want to add NULLS LAST to query and index:
...
ORDER BY "follow"."createdAt" DESC NULLS LAST
And:
"follow_follower_createdat_idx" btree (follower, "createdAt" DESC NULLS LAST)
More:
PostgreSQL sort by datetime asc, null first?
There are minor other performance implications:
The multicolumn index on (follower, "createdAt") is 8 bytes per row bigger than the simple index on (follower) - 44 bytes vs 36. More (btree indexes have mostly the same page layout as tables):
Making sense of Postgres row sizes
Columns involved in an index in any way cannot be changed with a HOT update. Adding more columns to an index might block this optimization - which seems particularly unlikely given the column name. And since you have another index on just ("createdAt") that's not an issue anyway. More:
PostgreSQL Initial Database Size
There is no downside in having another index on just ("createdAt") (other than the maintenance cost for each (for write performance, not for read performance). Both indexes support different queries. You may or may not need the index on just ("createdAt") additionally. Detailed explanation:
Is a composite index also good for queries on the first field?
In sort operations the main bottleneck is the disk I/O, i.e fetching all of the data from disk.
So I thought of experimenting it with IMCS and in-memory columnar plugin for Postgres.
Below is the regular query plan.(I didn't include an index for learning purposes)
perf_test=# explain (analyze,buffers) select * from users_act_demo order by userid limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=410011.04..410011.06 rows=10 width=14) (actual time=4328.393..4328.396 rows=10 loops=1)
Buffers: shared hit=3 read=59875
-> Sort (cost=410011.04..437703.22 rows=11076875 width=14) (actual time=4328.387..4328.389 rows=10 loops=1)
Sort Key: userid
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=3 read=59875
-> Seq Scan on users_act_demo (cost=0.00..170643.75 rows=11076875 width=14) (actual time=0.042..1938.043 rows=11076719 loops=1)
Buffers: shared read=59875
Planning time: 42.055 ms
Execution time: 4328.482 ms
(10 rows)
After doing cs_load which is a function that moves all of the pages into memory.After that, below is the query plan.
perf_test=# explain (analyze,buffers) select * from users_act_demo order by userid limit 10;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=410011.04..410011.06 rows=10 width=14) (actual time=3902.765..3902.767 rows=10 loops=1)
Buffers: shared hit=59875
-> Sort (cost=410011.04..437703.22 rows=11076875 width=14) (actual time=3902.763..3902.763 rows=10 loops=1)
Sort Key: userid
Sort Method: top-N heapsort Memory: 25kB
Buffers: shared hit=59875
-> Seq Scan on users_act_demo (cost=0.00..170643.75 rows=11076875 width=14) (actual time=0.028..1520.772 rows=11076719 loops=1)
Buffers: shared hit=59875
Planning time: 0.145 ms
Execution time: 3902.827 ms
My question is, why isn't the sort faster ? If all of the data is not I/O bound, then isn't it CPU bound? Why is it slow and what exactly is the cost involved in here ?
I have a table that has over 15m rows in Postgresql. The users can save these rows (let's say items) into their library and when they request their library, the system loads the user's library.
The query in Postgresql is like
SELECT item.id, item.name
FROM items JOIN library ON (library.item_id = item.id)
WHERE library.user_id = 1
, the table is already indexed and denormalized so I don't need any other JOIN.
If a user has many items in the library (like 1k items) the query time increases normally. (for example for 1k items the query time is 7s) My aim is to reduce the query time for large datasets.
I already use Solr for full-text searching, and I tried queries like ?q=id:1 OR id:100 OR id:345 but I'm not sure whether it's efficient or not in Solr.
I want to know my alternatives for querying this datasets. The bottleneck in my system seems disk speed. Should I buy a server that has over 15gb memory and go with Postgresql in increased shared_memory option or try something like Mongodb or another memory based databases, or should I create a cluster system and replicate the data in Postgresql?
items:
Column | Type
--------------+-------------------
id | text
mbid | uuid
name | character varying
length | integer
track_no | integer
artist | text[]
artist_name | text
release | text
release_name | character varying
rank | numeric
user_library:
Column | Type | Modifiers
--------------+-----------------------------+--------------------------------------------------------------
user_id | integer | not null
recording_id | character varying(32) |
timestamp | timestamp without time zone | default now()
id | integer | primary key nextval('user_library_idx_pk'::regclass)
-------------------
explain analyze
SELECT recording.id,name,track_no,artist,artist_name,release,release_name
FROM recording JOIN user_library ON (user_library.recording_id = recording.id)
WHERE user_library.user_id = 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..10745.33 rows=1036539 width=134) (actual time=0.168..57.663 rows=1000 loops=1)
Join Filter: (recording.id = (recording_id)::text)
-> Seq Scan on user_library (cost=0.00..231.51 rows=1000 width=19) (actual time=0.027..3.297 rows=1000 loops=1) (my opinion: because user_library has only 2 rows, Postgresql didn't use index to save resources.)
Filter: (user_id = 1)
-> Append (cost=0.00..10.49 rows=2 width=165) (actual time=0.045..0.047 rows=1 loops=1000)
-> Seq Scan on recording (cost=0.00..0.00 rows=1 width=196) (actual time=0.001..0.001 rows=0 loops=1000)
-> Index Scan using de_recording3_table_pkey on de_recording recording (cost=0.00..10.49 rows=1 width=134) (actual time=0.040..0.042 rows=1 loops=1000)
Index Cond: (id = (user_library.recording_id)::text)
Total runtime: 58.589 ms
(9 rows)
First, if your interesting (commonly used) set of data fits comfortably in memory as well as all indexes, you will have much better performance so yes, more RAM will help. However, with 1k records, part of your time will be spent on materializing records and sending them to the client.
A couple other initial points:
There is no substitute for real profiling. You may be surprised as to where the time is being spent. On PostgreSQL use explain analyze to do profiling of a query.
Look into cursors and limit/offset.
I don;t think one can come up with better advice until these are done.
When I run below query
explain
select count(*) over() as t_totalcnt, max(hits) over() as t_maxhits, max(bytes) over() as t_maxbytes, *
from
(
select category,sum(hits) as hits,sum(bytes) as bytes
from (
select "5mintime",category,hits,bytes,appid, 0 as tmpfield
from web_categoryutfv1_24hr_ts_201209
where "5mintime" >='2012-09-12 00:00:00' and "5mintime" < '2012-09-19 00:00:00'
) as tmp
where "5mintime" >='2012-09-12 00:00:00'
and "5mintime" <= '2012-09-18 23:59:59'
and appid in ('') group by category order by hits desc
) as foo
limit 10;
I get the below output
Limit (**cost=31.31..31.61** rows=10 width=580)
-> WindowAgg (**cost=31.31..32.03** rows=24 width=580)
-> Subquery Scan foo (cost=31.31..31.61 rows=24 width=580)
-> Sort (**cost=31.31..31.37** rows=24 width=31)
Sort Key: (sum(web_categoryutfv1_24hr_ts_201209.hits))
-> HashAggregate (**cost=30.39..30.75** rows=24 width=31)
-> Seq Scan on web_categoryutfv1_24hr_ts_201209 (cost=0.00..27.60 rows=373 width=31)
Filter: (("5mintime" >= '2012-09-12 00:00:00'::timestamp without time zone)
AND ("5mintime" < '2012-09-19 00:00:00'::timestamp without time zone)
AND ("5mintime" >= '2012-09-12 00:00:00'::timestamp without time zone)
AND ("5mintime" <= '2012-09-18 23:59:59'::timestamp without time zone)
AND ((appid)::text = ''::text))
When I have run above query without the explain tag. I get output with in 1 seconds, while here cost=31.31..31.61.
Anybody please help me to understood what is cost keyword means in explain plan I mean units of cost keyword in explain plan?
Cost is the query planner's estimate of how difficult an operation is or how long it will take to perform. It's based on some machine-level parameters -- how long a disk seek will take versus a streaming read, for example -- along with table-level information like how big each row is, how many rows there are, or the distribution of values in each column. There are no units, and the resulting cost values are arbitrary. Costs are the metric PostgreSQL uses to figure out how to execute a query; it will consider the myriad ways in which to execute your query and choose the plan with the lowest cost. For more specifics on cost calculations, see Planner Cost Constraints.
Assuming you're using the default settings, a cost this low for a sequential scan suggests to me that PostgreSQL thinks there aren't many rows in that table. The fact that it's taking a full second to run suggests that there are, in fact, a lot of rows in that table. You can tell PostgreSQL to collect new statistics on that table by saying ANALYZE web_categoryutfv1_24hr_ts_201209. The pg_autovacuum process should regularly collect statistics anyway, but maybe you're on an older version of PostgreSQL, or it hasn't run in a while, or who knows; regardless, there's no harm in doing again by hand.
If PostgreSQL thinks that table is small, it'll prefer a sequential scan over using indexes, because a sequential read of the whole table is faster than an index scan followed by a bunch of random reads. On the other hand, if PostgreSQL thinks the table is large, it will likely be faster to reference an index on 5mintime and/or appid, assuming that said index will allow it to exclude many rows. If you have no such index, consider creating one.
One last thing: EXPLAIN has a big brother named EXPLAIN ANALYZE. While EXPLAIN shows you the query plan PostgreSQL will choose along with the costs that guided its decision, EXPLAIN ANALYZE actually executes the query and shows you how long each component took to run. See EXPLAIN for more.