best way to query on over 15m rows?

best way to query on over 15m rows? - database

I have a table that has over 15m rows in Postgresql. The users can save these rows (let's say items) into their library and when they request their library, the system loads the user's library.
The query in Postgresql is like
SELECT item.id, item.name
FROM items JOIN library ON (library.item_id = item.id)
WHERE library.user_id = 1
, the table is already indexed and denormalized so I don't need any other JOIN.
If a user has many items in the library (like 1k items) the query time increases normally. (for example for 1k items the query time is 7s) My aim is to reduce the query time for large datasets.
I already use Solr for full-text searching, and I tried queries like ?q=id:1 OR id:100 OR id:345 but I'm not sure whether it's efficient or not in Solr.
I want to know my alternatives for querying this datasets. The bottleneck in my system seems disk speed. Should I buy a server that has over 15gb memory and go with Postgresql in increased shared_memory option or try something like Mongodb or another memory based databases, or should I create a cluster system and replicate the data in Postgresql?
items:
Column | Type
--------------+-------------------
id | text
mbid | uuid
name | character varying
length | integer
track_no | integer
artist | text[]
artist_name | text
release | text
release_name | character varying
rank | numeric
user_library:
Column | Type | Modifiers
--------------+-----------------------------+--------------------------------------------------------------
user_id | integer | not null
recording_id | character varying(32) |
timestamp | timestamp without time zone | default now()
id | integer | primary key nextval('user_library_idx_pk'::regclass)
-------------------
explain analyze
SELECT recording.id,name,track_no,artist,artist_name,release,release_name
FROM recording JOIN user_library ON (user_library.recording_id = recording.id)
WHERE user_library.user_id = 1;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop (cost=0.00..10745.33 rows=1036539 width=134) (actual time=0.168..57.663 rows=1000 loops=1)
Join Filter: (recording.id = (recording_id)::text)
-> Seq Scan on user_library (cost=0.00..231.51 rows=1000 width=19) (actual time=0.027..3.297 rows=1000 loops=1) (my opinion: because user_library has only 2 rows, Postgresql didn't use index to save resources.)
Filter: (user_id = 1)
-> Append (cost=0.00..10.49 rows=2 width=165) (actual time=0.045..0.047 rows=1 loops=1000)
-> Seq Scan on recording (cost=0.00..0.00 rows=1 width=196) (actual time=0.001..0.001 rows=0 loops=1000)
-> Index Scan using de_recording3_table_pkey on de_recording recording (cost=0.00..10.49 rows=1 width=134) (actual time=0.040..0.042 rows=1 loops=1000)
Index Cond: (id = (user_library.recording_id)::text)
Total runtime: 58.589 ms
(9 rows)

First, if your interesting (commonly used) set of data fits comfortably in memory as well as all indexes, you will have much better performance so yes, more RAM will help. However, with 1k records, part of your time will be spent on materializing records and sending them to the client.
A couple other initial points:
There is no substitute for real profiling. You may be surprised as to where the time is being spent. On PostgreSQL use explain analyze to do profiling of a query.
Look into cursors and limit/offset.
I don;t think one can come up with better advice until these are done.

Related

Advice on refactoring/scaling enormous events-based production table in PostgreSQL

I am hoping to get some advice regarding next steps to take for our production database that, in hindsight, has been very ill thought out.
We have an events based system that generates payment information based on "contracts" (a "contract" defines how/what the user gets paid).
Example schema:
Approx Quantity is ess. Quantity is slightly
5M rows 1-to-1 with "Events" less, approx.
i.e. ~5M rows 3M rows
---------- ---------------- --------------
| Events | | Contract | | Payment |
---------- ---------------- --------------
| id | | id | | id |
| userId | -> | eventId | ----> | contractId |
| jobId | | <a bunch of | | amount |
| date | | data to calc | | currency |
---------- | payments> | |------------|
----------------
There is some additional data on each table too such as generic "created_at", "updated_at", etc.
These are the questions I have currently come up with (please let me know if I'm barking up the wrong tree):
Should I denormalise the payment information?
Currently, in order to get a payment based on an event ID we need to do minimum one join but realistically we have to do two (there are rarely scenarios where we want a payment based only on some contract information). With 5M rows (and growing on a daily basis) it takes well over 20s to do a normal select with sort, let alone a join. My thought is that by moving the payment info directly onto the events table we can skip the joins in order to get payment information.
Should I be "archiving" old rows from Events table?
Since the Events table has so many new rows constantly (and also needs to be read from for other business logic purposes) should I be implementing some strategy to move events where payments have already been calculated into some "archive" table? My thought is that this will keep the very I/O-heavy Events table more "freed up" in order to be read/written to faster.
What other optimisation strategies should I be thinking about?
The application in question is currently live to paying customers and our team is growing concerned about the manpower that will be required to keep this unoptimised setup in-check. Are there any "quick wins" that we could perform on the tables/database in order to speed things up? (I've read about "indexing" but I haven't grasped the purpose well enough to be confident to deploy it to a production database)
Thank you so much in advance for any advice you can give!
Edit to include DB information:
explain (analyze, buffers) select * from "event" order by "updatedAt" desc
Gather Merge (cost=1237571.93..1699646.19 rows=3960360 width=381) (actual time=71565.082..76242.131 rows=5616130 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=107649 read=211715, temp read=805207 written=805973
I/O Timings: read=76365.059
-> Sort (cost=1236571.90..1241522.35 rows=1980180 width=381) (actual time=69698.164..70866.085 rows=1872043 loops=3)
Sort Key: "updatedAt" DESC
Sort Method: external merge Disk: 725448kB
Worker 0: Sort Method: external merge Disk: 685968kB
Worker 1: Sort Method: external merge Disk: 714432kB
Buffers: shared hit=107649 read=211715, temp read=805207 written=805973
I/O Timings: read=76365.059
-> Parallel Seq Scan on event (cost=0.00..339111.80 rows=1980180 width=381) (actual time=0.209..26506.333 rows=1872043 loops=3)
Buffers: shared hit=107595 read=211715
I/O Timings: read=76365.059
Planning Time: 1.167 ms
Execution Time: 76799.910 ms

Inconsistent response from Postgres DB on similar queries

I have a Postgres DB hosted on Amazon RDS with 150GB of stogare, 8GB RAM and 2vCPUs. The DB has a table with 320 columns and 20 million rows as of now. The problem I am facing is that the response time of the DB queries has reduced quite a lot as we began inserting more data. At 18 million rows, the DB response was quite fast. But after inserting another 2 million rows, the performance reduced quite a lot. I did a simple query as follows
explain analyze SELECT * from "data_table" WHERE foreign_key_id = 7 ORDER BY "TimeStamp" DESC LIMIT 1;
The response for the above as follows
Limit (cost=0.43..90.21 rows=1 width=2552) (actual time=650065.806..650065.807 rows=1 loops=1)
-> Index Scan Backward using "data_table_TimeStamp_219314ec" on data_table (cost=0.43..57250559.80 rows=637678 width=2552) (actual time=650065.803..650065.803 rows=1 loops=1)
Filter: (asset_id = 7)
Rows Removed by Filter: 4910074
Planning time: 44.072 ms
Execution time: 650066.004 ms
I ran another query with a different id for the foreign key and the result was as shown below
explain analyze SELECT * from "data_table" WHERE foreign_key_id = 1 ORDER BY "TimeStamp" DESC LIMIT 1;
Limit (cost=0.43..13.05 rows=1 width=2552) (actual time=2.749..2.750 rows=1 loops=1)
-> Index Scan Backward using "data_table_TimeStamp_219314ec" on data_table (cost=0.43..57250559.80 rows=4539651 width=2552) (actual time=2.747..2.747 rows=1 loops=1)
Filter: (asset_id = 1)
Planning time: 0.496 ms
Execution time: 2.927 ms
As you can see two different queries of the same type give highly different results. The number of records with foreign_key_id=1 is 11 million while that with foreign_key_id=7 is about 1 million.
I am not able to figure out why this is happening. There is a huge delay in response for all foreign_key_id's except for foreign_key_id=1. The first query has a line where Filter removed rows. Which is not there in the second query.
Could anyone help me with understanding this issue?
Additional Information
The TimeStamp is indexed using btree
A small amount of data insertion is being done every 10 minutes. Occasionally we also insert bulk data(5-6 million rows) using scripts.

You could add index to generate different execution plan:
CREATE INDEX idx ON data_table(foreign_key_id, "TimeStamp" DESC);

Is my PSQL database performance within expectations for my hardware?

I'm brand new to the concept of database administration, so I have no basis for what to expect. I am working with approximately 100GB of data in the form of five different tables. Descriptions of the data, as well as the first few rows of each file, can be found here.
I'm currently just working with the flows tables in an effort to gauge performance. Here is the results from \d flows:
Table "public.flows"
Column | Type | Modifiers
------------+-------------------+-----------
time | real |
duration | real |
src_comp | character varying |
src_port | character varying |
dest_comp | character varying |
dest_port | character varying |
protocol | character varying |
pkt_count | real |
byte_count | real |
Indexes:
"flows_dest_comp_idx" btree (dest_comp)
"flows_dest_port_idx" btree (dest_port)
"flows_protocol_idx" btree (protocol)
"flows_src_comp_idx" btree (src_comp)
"flows_src_port_idx" btree (src_port)
Here is the results from EXPLAIN ANALYZE SELECT src_comp, COUNT(DISTINCT dest_comp) FROM flows GROUP BY src_comp;, which I thought would be a relatively simple query:
GroupAggregate (cost=34749736.06..35724568.62 rows=200 width=64) (actual time=1292299.166..1621191.771 rows=11154 loops=1)
Group Key: src_comp
-> Sort (cost=34749736.06..35074679.58 rows=129977408 width=64) (actual time=1290923.435..1425515.812 rows=129977412 loops=1)
Sort Key: src_comp
Sort Method: external merge Disk: 2819360kB
-> Seq Scan on flows (cost=0.00..2572344.08 rows=129977408 width=64) (actual time=26.842..488541.987 rows=129977412 loops=1)
Planning time: 6.575 ms
Execution time: 1636290.138 ms
(8 rows)
If I'm interpreting this correctly (which I might not be since I'm new to PSQL), it's saying that my query would take almost 30 minutes to execute, which is much, much longer than I would expect. Even with ~130 million rows.
My computer is running with an 8th-gen i7 quad-core CPU, 16GBs of RAM, and a 2TB HDD (full specs can be found here).
My questions then are: 1) is this performance to be expected, and 2) is there anything I can do to speed it up, other than buying an external SSD?

1 - src_comp and dest_comp, which are used by the query, are both indexed. However, they are indexed independently. If you had an index of 'src_comp, dest_comp' then there is a possibility that the database could process this all via indexes, eliminating a full table scan.
2 - src_comp and dest_comp are character varying. That is NOT a good thing for indexed fields, unless necessary. What are these values really? Numbers? IP addresses? Computer network names? If there is a relatively finite number of these items and they can be identified as they are added to the database, change them to integers that are used as foreign keys into other tables. That will make a HUGE difference in this query. If they can't be stored that way, but they at least have a definite finite length - e.g., 15 characters for IPv4 addresses in dotted quad format - then set a maximum length for the fields, which should help some.

Why does Postgres not use better index for my query?

I have a table where I keep record of who is following whom on a Twitter-like application:
\d follow
Table "public.follow" .
Column | Type | Modifiers
---------+--------------------------+-----------------------------------------------------
xid | text |
followee | integer |
follower | integer |
id | integer | not null default nextval('follow_id_seq'::regclass)
createdAt | timestamp with time zone |
updatedAt | timestamp with time zone |
source | text |
Indexes:
"follow_pkey" PRIMARY KEY, btree (id)
"follow_uniq_users" UNIQUE CONSTRAINT, btree (follower, followee)
"follow_createdat_idx" btree ("createdAt")
"follow_followee_idx" btree (followee)
"follow_follower_idx" btree (follower)
Number of entries in table is more than a million and when I run explain analyze on the query I get this:
explain analyze SELECT "follow"."follower"
FROM "public"."follow" AS "follow"
WHERE "follow"."followee" = 6
ORDER BY "follow"."createdAt" DESC
LIMIT 15 OFFSET 0;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.43..353.69 rows=15 width=12) (actual time=5.456..21.497
rows=15 loops=1)
-> Index Scan Backward using follow_createdat_idx on follow (cost=0.43..61585.45 rows=2615 width=12) (actual time=5.455..21.488 rows=15 loops=1)
Filter: (followee = 6)
Rows Removed by Filter: 62368
Planning time: 0.068 ms
Execution time: 21.516 ms
Why it is doing backward index scan on follow_createdat_idx where it could have been more faster execution if it had used follow_followee_idx.
This query is taking around 33 ms when running first time and then subsequent calls are taking around 22 ms which I feel are on higher side.
I am using Postgres 9.5 provided by Amazon RDS. Any idea what wrong could be happening here?

The multicolumn index on (follower, "createdAt") that user1937198 suggested is perfect for the query - as you found in your test already.
Since "createdAt" can be NULL (not defined NOT NULL), you may want to add NULLS LAST to query and index:
...
ORDER BY "follow"."createdAt" DESC NULLS LAST
And:
"follow_follower_createdat_idx" btree (follower, "createdAt" DESC NULLS LAST)
More:
PostgreSQL sort by datetime asc, null first?
There are minor other performance implications:
The multicolumn index on (follower, "createdAt") is 8 bytes per row bigger than the simple index on (follower) - 44 bytes vs 36. More (btree indexes have mostly the same page layout as tables):
Making sense of Postgres row sizes
Columns involved in an index in any way cannot be changed with a HOT update. Adding more columns to an index might block this optimization - which seems particularly unlikely given the column name. And since you have another index on just ("createdAt") that's not an issue anyway. More:
PostgreSQL Initial Database Size
There is no downside in having another index on just ("createdAt") (other than the maintenance cost for each (for write performance, not for read performance). Both indexes support different queries. You may or may not need the index on just ("createdAt") additionally. Detailed explanation:
Is a composite index also good for queries on the first field?

Another way to build database structure

I have to optimize my little-big database, because it's too slow, maybe we'll find another solution together.
First of all let's talk about data that are stored in the database. There are two objects: users and let's say messages
Users
There is something like that:
+----+---------+-------+-----+
| id | user_id | login | etc |
+----+---------+-------+-----+
| 1 | 100001 | A | ....|
| 2 | 100002 | B | ....|
| 3 | 100003 | C | ....|
|... | ...... | ... | ....|
+----+---------+-------+-----+
There is no problem inside this table. (Don't afraid of id and user_id. user_id is used by another application, so it has to be here.)
Messages
And the second table has some problem. Each user has for example messages like this:
+----+---------+------+----+
| id | user_id | from | to |
+----+---------+------+----+
| 1 | 1 | aab | bbc|
| 2 | 2 | vfd | gfg|
| 3 | 1 | aab | bbc|
| 4 | 1 | fge | gfg|
| 5 | 3 | aab | gdf|
|... | ...... | ... | ...|
+----+---------+------+----+
There is no need to edit messages, but there should be an opportunity to updated the list of messages for the user. For example, an external service sends all user's messages to the db and the list has to be updated.
And the most important thing is that there are about 30 Mio of users and average user has 500+ of messages. Another problem that I have to search through the field from and calculate number of matches. I designed a simple SQL query with join, but it takes too much time to get the data.
So...it's quite big amount of data. I decided not to use RDS (I used Postgresql) and decided to move to databases like Clickhouse and so on.
However I faced with a problem that for example Clickhouse doesn't support UPDATE statement.
To resolve this issues I decided to store messages as one row. So the table Messages should be like this:
Here I'd like to store messages in JSON format
{"from":"aaa", "to":bbe"}
{"from":"ret", "to":fdd"}
{"from":"gfd", "to":dgf"}
||
\/
+----+---------+----------+------+ And there I'd like to store the
| id | user_id | messages | hash | <= hash of the messages.
+----+---------+----------+------+
I think that full-text search inside the messages column will save some time resources and so on.
Do you have any ideas? :)

In ClickHouse, the most optimal way is to store data in "big flat table".
So, you store every message in a separate row.
15 billion rows is Ok for ClickHouse, even on single node.
Also, it's reasonable to have each user attributes directly in messages table (pre-joined), so you don't need to do JOINs. It is suitable if user attributes are not updated.
These attributes will have repeated values for each users' message - it's Ok because ClickHouse compresses data well, especially repeated values.
If users' attributes are updated, consider to store users table in separate database and use 'External dictionaries' feature to join it.
If message is updated, just don't update it. Write another row with modified message to a table instead and leave old message as is.
Its important to have right primary key for your table. You should use table from MergeTree family, which constantly reorders data by primary key and so maintains efficiency of range queries. Primary key is not required to be unique, for example you could define primary key as just (from) if you would frequently write "from = ...", and if these queries must be processed in short time.
And you could use user_id as primary key: if queries by user id are frequent and must be processed as fast as possible, but then queries with predicate on 'from' will scan whole table (mind that ClickHouse do full scan efficiently).
If you need to fast lookup by many different attributes, you could just duplicate table with different primary keys. It's typically that table will be compressed well enough and you could afford to have data in few copies with different order for different range queries.

First of all, when we have such a big dataset, from and to columns should be integers, if possible, as their comparison is faster.
Second, you should consider creating proper indexes. As each user has relatively few records (500 compared to 30M in total), it should give you a huge performance benefit.
If everything else fails, consider using partitions:
https://www.postgresql.org/docs/9.1/static/ddl-partitioning.html
In your case they would be dynamic, and hinder first time inserts immensely, so I would consider them only as last, if very efficient, resort.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight