It is known that database indexing only makes sense if you have large tables and more reads than writes as the creation of the indices leads to additional writing overhead as each modification in the database also leads to a modification of the indices.
If we assume that the indexing structure of the database is
a) a B+ tree
b) a hashtable
what is a rule of thumb for the number of reads compared to the number of writes where it starts to make sense to implement database indexing ?
For information on how database indexing works, check out How does database indexing work?
There are many factors involved here. For example, is the index data in the cache? Are the data blocks in the cache? How many rows are retrieved by the query? Hoe many rows in the table etc etc
I have been asked this question (or a variant of it) many times. Rather than give number, I prefer to give some information so that the person asking the question can make an informed decision. Here is a "back of the envelope" calculation example with some stated assumptions.
Lets say the table has 100M rows
Lets say a row is 200 bytes on average
Than means the table consumes about 19GB (ignoring any compression)
Lets assume that the index is fully cached and takes zero time
Lets assume that the data has to be read from disk, at 5ms per read
Lets assume that your IO subsystem can deliver data at 10 GB/s
Lets assume that you need to read 50,000 rows from your table. Lets also assume that the rows are spread out such that two of the required rows live in te same block.
OK, now we can do some math.
Scenario 1. Using an index
Index reads are zero time
Data block reads: 25,000 blocks #5ms each is 125s
Scenario 2. Scanning the table
19GB scanned at 10GB/s is 2s
Therefore in this case, scanning the data is much faster than using an index.
Costs formula
Nothing:
insert: O(1)
search O(n)
cost search = n db lookup operations
cost insert = db insert
a) B+ tree of order b
insert: O (log n)
search: O (log n)
cost search = log_b n lookup operations + db lookup
cost insert = log_b n lookup operations + db insert
with increasing order, the number of lookup operations go down, but the cost per lookup operation increases
b) Hashtable
insert: O(1)
search: O(1)
cost search = hash calculation [+ extra buckets when collisions happen] + db lookup
cost insert = hash calculation [+ extra buckets when collisions happen] + db insert
Cost at n = 1000
Nothing:
cost search = 1000 db lookup operations
cost insert = 1 db insert
a1) B+ tree of order 2
cost search = 10 lookup operations + 1 db lookup
cost insert = 10 lookup operations + 1 db insert
a2) B+ tree of order 10
cost search = 3 lookup operations + 1 db lookup
cost insert = 3 lookup operations + 1 db insert
b) Hashtable
cost search = hash calculation [+ extra buckets when collisions happen] + db lookup
cost insert = hash calculation [+ extra buckets when collisions happen] + db insert
Cost at n = 1000000
Nothing:
cost search = 1000000 db lookup operations
cost insert = 1 db insert
a1) B+ tree of order 2
cost search = 20 lookup operations + 1 db lookup
cost insert = 20 lookup operations + 1 db insert
a2) B+ tree of order 10
cost search = 6 lookup operations + 1 db lookup
cost insert = 6 lookup operations + 1 db insert
b) Hashtable
cost search = hash calculation [+ extra buckets when collisions happen] + db lookup
cost insert = hash calculation [+ extra buckets when collisions happen] + db insert
There are also quite large costs gains to be gotten by hitting the hardware caches for subsequent hits. The exact gains depend on the hardware your database is installed on.
The costs are not all easily comparable. The hash calculation is generally more expensive than the lookups, but as n gets large it stays the same. B tree lookup and sequential DB lookups are likely to hit hardware caches (due to entire blocks being loaded).
The larger a table, the more important it is to have an index (see cost at n=1000 vs n=1000000). So the number of writes vs reads will vary with the size of your table.
You should also take your specific queries into account. For example, hash tables are not ordered while B trees are. So if a query needs to pick up all values between a minimum and a maximum value, a B tree would perform better than a hash table (a good hash is uniformly distributed).
In general, you would have to measure performance of the queries and inserts you would be using in practice. You could start without an index and then add one when you need it. Indexes can be added later without having to change the queries of the programs using your DB (the response time of the queries will change though, reads will get faster if they use the index, writes will get slower).
If you are in the specific case where you have a load operation with a lot of inserts followed by a period of reads, you can temporarily turn off the index and recalculate it after the load.
References:
B trees vs hash tables:
https://cs.stackexchange.com/questions/270/hash-tables-versus-binary-trees
Cache
http://www.moreprocess.com/devices/computer-memory-hierarchy-internal-register-cache-ram-hard-disk-magnetic-tape
the point of using database is usually to cut the time spent looking for something in the data, therefore it is about "reading". I am very sure that 90% people use database for "reading", if not 100%.
let's think of a few cases:
writes only, no reads: what's the point? just for backup? well, the
backup will be used for reads when it is restored.
many writes, few reads: i can't think of such situation, but when you do have this situation, which one do you value more? swiftly presenting the data? or how fast you save the data? if you value how fast data is saved, then you can remove the index and have a night run to present the data (i'm sure it's big enough to need a night run to summarize the data considering how fast the data is being saved), therefore you have a near real time writes, and H+1 or more late data reads, which bring to question: what's the point of having such configuration?
few writes, many reads: this is the most common situation
no writes, only reads: 100% you can't produce this situation, no writes then what to read?
sorry for my english, hope it helps
Related
I'm a beginner to DynamoDB, and my online constructor doesn't answer his Q/A lol, and i've been confused about this.
I know that the partition key decides the partition in which the item will be placed.
I also know that the number of partitions is calculated based on throughput or storage using the famous formulas
So let's say a table has user_id as its partition Key, with 200 user_ids. Does that automatically mean that we have 200 partitions? If so, why didn't we calculate the no. of partitions based on the famous formulas?
Thanks
Let's establish 2 things.
A DynamoDB partition can support 3000 read operations and 1000 write operations. It keeps a divider between read and write ops so they do not interfere with each other. If you had a table that was configured to support 18000 reads and 6000 writes, you'd have at least 12 partition, but probably a few more for some head room.
A provisioned capacity table has 1 partition by default, but an on-demand partition has 4 partitions by default.
So, to answer your question directly. Just because you have 200 items, does not mean you have 200 partitions. It is very possible for those 200 items to be in just one partition if your table was in provisioned capacity mode. If the configuration of the table changes or it takes on more traffic, those items might move around to new partitions.
There are a few distinct times where DynamoDB will add partitions.
When partitions grow in storage size larger than 10GB. DynamoDB might see that you are taking on data and try to do this proactively, but 10GB is the cut off.
When your table needs to support more operations per second that it is currently doing. This can happen manually because you configured your table to support 20,000 reads/sec where before I only supported 2000. DynamoDB would have to add partitions and move data to be able to handle that 20,000 reads/sec. Or is can happen automatically to add partitions because you configured floor and ceiling values in DynamoDB auto-scaling and DynamoDB senses your ops/sec is climbing and will therefore adjust the number of partitions in response to capacity exceptions.
Your table is in on-demand capacity mode and DynamoDB attempts to automatically keep 2x your previous high water mark of capacity. For example, say your table just reached 10,000 RCU for the first time. DynamoDB would see that is past your previous high water mark and start adding more partitions as it tries to keep 2x the capacity at the ready in case you peak up again like you just did.
DynamoDB is actively monitoring your table and if it sees one or more items are particularly being hit hard (hot keys), are in the same partition and this might create a hot partition. If that is happening, DynamoDB might split the table to help isolate those items and prevent or fix a hot partition situation.
There are one or two other more rare edge cases, but you'd likely be talking to AWS Support if you encountered this.
Note: Once DynamoDB creates partitions, the number of partitions never shrinks and this is ok. Throughput dilution is no longer a thing in DynamoDB.
The partition key value is hashed to determine the actual partition to place the data item into.
Thus the number of distinct partition key values has zero affect on the number of physical partitions.
The only things that affect the physical number of partitions are RCUs/WCUs (throughput) and the amount of data stored.
Nbr Partions Pt = RCU/3000 + WCU/1000
Nbr Partions Ps = GB/10
Unless one of the above is more than 1.0, there will likely only be a single partition. But I'm sure the split happens as you approach the limits, when exactly is something only AWS knows.
I have been searching good information about index benchmarking on PostgreSQL and found nothing really good.
I need to understand how PostgreSQL behaves while handling a huge amount of records.
Let's say 2000M records on a single non-partitioned table.
Theoretically, b-trees are O(log(n)) for reads and writes but in practicality
I think that's kind of an ideal scenario not considering things like HUGE indexes not fitting entirely in memory (swapping?) and maybe other things I am not seeing at this point.
There are no JOIN operations, which is fine, but note this is not an analytical database and response times below 150ms (less the better) are required. All searches are expected to be done using indexes, of course. Where we have 2-3 indexes:
UUID PK index
timestamp index
VARCHAR(20) index (non unique but high cardinality)
My concern is how writes and reads will perform once the table reach it's expected top capacity (2500M records)
... so specific questions might be:
May "adding more iron" achieve reasonable performance in such scenario?
NOTE this is non-clustered DB so this is vertical scaling.
What would be the main sources of time consumption either for reads and writes?
What would be the amount of records on a table that we can consider "too much" for this standard setup on PostgreSql (no cluster, no partitions, no sharding)?
I know this amount of records suggests taking some alternative (sharding, partitioning, etc) but this question is about learning and understanding PostgreSQL capabilities more than other thing.
There should be no performance degradation inserting into or selecting from a table, even if it is large. The expense of index access grows with the logarithm of the table size, but the base of the logarithm is large, and the index shouldn't have a depth of the index cannot be more than 5 or 6. The upper levels of the index are probably cached, so you end up with a handful of blocks read when accessing a single table row per index scan. Note that you don't have to cache the whole index.
Suppose I have a table with ten billion rows spread across 100 machines. The table has the following structure:
PK1 PK2 PK3 V1 V2
Where PK represents a partition key and V represents a value. So in the above example, the partition key consists of 3 columns.
Scylla requires that you to specify all columns of the partition key in the WHERE clause.
If you want to execute a query while specifying only some of the columns you'd get a warning as this requires a full table scan:
SELECT V1 & V2 FROM table WHERE PK1 = X & PK2 = Y
In the above query, we only specify 2 out of 3 columns. Suppose the query matches 1 billion out of 10 billion rows - what is a good mental model to think about the cost/performance of this query?
My assumption is that the cost is high: It is equivalent to executing ten billion separate queries on the data set since 1) there is no logical association between the rows in the way the rows are stored to disk as each row has a different partition key (high cardinality) 2) in order for Scylla to determine which rows match the query it has to scan all 10 billion rows (even though the result set only matches 1 billion rows)
Assuming a single server can process 100K transactions per second (well within the range advertised by ScyllaDB folks) and the data resides on 100 servers, the (estimated) time to process this query can be calculated as: 100K * 100 = 10 million queries per second. 10 billion divided by 10M = 1,000 seconds. So it would take the cluster approx. 1,000 seconds to process the query (consuming all of the cluster resources).
Is this correct? Or is there any flaw in my mental model of how Scylla processes such queries?
Thanks
As you suggested yourself, Scylla (and everything I will say in my answer also applies to Cassandra) keeps the partitions hashed by the full partition key - containing three columns. ּSo Scylla has no efficient way to scan only the matching partitions. It has to scan all the partitions, and check each of those whether its partition-key matches the request.
However, this doesn't mean that it's as grossly inefficient as "executing ten billion separate queries on the data". A scan of ten billion partitions is usually (when each row's data itself isn't very large) much more efficient than executing ten billion random-access reads, each reading a single partition individually. There's a lot of work that goes into random-access reads - Scylla needs to reach a coordinator which then sends it to replicas, each replica needs to find the specific position in its one-disk data files (often multiple files), often need to over-read from the disk (as disk and compression alignments require), and so on. Compare to this a scan - which can read long contiguous swathes of data sorted by tokens (partition-key hash) from disk and can return many rows fairly quickly with fewer I/O operations and less CPU work.
So if your example setup can do 100,000 random-access reads per node, it can probably read a lot more than 100,000 rows per second during scan. I don't know which exact number to give you, but the blog post https://www.scylladb.com/2019/12/12/how-scylla-scaled-to-one-billion-rows-a-second/ we (full disclosure: I am a ScyllaDB developer) showed an example use case scanning a billion (!) rows per second with just 83 nodes - that's 12 million rows per second on each node instead of your estimate of 100,000. So your example use case can potentially be over in just 8.3 seconds, instead of 1000 seconds as you calculated.
Finally, please don't forget (and this also mentioned in the aforementioned blog post), that if you do a large scan you should explicitly parallelize, i.e., split the token range into pieces and scan then in parallel. First of all, obviously no single client will be able to handle the results of scanning a billion partitions per second, so this parallelization is more-or-less unavoidable. Second, scanning returns partitions in partition order, which (as I explained above) sit contiguously on individual replicas - which is great for peak throughput but also means that only one node (or even one CPU) will be active at any time during the scan. So it's important to split the scan into pieces and do all of them in parallel. We also had a blog post about the importance of parallel scan, and how to do it: https://www.scylladb.com/2017/03/28/parallel-efficient-full-table-scan-scylla/.
Another option is to move a PK to become a clustering key, this way if you have the first two PKs, you'll be able to locate partition, and just search withing it
My understanding is that creating indexes on small tables could be more cost than benefit.
For example, there is no point creating indexes on a table with less than 100 rows (or even 1000 rows?)
Is there any specific number of rows as a threshold for creating indexes?
Update 1
The more I am investigating, the more I get conflicting information. I might be too concern about preserving IO write operations; since my SQL servers database is in HA Synchronous-commit mode.
Point #1:
This question concerns very much the IO write performance. With scenarios like SQL Server HA Synchronous-commit mode, the cost of IO write is high when database servers reside in cross subnet data centers. Adding indexes adds to the expensive IO write cost.
Point #2:
Books Online suggests:
Indexing small tables may not be optimal because it can take the query
optimizer longer to traverse the index searching for data than to
perform a simple table scan. Therefore, indexes on small tables might
never be used, but must still be maintained as data in the table
changes.
I am not sure adding index to a table with only 1 one row will ever have any benefit - or am I wrong?
Your understanding is wrong. Small tables also benefit from index specially when are used to join with bigger tables.
The cost of index has two part, storage space and process time during insert/update. First one is very cheap this days so is almost discard. So you only consideration should be when you have a table with lot of updates and inserts apply the proper configurations.
When I am searching for rows satisfying a certain condition:
SELECT something FROM table WHERE type = 5;
Is it a linear difference in time when I am executing this query on a table containing 10K and 10M of rows?
In other words - is making this kind of queries on a 10K table 1000 times faster than making it on a 10M table?
My table contains a column type which contains numbers from 1 to 10. The most often query on this table will be the one above. If the difference in performance is true, I will have to make 10 tables for each type to achieve a better performance. If this is not really the issue, I will have two tables - one for the types, and the second one for data with column type_id.
EDIT:
There are multiple rows with the type value.
(Answer originally tagged postgresql and this answer is in those terms. Other DBMSes will vary.)
Like with most super broad questions, "it depends".
If there's no index present, then time is probably roughly linear, though with a nearly fixed startup cost plus some breakpoints - e.g. from when the table fits in RAM to when it no longer fits in RAM. All sorts of effects can come into play - memory banking and NUMA, disk readahead, parallelism in the underlying disk subsystem, fragmentation on the file system, MVCC bloat in the tables, etc - that make this far from simple.
If there's a b-tree index on the attribute in question time is going to increase at a less than linear rate - probably around O(log n). How much less with vary based on whether the index fits in RAM, whether the table fits in RAM, etc. However, PostgreSQL usually then has to do a heap lookup for each index pointer, which adds random I/O cost rather unpredictably depending on the data distribution/clustering, caching and readahead, etc. It might be able to do an index-only scan, in which case this secondary lookup is avoided, if vacuum is running enough.
So ... in extremely simplified terms, no index = O(n), with index ~= O(log n). Very, very approximately.
I think the underlying intent of the question is along the lines of: Is it faster to have 1000 tables of 1000 rows, or 1 table of 1,000,000 rows?. If so: In the great majority of cases the single bigger table will be the better choice for performance and administration.