In Postgres, is overall indexing time for a column dependent on row count or on table disk space usage? - database

I'm working with a few large row count tables with a commonly named column (table_id). I now intend to add an index to this column in each table. One or two of these tables use up 10x more space than the others, but let's say for simplicity that all tables have the same row count. Basically, the 10x added space is because some tables have many more columns than others. I have two inter-related questions:
I'm curious about whether overall indexing time is a function of the table's disk usage or just the row count?
Additionally, would duplicate values in a column speed up the indexing time at all, or would it actually slow down indexing?
If indexing time is only dependent on row count, then all my tables should get indexed at the same speed. I'm sure someone could do benchmark tests to answer all these questions, but my disks are currently tied up indexing those tables.

The speed of indexing depends on the following factors:
Count of rows, it is one of the factors that most affect the speed of the index.
The type of column (int, text, bigint, json) is also one of the factors that influence indexing.
Duplicate data affects index size, not index speed. It may have a very slight effect on the speed of the index. It mainly affects the size. So if a column has a lot of duplicate data, the size of the column index decreases.
The speed of the index can be affected by the disk in this way, for example: When you created index choosing is in different tablespace, and this tablespace is setting to another HDD using configuration. And at this time, if the disk on which the index is created is an SSD, and the other is a regular HDD disk, then of course the index creation speed will increase.
Also, PostgreSQL server configurations have memory usage and other parameters that can affect index creation speed, so if some parameters like buffer memory are high, then indexing speed will increase.

The speed of CREATE INDEX depends on several factors:
the kind of index
the number of rows
the speed of your disks
the setting of maintenance_work_mem and max_parallel_maintenance_workers
The effort to sort the data grows with O(n * log(n)), where n is the number of rows. Reading and writing should grow linearly with the number of rows.
I am not certain, but I'd say that duplicate rows should slow down indexing a little bit from v13 on, where B-tree index deduplication was introduced.

Related

Index performance on Postgresql for big tables

I have been searching good information about index benchmarking on PostgreSQL and found nothing really good.
I need to understand how PostgreSQL behaves while handling a huge amount of records.
Let's say 2000M records on a single non-partitioned table.
Theoretically, b-trees are O(log(n)) for reads and writes but in practicality
I think that's kind of an ideal scenario not considering things like HUGE indexes not fitting entirely in memory (swapping?) and maybe other things I am not seeing at this point.
There are no JOIN operations, which is fine, but note this is not an analytical database and response times below 150ms (less the better) are required. All searches are expected to be done using indexes, of course. Where we have 2-3 indexes:
UUID PK index
timestamp index
VARCHAR(20) index (non unique but high cardinality)
My concern is how writes and reads will perform once the table reach it's expected top capacity (2500M records)
... so specific questions might be:
May "adding more iron" achieve reasonable performance in such scenario?
NOTE this is non-clustered DB so this is vertical scaling.
What would be the main sources of time consumption either for reads and writes?
What would be the amount of records on a table that we can consider "too much" for this standard setup on PostgreSql (no cluster, no partitions, no sharding)?
I know this amount of records suggests taking some alternative (sharding, partitioning, etc) but this question is about learning and understanding PostgreSQL capabilities more than other thing.
There should be no performance degradation inserting into or selecting from a table, even if it is large. The expense of index access grows with the logarithm of the table size, but the base of the logarithm is large, and the index shouldn't have a depth of the index cannot be more than 5 or 6. The upper levels of the index are probably cached, so you end up with a handful of blocks read when accessing a single table row per index scan. Note that you don't have to cache the whole index.

Oracle index fragmentation

I would like to know if there is any way other than "analyze index validate structure" to find out index fragmentation in Oracle database? As this causes row lock in a production environment.
"analyze index validate structure online" doesn't populate the index_stats.
Thanks.
If your optimizer statistics are relatively current, you can do something like
get number of LEAF_BLOCKS from USER_INDEXES
get NUM_ROWS from USER_INDEXES
get AVG_COL_LEN for each column in the index from USER_TABLES
Summing the column lengths, plus 6 bytes for the rowid, times the number of rows gives you a total byte figure for the index entries.
Depending on the data within it, and how that data was inserted, an index will typically sit at around 65-90% full. Throw in some block level overheads (lets say 200 bytes per 8k block), and you can use this get an estimate of how many leaf blocks you expect the index to have.
If that is roughly close to the LEAF_BLOCK statistic you have, then you can assume that the index is probably not "fragmented" (although that is a term that can cover a multitude of things).
But unless you have a performance issue that you can currently tie back to this index, then I wouldn't worry too much about index fragmentation.
A row lock is a row lock; as such it is very unlikely that this is due to index fragmentation (which is another topic).
A row lock is generally taken out by the application, and is normal behavior.

Optimum number of rows in a table for creating indexes

My understanding is that creating indexes on small tables could be more cost than benefit.
For example, there is no point creating indexes on a table with less than 100 rows (or even 1000 rows?)
Is there any specific number of rows as a threshold for creating indexes?
Update 1
The more I am investigating, the more I get conflicting information. I might be too concern about preserving IO write operations; since my SQL servers database is in HA Synchronous-commit mode.
Point #1:
This question concerns very much the IO write performance. With scenarios like SQL Server HA Synchronous-commit mode, the cost of IO write is high when database servers reside in cross subnet data centers. Adding indexes adds to the expensive IO write cost.
Point #2:
Books Online suggests:
Indexing small tables may not be optimal because it can take the query
optimizer longer to traverse the index searching for data than to
perform a simple table scan. Therefore, indexes on small tables might
never be used, but must still be maintained as data in the table
changes.
I am not sure adding index to a table with only 1 one row will ever have any benefit - or am I wrong?
Your understanding is wrong. Small tables also benefit from index specially when are used to join with bigger tables.
The cost of index has two part, storage space and process time during insert/update. First one is very cheap this days so is almost discard. So you only consideration should be when you have a table with lot of updates and inserts apply the proper configurations.

How does SELECTing from a table scale with table size?

When I am searching for rows satisfying a certain condition:
SELECT something FROM table WHERE type = 5;
Is it a linear difference in time when I am executing this query on a table containing 10K and 10M of rows?
In other words - is making this kind of queries on a 10K table 1000 times faster than making it on a 10M table?
My table contains a column type which contains numbers from 1 to 10. The most often query on this table will be the one above. If the difference in performance is true, I will have to make 10 tables for each type to achieve a better performance. If this is not really the issue, I will have two tables - one for the types, and the second one for data with column type_id.
EDIT:
There are multiple rows with the type value.
(Answer originally tagged postgresql and this answer is in those terms. Other DBMSes will vary.)
Like with most super broad questions, "it depends".
If there's no index present, then time is probably roughly linear, though with a nearly fixed startup cost plus some breakpoints - e.g. from when the table fits in RAM to when it no longer fits in RAM. All sorts of effects can come into play - memory banking and NUMA, disk readahead, parallelism in the underlying disk subsystem, fragmentation on the file system, MVCC bloat in the tables, etc - that make this far from simple.
If there's a b-tree index on the attribute in question time is going to increase at a less than linear rate - probably around O(log n). How much less with vary based on whether the index fits in RAM, whether the table fits in RAM, etc. However, PostgreSQL usually then has to do a heap lookup for each index pointer, which adds random I/O cost rather unpredictably depending on the data distribution/clustering, caching and readahead, etc. It might be able to do an index-only scan, in which case this secondary lookup is avoided, if vacuum is running enough.
So ... in extremely simplified terms, no index = O(n), with index ~= O(log n). Very, very approximately.
I think the underlying intent of the question is along the lines of: Is it faster to have 1000 tables of 1000 rows, or 1 table of 1,000,000 rows?. If so: In the great majority of cases the single bigger table will be the better choice for performance and administration.

How do indexes and disk seeks play well together?

I have another question but i'll be more specific.
I see that when selecting a million row table it takes < 1second. What I don't understand is how might it do this with indexes. It seems to take 10ms to do a seek so for it to succeed 1sec it must do <100seeks. If there is an index entry per row then 1M rows is at least 1K blocks to store the indexes (actually its higher if its 8bytes per row (32bit index value + 32 key offset)). Then we would need to actually travel to the rows and collect the data. How do databases keep the seeks low and pull that data as fast as they do?
One way is something called a 'clustered index', where the rows of the table are physically ordered according to the clustered index's sort. Then when you want to read in a range of values along the indexed field, you find the first one, and you can just read it all in at once with no extra IO.
Also:
1) When reading an index, a large chunk of the index will be read in at once. If descending the B-tree (or moving along the children at the bottom, once you've found your range) moves you to another node already read into memory, you've saved an IO.
2) If the number of records that the SQL server statistically expects to retrieve are so high that the random access requirement of going from the index to the underlying rows will require so many IO operations that it would be faster to do a table scan, then it will do a table scan instead. You can see this e.g. using the query planner in SQL Server or PostgreSQL. But for small ranges the index is usually better, and the query plan will reflect this.

Resources