How do indexes and disk seeks play well together?

How do indexes and disk seeks play well together? - database

I have another question but i'll be more specific.
I see that when selecting a million row table it takes < 1second. What I don't understand is how might it do this with indexes. It seems to take 10ms to do a seek so for it to succeed 1sec it must do <100seeks. If there is an index entry per row then 1M rows is at least 1K blocks to store the indexes (actually its higher if its 8bytes per row (32bit index value + 32 key offset)). Then we would need to actually travel to the rows and collect the data. How do databases keep the seeks low and pull that data as fast as they do?

One way is something called a 'clustered index', where the rows of the table are physically ordered according to the clustered index's sort. Then when you want to read in a range of values along the indexed field, you find the first one, and you can just read it all in at once with no extra IO.
Also:
1) When reading an index, a large chunk of the index will be read in at once. If descending the B-tree (or moving along the children at the bottom, once you've found your range) moves you to another node already read into memory, you've saved an IO.
2) If the number of records that the SQL server statistically expects to retrieve are so high that the random access requirement of going from the index to the underlying rows will require so many IO operations that it would be faster to do a table scan, then it will do a table scan instead. You can see this e.g. using the query planner in SQL Server or PostgreSQL. But for small ranges the index is usually better, and the query plan will reflect this.

Related

In Postgres, is overall indexing time for a column dependent on row count or on table disk space usage?

I'm working with a few large row count tables with a commonly named column (table_id). I now intend to add an index to this column in each table. One or two of these tables use up 10x more space than the others, but let's say for simplicity that all tables have the same row count. Basically, the 10x added space is because some tables have many more columns than others. I have two inter-related questions:
I'm curious about whether overall indexing time is a function of the table's disk usage or just the row count?
Additionally, would duplicate values in a column speed up the indexing time at all, or would it actually slow down indexing?
If indexing time is only dependent on row count, then all my tables should get indexed at the same speed. I'm sure someone could do benchmark tests to answer all these questions, but my disks are currently tied up indexing those tables.

The speed of indexing depends on the following factors:
Count of rows, it is one of the factors that most affect the speed of the index.
The type of column (int, text, bigint, json) is also one of the factors that influence indexing.
Duplicate data affects index size, not index speed. It may have a very slight effect on the speed of the index. It mainly affects the size. So if a column has a lot of duplicate data, the size of the column index decreases.
The speed of the index can be affected by the disk in this way, for example: When you created index choosing is in different tablespace, and this tablespace is setting to another HDD using configuration. And at this time, if the disk on which the index is created is an SSD, and the other is a regular HDD disk, then of course the index creation speed will increase.
Also, PostgreSQL server configurations have memory usage and other parameters that can affect index creation speed, so if some parameters like buffer memory are high, then indexing speed will increase.

The speed of CREATE INDEX depends on several factors:
the kind of index
the number of rows
the speed of your disks
the setting of maintenance_work_mem and max_parallel_maintenance_workers
The effort to sort the data grows with O(n * log(n)), where n is the number of rows. Reading and writing should grow linearly with the number of rows.
I am not certain, but I'd say that duplicate rows should slow down indexing a little bit from v13 on, where B-tree index deduplication was introduced.

Efficiency of Querying 10 Billion Rows (with High Cardinality) in ScyllaDB

Suppose I have a table with ten billion rows spread across 100 machines. The table has the following structure:
PK1 PK2 PK3 V1 V2
Where PK represents a partition key and V represents a value. So in the above example, the partition key consists of 3 columns.
Scylla requires that you to specify all columns of the partition key in the WHERE clause.
If you want to execute a query while specifying only some of the columns you'd get a warning as this requires a full table scan:
SELECT V1 & V2 FROM table WHERE PK1 = X & PK2 = Y
In the above query, we only specify 2 out of 3 columns. Suppose the query matches 1 billion out of 10 billion rows - what is a good mental model to think about the cost/performance of this query?
My assumption is that the cost is high: It is equivalent to executing ten billion separate queries on the data set since 1) there is no logical association between the rows in the way the rows are stored to disk as each row has a different partition key (high cardinality) 2) in order for Scylla to determine which rows match the query it has to scan all 10 billion rows (even though the result set only matches 1 billion rows)
Assuming a single server can process 100K transactions per second (well within the range advertised by ScyllaDB folks) and the data resides on 100 servers, the (estimated) time to process this query can be calculated as: 100K * 100 = 10 million queries per second. 10 billion divided by 10M = 1,000 seconds. So it would take the cluster approx. 1,000 seconds to process the query (consuming all of the cluster resources).
Is this correct? Or is there any flaw in my mental model of how Scylla processes such queries?
Thanks

As you suggested yourself, Scylla (and everything I will say in my answer also applies to Cassandra) keeps the partitions hashed by the full partition key - containing three columns. ּSo Scylla has no efficient way to scan only the matching partitions. It has to scan all the partitions, and check each of those whether its partition-key matches the request.
However, this doesn't mean that it's as grossly inefficient as "executing ten billion separate queries on the data". A scan of ten billion partitions is usually (when each row's data itself isn't very large) much more efficient than executing ten billion random-access reads, each reading a single partition individually. There's a lot of work that goes into random-access reads - Scylla needs to reach a coordinator which then sends it to replicas, each replica needs to find the specific position in its one-disk data files (often multiple files), often need to over-read from the disk (as disk and compression alignments require), and so on. Compare to this a scan - which can read long contiguous swathes of data sorted by tokens (partition-key hash) from disk and can return many rows fairly quickly with fewer I/O operations and less CPU work.
So if your example setup can do 100,000 random-access reads per node, it can probably read a lot more than 100,000 rows per second during scan. I don't know which exact number to give you, but the blog post https://www.scylladb.com/2019/12/12/how-scylla-scaled-to-one-billion-rows-a-second/ we (full disclosure: I am a ScyllaDB developer) showed an example use case scanning a billion (!) rows per second with just 83 nodes - that's 12 million rows per second on each node instead of your estimate of 100,000. So your example use case can potentially be over in just 8.3 seconds, instead of 1000 seconds as you calculated.
Finally, please don't forget (and this also mentioned in the aforementioned blog post), that if you do a large scan you should explicitly parallelize, i.e., split the token range into pieces and scan then in parallel. First of all, obviously no single client will be able to handle the results of scanning a billion partitions per second, so this parallelization is more-or-less unavoidable. Second, scanning returns partitions in partition order, which (as I explained above) sit contiguously on individual replicas - which is great for peak throughput but also means that only one node (or even one CPU) will be active at any time during the scan. So it's important to split the scan into pieces and do all of them in parallel. We also had a blog post about the importance of parallel scan, and how to do it: https://www.scylladb.com/2017/03/28/parallel-efficient-full-table-scan-scylla/.

Another option is to move a PK to become a clustering key, this way if you have the first two PKs, you'll be able to locate partition, and just search withing it

Oracle index fragmentation

I would like to know if there is any way other than "analyze index validate structure" to find out index fragmentation in Oracle database? As this causes row lock in a production environment.
"analyze index validate structure online" doesn't populate the index_stats.
Thanks.

If your optimizer statistics are relatively current, you can do something like
get number of LEAF_BLOCKS from USER_INDEXES
get NUM_ROWS from USER_INDEXES
get AVG_COL_LEN for each column in the index from USER_TABLES
Summing the column lengths, plus 6 bytes for the rowid, times the number of rows gives you a total byte figure for the index entries.
Depending on the data within it, and how that data was inserted, an index will typically sit at around 65-90% full. Throw in some block level overheads (lets say 200 bytes per 8k block), and you can use this get an estimate of how many leaf blocks you expect the index to have.
If that is roughly close to the LEAF_BLOCK statistic you have, then you can assume that the index is probably not "fragmented" (although that is a term that can cover a multitude of things).
But unless you have a performance issue that you can currently tie back to this index, then I wouldn't worry too much about index fragmentation.

A row lock is a row lock; as such it is very unlikely that this is due to index fragmentation (which is another topic).
A row lock is generally taken out by the application, and is normal behavior.

Why PostgreSQL(timescaledb) costs more storage in table?

I'm new to database. Recently I start using timescaledb, which is an extension in PostgreSQL, so I guess this is also PostgreSQL related.
I observed a strange behavior. I calculated my table structure, 1 timestamp, 2 double, so totally 24bytes per row. And I imported (by psycopg2 copy_from) 2,750,182 rows from csv file. I manually calculated the size should be 63MB, but I query timescaledb, it tells me the table size is 137MB, index size is 100MB and total 237MB. I was expecting that the table size should equal my calculation, but it doesn't. Any idea?

There are two basic reasons your table is bigger than you expect:
1. Per tuple overhead in Postgres
2. Index size
Per tuple overhead: An answer to a related question goes into detail that I won't repeat here but basically Postgres uses 23 (+padding) bytes per row for various internal things, mostly multi-version concurrency control (MVCC) management (Bruce Momjian has some good intros if you want more info). Which gets you pretty darn close to the 137 MB you are seeing. The rest might be because of either the fill factor setting of the table or if there are any dead rows still included in the table from say a previous insert and subsequent delete.
Index Size: Unlike some other DBMSs Postgres does not organize its tables on disk around an index, unless you manually cluster the table on an index, and even then it will not maintain the clustering over time (see https://www.postgresql.org/docs/10/static/sql-cluster.html). Rather it keeps its indices separately, which is why there is extra space for your index. If on-disk size is really important to you and you aren't using your index for, say, uniqueness constraint enforcement, you might consider a BRIN index, especially if your data is going in with some ordering (see https://www.postgresql.org/docs/10/static/brin-intro.html).

SQL Server Profiler - Evaluating Reads. What is considered 'good' or 'bad'?

I'm profiling (SQL Server 2008) some of our views and queries to determine their efficiency with regards to CPU usage and Reads. I understand Reads are the number of logical disk reads in 8KB pages. But I'm having a hard time determining what I should be satisfied with.
For example, when I query one of our views, which in turn joins with another view and has three OUTER APPLYs with table-valued UDFs, I get a Reads value of 321 with a CPU value of 0. My first thought is that I should be happy with this. But how do I evaluate the value of 321? This tells me 2,654,208 bytes of data were logically read to satisfy the query (which returned a single row and 30 columns).
How would some of you go about determining if this is good enough, or requires more fine tuning? What criteria would you use?
Also, I'm curious what is included in the 2,654,208 bytes of logical data read. Does this include all the data contained in the 30 columns in the single row returned?

The 2.5MB includes all data in the 321 pages, including the other rows in the same pages as those retrieved for your query, as well as the index pages retrieved to find your data. Note that these are logical reads, not physical reads, e.g. read from a cached page will make the read much 'cheaper' - take CPU and profiler cost indicator as well when optimising.
w.r.t. How to determine an optimum 'target' for reads.
FWIW I compare the actual reads with a optimum value which I can think of as the minimum number of pages needed to return the data in your query in a 'perfect' world.
e.g. if you calculate roughly 5 rows per page from table x, and your query returns 20 rows, the 'perfect' number of reads would be 4, plus some overhead of navigating indexes (assuming of course that the rows are clustered 'perfectly' for your query) - so utopia would be around say 5-10 pages.
For a performance critical query, you can use the actual reads vs 'utopian' reads to micro-optimise, e.g.:
Whether I can fit more rows per page in the cluster (table), e.g. replacing non-searched strings with varchar() not char, or using varchar not nvarchar() or using smaller integer types etc.
Whether the clustered index could be changed such that fewer pages would need to be fetched (e.g. if the 20 rows for the above query were scattered across different pages, then reads would be > 4)
Failing which (since you can only one CI), whether covering indexes could replace the need to go to the table data (cluster) at all, since covering indexes fitting your query will have higher 'row' densities
And for indexes, density improvements such as fillfactors or narrower indexing for indexes can mean less index reads
You might find this article useful
HTH!

321 reads with a CPU value of 0 sounds pretty good, but it all depends.
How often is this query run? Why are table-returning UDFs used instead of just doing joins? What is the context of database use (how many users, number of transactions per second, database size, is it OLTP or data warehousing)?
The extra data reads come from:
All the other data in the pages needed to satisfy the reads done in the execution plan. Note this includes clustered and nonclustered indexes. Examining the execution plan will give you a better idea of what exactly is being read. You'll see references to all sorts of indexes and tables, and whether a seek or scan was required. Note that a scan means every page in the whole index or table was read. That is why seeks are desirable over scans.
All the related data in tables INNER JOINed to in the views regardless of whether these JOINs are needed to give correct results for the query you're performing, since the optimizer doesn't know that these INNER JOINs will or won't exclude/include rows until it JOINs them.
If you provide the queries and execution plans, as requested, I would probably be able to give you better advice. Since you're using table-valued UDFs, I would also need to see the UDFs themselves or at least the execution plan of the UDFs (which is only possibly by tearing out its meat and running outside a function context, or converting it to a stored procedure).

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight