I would like to know if there is any way other than "analyze index validate structure" to find out index fragmentation in Oracle database? As this causes row lock in a production environment.
"analyze index validate structure online" doesn't populate the index_stats.
Thanks.
If your optimizer statistics are relatively current, you can do something like
get number of LEAF_BLOCKS from USER_INDEXES
get NUM_ROWS from USER_INDEXES
get AVG_COL_LEN for each column in the index from USER_TABLES
Summing the column lengths, plus 6 bytes for the rowid, times the number of rows gives you a total byte figure for the index entries.
Depending on the data within it, and how that data was inserted, an index will typically sit at around 65-90% full. Throw in some block level overheads (lets say 200 bytes per 8k block), and you can use this get an estimate of how many leaf blocks you expect the index to have.
If that is roughly close to the LEAF_BLOCK statistic you have, then you can assume that the index is probably not "fragmented" (although that is a term that can cover a multitude of things).
But unless you have a performance issue that you can currently tie back to this index, then I wouldn't worry too much about index fragmentation.
A row lock is a row lock; as such it is very unlikely that this is due to index fragmentation (which is another topic).
A row lock is generally taken out by the application, and is normal behavior.
Related
I'm working with a few large row count tables with a commonly named column (table_id). I now intend to add an index to this column in each table. One or two of these tables use up 10x more space than the others, but let's say for simplicity that all tables have the same row count. Basically, the 10x added space is because some tables have many more columns than others. I have two inter-related questions:
I'm curious about whether overall indexing time is a function of the table's disk usage or just the row count?
Additionally, would duplicate values in a column speed up the indexing time at all, or would it actually slow down indexing?
If indexing time is only dependent on row count, then all my tables should get indexed at the same speed. I'm sure someone could do benchmark tests to answer all these questions, but my disks are currently tied up indexing those tables.
The speed of indexing depends on the following factors:
Count of rows, it is one of the factors that most affect the speed of the index.
The type of column (int, text, bigint, json) is also one of the factors that influence indexing.
Duplicate data affects index size, not index speed. It may have a very slight effect on the speed of the index. It mainly affects the size. So if a column has a lot of duplicate data, the size of the column index decreases.
The speed of the index can be affected by the disk in this way, for example: When you created index choosing is in different tablespace, and this tablespace is setting to another HDD using configuration. And at this time, if the disk on which the index is created is an SSD, and the other is a regular HDD disk, then of course the index creation speed will increase.
Also, PostgreSQL server configurations have memory usage and other parameters that can affect index creation speed, so if some parameters like buffer memory are high, then indexing speed will increase.
The speed of CREATE INDEX depends on several factors:
the kind of index
the number of rows
the speed of your disks
the setting of maintenance_work_mem and max_parallel_maintenance_workers
The effort to sort the data grows with O(n * log(n)), where n is the number of rows. Reading and writing should grow linearly with the number of rows.
I am not certain, but I'd say that duplicate rows should slow down indexing a little bit from v13 on, where B-tree index deduplication was introduced.
I have been searching good information about index benchmarking on PostgreSQL and found nothing really good.
I need to understand how PostgreSQL behaves while handling a huge amount of records.
Let's say 2000M records on a single non-partitioned table.
Theoretically, b-trees are O(log(n)) for reads and writes but in practicality
I think that's kind of an ideal scenario not considering things like HUGE indexes not fitting entirely in memory (swapping?) and maybe other things I am not seeing at this point.
There are no JOIN operations, which is fine, but note this is not an analytical database and response times below 150ms (less the better) are required. All searches are expected to be done using indexes, of course. Where we have 2-3 indexes:
UUID PK index
timestamp index
VARCHAR(20) index (non unique but high cardinality)
My concern is how writes and reads will perform once the table reach it's expected top capacity (2500M records)
... so specific questions might be:
May "adding more iron" achieve reasonable performance in such scenario?
NOTE this is non-clustered DB so this is vertical scaling.
What would be the main sources of time consumption either for reads and writes?
What would be the amount of records on a table that we can consider "too much" for this standard setup on PostgreSql (no cluster, no partitions, no sharding)?
I know this amount of records suggests taking some alternative (sharding, partitioning, etc) but this question is about learning and understanding PostgreSQL capabilities more than other thing.
There should be no performance degradation inserting into or selecting from a table, even if it is large. The expense of index access grows with the logarithm of the table size, but the base of the logarithm is large, and the index shouldn't have a depth of the index cannot be more than 5 or 6. The upper levels of the index are probably cached, so you end up with a handful of blocks read when accessing a single table row per index scan. Note that you don't have to cache the whole index.
I'm new to database. Recently I start using timescaledb, which is an extension in PostgreSQL, so I guess this is also PostgreSQL related.
I observed a strange behavior. I calculated my table structure, 1 timestamp, 2 double, so totally 24bytes per row. And I imported (by psycopg2 copy_from) 2,750,182 rows from csv file. I manually calculated the size should be 63MB, but I query timescaledb, it tells me the table size is 137MB, index size is 100MB and total 237MB. I was expecting that the table size should equal my calculation, but it doesn't. Any idea?
There are two basic reasons your table is bigger than you expect:
1. Per tuple overhead in Postgres
2. Index size
Per tuple overhead: An answer to a related question goes into detail that I won't repeat here but basically Postgres uses 23 (+padding) bytes per row for various internal things, mostly multi-version concurrency control (MVCC) management (Bruce Momjian has some good intros if you want more info). Which gets you pretty darn close to the 137 MB you are seeing. The rest might be because of either the fill factor setting of the table or if there are any dead rows still included in the table from say a previous insert and subsequent delete.
Index Size: Unlike some other DBMSs Postgres does not organize its tables on disk around an index, unless you manually cluster the table on an index, and even then it will not maintain the clustering over time (see https://www.postgresql.org/docs/10/static/sql-cluster.html). Rather it keeps its indices separately, which is why there is extra space for your index. If on-disk size is really important to you and you aren't using your index for, say, uniqueness constraint enforcement, you might consider a BRIN index, especially if your data is going in with some ordering (see https://www.postgresql.org/docs/10/static/brin-intro.html).
I have another question but i'll be more specific.
I see that when selecting a million row table it takes < 1second. What I don't understand is how might it do this with indexes. It seems to take 10ms to do a seek so for it to succeed 1sec it must do <100seeks. If there is an index entry per row then 1M rows is at least 1K blocks to store the indexes (actually its higher if its 8bytes per row (32bit index value + 32 key offset)). Then we would need to actually travel to the rows and collect the data. How do databases keep the seeks low and pull that data as fast as they do?
One way is something called a 'clustered index', where the rows of the table are physically ordered according to the clustered index's sort. Then when you want to read in a range of values along the indexed field, you find the first one, and you can just read it all in at once with no extra IO.
Also:
1) When reading an index, a large chunk of the index will be read in at once. If descending the B-tree (or moving along the children at the bottom, once you've found your range) moves you to another node already read into memory, you've saved an IO.
2) If the number of records that the SQL server statistically expects to retrieve are so high that the random access requirement of going from the index to the underlying rows will require so many IO operations that it would be faster to do a table scan, then it will do a table scan instead. You can see this e.g. using the query planner in SQL Server or PostgreSQL. But for small ranges the index is usually better, and the query plan will reflect this.
I have a table myTable with a unique clustered index myId with fill factor 100%
Its an integer, starting at zero (but its not an identity column for the table)
I need to add a new type of row to the table.
It might be nice if I could distinguish these rows by using negative values of myId.
Would having negative values incur extra page splitting and slow down inserts?
Extra Background:
This table exists as part of the etl for a data warehouse that gathers data from disparate systems. I now want to accomodate a new type of data. A way for me to do this is to reserve negative ids for this new data, which will thus be automatically clustered. This will also avoid major key changes or extra columns in the schema.
Answer Summary:
Fill factors of 100% will noramlly slow down the inserts. But not inserts that happen sequentially, and that includes the sequntial negative inserts.
Besides the practical administration points you already got and the suspect dubious use of negative ids to represent data model attributes, there is also a valid question here: give a table with int ids from 0 to N, inserting new negative values where would those value go and would they cause additional splits?
The initial rows will be placed on the clustered index leaf pages, row with id 0 on first page and row with id N on the last page, filling the pages in between. When the first row with value of -1 is inserted, this will sort ahead of row with id 0 and as such will add a new page to the tree (will allocate an extent of 8 pages actually, but that is a different point) and will link the page in front of the leaf level linked list of pages. This will NOT cause a page split of the former first page. On further inserts of values -2, -3 etc they will go to the same new page and they will be inserted in the proper position (-2 ahead of -1, -3 ahead of -2 etc) until the page fills. Further inserts will add a new page ahead of this one, that will accommodate further new values. Inserts of positive values N+1, N+2 will go at the last page and be placed in it until it fills, then they'll cause a new page to be added and will start filling that page.
So basically the answer is this: inserts at either end of a clustered index should not cause page splits. Page splits can be caused only by inserts between two existing keys. This actually extends to the non-leaf pages as well, an index at either end of the cluster may not split a non-leaf page either. I do not discuss here the impact of updates of course (they can can cause splits if the increase the length of a variable length column).
Lately has been a lot of talk in the SQL Server blogosphere about the potential performance problems of page splits, but I must warn against going to unnecessary extremes to avoid them. Page splits are a normal index operation. If you find yourself in an environment where the page split performance hit is visible during inserts, then you'll be probably worse hit by the 'mitigation' measures because you'll create artificial page latch hot spots that are far worse as they'll affect every insert. What is true is that prolonged operation with frequent splits will result in high fragmentation which impacts the data access time. I say that is best mitigated with off-peak periodical index maintenance operation (reorganize). Avoid premature optimizations, always measure first.
Not enough to notice for any reasonable system.
Page splits happen when a page is full, either at the start or at the end of the range.
As long as you regular index maintenance...
Edit, after Fill factor comments:
After a page split wth 90 or 100 FF, each page will be 50% full. FF = 100 only means an insert will happen sooner (probably 1st insert).
With a strictly monotonically increasing (or decreasing) key (+ve or -ve), a page split happens at either end of the range.
However, from BOL, FILLFACTOR
Fill
Adding Data to the End of the Table
A nonzero fill factor other than 0 or
100 can be good for performance if the
new data is evenly distributed
throughout the table. However, if all
the data is added to the end of the
table, the empty space in the index
pages will not be filled. For example,
if the index key column is an IDENTITY
column, the key for new rows is always
increasing and the index rows are
logically added to the end of the
index. If existing rows will be
updated with data that lengthens the
size of the rows, use a fill factor of
less than 100. The extra bytes on each
page will help to minimize page splits
caused by extra length in the rows.
So does, fillfactor matter for strictly monotonic keys...? Especially if it's low volume writes
No, not at all. Negative values are just as valid as INTegers as positive ones. No problem. Basically, internally, they're all just 4 bytes worth of zeroes and ones :-)
Marc
You are asking the wrong question!
If you create a clustered index that has a fillfactor of 100%, every time a record is inserted, deleted or even modified, page splits can occur because there is likely no room on the existing index data page to write the change.
Even with regular index maintenance, a fill factor of 100% is counter productive on a table where you know inserts are going to be performed. A more usual value would be 90%.
I'm concerned that this post may have taken a wrong turn, in that there seems to be an underlying design issue at work here, irrespective of the resultant page splits.
Why do you need to introduce a negative ID?
An integer primary key, for example, should uniquely indentify a row, it's sign should be irrelevant. I suspect that there may be a definition issue with the primary key for your table if this is not the case.
If you need to flag/identify the newly inserted records then create a column specifically for this purpose.
This solution would be ideal because you may then be able to ensure that your primary key is sequential (perhaps using an Identity data type, although not essential), thereby avoiding issues with page splits (on insert) altogether.
Also, to confirm if I may, a fill factor of 100% for a clustered index primary key (identity integer for example), will not cause page splits for sequential inserts!