When does a partial index gets updated in PostgreSQL - database

I am doing a performance tuning on one of the largest table in our project. While reading about indexes I came across the partial index in PostgreSQL. This sounds a very nice idea to put indexes only on the rows which are getting accessed frequently.
Though, I am not able to figure-out how the partial index gets updated. For example, I have a table with following Columns:
task_uuid, job_id, enqueued_at, updated_at, task_status
task_status= ENQUEUED, RUNNING, ASSIGNED, FAILED
We search for the records which are in ENQUEUED state very frequently. If we add a partial index on (task_uuid, task_status) it will build a unique key and improve the performance. But, I want to know, what happens when the record gets updated, when we update the record RUNNING status. (task_uuid, task_status) is still unique, but will it be removed from the partial index? as the record does not fulfills the condition.

If we add a partial index on (task_uuid, task_status) it will build a unique key and improve the performance.
It will only build it as unique if you specify that in the definition of the index. Otherwise it won't be a unique index, even of those columns do happen to be unique.
When a record gets updated so that it no longer matches the WHERE predicate of the index, nothing happens to the index. It still has a pointer to the row, it just points to something no longer valid. If you did specify the index as UNIQUE, then upon inserting a conflicting index tuple, it will follow the pointer for the old tuple to the table, realize it is invalid, and allow the insertion to continue.
The next time the table is vacuumed, those obsolete pointers will be cleaned up. Queue tables with partial indexes should usually be vacuumed frequently (more frequently than the default) because the index bloats easily. The Autovac settings depend on the fraction of the table rows obsoleted, not on the fraction of the index rows obsoleted. For partial indexes, these fractions are not the same. (On the other hand, you don't seem to have a status for "COMPLETED". If completed tasks are immediately deleted, perhaps the queue table will stay small enough that this does not matter.)
Also, when an index scan follows the pointer from the index into the table and finds the row is no longer visible to anyone, it will mark the index entry as dead. Then future index scans won't have to pointlessly jump to the table. But this "microvacuum" only happens for regular index scans, not bitmap scans, and it only happens for queries done on the master, not for any done just on a hot standby.

Related

How to efficiently detect data corruption in a SQLite database?

Supposing there is no power loss or catastrophic OS crash so fsync() calls succeed and DB changes are truly written to oxide.
Since PRAGMA integrity_check does NOT [1] actually detect data corruption in a SQLite database, the developer needs to come up with a way to detect data corruption themselves.
Storing a checksum of the entire database file works but is too inefficient to do for every row insert (suppose the database is multiple gigabytes in size).
Keeping a checksum for each row is the obvious solution but this does not detect the case where entire rows are deleted by the data corruption. So we need a way to detect whether rows have been deleted from the table.
The method I thought up to detect missing rows is to keep a XOR sum of all checksums. This is efficient because every time a row is updated, we simply XOR the sum with the row's old checksum, and then XOR the sum with the row's new checksum. It occurs to me that this is not the strongest method but I have not been able to find any stronger alternatives that are efficient so suggestions are welcome.
EDIT: I have thought of an alternative method which requires all tables to be append-only. In an append-only table we can assume that the rowids are consecutive which means we can easily notice any missing rows except for the last row, which we can store somewhere.
References:
https://www.sqlite.org/pragma.html
Here is a scheme that should work:
In the table, add a "prev" and a "next" column which hold the primary keys of the previous and next rows respectively, so you can think of the table as a doubly linked list. This means that when a row is deleted from the table via data corruption, a scan through the table will find that a "next" key does not match the key of the next row, or the "prev" key does not match the key of the previous row. These columns should also be made UNIQUE so that two rows can't have the same "next" or "prev" row also they should have the FOREIGN KEY constraint.
See below for incorrect answer:
Pragma integrity_check will in fact detect missing rows, see where it says:
Missing or surplus index entries
So if you lose a row then it will say something like:
row 12345 missing from index some_index
EDIT: This answer is actually incorrect. The row x missing message only indicates that a row does not have a corresponding entry in the index, it does NOT mean that an entry in the index does not have a corresponding row (which is what would happen if a row was deleted).

does incremented column makes the b-tree index on the column unbalanced?

I have been thinking about two questions. Couldn't find any resources on the internet about this. How do dbms handle it ? Or do they ? Especially Oracle.
Before the questions, here is an example: Say I have a master table "MASTER" and slave table "SLAVE".
Master table has an "ID" column which is the primary key and index is created by Oracle.Slave table has the foreign key "MASTER_ID" which refers to master table and "SLAVE_NO". These two together is the primary key of slave table, which is again indexed.
**MASTER** | **SLAVE**
(P) ID <------> (P)(F) MASTER_ID
(P) SLAVE_NO
Now the questions;
1- If MASTER_ID is an autoincremented column, and no record is ever deleted, doesn't this get the table's index unbalanced ? Does Oracle rebuilds indexes periodically ? As far as i know Oracle only balances index branches at build time. Does Oracle re-builds indexes Automatically ever ? Say if the level goes up to some high levels ?
2- Assuming Oracle does not rebuild automatically, apart from scheduling a job that rebuilds index periodically, would it be wiser to order SLAVE table's primary key columns reverse ? I mean instead of "MASTER_ID", "SLAVE_NO" ordering it as "SLAVE_NO", "MASTER_ID"i, would it help the slave table's b-tree index be more balanced ? (Well each master table might not have exact number of slave records, but still, seems better than reverse order)
Anyone know anything about that ? Or opinions ?
If MASTER_ID is an autoincremented column, and no record is ever deleted, doesn't this get the table's index unbalanced ?
Oracle's indexes are never "unbalanced": every leaf in the index is at the same depth as any other leaf.
No page split introduces a new level by itself: a leaf page does not become a parent for new pages like it would be on a non-self-balancing tree.
Instead, a sibling for the split page is made and the new record (plus possibly some of the records from the old page) go to the new page. A pointer to the new page is added to the parent.
If the parent page is out of space too (can't accept the pointer to the newly created leaf page), it gets split as well, and so on.
These splits can propagate up to the root page, whose split is the only thing which increases the index depth (and does it for all pages at once).
Index pages are additionally organized into double-linked lists, each list on its own level. This would be impossible if the tree were unbalanced.
If master_id is auto-incremented this means that all splits occur at the end (such called 90/10 splits) which makes the most dense index possible.
Would it be wiser to order SLAVE table's primary key columns reverse?
No, it would not, for the reasons above.
If you join slave to master often, you may consider creating a CLUSTER of the two tables, indexed by master_id. This means that the records from both tables, sharing the same master_id, go to the same or nearby data pages which makes a join between them very fast.
When the engine found a record from master, with an index or whatever, this also means it already has found the records from slave to be joined with that master. And vice versa, locating a slave also means locating its master.
The b-tree index on MASTER_ID will remain balanced for most useful definitions of "balanced". In particular, the distance between the root block and any child block will always be the same and the amount of data in any branch will be at least roughly euqal to the amount of data in any other sibling branch.
As you insert data into an index on a sequence-generated column, Oracle will perform 90-10 block splits on the leaves when the amount of data in any particular level increases. If, for example, you have a leaf block that can hold 10 rows, when it is full and you want to add an 11th row, Oracle creates a new block, leaves the first 9 entries in the first block, puts 2 entries in the new block, and updates the parent block with the address of the new block. If the parent block needs to split because it is holding addresses for too many children, a similar process takes place. That leaves the indexes relatively balanced throughout their life. Richard Foote (the expert on Oracle indexes) has an excellent blog on When Does an Oracle Index Increase in Height that goes into much more detail about how this works.
The only case that you potentially have to worry about an index becoming skewed is when you regularly delete most but not all blocks from the left-hand side of the index. If, for example, you decide to delete 90% of the data from the left-hand side of the index leaving one row in every leaf node pointing to data, your index can become unbalanced in the sense that some paths lead to vastly more data than other paths. Oracle doesn't reclaim index leaf nodes until they are completely empty so you can end up with the left-hand side of the index using a lot more space than is really necessary. This doesn't really affect the performance of the system much-- it is more of a space utilization issue-- but it can be fixed by coalescing the index after you've purged the data (or structuring your purges so that you don't leave lots of sparse blocks).

Clustered index consideration in regards to distinct valus and large result sets and a single vertical table for auditing

I've been researching best practices for creating clustered indexes and I'm just trying to totally understand these two suggestions that's listed with pretty much every BLOG or article on the matter
Columns that contain a large number of distinct values.
Queries that return large result sets.
These seem to be slightly contrary or I'm guessing maybe it just depends on how you're accessing the table.. Or my interpretation of what "large result sets" mean is wrong....
Unless you're doing range queries over the clustered column it seems like you typically won't be getting large result sets that matter. So in cases where SQL Server defaults the clustered indexes on the PK you're rarely going to fulfill the large result set suggestion but of course it does the large number of distinct values..
To give the question a little more context. This quetion stems from a vertical auditing table we have that has a column for TABLE.... Every single query that's written against this table has a
WHERE TABLE = 'TABLENAME'
But the TableName is highly non distinct... Each result set of tablenames is rather large which seems to fulfill that second conditon but it's definitely not largerly unique.... Which means all that other stuff happens with having to add the 4 byte Uniquifer (sp?) which makes the table a lot larger etc...
This situation has come up a few times for me when I've come upon DBs that have say all the contact or some accounts normalized into a single table and they are only separated by a TYPE parameter. Which is on every query....
In the case of the audit table the queries are typically not that exciting either they are just sorted by date modified, sometimes filtered by column, user that made the change etc...
My other thought with this auditing scenario was to just make the auditing table a HEAP so that inserting is fast so there's not contention between tables being audited and then to generate indexed views over the data ...
Index design is just as much art as it is science.
There are many things to consider, including:
How the table will be accessed most often: mostly inserts? any updates? more SELECTs than DML statements? Any audit table will likely have mostly inserts, no updates, rarely deletes unless there is a time-limit on the data, and some SELECTs.
For Clustered indexes, keep in mind that the data in each column of the clustered index will be copied into each non-clustered index (though not for UNIQUE indexes, I believe). This is helpful as those values are available to queries using the non-clustered index for covering, etc. But it also means that the physical space taken up by the non-clustered indexes will be that much larger.
Clustered indexes generally should either be declared with the UNIQUE keyword or be the Primary Key (though there are exceptions, of course). A non-unique clustered index will have a hidden 4-byte field called a uniqueifier that is required to make each row with a non-unique key value addressable, and is just wasted space given that the order of your rows within the non-unique groupings is not apparently obvious so trying to narrow down to a single row is still a range.
As is mentioned everywhere, the clustered index is the physical ordering of the data so you want to cater to what needs the best I/O. This relates also to the point directly above where non-unique clustered indexes have an order but if the data is truly non-unique (as opposed to unique data but missing the UNIQUE keyword when the index was created) then you miss out on a lot of the benefit of having the data physically ordered.
Regardless of any information or theory, TEST TEST TEST. There are many more factors involved that pertain to your specific situation.
So, you mentioned having a Date field as well as the TableName. If the combination of the Date and TableName is unique then those should be used as a composite key on a PK or UNIQUE CLUSTERED index. If they are not then find another field that creates the uniqueness, such as UserIDModified.
While most recommendations are to have the most unique field as the first one (due to statistics being only on the first field), this doesn't hold true for all situations. Given that all of your queries are by TableName, I would opt for putting that field first to make use of the physical ordering of the data. This way SQL Server can read more relevant data per read without having to seek to other locations on disk. You would likely also being ordering on the Date so I would put that field second. Putting TableName first will cause higher fragmentation across INSERTs than putting the Date first, but upon an index rebuild the data access will be faster as the data is already both grouped ( TableName ) and ordered ( Date ) as the queries expect. If you put Date first then the data is still ordered properly but the rows needed to satisfy the query are likely spread out across the datafile(s) which would require more I/O to get. AND, more data pages to satisfy the same query means more pages in the Buffer Pool, potentially pushing out other pages and reducing Page Life Expectancy (PLE). Also, you would then really need to inculde the Date field in all queries as any queries using only TableName (and possibly other filters but NOT using the Date field) will have to scan the clustered index or force you to create a nonclustered index with TableName being first.
I would be weary of the Heap plus Indexed View model. Yes, it might be optimized for the inserts but the system still needs to maintain the data in the indexed view across all DML statements against the heap. Again you would need to test, but I don't see that being materially better than a good choice of fields for a clustered index on the audit table.

SQL Server Index cost

I have read that one of the tradeoffs for adding table indexes in SQL Server is the increased cost of insert/update/delete queries to benefit the performance of select queries.
I can conceptually understand what happens in the case of an insert because SQL Server has to write entries into each index matching the new rows, but update and delete are a little more murky to me because I can't quite wrap my head around what the database engine has to do.
Let's take DELETE as an example and assume I have the following schema (pardon the pseudo-SQL)
TABLE Foo
col1 int
,col2 int
,col3 int
,col4 int
PRIMARY KEY (col1,col2)
INDEX IX_1
col3
INCLUDE
col4
Now, if I issue the statement
DELETE FROM Foo WHERE col1=12 AND col2 > 34
I understand what the engine must do to update the table (or clustered index if you prefer). The index is set up to make it easy to find the range of rows to be removed and do so.
However, at this point it also needs to update IX_1 and the query that I gave it gives no obvious efficient way for the database engine to find the rows to update. Is it forced to do a full index scan at this point? Does the engine read the rows from the clustered index first and generate a smarter internal delete against the index?
It might help me to wrap my head around this if I understood better what is going on under the hood, but I guess my real question is this. I have a database that is spending a significant amount of time in delete and I'm trying to figure out what I can do about it.
When I display the execution plan for the deletion, it just shows an entry for "Clustered Index Delete" on table Foo which lists in the details section the other indices that need to be updated but I don't get any indication of the relative cost of these other indices.
Are they all equal in this case? Is there some way that I can estimate the impact of removing one or more of these indices without having to actually try it?
Nonclustered indexes also store the clustered keys.
It does not have to do a full scan, since:
your query will use the clustered index to locate rows
rows contain the other index value (c3)
using the other index value (c3) and the clustered index values (c1,c2), it can locate matching entries in the other index.
(Note: I had trouble interpreting the docs, but I would imagine that IX_1 in your case could be defined as if it was also sorted on c1,c2. Since these are already stored in the index, it would make perfect sense to use them to more efficiently locate records for e.g. updates and deletes.)
All this, however has a cost. For each matching row:
it has to read the row, to find out the value for c3
it has to find the entry for (c3,c1,c2) in the nonclustered index
it has to delete the entry from there as well.
Furthermore, while the range query can be efficient on the clustered index in your case (linear access, after finding a match), maintenance of the other indexes will most likely result in random access to them for every matching row. Random access has a much higher cost than just enumerating B+ tree leaf nodes starting from a given match.
Given the above query, more time is spent on the non-clustered index maintenance - the amount depends heavily on the number of records selected by the col1 = 12 AND col2 > 34
predicate.
My guess is that the cost is conceptually the same as if you did not have a secondary index but had e.g. a separate table, holding (c3,c1,c2) as the only columns in a clustered key and you did a DELETE for each matching row using (c3,c1,c2). Obviously, index maintenance is internal to SQL Server and is faster, but conceptually, I guess the above is close.
The above would mean that maintenance costs of indexes would stay pretty close to each other, since the number of entries in each secondary index is the same (the number of records) and deletion can proceed only one-by-one on each index.
If you need the indexes, performance-wise, depending on the number of deleted records, you might be better off scheduling the deletes, dropping the indexes - that are not used during the delete - before the delete and adding them back after. Depending on the number of records affected, rebuilding the indexes might be faster.

I have one table with ten foreign keys. How much of a hit am I going to take on insertion if I have indexes on all ten keys?

The consensus seems to be that all foreign keys need to have indexes. How much overhead am I going to incur on inserts if I follow the letter of the law?
NOTES:
Assume that the database is a good design, and that all of the joins are legitimate.
All Primary and Foreign Keys are of type Int.
Some tables are lookup tables, with fewer than ten records, that are not likely to grow in size.
It is an OLTP database.
Some of the joins are to lookup tables with fewer than 10 records.
Here is a decent list of examples of when and what type of index to use. I don't think you should accept the "law" and index everything. You need to define what will be used in the query joins and index accordingly
There is a significant performance penalty on inserts as all of the indexes need to be updated. Roughly, you will incur one disk write for the insert on the large table and slightly more than one (on average) for each index on the table. Each index leaf node will incur a write, and some additional writes will occur from time to time as the leaf and (less often) parent nodes split.
Each table and index write will also incur log traffic. The particularly nasty penalty is on bulk inserted data, as active indexes on tables where you are inserting bulk loaded data will be updated for each row - and these updates are not minimally logged. This will massively blow out your I/O (which will all be random access rather than nice sequential bulk writes) and will also generate vast amounts of log traffic.
There's no need to put an index on foreign keys that point into lookup tables with small numbers of elements.
The only possible way to answer you question is to test. For instance, if any of the keys have a cardinality of 10, they probably won't be very helpful. So you've got some work to do testing. But it has a lot to do with your table sizes, the key sizes, the absolute activity level and the mix of CRUD elements. Distrust all simple answers.
EDIT:
If you have no data currently because this is the initial design, start with only the obvious indexes and add others as you need them based on testing. It makes little sense to add them all unless it's a low-change database. But if it's read-only, there's not much penalty at all. (Another piece of information you haven't provided.)
The consensus seems to be that all foreign keys need to have indexes. How much overhead am I going to incur on inserts if I follow the letter of the law?
There are two overheads: on DML over the referencing table, and DML over the referenced table.
A referenced table should have an index, otherwise you won't be able to create a FOREIGN KEY.
A referencing table can have no index. It will make the INSERT's into the referencing table a little bit slower, and won't affect INSERT's into a referenced table.
Whenever you insert a row into a referencing table, the following occurs:
The row is checked against the FOREIGN KEY as in this query:
SELECT TOP 1 NULL
FROM referenced ed
WHERE ed.pk = #new_fk_value
The row is inserted
The index on the row (if any) is updated.
The first two steps are always performed, and the step 1 generally uses an index on the referenced table (again, you just cannot create a FOREIGN KEY relationship without having this index).
The step 1 is the only overhead specific to a FOREIGN KEY.
The overhead of the step 3 is implied only by the fact the index exists. It would be exactly the same in there were no FOREIGN KEY.
But UPDATE's and DELETE's from the referenced table can be much slower if you don't define an index on the referencing table, especially if the latter is large.
Whenever you DELETE from the referenced table, the following occurs:
The rows are checked against the FOREIGN KEY as in this query:
SELECT TOP 1 NULL
FROM referencing ing
WHERE ing.fk = #old_pk_value
The row is deleted
The index on the row is updated.
It's easy to see that this query will most probably benefit from an index on referencing.fk.
Otherwise, the optimizer will need to build a HASH TABLE over the whole table even if you are deleting a single record to check the constraint.
The only way to know the impact is to test. The answer may differ greatly depending on whether your system tends to insert large amounts of data in a bulk insert or one record at a time from the user interface. It also depends a lot on the size of the tables and the total number of indexes. Testing is the only way to know for certain what indexes you should use. A general rule of thumb is to start by indexing foreign key fields and fields you will be using in the where clauses. But that's just where to start looking at your system, not the "be all - end all" answer.
I will say that I have observed that users tend to be more tolerant of a little longer time spent on insert than they are of more time spent on querying the system. This is especially true since senior managers tend to do more querying than data entry and they can get cranky and have the power to do something about it if they feel their time is being wasted.
In a new system you need to generate test records at the expected volumn the system will have when implemented. If you don't then you will find that the queries (and design) that worked ok in a same test bed can be horrible with real users doing multiple things simultaneously against large tables. It's no fun at all to refactor a database where performance wasn't considered and tested in the design. It's no fun to pull back production changes becasue the query takes longer than the timeout setting because the developer didn't test against the true volumn (or in the case of the new project, the expected volumn).
SQL Server has tools to help you determine the best indexes. Use the indexing wizard and the executions plans to see where you need indexes. Put indexes on the fields and test inserts to see if there is a negative impact. There is no one right answer. It won't even stay the same answer for the lifetime of your database in all likelihood.
Insert/update/delete always hits the index and writes to it. Select sometimes hits the index to read from it, depending on the query optimizer's analysis or best guess. If you don't need an index to speed up reads (such as if the column only has a low number of potential values), then get rid of it.
If you have a billion rows in a child table and wish to delete 100 million of them because you're deleting one row from the parent table where that row is the the parent to all 100 million of the child rows, then having an index will only slow the whole operation down because the system has to delete from the index too, but won't speed the operation up because the system will not use the index to speed up choosing which rows to delete.
I know performance is a critical issue.
IMO, you should consider the ramifications of not having an index (and therefore no FK) on OLTP data. You can suffer data integrity issues on such a system.
Thank you all for your input.
Based on your feedback, I think I will add indexes to all of the foreign keys EXCEPT those pointing to lookup tables (containing a small number of records that are not likely to change). This will cut the number of required foreign key indexes in half (from ten to five).
If anyone has further insight, feel free to post new answers. I still have some votes left. :)
Are the fields going to be used in searching and sorting? If so an index might be a good idea. Only way to know is to test measure and test again
Edit: The look table will probally be cached but that won't help a search query against the referencing table. Your data table that is.

Resources