How do you determine when to use table clusters? There are two types, index and hash, to use for different cases. In your experience, have the introduction and use of table clusters paid off?
If none of your tables are set up this way, modifying them to use table clusters would add to the complexity of the set up. But would the expected performance benefits outweight the cost of increased complexity in future maintenance work?
Do you have any favorite online references or books that describe table clustering well and give good implementation examples?
//Oracle tips greatly appreciated.
The killer feature of table clusters is that you can store related rows of different tables at the same physical location.
That can improve join performance by an order of magnitude. However, it doesn't pay of so often as it sounds.
The only time I used it was a three-table join, executed by two hash joins. It took too long ;). However, the join was on the same column, so it was possible to use a hash table cluster keyed by the join column. That caused all related rows to be stored alongside (ideally, in the same database block). Knowing that, Oracle can execute the join with a special optimization ("cluster join").
It's more or less pre-joined, but still feeling like normal tables (for INSERT/SELECT/UPDATE/DELETE).
On the other hand, there are "single table clusters" that are mostly used to control the "clustering factor" -- A similar idea like clustered indexes (called Index-Organized-Table in Oracle) but not adding high cost if using a secondary index.
One can speak a lot about clustering, but I found that almost ultimate explanation about Oracle clusters (pros and cons, when to use and how to use) can be found in Tom Kyte's book - Effective Oracle by Design, also you can search asktom for some specific cluster usage examples (1, 2 etc). You should definitely take a look at this book if you haven't yet.
Some info you can also find here.
But the thing you should always do before creating complex schema structures is to try, to test, to benchmark and choose the one solution that best fits your needs :)
Hope this helps.
I haven't used Oracle's table clusters myself, but I understand that its index table clusters are very much like MS SQL Server's clustered indexes. That is, the row data is physically organized by the clustered index's key.
That makes one ideal for a heavily-accessed column that has a reasonably small number of possible values (compared to the total number of rows), where most queries want to retrieve all rows with a particular value. Because all such rows are physically stored together, disk I/O, particularly seek time, is reduced.
"Reasonably small" is not easily defined, but postal or zip codes in an address table seems reasonable if you're often querying for all addresses in a single code's region. Province/state/territory codes are likely too small a selection for a country-wide address table.
So, you don't want to use them on columns with few possible values (e.g., M/F for gender) because then the clustering doesn't buy you anything and likely costs you for insertions. You also never want to use clustering on "autonumber" surrogate key columns (from sequences in Oracle) because that will create a "hot spot" in the last extent of the table as all insertions must physically happen there. You also don't want to apply clustering to a column value that will be updated because the RDBMS will have to physically move the record to maintain the clustered ordering.
Related
Index Organized Tables (IOTs) are tables stored in an index structure. Whereas a table stored
in a heap is unorganized, data in an IOT is stored and sorted by primary key (the data is the index). IOTs behave just like “regular” tables, and you use the same SQL to access them.
Every table in a proper relational database is supposed to have a primary key... If every table in my database has a primary key, should I always use an index organized table?
I'm guessing the answer is no, so when is an index organized table not the best choice?
Basically an index-organized table is an index without a table. There is a table object which we can find in USER_TABLES but it is just a reference to the underlying index. The index structure matches the table's projection. So if you have a table whose columns consist of the primary key and at most one other column then you have a possible candidate for INDEX ORGANIZED.
The main use case for index organized table is a table which is almost always accessed by its primary key and we always want to retrieve all its columns. In practice, index organized tables are most likely to be reference data, code look-up affairs. Application tables are almost always heap organized.
The syntax allows an IOT to have more than one non-key column. Sometimes this is correct. But it is also an indication that maybe we need to reconsider our design decisions. Certainly if we find ourselves contemplating the need for additional indexes on the non-primary key columns then we're probably better off with a regular heap table. So, as most tables probably need additional indexes most tables are not suitable for IOTs.
Coming back to this answer I see a couple of other responses in this thread propose intersection tables as suitable candidates for IOTs. This seems reasonable, because it is common for intersection tables to have a projection which matches the candidate key: STUDENTS_CLASSES could have a projection of just (STUDENT_ID, CLASS_ID).
I don't think this is cast-iron. Intersection tables often have a technical key (i.e. STUDENT_CLASS_ID). They may also have non-key columns (metadata columns like START_DATE, END_DATE are common). Also there is no prevailing access path - we want to find all the students who take a class as often as we want to find all the classes a student is taking - so we need an indexing strategy which supports both equally well. Not saying intersection tables are not a use case for IOTs. just that they are not automatically so.
I'd consider them for very narrow tables (such as the join tables used to resolve many-to-many tables). If (virtually) all the columns in the table are going to be in an index anyway, then why shouldn't you used an IOT.
Small tables can be good candidates for IOTs as discussed by Richard Foote here
I consider the following kinds of tables excellent candidates for IOTs:
"small" "lookup" type tables (e.g. queried frequently, updated infrequently, fits in a relatively small number of blocks)
any table that you already are going to have an index that covers all the columns anyway (i.e. may as well save the space used by the table if the index duplicates 100% of the data)
From the Oracle Concepts guide:
Index-organized tables are useful when
related pieces of data must be stored
together or data must be physically
stored in a specific order. This type
of table is often used for information
retrieval, spatial (see "Overview of
Oracle Spatial"), and OLAP
applications (see "OLAP").
This question from AskTom may also be of some interest especially where someone gives a scenario and then asks would an IOT perform better than an heap organised table, Tom's response is:
we can hypothesize all day long, but
until you measure it, you'll never
know for sure.
An index-organized table is generally a good choice if you only access data from that table by the key, the whole key, and nothing but the key.
Further, there are many limitations about what other database features can and cannot be used with index-organized tables -- I recall that in at least one version one could not use logical standby databases with index-organized tables. An index-organized table is not a good choice if it prevents you from using other functionality.
All an IOT really saves is the logical read(s) on the table segment, and as you might have spent two or three or more on the IOT/index this is not always a great saving except for small data sets.
Another feature to consider for speeding up lookups, particularly on larger tables, is a single table hash cluster. When correctly created they are more efficient for large data sets than an IOT because they require only one logical read to find the data, whereas an IOT is still an index that needs multiple logical i/o's to locate the leaf node.
I can't per se comment on IOTs, however if I'm reading this right then they're the same as a 'clustered index' in SQL Server. Typically you should think about not using such an index if your primary key (or the value(s) you're indexing if it's not a primary key) are likely to be distributed fairly randomly - as these inserts can result in many page splits (expensive).
Indexes such as identity columns (sequences in Oracle?) and dates 'around the current date' tend to make for good candidates for such indexes.
An Index-Organized Table--in contrast to an ordinary table--has its own way of structuring, storing, and indexing data.
Index organized tables (IOT) are indexes which actually hold the data which is being indexed, unlike the indexes which are stored somewhere else and have links to actual data.
The database I'm working with is currently over 100 GiB and promises to grow much larger over the next year or so. I'm trying to design a partitioning scheme that will work with my dataset but thus far have failed miserably. My problem is that queries against this database will typically test the values of multiple columns in this one large table, ending up in result sets that overlap in an unpredictable fashion.
Everyone (the DBAs I'm working with) warns against having tables over a certain size and I've researched and evaluated the solutions I've come across but they all seem to rely on a data characteristic that allows for logical table partitioning. Unfortunately, I do not see a way to achieve that given the structure of my tables.
Here's the structure of our two main tables to put this into perspective.
Table: Case
Columns:
Year
Type
Status
UniqueIdentifier
PrimaryKey
etc.
Table: Case_Participant
Columns:
Case.PrimaryKey
LastName
FirstName
SSN
DLN
OtherUniqueIdentifiers
Note that any of the columns above can be used as query parameters.
Rather than guess, measure. Collect statistics of usage (queries run), look at the engine own statistics like sys.dm_db_index_usage_stats and then you make an informed decision: the partition that bests balances data size and gives best affinity for the most often run queries will be a good candidate. Of course you'll have to compromise.
Also don't forget that partitioning is per index (where 'table' = one of the indexes), not per table, so the question is not what to partition on, but which indexes to partition or not and what partitioning function to use. Your clustered indexes on the two tables are going to be the most likely candidates obviously (not much sense to partition just a non-clustered index and not partition the clustered one) so, unless you're considering redesign of your clustered keys, the question is really what partitioning function to choose for your clustered indexes.
If I'd venture a guess I'd say that for any data that accumulates over time (like 'cases' with a 'year') the most natural partition is the sliding window.
If you have no other choice you can partition by key module the number of partition tables.
Lets say that you want to partition to 10 tables.
You will define tables:
Case00
Case01
...
Case09
And partition you data by UniqueIdentifier or PrimaryKey module 10 and place each record in the corresponding table (Depending on your unique UniqueIdentifier you might need to start manual allocation of ids).
When performing a query, you will need to run same query on all tables, and use UNION to merge the result set into a single query result.
It's not as good as partitioning the tables based on some logical separation which corresponds to the expected query, but it's better then hitting the size limit of a table.
Another possible thing to look at (before partitioning) is your model.
Are you in a normalized database? Are there further steps which could improve performance by different choices in the normalization/de-/partial-normalization? Are there options to transform the data into a Kimball-style dimensional star model which is optimal for reporting/querying?
If you aren't going to drop partitions of the table (sliding window, as mentioned) or treat different partitions differently (you say any columns can be used in the query), I'm not sure what you are trying to get out of the partitioning that you won't already get out of your indexing strategy.
I'm not aware of any table limits on rows. AFAIK, the number of rows is limited only by available storage.
The consensus seems to be that all foreign keys need to have indexes. How much overhead am I going to incur on inserts if I follow the letter of the law?
NOTES:
Assume that the database is a good design, and that all of the joins are legitimate.
All Primary and Foreign Keys are of type Int.
Some tables are lookup tables, with fewer than ten records, that are not likely to grow in size.
It is an OLTP database.
Some of the joins are to lookup tables with fewer than 10 records.
Here is a decent list of examples of when and what type of index to use. I don't think you should accept the "law" and index everything. You need to define what will be used in the query joins and index accordingly
There is a significant performance penalty on inserts as all of the indexes need to be updated. Roughly, you will incur one disk write for the insert on the large table and slightly more than one (on average) for each index on the table. Each index leaf node will incur a write, and some additional writes will occur from time to time as the leaf and (less often) parent nodes split.
Each table and index write will also incur log traffic. The particularly nasty penalty is on bulk inserted data, as active indexes on tables where you are inserting bulk loaded data will be updated for each row - and these updates are not minimally logged. This will massively blow out your I/O (which will all be random access rather than nice sequential bulk writes) and will also generate vast amounts of log traffic.
There's no need to put an index on foreign keys that point into lookup tables with small numbers of elements.
The only possible way to answer you question is to test. For instance, if any of the keys have a cardinality of 10, they probably won't be very helpful. So you've got some work to do testing. But it has a lot to do with your table sizes, the key sizes, the absolute activity level and the mix of CRUD elements. Distrust all simple answers.
EDIT:
If you have no data currently because this is the initial design, start with only the obvious indexes and add others as you need them based on testing. It makes little sense to add them all unless it's a low-change database. But if it's read-only, there's not much penalty at all. (Another piece of information you haven't provided.)
The consensus seems to be that all foreign keys need to have indexes. How much overhead am I going to incur on inserts if I follow the letter of the law?
There are two overheads: on DML over the referencing table, and DML over the referenced table.
A referenced table should have an index, otherwise you won't be able to create a FOREIGN KEY.
A referencing table can have no index. It will make the INSERT's into the referencing table a little bit slower, and won't affect INSERT's into a referenced table.
Whenever you insert a row into a referencing table, the following occurs:
The row is checked against the FOREIGN KEY as in this query:
SELECT TOP 1 NULL
FROM referenced ed
WHERE ed.pk = #new_fk_value
The row is inserted
The index on the row (if any) is updated.
The first two steps are always performed, and the step 1 generally uses an index on the referenced table (again, you just cannot create a FOREIGN KEY relationship without having this index).
The step 1 is the only overhead specific to a FOREIGN KEY.
The overhead of the step 3 is implied only by the fact the index exists. It would be exactly the same in there were no FOREIGN KEY.
But UPDATE's and DELETE's from the referenced table can be much slower if you don't define an index on the referencing table, especially if the latter is large.
Whenever you DELETE from the referenced table, the following occurs:
The rows are checked against the FOREIGN KEY as in this query:
SELECT TOP 1 NULL
FROM referencing ing
WHERE ing.fk = #old_pk_value
The row is deleted
The index on the row is updated.
It's easy to see that this query will most probably benefit from an index on referencing.fk.
Otherwise, the optimizer will need to build a HASH TABLE over the whole table even if you are deleting a single record to check the constraint.
The only way to know the impact is to test. The answer may differ greatly depending on whether your system tends to insert large amounts of data in a bulk insert or one record at a time from the user interface. It also depends a lot on the size of the tables and the total number of indexes. Testing is the only way to know for certain what indexes you should use. A general rule of thumb is to start by indexing foreign key fields and fields you will be using in the where clauses. But that's just where to start looking at your system, not the "be all - end all" answer.
I will say that I have observed that users tend to be more tolerant of a little longer time spent on insert than they are of more time spent on querying the system. This is especially true since senior managers tend to do more querying than data entry and they can get cranky and have the power to do something about it if they feel their time is being wasted.
In a new system you need to generate test records at the expected volumn the system will have when implemented. If you don't then you will find that the queries (and design) that worked ok in a same test bed can be horrible with real users doing multiple things simultaneously against large tables. It's no fun at all to refactor a database where performance wasn't considered and tested in the design. It's no fun to pull back production changes becasue the query takes longer than the timeout setting because the developer didn't test against the true volumn (or in the case of the new project, the expected volumn).
SQL Server has tools to help you determine the best indexes. Use the indexing wizard and the executions plans to see where you need indexes. Put indexes on the fields and test inserts to see if there is a negative impact. There is no one right answer. It won't even stay the same answer for the lifetime of your database in all likelihood.
Insert/update/delete always hits the index and writes to it. Select sometimes hits the index to read from it, depending on the query optimizer's analysis or best guess. If you don't need an index to speed up reads (such as if the column only has a low number of potential values), then get rid of it.
If you have a billion rows in a child table and wish to delete 100 million of them because you're deleting one row from the parent table where that row is the the parent to all 100 million of the child rows, then having an index will only slow the whole operation down because the system has to delete from the index too, but won't speed the operation up because the system will not use the index to speed up choosing which rows to delete.
I know performance is a critical issue.
IMO, you should consider the ramifications of not having an index (and therefore no FK) on OLTP data. You can suffer data integrity issues on such a system.
Thank you all for your input.
Based on your feedback, I think I will add indexes to all of the foreign keys EXCEPT those pointing to lookup tables (containing a small number of records that are not likely to change). This will cut the number of required foreign key indexes in half (from ten to five).
If anyone has further insight, feel free to post new answers. I still have some votes left. :)
Are the fields going to be used in searching and sorting? If so an index might be a good idea. Only way to know is to test measure and test again
Edit: The look table will probally be cached but that won't help a search query against the referencing table. Your data table that is.
Suppose you have a very large database, and to simplify lets say it consists of one major table you will be doing your lookups on with one (and only one) primary key field - pk.
Given the fact that all lookups are going to be basically SELECT * FROM table_name WHERE pk=someKeyValue, what is the best way to optimize this database for the fastest lookups?
Edit: just a few more details - INSERTs and UPDATEs are going to be very non-frequent so I don't mind sacrificing performance there to achieve better lookup performance.
Also, seems like clustering is the way to go. Do you have any examples of the kind of increase in performance I can achieve with this method? And how exactly is this done (on any kind of DB)?
If the primary key is clustered, then you won't get any quicker.
If it isn't clustered, and the number of columns in your table is relatively small, then you could in theory create a covering index to speed up the query. But then this negates any insert/update performance enhancements that having the non-clustered primary key would have given you.
If your primary key is an always-increasing field (e.g. a SQL Server identity, or generated from a sequence in Oracle) then the clustered primary key has no drawbacks anyway.
One thing you could do is make the primary key clustered, this results in the actual data being physically ordered on the disk, resulting in faster queries.
It will also mean slower inserts, but if you select much more frequently than you insert, this should not be a problem.
If you're using MySQL, you can do some additional things (beyond tuning your cache values). The table engine can be a factor; for instance, MyISAM is widely held to be faster at SELECTs than InnoDB. If this table is primarily a lookup table, and you were using MySQL, that might be a good thing to do. (InnoDB is pretty good on average; it's better on writes than MyISAM, and also, InnoDB never needs to be repaired.)
I have to add two more options to all that was proposed above (I like dwc’s answer). You should consider partitioning if your table is really big.
First, horizontal partitioning (especially if I/O is bottleneck in your DB). You create several filegroups and locate them on different hard drives. Then, create Partition Function, Partition Scheme to divide your table and put parts of your table on separate HDs (like rows 1-499999 to the F: drive, 500000-999999 to the G: drive, and so on) .
Second, vertical partitioning. This would work if you select column sets (not *) in most of your queries. In this case, divide columns in the table in two groups: first, fields that you need in all queries; second, fields that you rarely need. Create two tables with the same primary key. Use JOINs on the primary key when you need columns from both tables.
(This answer pertains to SQL Server 2005/2008.)
If all your queries are going to be based off the PK, you wouldn't get any added benefit by setting an index on the PK since it should already be indexing by that.
Edit: The only other possible things I would suggest is looking at normalizing your table (if that is even an option or necessity). By splitting off items into other tables, you can refine what is being pulled back in each query and only pull the less-used items when needed using joins.
Based off the limited description of "a very large database with a single table" it is hard to locate any easy and obvious ways to optimize without looking at what kind of data you are actually storing in your fields.
If your PK order matches insertion order, i.e. time or id/autoincrement, then make it clustered. This will reduce disk and cache thrashing on inserts, leaving more resources to devote to lookups.
Consider tweaking page sizes on the table to be an exact multiple of your record size. This requires intimate knowledge of the particular database software for details of how, and record/index overhead, etc.
If practical, use fixed-size for all columns rather than variable size.
Consider putting the index and/or transaction log files on a separate volume.
Install as much RAM as the software and hardware can use.
If you were using Oracle then I'd advise benchmarking three approaches:
Heap table with primary key index
Index-organised table
Single table hash cluster
1 represents a very vanilla approach -- really it's the lowest common denominator, but could mean 5+ logical reads to get each row, with one of those being a probable physical read of the table if it is not completely cached.
2 will save you one of those logical read by avoiding the probe to a separate table segment, but might not save you the physical read because the IOT segment will be larger and harder to cache than the index alone.
3 will potentially get you the row with a single logical read, but unless you have the entire table cached that's probably going to translate into a physical read.
Benchmarking is highly recommended.
When working with tables in Oracle, how do you know when you are setting up a good index versus a bad index?
This depends on what you mean by 'good' and 'bad'. Basically you need to realise that every index you add will increase performance on any search by that column (so adding an index to the 'lastname' column of a person table will increase performance on queries that have "where lastname = " in them) but decrease write performance across the whole table.
The reason for this is when you add or update a row, it must add-to or update both the table itself and every index that row is a member of. So if you have five indexes on a table, each addition must write to six places - five indexes and the table - and an update may be touching up to six places in the worst case.
Index creation is a balancing act then between query speed and write speed. In some cases, such as a datamart that is only loaded with data once a week in an overnight job but queried thousands of times daily, it makes a great deal of sense to overload with indexes and speed the queries up as much as possible. In the case of online transaction processing systems however, you want to try and find a balance between them.
So in short, add indexes to columns that are used a lot in select queries, but try to avoid adding too many and so add the most-used columns first.
After that its a matter of load testing to see how the performance reacts under production conditions, and a lot of tweaking to find an aceeptable balance.
Fields that are diverse, highly specific, or unique make good indexes. Such as dates and timestamps, unique incrementing numbers (commonly used as primary keys), person's names, license plate numbers, etc...
A counterexample would be gender - there are only two common values, so the index doesn't really help reduce the number of rows that must be scanned.
Full-length descriptive free-form strings make poor indexes, as whoever is performing the query rarely knows the exact value of the string.
Linearly-ordered data (such as timestamps or dates) are commonly used as a clustered index, which forces the rows to be stored in index order, and allows in-order access, greatly speeding range queries (e.g. 'give me all the sales orders between October and December'). In such a case the DB engine can simply seek to the first record specified by the range and start reading sequentially until it hits the last one.
#Infamous Cow -- you must be thinking of primary keys, not indexes.
#Xenph Yan --
Something others have not touched on is choosing what kind of index to create. Some databases don't really give you much of a choice, but some have a large variety of possible indexes. B-trees are the default but not always the best kind of index. Choosing the right structure depends on the kind of usage you expect to have. What kind of queries do you need to support most? Are you in a read-mostly or write-mostly environment? Are your writes dominated by updates or appends? Etc, etc.
A description of the different types of indexes and their pros and cons is available here: http://20bits.com/2008/05/13/interview-questions-database-indexes/ .
Here's a great SQL Server article:
http://www.sql-server-performance.com/tips/optimizing_indexes_general_p1.aspx
Although the mechanics won't work on Oracle, the tips are very apropos (minus the thing on clustered indexes, which don't quite work the same way in Oracle).
Some rules of thumb if you are trying to improve a particular query.
For a particular table (where you think Oracle should start) try indexing each of the columns used in the WHERE clause. Put columns with equality first, followed by columns with a range or like.
For example:
WHERE CompanyCode = ? AND Amount BETWEEN 100 AND 200
If columns are very large in size (e.g. you are storing some XML or something) you may be better off leaving them out of the index. This will make the index smaller to scan, assuming you have to go to the table row to satisfy the select list anyway.
Alternatively, if all the values in the SELECT and WHERE clauses are in the index Oracle will not need to access the table row. So sometimes it is a good idea to put the selected values last in the index and avoid a table access all together.
You could write a book about the best ways to index - look for author Jonathan Lewis.
A good index is something that you can rely on to be unique for a specific table row.
One commonly used index scheme is the use of numbers which increment by 1 for each row in the table. Every row will end up having a different number index.