Different types of database indexes? - database

I am trying to compile a list of non-system-specific database indexes. I've looked at Oracle, DB2, MySQL, Postgres and Sybase, and almost every resource has a different list. So far I have seen:
clustered, multi-dimensional clustered, unclustered, unique,
non-unique, b-tree, hash, GiST, GIN, full-text, bitmap,
partitioned, function-based.
It seems that different systems have different names for the same types of indexes.
Are there standard index types across all systems?

If for whatever reason somebody else comes across this and is wondering the same thing, I ended up finding a good list at:
http://en.wikipedia.org/wiki/Comparison_of_relational_database_management_systems#Indexes

Many of the these concepts are orthogonal. A clustered index means that the rows are arranged in the table in the same order as they appear in the index. Independently, that index can be implemented using a B-tree, a B+ tree, a hash, spatially, etc. And then it may partition the table or not. One aspect may constrain but does not necessarily imply another.

You should scour harder :-) - Wiki gives a good description
http://en.wikipedia.org/wiki/Database_index

http://msdn.microsoft.com/en-us/library/ms175049.aspx
http://publib.boulder.ibm.com/infocenter/db2luw/v9r5/index.jsp?topic=%2Fcom.ibm.db2.luw.admin.dbobj.doc%2Fdoc%2Fc0020180.html
these links might give more clear idea...

Related

Unique identifier as primary key when syncing databases

I have spent the best part of a day researching and testing different methods of syncing a SQL server database with CoreData on the Mac. I have tested both INT's and GUID's (sequential GUID's also) as my primary keys and although GUID's are by far the worst in terms of performance I can't see no other way of ensuring uniqueness across systems.
Is using GUID's for primary keys the wrong way to go when syncing data between platforms? I find it hard to believe that companies use GUID's when syncing, but most articles I read on the subject seem to point to just that. If developers are using GUID's does anybody know how performance can be improved? I tried using a GUID as a primary key with a non clustered index and created a date field as my clustered index with no great performance improvement.
Any help would be much appreciated, especially if you have tackled a similar problem.
GUIDs make syncing a lot easier. Sequential guids will alleviate the fragmentation issue strongly, leaving only the 16 byte column size as the major issue.
As long as you ensure you have another sequential and narrow column as your clustered key, you'll save a lot of space for your nonclustered indexes - it seems like you already know this.
Assuming you're not dealing with GBs of data, performance shouldn't be that affected by a GUID in this case, given you've already handled the GUID column with proper care.
If you only need to sync two systems, I've previously created systems where A would use an identity(-1,-1) as the primary key where the other system used an identity(1,1) as the primary key. That ensure easy syncing while keeping the primary key nice and narrow. Won't work for more than two systems however.
Agreed, when you use GUIDs you may come across huge fragmentation of the indexes that use it as the key. Other ways of ensuring uniqueness across systems is to
use identity columns and seed them to different non-overlapping ranges:
1-100 million server A
100million1 -200 million server B
etc
or to use composite keys (identity int + location code) to distinguish original locations of data:
Three different rows:
1 AB
1 BZ
1 XV
Regards
Piotr

How do I know when to index a column, and with what?

In docs for various ORMs they always provide a way to create indexes, etc. They always mention to be sure to create the appropriate indexes for efficiency, as if that is inherent knowledge to a non-hand-written-SQLer who needs to use an ORM. My understanding of indexes (outside of PK) is basically: If you plan to do LIKE queries (ie, search) based on the contents of a column, you should use a full text index for that column. What else should I know regarding indexes (mostly pertaining to efficiency)? I feel like there is a world of knowledge at my door step, but there's a huge folded mouse pad jammed up under it, so I can't get through (I don't know why I felt like I needed to say that, but thanks for providing the couch).
Think of an index very roughly like the index in the back of a book. It's a totally separate area from the content of the book, where if you are seeking some specific value, you can go to the index and look it up (indexes are ordered, so finding things there is much quicker than scanning every page of the book).
The index entry has a page number, so you can then quickly go to the page seeking your topic. A database index is very similar; it is an ordered list of the relevant information in your database (the field(s) included in the index), with information for the database to find the records which match.
So... you would create an index when you have information that you need to search on frequently. Normal indexes don't help you for 'partial' seeks like LIKE queries, but any time you need to get a set of results where field X has certain value(s), they keep the DBMS from needing to 'scan' the whole table, looking for matching values.
They also help when you need to sort on a column.
Another thing to keep in mind; If the DBMS allows you to create single indexes that have multiple fields, be sure to investigate the effects of doing so, specific to your DBMS. An index that includes multiple fields is likely only to be fully (or at all) useful if all those fields are being used in a query. Conversely, having multiple indexes for a single table, with one field per index, may not be of much (or any) help for queries that are filtering/sorting by multiple fields.
You mentioned Full Text indexes and PKs (Primary Keys). These are different than regular indexes, though they often serve similar purposes.
First, note that a Primary Key is usually an index (in MSSQL, a 'Clustered Index', in fact), but this does not need to be the case specifically. As an example, an MSSQL PK is a Clustered Index by default; clustered indexes are special in that they are not a separate bit of data stored elsewhere, but the data itself is arranged in the table in order by the Clustered Index. This is why a popular PK is an int value that is auto-generated with sequential, increasing values. So, a Clustered Index sorts the data in the table specifically by the field's value. Compare this to a traditional dictionary; the entries themselves are ordered by the 'key', which is the word being defined.
But in MSSQL (check your DBMS documentation for your information), you can change the Clustered Index to be a different field, if you like. Sometimes this is done on datetime based fields.
Full Text indexes are different kinds of beasts entirely. They use some of the same principles, but what they are doing isn't exactly the same as normal indexes, which I am describing. Also: in some DBMS's, LIKE queries do not use the full text index; special query operators are required.
These indexes are different because their intent is not to find/sort on the whole value of the column (a number, a date, a short bit of char data), but instead to find individual words/phrases within the text field(s) being indexed.
They can also often enable searching for similar words, different tenses, common misspellings and the like, and typically ignore noise words. The different way in which they work is why they also may need different operators to use them. (again, check your local documentation for your DBMS!)
This answer is Oracle-specific, but the main points in the answers apply to most relational database systems
How to choose and optimize oracle indexes?

When should I use Oracle's Index Organized Table? Or, when shouldn't I?

Index Organized Tables (IOTs) are tables stored in an index structure. Whereas a table stored
in a heap is unorganized, data in an IOT is stored and sorted by primary key (the data is the index). IOTs behave just like “regular” tables, and you use the same SQL to access them.
Every table in a proper relational database is supposed to have a primary key... If every table in my database has a primary key, should I always use an index organized table?
I'm guessing the answer is no, so when is an index organized table not the best choice?
Basically an index-organized table is an index without a table. There is a table object which we can find in USER_TABLES but it is just a reference to the underlying index. The index structure matches the table's projection. So if you have a table whose columns consist of the primary key and at most one other column then you have a possible candidate for INDEX ORGANIZED.
The main use case for index organized table is a table which is almost always accessed by its primary key and we always want to retrieve all its columns. In practice, index organized tables are most likely to be reference data, code look-up affairs. Application tables are almost always heap organized.
The syntax allows an IOT to have more than one non-key column. Sometimes this is correct. But it is also an indication that maybe we need to reconsider our design decisions. Certainly if we find ourselves contemplating the need for additional indexes on the non-primary key columns then we're probably better off with a regular heap table. So, as most tables probably need additional indexes most tables are not suitable for IOTs.
Coming back to this answer I see a couple of other responses in this thread propose intersection tables as suitable candidates for IOTs. This seems reasonable, because it is common for intersection tables to have a projection which matches the candidate key: STUDENTS_CLASSES could have a projection of just (STUDENT_ID, CLASS_ID).
I don't think this is cast-iron. Intersection tables often have a technical key (i.e. STUDENT_CLASS_ID). They may also have non-key columns (metadata columns like START_DATE, END_DATE are common). Also there is no prevailing access path - we want to find all the students who take a class as often as we want to find all the classes a student is taking - so we need an indexing strategy which supports both equally well. Not saying intersection tables are not a use case for IOTs. just that they are not automatically so.
I'd consider them for very narrow tables (such as the join tables used to resolve many-to-many tables). If (virtually) all the columns in the table are going to be in an index anyway, then why shouldn't you used an IOT.
Small tables can be good candidates for IOTs as discussed by Richard Foote here
I consider the following kinds of tables excellent candidates for IOTs:
"small" "lookup" type tables (e.g. queried frequently, updated infrequently, fits in a relatively small number of blocks)
any table that you already are going to have an index that covers all the columns anyway (i.e. may as well save the space used by the table if the index duplicates 100% of the data)
From the Oracle Concepts guide:
Index-organized tables are useful when
related pieces of data must be stored
together or data must be physically
stored in a specific order. This type
of table is often used for information
retrieval, spatial (see "Overview of
Oracle Spatial"), and OLAP
applications (see "OLAP").
This question from AskTom may also be of some interest especially where someone gives a scenario and then asks would an IOT perform better than an heap organised table, Tom's response is:
we can hypothesize all day long, but
until you measure it, you'll never
know for sure.
An index-organized table is generally a good choice if you only access data from that table by the key, the whole key, and nothing but the key.
Further, there are many limitations about what other database features can and cannot be used with index-organized tables -- I recall that in at least one version one could not use logical standby databases with index-organized tables. An index-organized table is not a good choice if it prevents you from using other functionality.
All an IOT really saves is the logical read(s) on the table segment, and as you might have spent two or three or more on the IOT/index this is not always a great saving except for small data sets.
Another feature to consider for speeding up lookups, particularly on larger tables, is a single table hash cluster. When correctly created they are more efficient for large data sets than an IOT because they require only one logical read to find the data, whereas an IOT is still an index that needs multiple logical i/o's to locate the leaf node.
I can't per se comment on IOTs, however if I'm reading this right then they're the same as a 'clustered index' in SQL Server. Typically you should think about not using such an index if your primary key (or the value(s) you're indexing if it's not a primary key) are likely to be distributed fairly randomly - as these inserts can result in many page splits (expensive).
Indexes such as identity columns (sequences in Oracle?) and dates 'around the current date' tend to make for good candidates for such indexes.
An Index-Organized Table--in contrast to an ordinary table--has its own way of structuring, storing, and indexing data.
Index organized tables (IOT) are indexes which actually hold the data which is being indexed, unlike the indexes which are stored somewhere else and have links to actual data.

Advantages and disadvantages of GUID / UUID database keys

I've worked on a number of database systems in the past where moving entries between databases would have been made a lot easier if all the database keys had been GUID / UUID values. I've considered going down this path a few times, but there's always a bit of uncertainty, especially around performance and un-read-out-over-the-phone-able URLs.
Has anyone worked extensively with GUIDs in a database? What advantages would I get by going that way, and what are the likely pitfalls?
Advantages:
Can generate them offline.
Makes replication trivial (as opposed to int's, which makes it REALLY hard)
ORM's usually like them
Unique across applications. So We can use the PK's from our CMS (guid) in our app (also guid) and know we are NEVER going to get a clash.
Disadvantages:
Larger space use, but space is cheap(er)
Can't order by ID to get the insert order.
Can look ugly in a URL, but really, WTF are you doing putting a REAL DB key in a URL!? (This point disputed in comments below)
Harder to do manual debugging, but not that hard.
Personally, I use them for most PK's in any system of a decent size, but I got "trained" on a system which was replicated all over the place, so we HAD to have them. YMMV.
I think the duplicate data thing is rubbish - you can get duplicate data however you do it. Surrogate keys are usually frowned upon where ever I've been working. We DO use the WordPress-like system though:
unique ID for the row (GUID/whatever). Never visible to the user.
public ID is generated ONCE from some field (e.g. the title - make it the-title-of-the-article)
UPDATE:
So this one gets +1'ed a lot, and I thought I should point out a big downside of GUID PK's: Clustered Indexes.
If you have a lot of records, and a clustered index on a GUID, your insert performance will SUCK, as you get inserts in random places in the list of items (that's the point), not at the end (which is quick).
So if you need insert performance, maybe use a auto-inc INT, and generate a GUID if you want to share it with someone else (e.g., showing it to a user in a URL).
Why doesn't anyone mention performance? When you have multiple joins, all based on these nasty GUIDs the performance will go through the floor, been there :(
#Matt Sheppard:
Say you have a table of customers. Surely you don't want a customer to exist in the table more than once, or lots of confusion will happen throughout your sales and logistics departments (especially if the multiple rows about the customer contain different information).
So you have a customer identifier which uniquely identifies the customer and you make sure that the identifier is known by the customer (in invoices), so that the customer and the customer service people have a common reference in case they need to communicate. To guarantee no duplicated customer records, you add a uniqueness-constraint to the table, either through a primary key on the customer identifier or via a NOT NULL + UNIQUE constraint on the customer identifier column.
Next, for some reason (which I can't think of), you are asked to add a GUID column to the customer table and make that the primary key. If the customer identifier column is now left without a uniqueness-guarantee, you are asking for future trouble throughout the organization because the GUIDs will always be unique.
Some "architect" might tell you that "oh, but we handle the real customer uniqueness constraint in our app tier!". Right. Fashion regarding that general purpose programming languages and (especially) middle tier frameworks changes all the time, and will generally never out-live your database. And there is a very good chance that you will at some point need to access the database without going through the present application. == Trouble. (But fortunately, you and the "architect" are long gone, so you will not be there to clean up the mess.) In other words: Do maintain obvious constraints in the database (and in other tiers, as well, if you have the time).
In other words: There may be good reasons to add GUID columns to tables, but please don't fall for the temptation to make that lower your ambitions for consistency within the real (==non-GUID) information.
The main advantages are that you can create unique id's without connecting to the database. And id's are globally unique so you can easilly combine data from different databases. These seem like small advantages but have saved me a lot of work in the past.
The main disadvantages are a bit more storage needed (not a problem on modern systems) and the id's are not really human readable. This can be a problem when debugging.
There are some performance problems like index fragmentation. But those are easilly solvable (comb guids by jimmy nillson: http://www.informit.com/articles/article.aspx?p=25862 )
Edit merged my two answers to this question
#Matt Sheppard I think he means that you can duplicate rows with different GUIDs as primary keys. This is an issue with any kind of surrogate key, not just GUIDs. And like he said it is easilly solved by adding meaningfull unique constraints to non-key columns. The alternative is to use a natural key and those have real problems..
GUIDs may cause you a lot of trouble in the future if they are used as "uniqifiers", letting duplicated data get into your tables. If you want to use GUIDs, please consider still maintaining UNIQUE-constraints on other column(s).
One other small issue to consider with using GUIDS as primary keys if you are also using that column as a clustered index (a relatively common practice). You are going to take a hit on insert because of the nature of a guid not begin sequential in anyway, thus their will be page splits, etc when you insert. Just something to consider if the system is going to have high IO...
primary-keys-ids-versus-guids
The Cost of GUIDs as Primary Keys (SQL Server 2000)
Myths, GUID vs. Autoincrement (MySQL 5)
This is realy what you want.
UUID Pros
Unique across every table, every database, every server
Allows easy merging of records from different databases
Allows easy distribution of databases across multiple servers
You can generate IDs anywhere, instead of having to roundtrip to the database
Most replication scenarios require GUID columns anyway
GUID Cons
It is a whopping 4 times larger than the traditional 4-byte index value; this can have serious performance and storage implications if you're not careful
Cumbersome to debug (where userid='{BAE7DF4-DDF-3RG-5TY3E3RF456AS10}')
The generated GUIDs should be partially sequential for best performance (eg, newsequentialid() on SQL 2005) and to enable use of clustered indexes
There is one thing that is not really addressed, namely using random (UUIDv4) IDs as primary keys will harm the performance of the primary key index. It will happen whether or not your table is clustered around the key.
RDBMs usually ensure the uniqueness of the primary keys, and ensure the lookups by a key, in a structure called BTree, which is a search tree with a large branching factor (a binary search tree has branching factor of 2). Now, a sequential integer ID would cause the inserts to occur just one side of the tree, leaving most of the leaf nodes untouched. Adding random UUIDs will cause the insertions to split leaf nodes all over the index.
Likewise if the data stored is mostly temporal, it is often the case that the most recent data needs to be accessed and joined against the most. With random UUIDs the patterns will not benefit from this, and will hit more index rows, thereby needing more of the index pages in memory. With sequential IDs if the most-recent data is needed the most, the hot index pages would require less RAM.
Advantages:
UUID values are unique between tables and databases. Thats why it can be merge rows between two databases or distributed databases.
UUID is more safer to pass through url than integer type data.
If one pass UUID through url, attackers can't guess the next id.But if we pass Integer type such as 10, then attackers can guess the next id is 11 then 12 etc.
UUID can generate offline.
One thing not mentioned so far: UUIDs make it much harder to profile data
For web apps at least, it's common to access a resource with the id in the url, like stackoverflow.com/questions/45399. If the id is an integer, this both
provides information about the number of questions (ie September 5th, 2008, the 45,399th question was asked)
provides a leverage point to iterate through questions (what happens when I increment that by 1? I open the next asked question)
From the first point, I can combine the timestamp from the question and the number to profile how frequently questions are asked and how that changes over time. this matters less on a site like Stack Overflow, with publicly available information, but, depending on context, this may expose sensitive information.
For example, I am a company that offers customers a permissions gated portal. the address is portal.com/profile/{customerId}. If the id is an integer, you could profile the number of customers regardless of being able to see their information by querying for lastKnownCustomerCount + 1 regularly, and checking if the result is 404 - NotFound (customer does not exist) or 403 - Forbidden (customer does exist, but you do not have access to view).
UUIDs non-sequential nature mitigate these issues. This isn't a garunted to prevent profiling, but it's a start.

When do you use table clusters?

How do you determine when to use table clusters? There are two types, index and hash, to use for different cases. In your experience, have the introduction and use of table clusters paid off?
If none of your tables are set up this way, modifying them to use table clusters would add to the complexity of the set up. But would the expected performance benefits outweight the cost of increased complexity in future maintenance work?
Do you have any favorite online references or books that describe table clustering well and give good implementation examples?
//Oracle tips greatly appreciated.
The killer feature of table clusters is that you can store related rows of different tables at the same physical location.
That can improve join performance by an order of magnitude. However, it doesn't pay of so often as it sounds.
The only time I used it was a three-table join, executed by two hash joins. It took too long ;). However, the join was on the same column, so it was possible to use a hash table cluster keyed by the join column. That caused all related rows to be stored alongside (ideally, in the same database block). Knowing that, Oracle can execute the join with a special optimization ("cluster join").
It's more or less pre-joined, but still feeling like normal tables (for INSERT/SELECT/UPDATE/DELETE).
On the other hand, there are "single table clusters" that are mostly used to control the "clustering factor" -- A similar idea like clustered indexes (called Index-Organized-Table in Oracle) but not adding high cost if using a secondary index.
One can speak a lot about clustering, but I found that almost ultimate explanation about Oracle clusters (pros and cons, when to use and how to use) can be found in Tom Kyte's book - Effective Oracle by Design, also you can search asktom for some specific cluster usage examples (1, 2 etc). You should definitely take a look at this book if you haven't yet.
Some info you can also find here.
But the thing you should always do before creating complex schema structures is to try, to test, to benchmark and choose the one solution that best fits your needs :)
Hope this helps.
I haven't used Oracle's table clusters myself, but I understand that its index table clusters are very much like MS SQL Server's clustered indexes. That is, the row data is physically organized by the clustered index's key.
That makes one ideal for a heavily-accessed column that has a reasonably small number of possible values (compared to the total number of rows), where most queries want to retrieve all rows with a particular value. Because all such rows are physically stored together, disk I/O, particularly seek time, is reduced.
"Reasonably small" is not easily defined, but postal or zip codes in an address table seems reasonable if you're often querying for all addresses in a single code's region. Province/state/territory codes are likely too small a selection for a country-wide address table.
So, you don't want to use them on columns with few possible values (e.g., M/F for gender) because then the clustering doesn't buy you anything and likely costs you for insertions. You also never want to use clustering on "autonumber" surrogate key columns (from sequences in Oracle) because that will create a "hot spot" in the last extent of the table as all insertions must physically happen there. You also don't want to apply clustering to a column value that will be updated because the RDBMS will have to physically move the record to maintain the clustered ordering.

Resources