When working with tables in Oracle, how do you know when you are setting up a good index versus a bad index?
This depends on what you mean by 'good' and 'bad'. Basically you need to realise that every index you add will increase performance on any search by that column (so adding an index to the 'lastname' column of a person table will increase performance on queries that have "where lastname = " in them) but decrease write performance across the whole table.
The reason for this is when you add or update a row, it must add-to or update both the table itself and every index that row is a member of. So if you have five indexes on a table, each addition must write to six places - five indexes and the table - and an update may be touching up to six places in the worst case.
Index creation is a balancing act then between query speed and write speed. In some cases, such as a datamart that is only loaded with data once a week in an overnight job but queried thousands of times daily, it makes a great deal of sense to overload with indexes and speed the queries up as much as possible. In the case of online transaction processing systems however, you want to try and find a balance between them.
So in short, add indexes to columns that are used a lot in select queries, but try to avoid adding too many and so add the most-used columns first.
After that its a matter of load testing to see how the performance reacts under production conditions, and a lot of tweaking to find an aceeptable balance.
Fields that are diverse, highly specific, or unique make good indexes. Such as dates and timestamps, unique incrementing numbers (commonly used as primary keys), person's names, license plate numbers, etc...
A counterexample would be gender - there are only two common values, so the index doesn't really help reduce the number of rows that must be scanned.
Full-length descriptive free-form strings make poor indexes, as whoever is performing the query rarely knows the exact value of the string.
Linearly-ordered data (such as timestamps or dates) are commonly used as a clustered index, which forces the rows to be stored in index order, and allows in-order access, greatly speeding range queries (e.g. 'give me all the sales orders between October and December'). In such a case the DB engine can simply seek to the first record specified by the range and start reading sequentially until it hits the last one.
#Infamous Cow -- you must be thinking of primary keys, not indexes.
#Xenph Yan --
Something others have not touched on is choosing what kind of index to create. Some databases don't really give you much of a choice, but some have a large variety of possible indexes. B-trees are the default but not always the best kind of index. Choosing the right structure depends on the kind of usage you expect to have. What kind of queries do you need to support most? Are you in a read-mostly or write-mostly environment? Are your writes dominated by updates or appends? Etc, etc.
A description of the different types of indexes and their pros and cons is available here: http://20bits.com/2008/05/13/interview-questions-database-indexes/ .
Here's a great SQL Server article:
http://www.sql-server-performance.com/tips/optimizing_indexes_general_p1.aspx
Although the mechanics won't work on Oracle, the tips are very apropos (minus the thing on clustered indexes, which don't quite work the same way in Oracle).
Some rules of thumb if you are trying to improve a particular query.
For a particular table (where you think Oracle should start) try indexing each of the columns used in the WHERE clause. Put columns with equality first, followed by columns with a range or like.
For example:
WHERE CompanyCode = ? AND Amount BETWEEN 100 AND 200
If columns are very large in size (e.g. you are storing some XML or something) you may be better off leaving them out of the index. This will make the index smaller to scan, assuming you have to go to the table row to satisfy the select list anyway.
Alternatively, if all the values in the SELECT and WHERE clauses are in the index Oracle will not need to access the table row. So sometimes it is a good idea to put the selected values last in the index and avoid a table access all together.
You could write a book about the best ways to index - look for author Jonathan Lewis.
A good index is something that you can rely on to be unique for a specific table row.
One commonly used index scheme is the use of numbers which increment by 1 for each row in the table. Every row will end up having a different number index.
Related
Good day,
In SQL Server 2005, I have a table numerous columns, including a few boolean (bit) columns. For example,
table 'Person' has columns ID and columns HasItem1, HasItem2, HasItem3, HasItem4. This table is kinda large, so I would like to create indexes to get faster search results.
I know that is not I good idea to create an index on a bit column, so I thought about using a index with all of the bit columms. However, the thing is, all of these bit columns may or may not be in the query. Since the order of the indexed columns are important in an index, and that I don't know which ones will be used in the query, how should I handle this?
BTW, there is already clustered index that I can't remove.
I would suggest that this is probably not a good idea. Trying to index fields with very low cardinality will generally not make queries faster and you have the overhead of maintaining the index as well.
If you generally search for one of your bit fields with another field then a composite index on the two fields would probably benefit you.
If you were to create a composite index on the bit fields then this would help but only if the composite fields at the beginning of the index were provided. If you do not include the 1st value within the composite index then the index will probably not be used at all.
If, as an example bita was used in 90% of your queries and bitd in 70% and bits b and c in 20% then a composite index on (bita, bitd, bitb, bitc) would probably yield some benefit but for at least 10% of your queries and possibly even 40% the index would most likely not be used.
The best advice is probably to try it with the same data volumes and data cardinality and see what the Execution plan says.
I don't know a lot of specifics on sql server, but in general indexing a column that has non-unique data is not very effective. In some RDBMS systems, the optimizer will ignore indexes that are less than a certain percent unique anyway, so the index may as well not even exist.
Using a composite, or multi-column index can help, but only in particular cases where the filter constraints are in the same order that the index was built in. If you index includes 'field1, field2' and you are searching for 'field2, field1' or some other combination, the index may not be used. You could add an index for each of the particular search cases that you want to optimize, that is really all I can think of that you could do. And in the case that your data is not very unique, even after considering all of the bit fields, the index may be ignored anyway.
For example, if you have 3 bit fields, you are only segmenting your data into 8 distinct groups. If you have a reasonable number of rows in the table, segmenting it by 8 isn't going to be very effective.
Odds are it will be easier for SQL to query the large table with the person_id and item_id and BitValue then it will be to search a single table with Item1, Item2, ... ItemN.
I don't know about 2005 but in SQL Server 2000 (From Books Online):
"Columns of type bit cannot have indexes on them."
How about using checksum?
Add a int field named mysum to your table and execute this
UPDATE checksumtest SET mysum = CHECKSUM(hasitem1,hasitem2,hasitem3,hasitem4)
Now you have a value that represents the combination of bits.
Do the same checksum calc in your search query and match on mysum.
This may speed things up.
You should revisit the design of your database. Instead of having a table with fields HasItem1 to HasItem#, you should create a bridge entity, and a master Items table if you don't have one. The bridge entity (table), person_items, would have (a minimum of) two fields: person_id and item_id.
Designing the database this way doesn't lock you in to a database that only handles N number of items based on column definitions. You can add as many items as you want to a master Items table, and associate as many of them as you need with as many people as you need.
Good day,
I have about 4GB of data, separated in about 10 different tables. Each table has a lot of columns, and each column can be a search criteria in a query. I'm not a DBA at all, and I don't know much about indexes, but I want to speed up the search as much as possible. The important point is, there won't be any update, insert or delete at any moment (the tables are populated once every 4 months). Is it appropriate to create an index on each and every column? Remember: no insert, update or delete, only selects!
Also, if I can make all of these columns integer instead of varchar, would i make a difference in speed?
Thank you very much!
Answer: No. Indexing every column separately is not good design. Indexes need to comprise multiple columns in many cases, and there are different types of indexes for different requirements.
The tuning wizard mentioned in other answers is a good first cut (esp. for a learner).
Don't try to guess your way through it, or hope you understand complex analyses - get advice specific to your situation. We seem to have several threads going here that are quite active for specific situations and query optimization.
Have you looked at running the Index Tuning Wizard? Will give you suggestions of indexes based on a workload.
Absolutely not.
You have to understand how indexes work. If you have a table of say, 1000 records, but it's a BIT and there can be one of two values, if you index on that column and that column only, it will be worthless, because it will not be selective enough. When you index on a column, be very cognizant of what types of selects are going to be done on the table. When you create an index on a column, will that index be selective enough for the optimizer to use effectively?
To that point, you may very well find that a few carefully selected composite indexes will vastly outperform the solution of many single indexes on each column. The golden rule: how the database is queried will determine how you should make your indexes.
Two pieces of missing information: how many distinct values are in each column, and which DBMS you're using. If you're using Oracle and have less than a few thousand distinct values per column, you can create bitmap indexes. These are very space- and execution-efficient for exact matches.
Otherwise, it's a tradeoff: each index will add roughly the same amount of space as a one-column name containing the same data, so you'll essentially double (probably 2.5x) your space requirements. So maybe 10G, which isn't a whole lot of data.
Then there's the question of whether your DBMS will efficiently merge multiple index-based selects. It's quite possible that it won't, unless you do self-joins for every column that you're selecting against.
Best answer: try it on a smaller dataset (so that you're not spending all your time building the indexes) and see how it works.
If you are selecting a set of columns from the table greater than those covered by the columns in the selected indexes, then you will inevitably incur a bookmark lookup in the query plan, which is where the query processor has to retrieve the non-covered columns from the clustered index using the reference ID from leaf rows in the associated non-clustered index.
In my experience, bookmark lookups can really kill query performance, due to the volume of extra reads required and the fact that each row in the clustered index has to be resolved individually. This is why I try to make NC indexes covering wherever possible, which is easier on smaller tables where the required query plans are well-known, but if you have large tables with lots of columns with arbitrary queries expected then this probably won't be feasible.
This means you only get bang for your buck with an NC index of any kind, if the index is covering, or selects a small-enough data set that the cost of a bookmark lookup is mitigated - indeed, you may find that the query optimizer won't even look at your indexes if the cost is prohibitive compared to a clustered index scan, where all the columns are already available.
So there is no point in creating an index unless you know that index will optimize the result of a given query. The value of an index is therefore proportional to the percentage of queries that it can optimize for a given table, and this can only be determined by analyzing the queries that are being executed, which is exactly what the Index Tuning Wizard does for you.
so in summary:
1) Don't index every column. This is classic premature optimization. You cannot optimize a large table with indexes for all possible query plans in advance.
2) Don't index any column, until you have captured and run a base workload through the Index Tuning Wizard. This workload needs to be representative of the usage patterns of your application, so that the wizard can determine what indexes would actually help the performance of your queries.
I'm importing Brazilian stock market data to a SQL Server database. Right now I have a table with price information from three kind of assets: stocks, options and forwards. I'm still in 2006 data and the table has over half million records. I have more 12 years of data to import so the table will exceed a million records for sure.
Now, my first approach for optimization was to keep the data to a minimum size, so I reduced the row size to an average of 60 bytes, with the following columns:
[Stock] [int] NOT NULL
[Date] [smalldatetime] NOT NULL
[Open] [smallmoney] NOT NULL
[High] [smallmoney] NOT NULL
[Low] [smallmoney] NOT NULL
[Close] [smallmoney] NOT NULL
[Trades] [int] NOT NULL
[Quantity] [bigint] NOT NULL
[Volume] [money] NOT NULL
Now, second approach for optimization was to make a clustered index. Actually the primary index is automatically clusted and I made it a compound index with Stock and Date fields. This is unique, I can't have two quote data for the same stock on the same day.
The clusted index makes sure that quotes from the same stock stay together, and probably ordered by date. Is this second information true?
Right now with a half million records it's taking around 200ms to select 700 quotes from a specific asset. I believe this number will get higher as the table grows.
Now for a third approach I'm thinking in maybe splitting the table in three tables, each for a specific market (stocks, options and forwards). This will probably cut the table size by 1/3. Now, will this approach help or it doesn't matter too much? Right now the table has 50mb of size so it can fit entirely in RAM without much trouble.
Another approach would be using the partition feature of SQL Server. I don't know much about it but I think it's normally used when the tables are large and you can span across multiple disks to reduce I/O latency, am I right? Would partitioning be any helpful in this case? I believe I can partition the newest values (latest years) and oldest values in different tables, The probability of seeking for newest data is higher, and with a small partition it will probably be faster, right?
What would be other good approachs to make this the fastest possible? The mainly select usage of the table will be for seeking a specific range of records from a specific asset, like the latest 3 months of asset X. There will be another usages but this will be the most common, being possible executed by more than 3k users concurrently.
At 1 million records, I wouldn't consider this a particularly large table needing unusual optimization techniques such as splitting the table up, denormalizing, etc. But those decisions will come when you've tried all the normal means that don't affect your ability to use standard query techniques.
Now, second approach for optimization was to make a clustered index. Actually the primary index is automatically clusted and I made it a compound index with Stock and Date fields. This is unique, I can't have two quote data for the same stock on the same day.
The clusted index makes sure that quotes from the same stock stay together, and probably ordered by date. Is this second information true?
It's logically true - the clustered index defines the logical ordering of the records on the disk, which is all you should be concerned about. SQL Server may forego the overhead of sorting within a physical block, but it will still behave as if it did, so it's not significant. Querying for one stock will probably be 1 or 2 page reads in any case; and the optimizer doesn't benefit much from unordered data within a page read.
Right now with a half million records it's taking around 200ms to select 700 quotes from a specific asset. I believe this number will get higher as the table grows.
Not necessarily significantly. There isn't a linear relationship between table size and query speed. There are usually a lot more considerations that are more important. I wouldn't worry about it in the range you describe. Is that the reason you're concerned? 200 ms would seem to me to be great, enough to get you to the point where your tables are loaded and you can start doing realistic testing, and get a much better idea of real-life performance.
Now for a third approach I'm thinking in maybe splitting the table in three tables, each for a specific market (stocks, options and forwards). This will probably cut the table size by 1/3. Now, will this approach help or it doesn't matter too much? Right now the table has 50mb of size so it can fit entirely in RAM without much trouble.
No! This kind of optimization is so premature it's probably stillborn.
Another approach would be using the partition feature of SQL Server.
Same comment. You will be able to stick for a long time to strictly logical, fully normalized schema design.
What would be other good approachs to make this the fastest possible?
The best first step is clustering on stock. Insertion speed is of no consequence at all until you are looking at multiple records inserted per second - I don't see anything anywhere near that activity here. This should get you close to maximum efficiency because it will efficiently read every record associated with a stock, and that seems to be your most common index. Any further optimization needs to be accomplished based on testing.
A million records really isn't that big. It does sound like it's taking too long to search though - is the column you're searching against indexed?
As ever, the first port of call should be the SQL profiler and query plan evaluator. Ask SQL Server what it's going to do with the queries you're interested in. I believe you can even ask it to suggest changes such as extra indexes.
I wouldn't start getting into partitioning etc just yet - as you say, it should all comfortably sit in memory at the moment, so I suspect your problem is more likely to be a missing index.
Check your execution plan on that query first. Make sure your indexes are being used. I've found that. A million records is not a lot. To give some perspective, we had an inventory table with 30 million rows in it and our entire query which joined tons of tables and did lots of calculations could run in under 200 MS. We found that on a quad proc 64 bit server, we could have signifcantly more records so we never bothered partioning.
You can use SQL Profier to see the execution plan, or just run the query from SQL Management Studio or Query Analyzer.
reevaluate the indexes... thats the most important part, the size of the data doesn't really matter, well it does but no entirely for speed purposes.
My recommendation is re build the indexes for that table, make a composite one for the columns you´ll need the most. Now that you have only a few records play with the different indexes otherwise it´ll get quite annoying to try new things once you have all the historical data in the table.
After you do that review your query, make the query plan evaluator your friend, and check if the engine is using the right index.
I just read you last post, theres one thing i don't get, you are quering the table while you insert data? at the same time?. What for? by inserting, you mean one records or hundred thousands? How are you inserting? one by one?
But again the key of this are the indexes, don't mess with partitioning and stuff yet.. specially with a millon records, thats nothing, i have tables with 150 millon records, and returning 40k specific records takes the engine about 1500ms...
I work for a school district and we have to track attendance for each student. It's how we make our money. My table that holds the daily attendance mark for each student is currently 38.9 Million records large. I can pull up a single student's attendance very quickly from this. We keep 4 indexes (including the primary key) on this table. Our clustered index is student/date which keeps all the student's records ordered by that. We've taken a hit on inserts to this table with regards to that in the event that an old record for a student is inserted, but it is a worthwhile risk for our purposes.
With regards to select speed, I would certainly take advantage of caching in your circumstance.
You've mentioned that your primary key is a compound on (Stock, Date), and clustered. This means the table is organised by Stock and then by Date. Whenever you insert a new row, it has to insert it into the middle of the table, and this can cause the other rows to be pushed out to other pages (page splits).
I would recommend trying to reverse the primary key to (Date, Stock), and adding a non-clustered index on Stock to facilitate quick lookups for a specific Stock. This will allow inserts to always happen at the end of the table (assuming you're inserting in order of date), and won't affect the rest of the table, and lesser chance of page splits.
The execution plan shows it's using the clustered index quite fine, but I forgot an extremely important fact, I'm still inserting data! The insert is probably locking the table too often. There is a way we can see this bottleneck?
The execution plan doesn't seems to show anything about lock issues.
Right now this data is only historical, when the importing process is finished the inserts will stop and be much less often. But I will have a larger table for real-time data soon, that will suffer from this constant insert problem and will be bigger than this table. So any approach on optimizing this kind of situation is very welcome.
another solution would be to create an historical table for each year, and put all this tables in an historical database, fill all those in and then create the appropriate indexes for them. Once you are done with this you won't have to touch them ever again. Why would you have to keep on inserting data? To query all those tables you just "union all" them :p
The current year table should be very different to this historical tables. For what i understood you are planning to insert records on the go?, i'd plan something different like doing a bulk insert or something similar every now and then along the day. Of course all this depends on what you want to do.
The problems here seems to be in the design. I'd go for a new design. The one you have now for what i understand its not suitable.
Actually the primary index is automatically clusted and I made it a compound index with Stock and Date fields. This is unique, I can't have two quote data for the same stock on the same day.
The clusted index makes sure that quotes from the same stock stay together, and probably ordered by date. Is this second information true?
Indexes in SQL Server are always sorted by column order in index. So an index on [stock,date] will first sort on stock, then within stock on date. An index on [date, stock] will first sort on date, then within date on stock.
When doing a query, you should always include the first column(s) of an index in the WHERE part, else the index cannot be efficiently used.
For your specific problem: If date range queries for stocks are the most common usage, then do the primary key on [date, stock], so the data will be stored sequencially by date on disk and you should get fastest access. Build up other indexes as needed. Do index rebuild/statistics update after inserting lots of new data.
As a follow up to "What are indexes and how can I use them to optimise queries in my database?" where I am attempting to learn about indexes, what columns are good index candidates? Specifically for an MS SQL database?
After some googling, everything I have read suggests that columns that are generally increasing and unique make a good index (things like MySQL's auto_increment), I understand this, but I am using MS SQL and I am using GUIDs for primary keys, so it seems that indexes would not benefit GUID columns...
Indexes can play an important role in query optimization and searching the results speedily from tables. The most important step is to select which columns are to be indexed. There are two major places where we can consider indexing: columns referenced in the WHERE clause and columns used in JOIN clauses. In short, such columns should be indexed against which you are required to search particular records. Suppose, we have a table named buyers where the SELECT query uses indexes like below:
SELECT
buyer_id /* no need to index */
FROM buyers
WHERE first_name='Tariq' /* consider indexing */
AND last_name='Iqbal' /* consider indexing */
Since "buyer_id" is referenced in the SELECT portion, MySQL will not use it to limit the chosen rows. Hence, there is no great need to index it. The below is another example little different from the above one:
SELECT
buyers.buyer_id, /* no need to index */
country.name /* no need to index */
FROM buyers LEFT JOIN country
ON buyers.country_id=country.country_id /* consider indexing */
WHERE
first_name='Tariq' /* consider indexing */
AND
last_name='Iqbal' /* consider indexing */
According to the above queries first_name, last_name columns can be indexed as they are located in the WHERE clause. Also an additional field, country_id from country table, can be considered for indexing because it is in a JOIN clause. So indexing can be considered on every field in the WHERE clause or a JOIN clause.
The following list also offers a few tips that you should always keep in mind when intend to create indexes into your tables:
Only index those columns that are required in WHERE and ORDER BY clauses. Indexing columns in abundance will result in some disadvantages.
Try to take benefit of "index prefix" or "multi-columns index" feature of MySQL. If you create an index such as INDEX(first_name, last_name), don’t create INDEX(first_name). However, "index prefix" or "multi-columns index" is not recommended in all search cases.
Use the NOT NULL attribute for those columns in which you consider the indexing, so that NULL values will never be stored.
Use the --log-long-format option to log queries that aren’t using indexes. In this way, you can examine this log file and adjust your queries accordingly.
The EXPLAIN statement helps you to reveal that how MySQL will execute a query. It shows how and in what order tables are joined. This can be much useful for determining how to write optimized queries, and whether the columns are needed to be indexed.
Update (23 Feb'15):
Any index (good/bad) increases insert and update time.
Depending on your indexes (number of indexes and type), result is searched. If your search time is gonna increase because of index then that's bad index.
Likely in any book, "Index Page" could have chapter start page, topic page number starts, also sub topic page starts. Some clarification in Index page helps but more detailed index might confuse you or scare you. Indexes are also having memory.
Index selection should be wise. Keep in mind not all columns would require index.
Some folks answered a similar question here: How do you know what a good index is?
Basically, it really depends on how you will be querying your data. You want an index that quickly identifies a small subset of your dataset that is relevant to a query. If you never query by datestamp, you don't need an index on it, even if it's mostly unique. If all you do is get events that happened in a certain date range, you definitely want one. In most cases, an index on gender is pointless -- but if all you do is get stats about all males, and separately, about all females, it might be worth your while to create one. Figure out what your query patterns will be, and access to which parameter narrows the search space the most, and that's your best index.
Also consider the kind of index you make -- B-trees are good for most things and allow range queries, but hash indexes get you straight to the point (but don't allow ranges). Other types of indexes have other pros and cons.
Good luck!
It all depends on what queries you expect to ask about the tables. If you ask for all rows with a certain value for column X, you will have to do a full table scan if an index can't be used.
Indexes will be useful if:
The column or columns have a high degree of uniqueness
You frequently need to look for a certain value or range of values for
the column.
They will not be useful if:
You are selecting a large % (>10-20%) of the rows in the table
The additional space usage is an issue
You want to maximize insert performance. Every index on a table reduces insert and update performance because they must be updated each time the data changes.
Primary key columns are typically great for indexing because they are unique and are often used to lookup rows.
Any column that is going to be regularly used to extract data from the table should be indexed.
This includes:
foreign keys -
select * from tblOrder where status_id=:v_outstanding
descriptive fields -
select * from tblCust where Surname like "O'Brian%"
The columns do not need to be unique. In fact you can get really good performance from a binary index when searching for exceptions.
select * from tblOrder where paidYN='N'
In general (I don't use mssql so can't comment specifically), primary keys make good indexes. They are unique and must have a value specified. (Also, primary keys make such good indexes that they normally have an index created automatically.)
An index is effectively a copy of the column which has been sorted to allow binary search (which is much faster than linear search). Database systems may use various tricks to speed up search even more, particularly if the data is more complex than a simple number.
My suggestion would be to not use any indexes initially and profile your queries. If a particular query (such as searching for people by surname, for example) is run very often, try creating an index over the relevate attributes and profile again. If there is a noticeable speed-up on queries and a negligible slow-down on insertions and updates, keep the index.
(Apologies if I'm repeating stuff mentioned in your other question, I hadn't come across it previously.)
It really depends on your queries. For example, if you almost only write to a table then it is best not to have any indexes, they just slow down the writes and never get used. Any column you are using to join with another table is a good candidate for an index.
Also, read about the Missing Indexes feature. It monitors the actual queries being used against your database and can tell you what indexes would have improved the performace.
Your primary key should always be an index. (I'd be surprised if it weren't automatically indexed by MS SQL, in fact.) You should also index columns you SELECT or ORDER by frequently; their purpose is both quick lookup of a single value and faster sorting.
The only real danger in indexing too many columns is slowing down changes to rows in large tables, as the indexes all need updating too. If you're really not sure what to index, just time your slowest queries, look at what columns are being used most often, and index them. Then see how much faster they are.
Numeric data types which are ordered in ascending or descending order are good indexes for multiple reasons. First, numbers are generally faster to evaluate than strings (varchar, char, nvarchar, etc). Second, if your values aren't ordered, rows and/or pages may need to be shuffled about to update your index. That's additional overhead.
If you're using SQL Server 2005 and set on using uniqueidentifiers (guids), and do NOT need them to be of a random nature, check out the sequential uniqueidentifier type.
Lastly, if you're talking about clustered indexes, you're talking about the sort of the physical data. If you have a string as your clustered index, that could get ugly.
A GUID column is not the best candidate for indexing. Indexes are best suited to columns with a data type that can be given some meaningful order, ie sorted (integer, date etc).
It does not matter if the data in a column is generally increasing. If you create an index on the column, the index will create it's own data structure that will simply reference the actual items in your table without concern for stored order (a non-clustered index). Then for example a binary search can be performed over your index data structure to provide fast retrieval.
It is also possible to create a "clustered index" that will physically reorder your data. However you can only have one of these per table, whereas you can have multiple non-clustered indexes.
The ol' rule of thumb was columns that are used a lot in WHERE, ORDER BY, and GROUP BY clauses, or any that seemed to be used in joins frequently. Keep in mind I'm referring to indexes, NOT Primary Key
Not to give a 'vanilla-ish' answer, but it truly depends on how you are accessing the data
It should be even faster if you are using a GUID.
Suppose you have the records
100
200
3000
....
If you have an index(binary search, you can find the physical location of the record you are looking for in O( lg n) time, instead of searching sequentially O(n) time. This is because you dont know what records you have in you table.
Best index depends on the contents of the table and what you are trying to accomplish.
Taken an example A member database with a Primary Key of the Members Social Security Numnber. We choose the S.S. because the application priamry referes to the individual in this way but you also want to create a search function that will utilize the members first and last name. I would then suggest creating a index over those two fields.
You should first find out what data you will be querying and then make the determination of which data you need indexed.
How do you determine when to use table clusters? There are two types, index and hash, to use for different cases. In your experience, have the introduction and use of table clusters paid off?
If none of your tables are set up this way, modifying them to use table clusters would add to the complexity of the set up. But would the expected performance benefits outweight the cost of increased complexity in future maintenance work?
Do you have any favorite online references or books that describe table clustering well and give good implementation examples?
//Oracle tips greatly appreciated.
The killer feature of table clusters is that you can store related rows of different tables at the same physical location.
That can improve join performance by an order of magnitude. However, it doesn't pay of so often as it sounds.
The only time I used it was a three-table join, executed by two hash joins. It took too long ;). However, the join was on the same column, so it was possible to use a hash table cluster keyed by the join column. That caused all related rows to be stored alongside (ideally, in the same database block). Knowing that, Oracle can execute the join with a special optimization ("cluster join").
It's more or less pre-joined, but still feeling like normal tables (for INSERT/SELECT/UPDATE/DELETE).
On the other hand, there are "single table clusters" that are mostly used to control the "clustering factor" -- A similar idea like clustered indexes (called Index-Organized-Table in Oracle) but not adding high cost if using a secondary index.
One can speak a lot about clustering, but I found that almost ultimate explanation about Oracle clusters (pros and cons, when to use and how to use) can be found in Tom Kyte's book - Effective Oracle by Design, also you can search asktom for some specific cluster usage examples (1, 2 etc). You should definitely take a look at this book if you haven't yet.
Some info you can also find here.
But the thing you should always do before creating complex schema structures is to try, to test, to benchmark and choose the one solution that best fits your needs :)
Hope this helps.
I haven't used Oracle's table clusters myself, but I understand that its index table clusters are very much like MS SQL Server's clustered indexes. That is, the row data is physically organized by the clustered index's key.
That makes one ideal for a heavily-accessed column that has a reasonably small number of possible values (compared to the total number of rows), where most queries want to retrieve all rows with a particular value. Because all such rows are physically stored together, disk I/O, particularly seek time, is reduced.
"Reasonably small" is not easily defined, but postal or zip codes in an address table seems reasonable if you're often querying for all addresses in a single code's region. Province/state/territory codes are likely too small a selection for a country-wide address table.
So, you don't want to use them on columns with few possible values (e.g., M/F for gender) because then the clustering doesn't buy you anything and likely costs you for insertions. You also never want to use clustering on "autonumber" surrogate key columns (from sequences in Oracle) because that will create a "hot spot" in the last extent of the table as all insertions must physically happen there. You also don't want to apply clustering to a column value that will be updated because the RDBMS will have to physically move the record to maintain the clustered ordering.