database row/ record pointers - database

I don't know the correct words for what I'm trying to find out about and as such having a hard time googling.
I want to know whether its possible with databases (technology independent but would be interested to hear whether its possible with Oracle, MySQL and Postgres) to point to specific rows instead of executing my query again.
So I might initially execute a query find some rows of interest and then wish to avoid searching for them again by having a list of pointers or some other metadata which indicates the location on a database which I can go to straight away the next time I want those results.
I realise there is caching on databases, but I want to keep these "pointers" else where and as such caching doesn't ultimately solve this problem. Is this just an index and I store the index and look up by this? most of my current tables don't have indexes and I don't want the speed decrease that sometimes comes with indexes.
So whats the magic term I've been trying to put into google?
Cheers

In Oracle it is called ROWID. It identifies the file, the block number, and the row number in that block. I can't say that what you are describing is a good idea, but this might at least get you started looking in the right direction.
Check here for more info: http://www.orafaq.com/wiki/ROWID.
By the way, the "speed decrease that comes with indexes" that you are afraid of is only relevant if you do more inserts and updates than reads. Indexes only speed up reads, so if the read ratio is high, you might not have an issue and an index might be your best solution.

most of my current tables don't have
indexes and I don't want the speed
decrease that sometimes comes with
indexes.
And you also don't want the speed increase which usually comes with indexes but you want to hand-roll a bespoke pseudo-cache instead?
I'm not being snarky here, this is a serious point. Database designers have expended a great deal of skill and energy into optimizing their products. Wouldn't it be more sensible to learn how to take advantage of their efforts rather re-implementing some core features?

In general, the best way to handle this sort of requirement is to use the primary key (or in fact any convenient, compact unique identifier) as the 'pointer', and rely on the indexed lookup to be swift - which it usually will be.
You can use ROWID in more DBMS than just Oracle, but it generally isn't recommended for a variety or reasons. If you succumb to the 'every table has an autoincrement column' school of database design, then you can record the autoincrement column values as the identifiers.
You should have at least one index on (almost) all of your tables - that index will be for the primary key. The exception might be for a table so small that it fits in memory easily and won't be updated and will be used enough not to be evicted from memory. Then an index might be a distraction; however, such tables are typically seldom updated so the index won't harm anything, and the optimizer will ignore it if the index doesn't help (and it may not).
You may also have auxilliary indexes. In a system where most of the activity is reading the data, you may want to erro on the side of having more indexes rather than fewer, because access time is most critical. If your system was update intensive, then you would go with fewer indexes because there is a cost associated with updating indexes when data is added, removed or updated. Clearly, you need to design the indexes to work well with the queries that your users actually perform (or your applications perform).

You may also be interested in cursors. (Note that the index debate is still valid with cursors.)
Wikipedia definition here.

Related

What is the best moment to create SQL indexes?

When starting a project, should SQL indexes be created at the beginning?
I have a project where I haven´t created any indexes yet in production. The table that will grow most has 30000 rows and I have measured the time of the queries against this table creating an index and deleting it afterwards. The times are very similar.
I have decided to postpone the creation of the indexes in production until I notice a reduction of the response time in queries when creating them.
Is my approach correct? Or should I create them now?
I'm pretty deep into the topic of database indexing (it's actually my full-time job, also wrote a book about it (SQL Performance Explained) which is available for free here).
In my opinion, indexes should be created at the time you write the query because this is the time you have all the required information needed to decide which indexes to create in your head. In other words, if you do it at that time, it doesn't take you any extra effort. Another reason is that indexing sometimes affects the way you have to write the query so it can actually take benefit of that index.
However, the above statement assumes that you know how indexes work so you can decide which indexes to create. If you don't know that, I'd really suggest to learn about proper indexing first. Again, the book I've written is available for free on the web (Table of Contents). According to a recent survey, it takes you about 4-5 hours to read through it. Well-spent time, I'd say.
However, due to the ludicrous speed of modern hardware and vast amount of memory—even cheap commodity hardware—it is absolutely possible that you cannot measure any difference with these small tables (30k is small in DB world) yet. Nevertheless, you because you cannot measure this difference with a timers resolution of maybe 10ms, it doesn't mean the difference isn't there. Further: did you verify that the index was actually used? Are you sure the index you created was a good index for the given query?
Never the less, if the overall system is fast enough for you at the moment, sure you can go on without indexes. The risk remains, however, that it isn't fast enough on the day a major news outlet covers your app. What is supposed to be your best day might turn out to become your worst day :(
You didn't tell us a lot about your app, so I've to do some guesswork. I guess it is more like an OLTP app like an online website (as opposed to BI/OLAP). Although indexes add some overhead to write operations (insert, update, delete and merge), this is typically small compared to the benefit they bring to select (still assuming OLTP). Sure you can misuse indexes (e.g., creating hundreds on a single table) so that the overhead becomes a major problem too. But adding "a few" indexes on an OLTP table will most certainly not cause any problems due to the maintenance overhead.
Coming to an end: if you already know which indexes are good for your queries (verify it using explain), add them now before it is too late. If you are not sure, I'd still suggest to put some effort on that now. If you are not afraid of load peaks taking your app down, go on without indexes.
If you need more help, create a new question containing your query, table & index definitions as well as the explain output and people will be happy to help you figuring out if that index is fine or not.
Just create them now based on sensible choices: start with primary and foreign keys - thst'll keep your joins fast - then add indexes on single columns you'll be searching on (name, phone, etc) you are using.
Avoid creating multiple column indexes until you have a demonstrated performance problem and you can prove that an index helps. Often, reworking the query will fix the problem better than some complicated index.
The only time I delay creating indexes is if I'm about to load a heap of data and building indexes before loading means a much slower load as the index is updated for every row addition, although some databases allow the index rebuild to be deferred until after the load, so even then there's no point in waiting.

Clustered indexes on non-identity columns to speed up bulk inserts?

My two questions are:
Can I use clustered indexes to speed
up bulk inserts in big tables?
Can I then still efficiently use
foreign key relationships if my
IDENTITY column is not the clustered
index anymore?
To elaborate, I have a database with a couple of very big (between 100-1000 mln rows) tables containing company data. Typically there is data about 20-40 companies in such a table, each as their own "chunk" marked by "CompanyIdentifier" (INT). Also, every company has about 20 departments, each with their own "subchunk" marked by "DepartmentIdentifier" (INT).
It frequently happens that a whole "chunk" or "subchunk" is added or removed from the table. My first thought was to use Table Partitioning on those chunks, but since I am using SQL Server 2008 Standard Edition I am not entitled to it. Still, most queries I have are executed on a "chunk" or "subchunk" rather than on the table as a whole.
I have been working to optimize these tables for the following functions:
Queries that are run on subchunks
"Benchmarking" queries that are run on the table as a whole
Inserting/removing big chunks of data.
For 1) and 2) I haven't encountered a lot of problems. I have created several indexes on key fields (also containing CompanyIdentifier and DepartmentIdentifier where useful) and the queries are running fine.
But for 3) I have struggled to find a good solution.
My first strategy was to always disable indexes, bulk insert a big chunk and rebuild indexes. This was very fast in the beginning, but now that there are a lot of companies in the database, it takes a very long time to rebuild the index each time.
At the moment my strategy has changed to just leaving the index on while inserting, since this seems to be faster now. But I want to optimize the insert speed even further.
I seem to have noticed that by adding a clustered index defined on CompanyIdentifier + DepartmentIdentifier, the loading of new "chunks" into the table is faster. Before I had abandoned this strategy in favour of adding a clustered index on an IDENTITY column, as several articles pointed out to me that the clustered index is contained in all other indexes and so the clustered index should be as small as possible. But now I am thinking of reviving this old strategy to speed up the inserts. My question, would this be wise, or will I suffer performance hits in other areas? And will this really speed up my inserts or is that just my imagination?
I am also not sure whether in my case an IDENTITY column is really needed. I would like to be able to establish foreign key relationships with other tables, but can I also use something like a CompanyIdentifier+DepartmentIdentifier+[uniquifier] scheme for that? Or does it have to be a table-wide, fragmented IDENTITY number?
Thanks a lot for any suggestions or explanations.
Well, I've put it to the test, and putting a clustered index on the two "chunk-defining" columns increases the performance of my table.
Inserting a chunk is now relatively fast compared to the situation where I had a clustered IDENTITY key, and about as fast as when I did not have any clustered index. Deleting a chunk is faster than with or without clustered index.
I think the fact that all the records I want to delete or insert are guaranteed to be all together on a certain part of the harddisk makes the tables faster - it would seem logical to me.
Update: After a year of experience with this design I can say that for this approach to work, it is necessary to schedule regular rebuilding of all the indexes (we do it once a week). Otherwise, the indexes become fragmented very soon and performance is lost. Nevertheless, we are in a process of migration to a new database design with partitioned tables, which is basically better in every way - except for the Enterprise Server license cost, but we've already forgotten about it by now. At least I have.
A clustered index is a physical index, a physical data structure, a row order. If you insert in the middle of the clustered index, the data will be physically inserted in the middle of the present data. I imagine a serious performance issue in this case. I only know this from theory, because if I do this in practice, it will be a mistake according to my theoretical knowledge.
Therefore, I only use (and advise the use) of clustered indexes on fields that are always, physically, inserted at the end, preserving the order.
A clustered index can be placed on a datetime field which marks the moment of insertion or something like that, because physically they will be ordered after appending a row. Identity is a good clustered index also, but not always relevant for querying.
In your solution you place a [uniquifier] field, but why do this when you can put an identity that will do just that? It will be unique, physically ordered, small (for foreign keys in other tables means smaller index), and in some cases faster.
Can't you try this, experiment? I have a similar situation here, where I have 4 billion rows, constantly more are inserting (up to 100 per second), the table has no primary key and no clustered index, so the propositions in this topic are VERY interesting for me too.
Can I use clustered indexes to speed up bulk inserts in big tables?
Never! Imagine another million rows that you need to put in that table and have them physically ordered it is a colossal loss in performance in the long run.
Can I then still efficiently use foreign key relationships if my IDENTITY column is not the clustered index anymore?
Absolutely. By the way, clustered index is no silver bullet and may be slower than your ordinary index.
Have a look at the System.Data.SqlClient.SqlBulkCopy API. Given your requirements to write signficant numbers of rows in and out of the database, it might be what you need?
Bulk copy streams the data into the table in a single operation then performs the index check once. I use it to copy 500,000 rows in and out of a database table and it's performance is an order of magnitude better than any other technique I've tried, assuming that your application can be structured to take use of the API?
i've been playing around with some etl stuff the last little bit. i went through jsut regularly inserting into the table, then removing and readding indexes before and after the insert, tried merge statements, then i finally tried ssis. I'm sold on ssis. Just yesterday i managed to cut an etl process (~24 million records, ~6gb) from ~1-1 1/2 hours per run to ~24 minutes, jsut by letting ssis handle the inserts.
i believe with advanced services you should be able to use ssis.
(Given you have already chosen the Answer and given yourself the points, this is provided as a free service, a charitable act !)
A little knowledge is a dangerous thing. There are many issues to be considered; and they must be considered together. Taking any one issue and examining it in isolation is a very fragmented way to go about administering a database: you will forever be finding some new truth and changing eveything you thought before. Before launching into it, please read this ▶question/answer◀ for context.
Do not forget, these days anyone with a keyboard and a modem can get their "papers" published. Some of them work for MS, evangelising the latest "enhancement"; others publish glowing reports of features they have never used, or used only once, in one context, but they publish that it works in every context. (Look at Spence's answer: he is enthusiastic and "sold" but under scrutiny, the statements are false; he is not a bad person, just typical of the masses in the MS world and how they operate; how they publish.)
Note: I use the term MicroSofties to describe those people who believe in the gatesian notion that any unqualified person can administer a database; and that MS will fix everything. It is not intended as an insult, more as an endearment, because of the belief in magic, and the suspension of the laws of physics.
Clustered Indices
Were designed for Relational databases, by real engineers (Sybase, before MS acquired the code) who have more brains than all of MS put together. Relational databases have Relational Keys, not Idiot keys. These are multi-column keys, that automatically distribute the data, and therefore the insert load, eg. inserting Invoices for various Companies all the time (although not in our discussed case of "chunks").
if you have good Relational keys, CIs provide Range Queries (your (1) & (2) ), and other advantages, that NCIs simply do not have.
Starting off with Id columns, before modelling and normalising the data, severely hinders the modelling and normalisation processes.
If you have an Idiot database, then you will have more indices than not. The contents of many MS databases are not "relational", they are commonly just unnormalised filing systems, with way more indices than a Normalised database would have. Therefore there is a big push, a lot of MS "enhancements" to try and give these abortions a bit of speed. Fix the symptom but don't go anywhere near the problem that caused the symptom.
In SQL 2005 and again in 2008 MS has screwed around with CIs, and the result is they are now better in some ways, but worse in other ways; the universality of CIs has been lost.
It is not correct that NCIs carry the CI (the CI is the basic single storage structure; the NCIs are secondary, and dependent on the CI; that's why when you re-create a CI, all the NCIs are automatically re-created). The NCIs carry the CI Key at the leaf level.
Microsoft has its problems, which change with the major releases (but are not eliminated):
and in MS this is not efficiently done, so the NCI index size is large; in enterprise DBMS when this is efficiently done, this is not a consideration.
In the MS world, therefore, it is only half true, that the CI key should be as short as possible. If you understand that the consideration is the size of NCIs, and if you are willing to incur that expense, it return for a table that is very fast due to a carefully constructed CI, then that is the best option.
The common advice that the CI should be theIdiot column is totally and completely wrong. The worst canditate fo a CI key is a monotonically increasing value (IDENTITY, DATETIME, etc). WHy ? because you have guaranteed that all concurrent inserts will fight for the current insert location, the last page on the index.
The real purpose of Partitioning (Which MS provided 10 years after the Enterprise vendors) is to spread this load. Sure, they then have to provide a method of allocating the Partitions, on guess what, nothing but a Relational Key; but to start with, now the Idiot key is spread across 32 or 64 Partitions, providing better concurrency.
the CI must be Unique. Relational dbs demand Unique keys, so that is a no-brainer.
But for the amateurs who have poured non-relational contents into the database, if they do not know this rule, but they know that the CI spreads the data (a little knowledge is a dangerous thing), they keep their Idiot key in a NCI (good) but they create the CI on an almost-but-not-quite Unique Key. Deadly. CI's must be Unique, that is a design demand. Duplicate (remember we are talking CI Key here) rows are off-page, located in Overflow pages, and the (then) last page; and constitute a method of badly fragmenting the Page Chain.
Update, since this point is being questioned elsewhere. I have already stated the MS keeps changing the methods without fixing the problem.
The MS Online manual, with their pretty pictures (not technical diagrams) tells us that In 2008, they have replaced (substitued one for another) Overflow Pages, with the adorable "Uniqueifier".
That totally satisfies the MicroSofties. Non-Unique CIs are not a problem. It is handled by magic. Case closed.
But there is no logic or completeness to the statements, and qualified people will ask the obvious questions: where is this "Uniqueifier" located ? On every row, or just the rows needing "Uniqueifying". DBBC PAGE shows it is on every row. So MS has just added a 4-byte secret column (including handling overhead) to every row, instead of a few Overflow Pages for the non-unique rows only. That's MS idea of engineering.
End Update
Anyway, the point remains, that Non-Unique CIs have a substantial overhead (now more than before) and should be avoided. you would be better off adding a 1- or 2-byte column yourself, to force uniqueness.
.
Therefore, unchanged from the beginning (1984), the best candidate for a CI is a multi-column unique Relational key (I cannot say that yours is for sure, but it certainly looks like it).
And put any monotonically increasing keys (IDENTITY, DATETIME) in an NCI.
Remember also that the CI is a single storage structure, which eliminates the (otherwise) Heap; the CI B-Tree is married to the rows at the Leaf level; the Leaf Level entry is the row. That guarantees one less read on every access.
So it is not possible, that a NCI+Heap can be faster than a CI. Anther common myth in the MS world that defies the laws of physics: navigating a B-Tree and writing to the one place you are already in, has got to be faster than additionally writing the row to a separate storage structure. But MicroSofties do believe in magic, they've suspended the laws of physics.
.
There are many other features you need to learn and use, I will mention at least FILLFACTOR and RESERVEPAGEGAP, to give this post a bit of completeness. Do not use these features until you understand them. All performance features have a cost that you need to understand and accept.
CIs are also self-trimming at both the Page and Extent level, there is no wasted space. PageSplits are something to monitor for (Random inserts only), and that is easily modulated by FILLFACTOR and RESERVEPAGEGAP.
And read the SO site for Clustered Indices, but keep in mind all the above, esp. the first two paras.
Your Specific Case
By all means, get rid of your surrogate keys (Idiot columns), and replace them with true natural Relational keys. Surrogates are always an additional key and index; that is a price that should not be forgotten or taken lightly.
CompanyIdentifier+DepartmentIdentifier+[uniquiefier] is exactly what I am talking about. Now notice that they are already INTs, and very fast, so it is very silly to add a NUMERIC(10,0) Idiot Key. Use a 1- or 2-byte column toforce Uniqueness.
If you get this right, you may not need a Partition licence.
The CompanyIdentifier+DepartmentIdentifier+[uniquifier] is the perfect candidate (not knowing anything about your db other than that which you have posted) for a CI, in the context that you perform mass delete/insert periodically. Detailed above.
Contrary to what others have stated, this is a good thing, and does not fragment the CI. Lets' say ou have 20 Companies, and you delete 1, which constitutes 5% of the data. That entire PageChain which was reasonably contiguous, is now relegated to the FreePageChain, contiguous and intact. To be precise, you have a single point of fragmentation, but not fragmentation in the sense of the normal use of the word. And guess what, if you turn around and perform a mass insert, where do you think that data will go ? That's right the exact same physical location as the Deleted rows. And the FreePageChain moves to the PageChain, extent and page at a time.
.
but what is alarming is that you did not know about the demand for CI to be Unique. Sad that the MicroSofties write rubbish, but not why/what each simplistic rule is based on; not the core information. The exact symptom of non-unique CIs is, the table will be very fast immediately after DROP/CREATE CI, and then slow down over time. An good Unique CI will hold its speed, and it would take a year to slow down (2 years on my large, active banking dbs).
4 hours is a very long time for 1 Billion rows (I can recreate a CI on 16 billion rows with a 6-column key in 3 minutes on an enterprise platform). But in any case, that means you have to schedule it as regular weekly or demand maintenance.
why aren't you using the WITH SORTED_DATA option ? Wasn't your data sorted, before the drop ? This option rewrites the CI Non-leaf pages but not the leaf pages (containing the rows). It can only do that if it is confident that the data was sorted. Not using this option rewrites every page, in physical order.
Now, please be kind. Before you ask me twenty questions, read up a little and understand all the issues I have defined here.

Is adding indexes to a SQL Server ever a bad idea?

We have a mid-size SQL Server based application that has no indexes defined. Not even on the the identity columns. I suggested to our moderately expensive application consultant that perhaps we might get better performance (particularly as our database grows) by creating some indexes on appropriate fields, and he said:
"Indexes will significantly impact other areas of the application and customers should not create them under any circumstances."
Anybody ever heard of anything like this? Are there ever circumstances where one should not create any indexes? I can see nothing special about this app - it's got int identity columns, then lots of string columns, bunch of relational tables but nothing special or weird that I can see.
Thanks!
[EDIT: the identity columns are not using "identity specification", they seem to be set by the program, looking at the database with Management Studio, I can find NO indexes...]
FOLLOWUP: At a conference I asked the CEO (and chief architect) of the company producing this product about this, his response was that they felt for small to midsize deployments, the overhead associated with maintaining indexes would have more of a negative to overall user experience (the application does a lot of writes) than the benefits of the indexes would offset, but for large databases, they do create indexes. The tech support guy was just overzealous and very unhelpful with his answer. Mystery solved.
Hire me and I'll create the indexes for you. 14 years' Sybase/SQL Server experience tells me to create those !darn! indexes. Unless your table has less than 500 records each.
My idea is that an index hash node is roughly sized to 1000.
The other thing you need to look out for is whether your consultant has normalized the tables. Perhaps, the table has 500 fields/columns, containing more than one conceptual entity or a whole dozen of conceptual entities. And that could be why he is nervous about creating indexes, because if there are 12 conceptual entities in the table there would be at least 12 set of indexes - in which case, he is absolutely true - under no circumstances ... blah blah.
However, if he indeed does have 500 columns or detectably multiple conceptual entities per table - he is a very very lousy data design engineer. In all my years working with more experienced data engineers, our tables rarely exceed 20 columns. 5 on the low side, 10 on the average. Sometimes for performance' sake we do allow mixing two entities in a table, or horizontalizing row occurrences into columns of a table.
When you look at the table design you can with an untrained eye see Product, Project, BuildSheet, FloorPlan, Equipment, etc records all rolled into one long row. You cannot mix all these entities together into one table.
That is the only reason I know why he could advise you against having indexes. If he is doing that, you should know that he is fraudulently representing his data design skills to your company and you should immediately drop him from your weekly contractual expenses.
OK, after reading larry's post - I agree with him too.
There is such a thing as over-indexing, especially in INSERT and UPDATE heavy applications with very large tables. So the answer to the question in your title is yes, it can sometimes be a bad idea to add indexes.
That's quite a different question from the one you ask in the body of your question, which is "Is it ever normal to have NO indexes in a SQL Server database". The answer is that unless you're using the database as a "write-only" system, in which data is added but only read after being bulk extracted and transformed into a another data store, it's exceedingly unusual not to have some indexes in the database.
Your consultant's statement is odd enough to make me believe that you may have left some important information out of your description. If not, I'd say he's nuts.
Do you have the disk space to spare? I've seen cases where the indexes weighed more than the table.
However, No indexes exist whatsoever! There can't be a case for that except for when all read operations need the entire table.
Columns with key constraints will have an implicit index on them anyway. So if you're always selecting by the primary key, then there's no point adding more indexes. If you're selecting by other criteria, then it makes sense to add indexes on those columns that you're querying on.
It also depends on how insert-heavy your data is. If you're inserting more often than you're querying, then the overhead of keeping the indexes up to date can make your inserts slower.
But to say you "should not create [indexes] under any circumstances" is a bit much.
What I would recommend is that you run the SQL Server Profiler tool with some your queries. This tool will recommend which indexes to add that will have the biggest effect on performance.
In most run-of-the-mill applications, the impact of indexes on insertion performance is a bit of non-issue. You're usually better off creating the index and if insertion performance drops dramatically (which it probably won't) you can try something else. Obviously there are some exceptions, where you should be more careful, like tables that are used for logging for instance.
As mentioned, disk space can be an issue.
Creating irrelevant indexes (e.g. duplicates) will also waste microseconds and occasionally result in a bad query execution plan.
The other problem I've seen is with strangely code third-party applications that generate parts of the database at runtime, and can delete or choke on indexes that they don't know about.
In the vast majority of cases though, a carefully chosen index will only be a benefit.
Not having indexes on id columns sounds really unusual and I would find any justification for not including them to smell very fishy.
You should be aware that if you are doing a high volume of commits to the database, adding more indexes will affect the speed of insertion, but no index on id? Wow.
It would be good to get better justification of exactly how adding extra indexes might cause problems though.
the more indexes you have the slower data inserts and modifications will be. Make sure that you add indexes when appropriate and write queries that can take advantage of those indexes, also if the selectivity leve of your index is low, it will not be used effectively
I would say that if your server is having troubles with CPU time, indexes could be a solution. If you are querying tables without indexes, the server will need a lot more resources and if tables are having millions of records, it can become a serious problem. I recently cooled down a CPU from 80-90% all the time to 10-20% just by putting the right indexes.
If using MS SQL, you could check the activity monitor to see what queries are expensive and create indexes based on the where clauses or joins.
Then at the recent expensive queries:
You can then right click and check the complete query!

Access database performance

For a few different reasons one of my projects is hosted on a shared hosting server
and developed in asp.Net/C# with access databases (Not a choice so don't laugh at this limitation, it's not from me).
Most of my queries are on the last few records of the databases they are querying.
My question is in 2 parts:
1- Is the order of the records in the database only visual or is there an actual difference internally. More specifically, the reason I ask is that the way it is currently designed all records (for all databases in this project) are ordered by a row identifying key (which is an auto number field) ascending but since over 80% of my queries will be querying fields that should be towards the end of the table would it increase the query performance if I set the table to showing the most recent record at the top instead of at the end?
2- Are there any other performance tuning that can be done to help with access tables?
"Access" and "performance" is an euphemism but the database type wasn't a choice
and so far it hasn't proven to be a big problem but if I can help the performance
I would sure like to do whatever I can.
Thanks.
Edit:
No, I'm not currently experiencing issues with my current setup, just trying to look forward and optimize everything.
Yes, I do have indexes and have a primary key (automatically indexes) on the unique record identifier for each of my tables. I definitely should have mentioned that.
You're all saying the same thing, I'm already doing all that can be done for access performance. I'll give the question "accepted answer" to the one that was the fastest to answer.
Thanks everyone.
As far as I know...
1 - That change would just be visual. There'd be no impact.
2 - Make sure your fields are indexed. If the fields you are querying on are unique, then make sure you make the fields a unique key.
Yes there is an actual order to the records in the database. Setting the defaults on the table preference isn't going to change that.
I would ensure there are indexes on all your where clause columns. This is a rule of thumb. It would rarely be optimal, but you would have to do workload testing against different database setups to prove the most optimal solution.
I work daily with legacy access system that can be reasonably fast with concurrent users, but only for smallish number of users.
You can use indexes on the fields you search for (aren't you already?).
http://www.google.com.br/search?q=microsoft+access+indexes
The order is most likely not the problem. Besides, I don't think you can really change it in Access anyway.
What is important is how you are accessing those records. Are you accessing them directly by the record ID? Whatever criteria you use to find the data you need, you should have an appropriate index defined.
By default, there will only be an index on the primary key column, so if you're using any other column (or combination of columns), you should create one or more indexes.
Don't just create an index on every column though. More indexes means Access will need to maintain them all when a new record is inserted or updated, which makes it slower.
Here's one article about indexes in Access.
Have a look at the field or fields you're using to query your data and make sure you have an index on those fields. If it's the same as SQL server you won't need to include the primary key in the index (assuming it's clustering on this) as it's included by default.
If you're running queries on a small sub-set of fields you could get your index to be a 'covering' index by including all the fields required, there's a space trade-off here, so I really only recommend it for 5 fields or less, depending on your requirements.
Are you actually experiencing a performance problem now or is this just a general optimization question? Also from your post it sounds like you are talking about a db with 1 table, is that accurate? If you are already experiencing a problem and you are dealing with concurrent access, some answers might be:
1) indexing fields used in where clauses (mentioned already)
2) Splitting tables. For example, if only 80% of your table rows are not accessed (as implied in your question), create an archive table for older records. Or, if the bulk of your performance hits are from reads (complicated reports) and you don't want to impinge on performance for people adding records, create a separate reporting table structure and query off of that.
3) If this is a reporting scenario, all queries are similar or the same, concurrency is somewhat high (very relative number given Access) and the data is not extremely volatile, consider persisting the data to a file that can be periodically updated, thus offloading the querying workload from the Access engine.
In regard to table order, Jet/ACE writes the actual table date in PK order. If you want a different order, change the PK.
But this oughtn't be a significant issue.
Indexes on the fields other than the PK that you sort on should make sorting pretty fast. I have apps with 100s of thousands of records that return subsets of data in non-PK sorted order more-or-less instantaneously.
I think you're engaging in "premature optimization," worrying about something before you actually have an issue.
The only circumstances in which I think you'd have a performance problem is if you had a table of 100s of thousands of records and you were trying to present the whole thing to the end user. That would be a phenomenally user-hostile thing to do, so I don't think it's something you should be worrying about.
If it really is a concern, then you should consider changing your PK from the Autonumber to a natural key (though that can be problematic, given real-world data and the prohibition on non-Null fields in compound unique indexes).
I've got a couple of things to add that I didn't notice being mentioned here, at least not explicitly:
Field Length, create your fields as large as you'll need them but don't go over - for instance, if you have a number field and the value will never be over 1000 (for the sake of argument) then don't type it as a Long Integer, something smaller like Integer would be more appropriate, or use a single instead of a double for decimal numbers, etc. By the same token, if you have a text field that won't have more than 50 chars, don't set it up for 255, etc, etc. Sounds obvious, but it's done, often times with the idea that "I might need that space in the future" and your app suffers in the mean time.
Not to beat the indexing thing to death...but, tables that you're joining together in your queries should have relationships established, this will create indexes on the foreign keys which greatly increases the performance of table joins (NOTE: Double check any foreign keys to make sure they did indeed get indexed, I've seen cases where they haven't been - so apparently a relationship doesn't explicitly mean that the proper indexes have been created)
Apparently compacting your DB regularly can help performance as well, this reduces internal fragmentation of the file and can speed things up that way.
Access actually has a Performance Analyzer, under tools Analyze > Performance, it might be worth running it on your tables & queries at least to see what it comes up with. The table analyzer (available from the same menu) can help you split out tables with alot of redundant data, obviously, use with caution - but it's could be helpful.
This link has a bunch of stuff on access performance optimization on pretty much all aspects of the database, tables, queries, forms, etc - it'd be worth checking out for sure.
http://office.microsoft.com/en-us/access/hp051874531033.aspx
To understand the answers here it is useful to consider how access works, in an un-indexed table there is unlikely to be any value in organising the data so that recently accessed records are at the end. Indeed by the virtue of the fact that Access / the JET engine is an ISAM database it's the other way around. (http://en.wikipedia.org/wiki/ISAM) That's rather moot however as I would never suggest putting frequently accessed values at the top of a table, it is best as others have said to rely on useful indexes.

How many database indexes is too many?

I'm working on a project with a rather large Oracle database (although my question applies equally well to other databases). We have a web interface which allows users to search on almost any possible combination of fields.
To make these searches go fast, we're adding indexes to the fields and combinations of fields on which we believe users will commonly search. However, since we don't really know how our customers will use this software, it's hard to tell which indexes to create.
Space isn't a concern; we have a 4 terabyte RAID drive of which we are using only a small fraction. However, I'm worried about the possible performance penalties of having too many indexes. Because those indexes need to be updated every time a row is added, deleted, or modified, I imagine it'd be a bad idea to have dozens of indexes on a single table.
So how many indexes is considered too many? 10? 25? 50? Or should I just cover the really, really common and obvious cases and ignore everything else?
It depends on the operations that occur on the table.
If there's lots of SELECTs and very few changes, index all you like.... these will (potentially) speed the SELECT statements up.
If the table is heavily hit by UPDATEs, INSERTs + DELETEs ... these will be very slow with lots of indexes since they all need to be modified each time one of these operations takes place
Having said that, you can clearly add a lot of pointless indexes to a table that won't do anything. Adding B-Tree indexes to a column with 2 distinct values will be pointless since it doesn't add anything in terms of looking the data up. The more unique the values in a column, the more it will benefit from an index.
I usually proceed like this.
Get a log of the real queries run on the data on a typical day.
Add indexes so the most important queries hit the indexes in their execution plan.
Try to avoid indexing fields that have a lot of updates or inserts
After a few indexes, get a new log and repeat.
As with all any optimization, I stop when the requested performance is reached (this obviously implies that point 0. would be getting specific performance requirements).
Everyone else has been giving you great advice. I have an added suggestion for you as you move forward. At some point you have to make a decision as to your best indexing strategy. In the end though, the best PLANNED indexing strategy can still end up creating indexes that don't end up getting used. One strategy that lets you find indexes that aren't used is to monitor index usage. You do this as follows:-
alter index my_index_name monitoring usage;
You can then monitor whether the index is used or not from that point forward by querying v$object_usage. Information on this can be found in the Oracle® Database Administrator's Guide.
Just remember that if you have a warehousing strategy of dropping indexes before updating a table, then recreating them, you will have to set the index up for monitoring again, and you'll lose any monitoring history for that index.
In data warehousing it is very common to have a high number of indexes. I have worked with fact tables having two hundred columns and 190 of them indexed.
Although there is an overhead to this it must be understood in the context that in a data warehouse we generally only insert a row once, we never update it, but it can then participate in thousands of SELECT queries which might benefit from indexing on any of the columns.
For maximum flexibility a data warehouse generally uses single column bitmap indexes except on high cardinality columns, where (compressed) btree indexes can be used.
The overhead on index maintenance is mostly associated with the expense of writing to a great many blocks and the block splits as new rows are added with values that are "in the middle" of existing value ranges for that column. This can be mitigated by partitioning and having the new data loads aligned with the partitioning scheme, and by using direct path inserts.
To address your question more directly, I think it is probably fine to index the obvious at first, but do not be afraid of adding more indexes on if the queries against the table would benefit.
In a paraphrase of Einstein about simplicity, add as many indexes as you need and no more.
Seriously, however, every index you add requires maintenance whenever data is added to the table. On tables that are primarily read only, lots of indexes are a good thing. On tables that are highly dynamic, fewer is better.
My advice is to cover the common and obvious cases and then, as you encounter issues where you need more speed in getting data from specific tables, evaluate and add indices at that point.
Also, it's a good idea to re-evaluate your indexing schemes every few months, just to see if there is anything new that needs indexing or any indices that you've created that aren't being used for anything and should be gotten rid of.
In addition to the points everyone else has raised, the Cost Based Optimizer incurs a cost when creating a plan for an SQL statement if there are more indexes because there are more combinations for it to consider. You can reduce this by correctly using bind variables so that SQL statements stay in the SQL cache. Oracle can then do a soft parse and re-use the plan it found last time.
As always, nothing is simple. If there are skewed columns and histograms involved then this can be a bad idea.
In our web applications we tend to limit the combinations of searches that we allow. Otherwise you would have to test literally every combination for performance to ensure you did not have a lurking problem that someone will find one day. We have also implemented resource limits to stop this causing issues elsewhere in the application should something go wrong.
I made some simple tests on my real project and real MySql database. I already answered in this topic: What is the cost of indexing multiple db columns?
But I think it will be better if I quote it here:
I made some simple tests using my real
project and real MySql database.
My results are: adding average index
(1-3 columns in an index) to a table -
makes inserts slower by 2.1%. So, if
you add 20 indexes, your inserts will
be slower by 40-50%. But your selects
will be 10-100 times faster.
So is it ok to add many indexes? - It
depends :) I gave you my results - You
decide!
Ultimately how many indexes you need depend on the behavior of your applications that ride on top of your database server.
In general the more inserting you do the more painful your indexes become. Each time you do an insert, all the indexes that include that table have to be updated.
Now if your application has a decent amount of reading, or even more so if it's almost all reading, then indexes are the way to go as there will be major performance improvements for very little cost.
There's no static answer in my opinion, this sort of thing falls under 'performance tuning'.
It could be that everything your app does is looked up by a primary key, or it could be the oposite in that queries are done over unristricted combinations of fields and any one in particular could be used at any given time.
Beyond just indexing, there's reogranizing your DB to include calculated search fields, splitting tables, etc - it's really dependant on your load shapes and query parameters, how much/what data 'really' needs to be retruend by a query.
If your entire DB is fronted by stored-procedure facades turning becomes a bit easier, as you don't have to wory about every ad-hoc query. Or you may have a deep understanding of the kind of queries that will hit your DB, and can limit the tuning to those.
For SQL Server I've found the Database Engine Tuning advisor usefull - you set up 'typical' workloads and it can make recommendations about adding/removing indexes and statistics. I'm sure other DBs have similar tools, either 'offical' or third party.
This really is a more theoretical questions than practical. Indexes impact on your performance depends on the hardware you have, the version of Oracle, index types, etc. Yesterday I heard Oracle announced a dedicated storage, made by HP, which is supposed to perform 10 times faster with 11g database.
As for your case, there can be several solutions:
1. Have a large amount of indexes (>20) and rebuild them daily (nightly). This would be especially useful if the table gets thousands of updates/deletes daily.
2. Partition your table (if that applies your data model).
3. Use a separate table for new/updated data, and run a nightly process which combines the data together. This would require a change in your application logic.
4. Switch to IOT (index organized table), if your data support this.
Of course there might be many more solutions for such case. My first suggestion to you, would be to clone the DB to a development environment, and run some stress testing against it.
An index imposes a cost when the underlying table is updated. An index provides a benefit when it is used to spped up a query. For each index, you need to balance the cost against the benefit. How much slower does the query run without the index? How much of a benefit is running faster? Can you or your users tolerate the slow speed when the index is missing?
Can you tolerate the additional time it takes to complete an update?
You need to compare costs and benefits. That's particular to your situation. There's no magic number of indexes that passes the threshold of "too many".
There's also the cost of the space needed to store the index, but you've said that in your situation that's not an issue. The same is true in most situations, given how cheap disk space has become.
If you do mostly reads (and few updates) then there's really no reason not to index everything you'll need to index. If you update often, then you may need to be cautious on how many indexes you have. There's no hard number, but you'll notice when things start to slow down. Make sure your clustered index is the one that makes the most sense based on the data.
One thing you may consider is building indexes to target a standard combination of searches. If column1 is commonly searched, and column2 is often used with it, and column3 is sometimes used with column2 and column1, then an index on column1, column2, and column3 in that order can be used for any of those three circumstances, though it is only one index that has to be maintained.
How many columns are there?
I have always been told to make single-column indexes, not multi-column indexes. So no more indexes than the amount of columns, IMHO.
What it really comes down to is, don't add an index unless you know (and this often means gathering usage statistics) that it will be used far more often than it's updated.
Any index that doesn't meet that criteria will cost you more to rebuild than the performance penalty of not having it in the odd case it got used.
Sql server gives you some good tools that let you see which indexes are actually being used.
This article, http://www.mssqltips.com/tip.asp?tip=1239, gives you some queries that let you get a better insight into how much an index is used, as opposed to how much it is updated.
It is totally based on the columns which are being used in Where Clause.
And as the Thumb of Rule, we must have indexes on Foreign Key Columns to avoid DEADLOCKS.
AWR report should analyze periodically to understand the need of indexes.

Resources