GUID vs INT IDENTITY [duplicate] - database

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How do you like your primary keys?
I'm aware of the benefits of using a GUID, as well as the benefits of using and INT as a PK in a database. Considering that a GUID is in essence a 128 bit INT and a normal INT is 32 bit, the INT is a space saver (though this point is generally moot in most modern systems).
In the end, in what circumstances would you see yourself using an INT as a PK versus a GUID?

Kimberley Tripp (SQLSkills.com) has an article on using GUID's as primary keys. She advices against it because of the unnecessary overhead.

To answer your question:
In the end, in what circumstances would you see yourself using an INT as a PK versus a GUID?
I would use a GUID if my system would have an online/offline version that inside the offline version you can save data and that data is transferred back to the server one day during a synch. That way, you are sure that you won't have the same key twice inside your database.

We have Guids in our very complex enterprise software everywhere. Works smoothly.
I believe Guids are semantically more suitable to serve as identifiers. There is also no point in unnecessarily worrying about performance until you are faced with that problem. Beware premature optimization.
There is also an advantage with database migration of any sort. With Guids you will have no collisions. If you attempt to merge several DBs where ints are used for identity, you will have to replace their values. If these old values were used in urls, they will now be different following SEO hit.

Apart from being a poor choice when you need to synchronize several database instances, INT's have one drawback I haven't seen mentioned: inserts always occur at one end of the index tree. This increases lock contention when you have a table with a lot of movement (since the same index pages have to be modified by concurrent inserts, whereas GUID's will be inserted all over the index). The index may also have to be rebalanced more often if a B* tree or similar data structure is used.
Of course, int's are easier on the eye when doing manual queries and report construction, and space consumption may add up through FK usages.
I'd be interested to see any measurements of how well e.g. SQL Server actually handles insert-heavy tables with IDENTITY PK's.

the INT is a space saver (though this
point is generally moot in most modern
systems).
Not so. It may seem so at first glance, but note that the primary key of each table will be repeated multiple times throughout the database in indexes and as foreign key in other tables. And it will be involved in nearly any query containing its table - and very intensively when it's a foreign key used for a join.
Furthermore, remember that modern CPUs are very, very fast, but RAM speeds have not kept up. Cache behaviour becomes therefore increasingly important. And the best way to get good cache behaviour is to have smaller data sets. So the seemingly irrelevant difference between 4 and 16 bytes may well result in a noticeable difference in speed. Not necessarily always - but it's something to consider.

When comparing values such as Primary to Foreign key relationship, the INT will be faster. If the tables are indexed properly and the tables are small, you might not see much of a slow down, but you'd have to try it to be sure. INTs are also easier to read, and communicate with other people. It's a lot simpler to say, "Can you look at record 1234?" instead of "Can you look at record 031E9502-E283-4F87-9049-CE0E5C76B658?"

If you are planning on merging database at some stage, ie for a multi-site replication type setup, Guid's will save a lot of pain. But other than that I find Int's easier.

If the data lives in a single database (as most data for the applications that we write in general does), then I use an IDENTITY. It's easy, intended to be used that way, doesn't fragment the clustered index and is more than enough. You'll run out of room at 2 billion some records (~ 4 billion if you use negative values), but you'd be toast anyway if you had that many records in one table, and then you have a data warehousing problem.
If the data lives in multiple, independent databases or interfaces with a third-party service, then I'll use the GUID that was likely already generated. A good example would be a UserProfiles table in the database that maps users in Active Directory to their user profiles in the application via their objectGUID that Active Directory assigned to them.

Some OSes don't generate GUIDs anymore based on unique hardware features (CPUID,MAC) because it made tracing users to easy (privacy concerns). This means the GUID uniqueness is often no longer as universal as many people think.
If you use some auto-id function of your database, the database could in theory make absolutely sure that there is no duplication.

I always think PK's should be numeric where possble. Dont forget having GUIDs as a PK will probably mean that they are also used in other tables as foriegn keys, so paging and index etc will be greater.

An INT is certainly much easier to read when debugging, and much smaller.
I would, however, use a GUID or similar as a license key for a product. You know it's going to be unique, and you know that it's not going to be sequential.

I think the database also matters. From a MySQL perspective - generally, the smaller the datatype the faster the performance.
It seems to hold true for int vs GUID too -
http://kccoder.com/mysql/uuid-vs-int-insert-performance/

I would use GUID as PK only if this key bounds to similar value. For example, user id (users in WinNT are describes with GUIDs), or user group id.
Another one example. If you develop distributed system for documents management and different parts of system in different places all over the world can create some documents. In such case I would use GUID, because it guaranties that 2 documents created in different parts of distributed system wouldn't have same Id.

Related

Questions about database indexes

When a database index is created for a unique constraint on a field or multiple indexes are created for a multiple field unique constraint, are those same indexes able to be used for efficiency gains when querying for objects, much in the same way any other database index is used? My guess is that the indexes created for unique constraints are the same as ones created for efficiency gains, and the unique constraint itself is something additional, but I'm not well versed in databases.
Is it ever possible to break a unique constraint, including a multiple field constraint (e.g. field_a and field_b are unique together) in any way through long transactions and high concurrency, etc? Or, does a unique constraint offer 100% protection.
As to question 1:
YES - these are indexes as any other indexes you define and are used in query plans for example for performance gains... you can define unique indexes without defining a "unique contraint" btw.
As to question 2:
YES - it is a 100% protection as long as the DB engine is is ACID compliant and reliable (i.e. no bugs in this regard) and as long as you don't temporarily disable the constraint.
Yes. A unique constraint is an index (in SQL Server) and will (can) be used in query plans
This is impossible. Regardless of transaction times or concurrency issues, you cannot store data in a table that violates a constraint (at least in SQL Server). BTW, if your transactions are so long that you're worried about this, you NEED to rethink what you're doing in the context of this transaction. Even though you won't violate database constraints with long transaction operations, YOU WILL run into other problems.
The problem with your question is, that it is very general and not tailored to a specific implementation. Therefore any answer will be quite generic to.
In this mind:
Whenever a database thinks, that access via an index might speed up things, it will do so - uniqueness is not concern here. If many indizes exists on one table a decent database will try to use the "best" one - with different views about what "best" means actually. BUT many databases will only use one index to get a row. Therefore as a rule of thumb DBs usually try to use indizes where lookups result in as few rows as possible. A unique index is quite good at this. :-)
Actually this is not one point but two different points:
A decent DB will not corrupt your index even for long running transactions or high concurrency. At least not on purpose. And if it does it is either a bug in the DB software which has to be fixed very quickly - otherwise the DB vendor might suffer reputation loss in a very drastic way. The other possibility is, that it is not a decent DB but a mere persistent hashmap or something like that. If the data really matters, then high concurrency and longrunning transactions are no excuse.
Multivalued unique indices are a beast: DB implementations are silghty different, what they consider "unique" when one or more of the key columns contain NULL. For example you can look at the PostgreSQL documentation regarding this point: http://www.postgresql.org/docs/9.1/interactive/indexes-unique.html
Hope this makes some things clear.

GUID vs. int ID auto-increment

I'm trying to shift my design sensibilities from the LAMP stack to the Microsoft stack, and I just thought of something - when would I want to use a GUID? What benefits/drawbacks does it have compared to the old, reliable auto-incremented int?
Your experience never having needed anything beyond autoincremental ids might suggest that GUIDs are often a solution in search of a problem. Use them when and if you ever run into a requirement where your familiar pattern doesn't work. The microsoftiness is irrelevant.
The only realistic scenario I've seen is merging tables from two sources.
"old, reliable auto-incremented int" depends rather strongly on just how scalable your database needs to be. Auto incrementing stops working in the trivial case when you have a setup with at least two masters. It's not too difficult to work around that, of course, because it's such a common problem; Different database engines may coordinate the sequence between the masters, for instance only one master may allocate from any given sequence.
When you get into sharding the data, it's normally desirable to know the shard from the key. An auto incremented id doesn't contain information about which shard should host that record.
GUID's solve the problem in a different way; two distinct masters have distinct host identifiers (typically a MAC address). Since that is used in computing a new GUID, distinct masters cannot create guid's that collide. Furthermore, since the host is part of the ID, It can be used to directly identify the shard that holds the record.
A third option is to use no surrogate keys at all (neither auto increment integers nor guids).
One problem with GUIDs:
Because they are not sequential, the database will have to work hard to update indexes. With a sequential id, it can usually just append it to the end (more or less). Since GUIDs are random it has to fit it into an existing block. That said, we use GUIDs for some tables and they seem to work fine even under fairly heavy load.
I would recommend to use int ID instead of Guid for the following reasons:
Int ID has a size of 4 bytes (32 bits) however Guid has size of 16 bytes (128 bits) which 4 times larger. in this case you might have performance problems and storage implaication
By default GUIDs are not sequential (be carefull)
It's easy to visualize links and relationships between tables

Clustered indexes on non-identity columns to speed up bulk inserts?

My two questions are:
Can I use clustered indexes to speed
up bulk inserts in big tables?
Can I then still efficiently use
foreign key relationships if my
IDENTITY column is not the clustered
index anymore?
To elaborate, I have a database with a couple of very big (between 100-1000 mln rows) tables containing company data. Typically there is data about 20-40 companies in such a table, each as their own "chunk" marked by "CompanyIdentifier" (INT). Also, every company has about 20 departments, each with their own "subchunk" marked by "DepartmentIdentifier" (INT).
It frequently happens that a whole "chunk" or "subchunk" is added or removed from the table. My first thought was to use Table Partitioning on those chunks, but since I am using SQL Server 2008 Standard Edition I am not entitled to it. Still, most queries I have are executed on a "chunk" or "subchunk" rather than on the table as a whole.
I have been working to optimize these tables for the following functions:
Queries that are run on subchunks
"Benchmarking" queries that are run on the table as a whole
Inserting/removing big chunks of data.
For 1) and 2) I haven't encountered a lot of problems. I have created several indexes on key fields (also containing CompanyIdentifier and DepartmentIdentifier where useful) and the queries are running fine.
But for 3) I have struggled to find a good solution.
My first strategy was to always disable indexes, bulk insert a big chunk and rebuild indexes. This was very fast in the beginning, but now that there are a lot of companies in the database, it takes a very long time to rebuild the index each time.
At the moment my strategy has changed to just leaving the index on while inserting, since this seems to be faster now. But I want to optimize the insert speed even further.
I seem to have noticed that by adding a clustered index defined on CompanyIdentifier + DepartmentIdentifier, the loading of new "chunks" into the table is faster. Before I had abandoned this strategy in favour of adding a clustered index on an IDENTITY column, as several articles pointed out to me that the clustered index is contained in all other indexes and so the clustered index should be as small as possible. But now I am thinking of reviving this old strategy to speed up the inserts. My question, would this be wise, or will I suffer performance hits in other areas? And will this really speed up my inserts or is that just my imagination?
I am also not sure whether in my case an IDENTITY column is really needed. I would like to be able to establish foreign key relationships with other tables, but can I also use something like a CompanyIdentifier+DepartmentIdentifier+[uniquifier] scheme for that? Or does it have to be a table-wide, fragmented IDENTITY number?
Thanks a lot for any suggestions or explanations.
Well, I've put it to the test, and putting a clustered index on the two "chunk-defining" columns increases the performance of my table.
Inserting a chunk is now relatively fast compared to the situation where I had a clustered IDENTITY key, and about as fast as when I did not have any clustered index. Deleting a chunk is faster than with or without clustered index.
I think the fact that all the records I want to delete or insert are guaranteed to be all together on a certain part of the harddisk makes the tables faster - it would seem logical to me.
Update: After a year of experience with this design I can say that for this approach to work, it is necessary to schedule regular rebuilding of all the indexes (we do it once a week). Otherwise, the indexes become fragmented very soon and performance is lost. Nevertheless, we are in a process of migration to a new database design with partitioned tables, which is basically better in every way - except for the Enterprise Server license cost, but we've already forgotten about it by now. At least I have.
A clustered index is a physical index, a physical data structure, a row order. If you insert in the middle of the clustered index, the data will be physically inserted in the middle of the present data. I imagine a serious performance issue in this case. I only know this from theory, because if I do this in practice, it will be a mistake according to my theoretical knowledge.
Therefore, I only use (and advise the use) of clustered indexes on fields that are always, physically, inserted at the end, preserving the order.
A clustered index can be placed on a datetime field which marks the moment of insertion or something like that, because physically they will be ordered after appending a row. Identity is a good clustered index also, but not always relevant for querying.
In your solution you place a [uniquifier] field, but why do this when you can put an identity that will do just that? It will be unique, physically ordered, small (for foreign keys in other tables means smaller index), and in some cases faster.
Can't you try this, experiment? I have a similar situation here, where I have 4 billion rows, constantly more are inserting (up to 100 per second), the table has no primary key and no clustered index, so the propositions in this topic are VERY interesting for me too.
Can I use clustered indexes to speed up bulk inserts in big tables?
Never! Imagine another million rows that you need to put in that table and have them physically ordered it is a colossal loss in performance in the long run.
Can I then still efficiently use foreign key relationships if my IDENTITY column is not the clustered index anymore?
Absolutely. By the way, clustered index is no silver bullet and may be slower than your ordinary index.
Have a look at the System.Data.SqlClient.SqlBulkCopy API. Given your requirements to write signficant numbers of rows in and out of the database, it might be what you need?
Bulk copy streams the data into the table in a single operation then performs the index check once. I use it to copy 500,000 rows in and out of a database table and it's performance is an order of magnitude better than any other technique I've tried, assuming that your application can be structured to take use of the API?
i've been playing around with some etl stuff the last little bit. i went through jsut regularly inserting into the table, then removing and readding indexes before and after the insert, tried merge statements, then i finally tried ssis. I'm sold on ssis. Just yesterday i managed to cut an etl process (~24 million records, ~6gb) from ~1-1 1/2 hours per run to ~24 minutes, jsut by letting ssis handle the inserts.
i believe with advanced services you should be able to use ssis.
(Given you have already chosen the Answer and given yourself the points, this is provided as a free service, a charitable act !)
A little knowledge is a dangerous thing. There are many issues to be considered; and they must be considered together. Taking any one issue and examining it in isolation is a very fragmented way to go about administering a database: you will forever be finding some new truth and changing eveything you thought before. Before launching into it, please read this â–¶question/answerâ—€ for context.
Do not forget, these days anyone with a keyboard and a modem can get their "papers" published. Some of them work for MS, evangelising the latest "enhancement"; others publish glowing reports of features they have never used, or used only once, in one context, but they publish that it works in every context. (Look at Spence's answer: he is enthusiastic and "sold" but under scrutiny, the statements are false; he is not a bad person, just typical of the masses in the MS world and how they operate; how they publish.)
Note: I use the term MicroSofties to describe those people who believe in the gatesian notion that any unqualified person can administer a database; and that MS will fix everything. It is not intended as an insult, more as an endearment, because of the belief in magic, and the suspension of the laws of physics.
Clustered Indices
Were designed for Relational databases, by real engineers (Sybase, before MS acquired the code) who have more brains than all of MS put together. Relational databases have Relational Keys, not Idiot keys. These are multi-column keys, that automatically distribute the data, and therefore the insert load, eg. inserting Invoices for various Companies all the time (although not in our discussed case of "chunks").
if you have good Relational keys, CIs provide Range Queries (your (1) & (2) ), and other advantages, that NCIs simply do not have.
Starting off with Id columns, before modelling and normalising the data, severely hinders the modelling and normalisation processes.
If you have an Idiot database, then you will have more indices than not. The contents of many MS databases are not "relational", they are commonly just unnormalised filing systems, with way more indices than a Normalised database would have. Therefore there is a big push, a lot of MS "enhancements" to try and give these abortions a bit of speed. Fix the symptom but don't go anywhere near the problem that caused the symptom.
In SQL 2005 and again in 2008 MS has screwed around with CIs, and the result is they are now better in some ways, but worse in other ways; the universality of CIs has been lost.
It is not correct that NCIs carry the CI (the CI is the basic single storage structure; the NCIs are secondary, and dependent on the CI; that's why when you re-create a CI, all the NCIs are automatically re-created). The NCIs carry the CI Key at the leaf level.
Microsoft has its problems, which change with the major releases (but are not eliminated):
and in MS this is not efficiently done, so the NCI index size is large; in enterprise DBMS when this is efficiently done, this is not a consideration.
In the MS world, therefore, it is only half true, that the CI key should be as short as possible. If you understand that the consideration is the size of NCIs, and if you are willing to incur that expense, it return for a table that is very fast due to a carefully constructed CI, then that is the best option.
The common advice that the CI should be theIdiot column is totally and completely wrong. The worst canditate fo a CI key is a monotonically increasing value (IDENTITY, DATETIME, etc). WHy ? because you have guaranteed that all concurrent inserts will fight for the current insert location, the last page on the index.
The real purpose of Partitioning (Which MS provided 10 years after the Enterprise vendors) is to spread this load. Sure, they then have to provide a method of allocating the Partitions, on guess what, nothing but a Relational Key; but to start with, now the Idiot key is spread across 32 or 64 Partitions, providing better concurrency.
the CI must be Unique. Relational dbs demand Unique keys, so that is a no-brainer.
But for the amateurs who have poured non-relational contents into the database, if they do not know this rule, but they know that the CI spreads the data (a little knowledge is a dangerous thing), they keep their Idiot key in a NCI (good) but they create the CI on an almost-but-not-quite Unique Key. Deadly. CI's must be Unique, that is a design demand. Duplicate (remember we are talking CI Key here) rows are off-page, located in Overflow pages, and the (then) last page; and constitute a method of badly fragmenting the Page Chain.
Update, since this point is being questioned elsewhere. I have already stated the MS keeps changing the methods without fixing the problem.
The MS Online manual, with their pretty pictures (not technical diagrams) tells us that In 2008, they have replaced (substitued one for another) Overflow Pages, with the adorable "Uniqueifier".
That totally satisfies the MicroSofties. Non-Unique CIs are not a problem. It is handled by magic. Case closed.
But there is no logic or completeness to the statements, and qualified people will ask the obvious questions: where is this "Uniqueifier" located ? On every row, or just the rows needing "Uniqueifying". DBBC PAGE shows it is on every row. So MS has just added a 4-byte secret column (including handling overhead) to every row, instead of a few Overflow Pages for the non-unique rows only. That's MS idea of engineering.
End Update
Anyway, the point remains, that Non-Unique CIs have a substantial overhead (now more than before) and should be avoided. you would be better off adding a 1- or 2-byte column yourself, to force uniqueness.
.
Therefore, unchanged from the beginning (1984), the best candidate for a CI is a multi-column unique Relational key (I cannot say that yours is for sure, but it certainly looks like it).
And put any monotonically increasing keys (IDENTITY, DATETIME) in an NCI.
Remember also that the CI is a single storage structure, which eliminates the (otherwise) Heap; the CI B-Tree is married to the rows at the Leaf level; the Leaf Level entry is the row. That guarantees one less read on every access.
So it is not possible, that a NCI+Heap can be faster than a CI. Anther common myth in the MS world that defies the laws of physics: navigating a B-Tree and writing to the one place you are already in, has got to be faster than additionally writing the row to a separate storage structure. But MicroSofties do believe in magic, they've suspended the laws of physics.
.
There are many other features you need to learn and use, I will mention at least FILLFACTOR and RESERVEPAGEGAP, to give this post a bit of completeness. Do not use these features until you understand them. All performance features have a cost that you need to understand and accept.
CIs are also self-trimming at both the Page and Extent level, there is no wasted space. PageSplits are something to monitor for (Random inserts only), and that is easily modulated by FILLFACTOR and RESERVEPAGEGAP.
And read the SO site for Clustered Indices, but keep in mind all the above, esp. the first two paras.
Your Specific Case
By all means, get rid of your surrogate keys (Idiot columns), and replace them with true natural Relational keys. Surrogates are always an additional key and index; that is a price that should not be forgotten or taken lightly.
CompanyIdentifier+DepartmentIdentifier+[uniquiefier] is exactly what I am talking about. Now notice that they are already INTs, and very fast, so it is very silly to add a NUMERIC(10,0) Idiot Key. Use a 1- or 2-byte column toforce Uniqueness.
If you get this right, you may not need a Partition licence.
The CompanyIdentifier+DepartmentIdentifier+[uniquifier] is the perfect candidate (not knowing anything about your db other than that which you have posted) for a CI, in the context that you perform mass delete/insert periodically. Detailed above.
Contrary to what others have stated, this is a good thing, and does not fragment the CI. Lets' say ou have 20 Companies, and you delete 1, which constitutes 5% of the data. That entire PageChain which was reasonably contiguous, is now relegated to the FreePageChain, contiguous and intact. To be precise, you have a single point of fragmentation, but not fragmentation in the sense of the normal use of the word. And guess what, if you turn around and perform a mass insert, where do you think that data will go ? That's right the exact same physical location as the Deleted rows. And the FreePageChain moves to the PageChain, extent and page at a time.
.
but what is alarming is that you did not know about the demand for CI to be Unique. Sad that the MicroSofties write rubbish, but not why/what each simplistic rule is based on; not the core information. The exact symptom of non-unique CIs is, the table will be very fast immediately after DROP/CREATE CI, and then slow down over time. An good Unique CI will hold its speed, and it would take a year to slow down (2 years on my large, active banking dbs).
4 hours is a very long time for 1 Billion rows (I can recreate a CI on 16 billion rows with a 6-column key in 3 minutes on an enterprise platform). But in any case, that means you have to schedule it as regular weekly or demand maintenance.
why aren't you using the WITH SORTED_DATA option ? Wasn't your data sorted, before the drop ? This option rewrites the CI Non-leaf pages but not the leaf pages (containing the rows). It can only do that if it is confident that the data was sorted. Not using this option rewrites every page, in physical order.
Now, please be kind. Before you ask me twenty questions, read up a little and understand all the issues I have defined here.

Access database performance

For a few different reasons one of my projects is hosted on a shared hosting server
and developed in asp.Net/C# with access databases (Not a choice so don't laugh at this limitation, it's not from me).
Most of my queries are on the last few records of the databases they are querying.
My question is in 2 parts:
1- Is the order of the records in the database only visual or is there an actual difference internally. More specifically, the reason I ask is that the way it is currently designed all records (for all databases in this project) are ordered by a row identifying key (which is an auto number field) ascending but since over 80% of my queries will be querying fields that should be towards the end of the table would it increase the query performance if I set the table to showing the most recent record at the top instead of at the end?
2- Are there any other performance tuning that can be done to help with access tables?
"Access" and "performance" is an euphemism but the database type wasn't a choice
and so far it hasn't proven to be a big problem but if I can help the performance
I would sure like to do whatever I can.
Thanks.
Edit:
No, I'm not currently experiencing issues with my current setup, just trying to look forward and optimize everything.
Yes, I do have indexes and have a primary key (automatically indexes) on the unique record identifier for each of my tables. I definitely should have mentioned that.
You're all saying the same thing, I'm already doing all that can be done for access performance. I'll give the question "accepted answer" to the one that was the fastest to answer.
Thanks everyone.
As far as I know...
1 - That change would just be visual. There'd be no impact.
2 - Make sure your fields are indexed. If the fields you are querying on are unique, then make sure you make the fields a unique key.
Yes there is an actual order to the records in the database. Setting the defaults on the table preference isn't going to change that.
I would ensure there are indexes on all your where clause columns. This is a rule of thumb. It would rarely be optimal, but you would have to do workload testing against different database setups to prove the most optimal solution.
I work daily with legacy access system that can be reasonably fast with concurrent users, but only for smallish number of users.
You can use indexes on the fields you search for (aren't you already?).
http://www.google.com.br/search?q=microsoft+access+indexes
The order is most likely not the problem. Besides, I don't think you can really change it in Access anyway.
What is important is how you are accessing those records. Are you accessing them directly by the record ID? Whatever criteria you use to find the data you need, you should have an appropriate index defined.
By default, there will only be an index on the primary key column, so if you're using any other column (or combination of columns), you should create one or more indexes.
Don't just create an index on every column though. More indexes means Access will need to maintain them all when a new record is inserted or updated, which makes it slower.
Here's one article about indexes in Access.
Have a look at the field or fields you're using to query your data and make sure you have an index on those fields. If it's the same as SQL server you won't need to include the primary key in the index (assuming it's clustering on this) as it's included by default.
If you're running queries on a small sub-set of fields you could get your index to be a 'covering' index by including all the fields required, there's a space trade-off here, so I really only recommend it for 5 fields or less, depending on your requirements.
Are you actually experiencing a performance problem now or is this just a general optimization question? Also from your post it sounds like you are talking about a db with 1 table, is that accurate? If you are already experiencing a problem and you are dealing with concurrent access, some answers might be:
1) indexing fields used in where clauses (mentioned already)
2) Splitting tables. For example, if only 80% of your table rows are not accessed (as implied in your question), create an archive table for older records. Or, if the bulk of your performance hits are from reads (complicated reports) and you don't want to impinge on performance for people adding records, create a separate reporting table structure and query off of that.
3) If this is a reporting scenario, all queries are similar or the same, concurrency is somewhat high (very relative number given Access) and the data is not extremely volatile, consider persisting the data to a file that can be periodically updated, thus offloading the querying workload from the Access engine.
In regard to table order, Jet/ACE writes the actual table date in PK order. If you want a different order, change the PK.
But this oughtn't be a significant issue.
Indexes on the fields other than the PK that you sort on should make sorting pretty fast. I have apps with 100s of thousands of records that return subsets of data in non-PK sorted order more-or-less instantaneously.
I think you're engaging in "premature optimization," worrying about something before you actually have an issue.
The only circumstances in which I think you'd have a performance problem is if you had a table of 100s of thousands of records and you were trying to present the whole thing to the end user. That would be a phenomenally user-hostile thing to do, so I don't think it's something you should be worrying about.
If it really is a concern, then you should consider changing your PK from the Autonumber to a natural key (though that can be problematic, given real-world data and the prohibition on non-Null fields in compound unique indexes).
I've got a couple of things to add that I didn't notice being mentioned here, at least not explicitly:
Field Length, create your fields as large as you'll need them but don't go over - for instance, if you have a number field and the value will never be over 1000 (for the sake of argument) then don't type it as a Long Integer, something smaller like Integer would be more appropriate, or use a single instead of a double for decimal numbers, etc. By the same token, if you have a text field that won't have more than 50 chars, don't set it up for 255, etc, etc. Sounds obvious, but it's done, often times with the idea that "I might need that space in the future" and your app suffers in the mean time.
Not to beat the indexing thing to death...but, tables that you're joining together in your queries should have relationships established, this will create indexes on the foreign keys which greatly increases the performance of table joins (NOTE: Double check any foreign keys to make sure they did indeed get indexed, I've seen cases where they haven't been - so apparently a relationship doesn't explicitly mean that the proper indexes have been created)
Apparently compacting your DB regularly can help performance as well, this reduces internal fragmentation of the file and can speed things up that way.
Access actually has a Performance Analyzer, under tools Analyze > Performance, it might be worth running it on your tables & queries at least to see what it comes up with. The table analyzer (available from the same menu) can help you split out tables with alot of redundant data, obviously, use with caution - but it's could be helpful.
This link has a bunch of stuff on access performance optimization on pretty much all aspects of the database, tables, queries, forms, etc - it'd be worth checking out for sure.
http://office.microsoft.com/en-us/access/hp051874531033.aspx
To understand the answers here it is useful to consider how access works, in an un-indexed table there is unlikely to be any value in organising the data so that recently accessed records are at the end. Indeed by the virtue of the fact that Access / the JET engine is an ISAM database it's the other way around. (http://en.wikipedia.org/wiki/ISAM) That's rather moot however as I would never suggest putting frequently accessed values at the top of a table, it is best as others have said to rely on useful indexes.

Strategy for coping with database identity/autonumber maxing out

Autonumber fields (e.g. "identity" in SQL Server) are a common method for providing a unique key for a database table. However, given that they are quite common, at some point in the future we'll be dealing with the problem where they will start reaching their maximum value.
Does anyone know of or have a recommended strategy to avoid this scenario? I expect that many answers will suggest switching to guids, but given that this will take large amount of development (especially where many systems are integrated and share the value) is there another way? Are we heading in a direction where newer hardware/operating systems/databases will simply allow larger and larger values for integers?
If you really expect your IDs to run out, use bigint. For most practical purposes, it won't ever run out, and if it does, you should probably use a uniqueidentifier.
If you have 3 billion (that means a transaction/cycle on a 3.0GHz processor) transactions per second, it'll take about a century for bigint to run out (even if you put off a bit for the sign).
That said, once, "640K ought to be enough for anybody." :)
See these related questions:
Reset primary key (int as identity)
Is bigint large enough for an event log table?
The accepted/top voted answers pretty much cover it.
Is there a possibility to cycle through the numbers that have been deleted from the database? Or are most of the records still alive? Just a thought.
My other idea was Mehrdad's suggestion of switching to bigint
Identity columns are typically set to start at 1 and increment by +1. Negative values are just as valid as positive, thus doubling the pool of available identifiers.
If you may be getting such large data amounts that your IDs max out, you may also want to have support for replication so that you can have multiple somehow synchronized instances of your database.
For such cases, and for cases where you want to avoid having "guessable" IDs (web applications etc.), I'd suggest using Guids (Uniqueidentifiers) with a default of a new Guid as replacement for identity columns.
Because Guids are unique, they allow data to be synchronized properly, even if records were added concurrently to the system.

Resources