When to use #TableGenerator? - database

Can you tell me when to use it and why should I create a new table just for keeping primary key while most DBMS now support auto increment and you can adjust it easily?

The major advantage of the TableGenerator is portability, this is the only strategy that is guaranteed to work with any database. Another advantage is that table sequencing is fully transactional and permits truly sequential ids to be allocated (at the cost of performance and concurrency), if desired.
See also
JPA wiki book
Advanced Sequencing

Related

SQL Server 2008 or above: Is there a clear criterion to determine whether I should or should not create an index?

I have been reading "Microsoft SQL Server 2008: A Beginner's Guide" and came across the chapter on indices. The book does a really good job of demonstrating how to create an index, but not why to create an index.
Nor have I been able to find any satisfactory explanations on the Internet. All the different search results I find are variations of a theme: "how to create an index."
My question is this: is there a clear criterion I can use to determine whether I should create an index? I have read so far that indices should be used intelligently, and that it might create overhead problems for tables with frequent inserts, updates, and deletes (which is true for my database).
Is the answer "it depends -- keep monitoring and adjusting" or is there a crisp criterion I can use to determine whether I should use indices?
Thank you for any insight.
There's no crisp criterion. There's just rules of thumb:
overall rule: If a field is used in a "decision context" (e.g. a where, join, order by, etc..., then it should be indexed.
After that, it comes down to usage cases. ie
rarely read, frequently modified - indexes may hurt. the overhead of keeping the indexes current will probably outweigh any benefits the indexes would give the relatively rare "read" operations.
frequently read, rarely modified - definitely. index maintenance overhead is only paid when the table is modified. since you modify rarely, the efficiency gains for the read operations will vastly outweigh the index overhead costs.
some reads/some updates - no way to tell - you'd have to benchmark your particular cases. a table with one index will have far less maintenance overhead than a table with hundreds of (overlapping?) indexes.

Questions about database indexes

When a database index is created for a unique constraint on a field or multiple indexes are created for a multiple field unique constraint, are those same indexes able to be used for efficiency gains when querying for objects, much in the same way any other database index is used? My guess is that the indexes created for unique constraints are the same as ones created for efficiency gains, and the unique constraint itself is something additional, but I'm not well versed in databases.
Is it ever possible to break a unique constraint, including a multiple field constraint (e.g. field_a and field_b are unique together) in any way through long transactions and high concurrency, etc? Or, does a unique constraint offer 100% protection.
As to question 1:
YES - these are indexes as any other indexes you define and are used in query plans for example for performance gains... you can define unique indexes without defining a "unique contraint" btw.
As to question 2:
YES - it is a 100% protection as long as the DB engine is is ACID compliant and reliable (i.e. no bugs in this regard) and as long as you don't temporarily disable the constraint.
Yes. A unique constraint is an index (in SQL Server) and will (can) be used in query plans
This is impossible. Regardless of transaction times or concurrency issues, you cannot store data in a table that violates a constraint (at least in SQL Server). BTW, if your transactions are so long that you're worried about this, you NEED to rethink what you're doing in the context of this transaction. Even though you won't violate database constraints with long transaction operations, YOU WILL run into other problems.
The problem with your question is, that it is very general and not tailored to a specific implementation. Therefore any answer will be quite generic to.
In this mind:
Whenever a database thinks, that access via an index might speed up things, it will do so - uniqueness is not concern here. If many indizes exists on one table a decent database will try to use the "best" one - with different views about what "best" means actually. BUT many databases will only use one index to get a row. Therefore as a rule of thumb DBs usually try to use indizes where lookups result in as few rows as possible. A unique index is quite good at this. :-)
Actually this is not one point but two different points:
A decent DB will not corrupt your index even for long running transactions or high concurrency. At least not on purpose. And if it does it is either a bug in the DB software which has to be fixed very quickly - otherwise the DB vendor might suffer reputation loss in a very drastic way. The other possibility is, that it is not a decent DB but a mere persistent hashmap or something like that. If the data really matters, then high concurrency and longrunning transactions are no excuse.
Multivalued unique indices are a beast: DB implementations are silghty different, what they consider "unique" when one or more of the key columns contain NULL. For example you can look at the PostgreSQL documentation regarding this point: http://www.postgresql.org/docs/9.1/interactive/indexes-unique.html
Hope this makes some things clear.

database row/ record pointers

I don't know the correct words for what I'm trying to find out about and as such having a hard time googling.
I want to know whether its possible with databases (technology independent but would be interested to hear whether its possible with Oracle, MySQL and Postgres) to point to specific rows instead of executing my query again.
So I might initially execute a query find some rows of interest and then wish to avoid searching for them again by having a list of pointers or some other metadata which indicates the location on a database which I can go to straight away the next time I want those results.
I realise there is caching on databases, but I want to keep these "pointers" else where and as such caching doesn't ultimately solve this problem. Is this just an index and I store the index and look up by this? most of my current tables don't have indexes and I don't want the speed decrease that sometimes comes with indexes.
So whats the magic term I've been trying to put into google?
Cheers
In Oracle it is called ROWID. It identifies the file, the block number, and the row number in that block. I can't say that what you are describing is a good idea, but this might at least get you started looking in the right direction.
Check here for more info: http://www.orafaq.com/wiki/ROWID.
By the way, the "speed decrease that comes with indexes" that you are afraid of is only relevant if you do more inserts and updates than reads. Indexes only speed up reads, so if the read ratio is high, you might not have an issue and an index might be your best solution.
most of my current tables don't have
indexes and I don't want the speed
decrease that sometimes comes with
indexes.
And you also don't want the speed increase which usually comes with indexes but you want to hand-roll a bespoke pseudo-cache instead?
I'm not being snarky here, this is a serious point. Database designers have expended a great deal of skill and energy into optimizing their products. Wouldn't it be more sensible to learn how to take advantage of their efforts rather re-implementing some core features?
In general, the best way to handle this sort of requirement is to use the primary key (or in fact any convenient, compact unique identifier) as the 'pointer', and rely on the indexed lookup to be swift - which it usually will be.
You can use ROWID in more DBMS than just Oracle, but it generally isn't recommended for a variety or reasons. If you succumb to the 'every table has an autoincrement column' school of database design, then you can record the autoincrement column values as the identifiers.
You should have at least one index on (almost) all of your tables - that index will be for the primary key. The exception might be for a table so small that it fits in memory easily and won't be updated and will be used enough not to be evicted from memory. Then an index might be a distraction; however, such tables are typically seldom updated so the index won't harm anything, and the optimizer will ignore it if the index doesn't help (and it may not).
You may also have auxilliary indexes. In a system where most of the activity is reading the data, you may want to erro on the side of having more indexes rather than fewer, because access time is most critical. If your system was update intensive, then you would go with fewer indexes because there is a cost associated with updating indexes when data is added, removed or updated. Clearly, you need to design the indexes to work well with the queries that your users actually perform (or your applications perform).
You may also be interested in cursors. (Note that the index debate is still valid with cursors.)
Wikipedia definition here.

What advantages do constraints provide to a database?

I realize this question may seem a little on the "green" side, but after the number of "enterprise" or "commercial" databases I've encountered I've begun to ask this question. What advantages to constraints provide to a database? I'm asking more about Foreign Key constraints rather than Unique constraints. Do they offer performance gains, or just data integrity?
I've been rather surprised at the number of relational databases without foreign keys or even without specified primary keys (just constraints on fields being not null or a unique constraint on the field).
Thoughts?
"just" data integrity? You say that like it's a minor thing. In all applications, it's critical. So yes, it provides that, and it's a huge benefit.
Data integrity is what they offer. If anything they have a performance cost (a very minor one at least).
They provide both performance and data integrity, and the latter is paramount to any serious system. I cringe every time I see a database without any foreign keys and where all integrity is done through triggers (if at all). And I saw quite a bit of those out there.
The following, assuming you get the constraint right in the first place:-
Your data will be valid with respect to the constraint
The database knows your data will be valid with respect to the constraint and can use this when querying or updating the database (e.g. removing an unnecessary join for a query on a view)
The constraint is documented for future users of the database
A violation of the constraint will be caught as soon as possible; not in some later unrelated process that fails
In relational theory, a database that allows inconsistent data isn't really a relational database. Foreign keys are necessary for data integrity and consistency to keep the database "relational"; i.e. the logical model of the database is always correct.
In practical terms, it's usually easier to define a foreign key and let the DB engine handle making sure the relation is valid. The other options are:
nothing - guaranteed data corruption
at some point
DB triggers - which
will usually be slower and less
performant
application code - which
will eventually cause problems when
either you forget to call the right
code or another application accesses
the database.
Data is an asset. Lots of textbooks state that.
But it is actually wrong. It should rather say "correct data is an asset, incorrect data is a liability".
And database constraints give you the best possible guarantee that data is correct.
In some DBMSs (e.g. Oracle) constraints can actually improve the performance of some queries, since the optimiser can use the constraints to gain knowledge about the structure of the data. For some examples, see this Oracle Magazine article.
I would say all required constraints must be in the database. Foreign key constraints prevent unusable data. They aren't a nice to have - they are a requirement unless you want a useless database. Foreign keys may hurt performance of deletes and updates but that is OK. Is it better to take a little longer to do a delete (or to tell the application not to delete this person because he has orders in the system) or to delete the user but not his data? Lack of foreign keys may cause unexpected and often serious problems in querying the data. For instance the cost reports may depend on all the tables having related data and so may fail to show important data because one or more tables have nothing to join to.
Unique constraints are also a requirement of any decent databse. If a field or group of fields must be unique, to fail to define this at the database leve is tocreate data problems that are extremely hard to fix.
You don't mention other constraints but you should. Any business rule that must always be applied to all data in the table should always be applied in the database through a datatype (such as a datatime datatype which willnot accept '02\31\2009' as a valid date), a constraint (say one that does not allow the field to have a value greater than 100) or through a trigger is the logic is so complex it cannot be handled by an ordinary constraint. (Triggers are tricky to write if you don't know what you are doing, so if you have logic this complex, you hopefully have adatabase professional on your team.) The order is important to. Datatypes are the first choice, followed by constraints, followed by triggers as a last choice.
Simpler Application Code
One nice thing they provide is that your application code has to do a lot less error checking and validation. Contrast these two bits of code and multiply by thousands of operations and you can see there's a big win.
get department number for employee # it's good coz of constraints
do something with department number
vs.
get department number for employee
if department number is empty
...
else if department number not in list of good department numbers
....
else
do something with department number
Of course, people that ignore constraints probably don't put a lot of effort into code validation anyway... :-/
Oh, and if the data constraints change, it's a database configuration issue and not a code change issue.
Integrity constraints are especially important when you integrate several applications using a shared database.
You may be able to properly manage data integrity in a single application's code (and even if you don't, at least the broken data affects only that application), but with multiple apps it gets hairy (and at the least redundant).
"Oh, and if the data constraints change, it's a database configuration issue and not a code change issue."
Unless it's a constraint that disappears from the design. In that case, there definitely IS a code impact, because some code might have been written that depends on that removed constraint being there.
That must always be taken in consideration when "laxing" or "removing" any declared constraint.
Other than that, you are of course completely right.

GUID vs INT IDENTITY [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
How do you like your primary keys?
I'm aware of the benefits of using a GUID, as well as the benefits of using and INT as a PK in a database. Considering that a GUID is in essence a 128 bit INT and a normal INT is 32 bit, the INT is a space saver (though this point is generally moot in most modern systems).
In the end, in what circumstances would you see yourself using an INT as a PK versus a GUID?
Kimberley Tripp (SQLSkills.com) has an article on using GUID's as primary keys. She advices against it because of the unnecessary overhead.
To answer your question:
In the end, in what circumstances would you see yourself using an INT as a PK versus a GUID?
I would use a GUID if my system would have an online/offline version that inside the offline version you can save data and that data is transferred back to the server one day during a synch. That way, you are sure that you won't have the same key twice inside your database.
We have Guids in our very complex enterprise software everywhere. Works smoothly.
I believe Guids are semantically more suitable to serve as identifiers. There is also no point in unnecessarily worrying about performance until you are faced with that problem. Beware premature optimization.
There is also an advantage with database migration of any sort. With Guids you will have no collisions. If you attempt to merge several DBs where ints are used for identity, you will have to replace their values. If these old values were used in urls, they will now be different following SEO hit.
Apart from being a poor choice when you need to synchronize several database instances, INT's have one drawback I haven't seen mentioned: inserts always occur at one end of the index tree. This increases lock contention when you have a table with a lot of movement (since the same index pages have to be modified by concurrent inserts, whereas GUID's will be inserted all over the index). The index may also have to be rebalanced more often if a B* tree or similar data structure is used.
Of course, int's are easier on the eye when doing manual queries and report construction, and space consumption may add up through FK usages.
I'd be interested to see any measurements of how well e.g. SQL Server actually handles insert-heavy tables with IDENTITY PK's.
the INT is a space saver (though this
point is generally moot in most modern
systems).
Not so. It may seem so at first glance, but note that the primary key of each table will be repeated multiple times throughout the database in indexes and as foreign key in other tables. And it will be involved in nearly any query containing its table - and very intensively when it's a foreign key used for a join.
Furthermore, remember that modern CPUs are very, very fast, but RAM speeds have not kept up. Cache behaviour becomes therefore increasingly important. And the best way to get good cache behaviour is to have smaller data sets. So the seemingly irrelevant difference between 4 and 16 bytes may well result in a noticeable difference in speed. Not necessarily always - but it's something to consider.
When comparing values such as Primary to Foreign key relationship, the INT will be faster. If the tables are indexed properly and the tables are small, you might not see much of a slow down, but you'd have to try it to be sure. INTs are also easier to read, and communicate with other people. It's a lot simpler to say, "Can you look at record 1234?" instead of "Can you look at record 031E9502-E283-4F87-9049-CE0E5C76B658?"
If you are planning on merging database at some stage, ie for a multi-site replication type setup, Guid's will save a lot of pain. But other than that I find Int's easier.
If the data lives in a single database (as most data for the applications that we write in general does), then I use an IDENTITY. It's easy, intended to be used that way, doesn't fragment the clustered index and is more than enough. You'll run out of room at 2 billion some records (~ 4 billion if you use negative values), but you'd be toast anyway if you had that many records in one table, and then you have a data warehousing problem.
If the data lives in multiple, independent databases or interfaces with a third-party service, then I'll use the GUID that was likely already generated. A good example would be a UserProfiles table in the database that maps users in Active Directory to their user profiles in the application via their objectGUID that Active Directory assigned to them.
Some OSes don't generate GUIDs anymore based on unique hardware features (CPUID,MAC) because it made tracing users to easy (privacy concerns). This means the GUID uniqueness is often no longer as universal as many people think.
If you use some auto-id function of your database, the database could in theory make absolutely sure that there is no duplication.
I always think PK's should be numeric where possble. Dont forget having GUIDs as a PK will probably mean that they are also used in other tables as foriegn keys, so paging and index etc will be greater.
An INT is certainly much easier to read when debugging, and much smaller.
I would, however, use a GUID or similar as a license key for a product. You know it's going to be unique, and you know that it's not going to be sequential.
I think the database also matters. From a MySQL perspective - generally, the smaller the datatype the faster the performance.
It seems to hold true for int vs GUID too -
http://kccoder.com/mysql/uuid-vs-int-insert-performance/
I would use GUID as PK only if this key bounds to similar value. For example, user id (users in WinNT are describes with GUIDs), or user group id.
Another one example. If you develop distributed system for documents management and different parts of system in different places all over the world can create some documents. In such case I would use GUID, because it guaranties that 2 documents created in different parts of distributed system wouldn't have same Id.

Resources