Request suggestions for defining a primary key in Oracle 11g

Request suggestions for defining a primary key in Oracle 11g - database

All,
I have read several other posts before posing my question to you.
I have a lot of programming/admin experience with other databases: MySql, MSSQL, PostGres, and the one which will not be named. I just don't have much experience with Oracle.
I was tasked with designing a few web-applications and supporting database tables. The tables were designed using an ER diagram, and sent to the development group for implementation. When they sent back the proposed table creation statements, I saw two things that seems wrong to me. The primary key is NUMBER(5) and the sequence set the MAXVALUE to 99999.
I would have expected that the MAXVALUE would be omitted in favor of NOMAXVALUE Primary key column be a NUMBER(*,0) or a LONG. Since I don't have much experience with Oracle table design, would you please offer up your advice?
Sincerely
Kristofer Hoch
Edit
Thank you for the information on LONG. I'll make sure to use NUMBER, but I'm still unclear on the best way to define it: NUMBER, or NUMBER(*,0), or NUMBER(9), etc.

I agree with you: since the column is to hold a surrogate key generated from a sequence, the only possible purpose of the 5 digit limit would be to restrict the total number of rows ever allowed in the table to under 100,000 - which would seem perverse. It certainly does not confer any performance or space efficiency advantages. Probably it is just the default of their ERD tool's DDL generator.
Do not use LONG: in Oracle that is an obsolete and deprecated way of storing large text strings (for which CLOB is now preferred).

If it was me, I wouldn't specify the size of the number, but that may be more laziness (set it, forget it) than a good practice for design purposes.
But "long" would cause problems. In Oracle, "long" is a character data type that's being deprecated. It's not the same thing as the long data type for numbers in other languages/systems. It's tricky.

Related

MS SQL: What is more efficient? Using a junction table or storing everything in a varchar?

here is a simple question to which I would like an answer to:
We have a member table. Each member practices one, many or no sports. Initially we (the developers) created a [member] table, a [sports] table and a [member_sports] table, just as we have always done.
However our client here doesn't like this and wants to store all the sports that the member practices in a single varchar column, separated with a special character.
So if:
1 is football
2 is tennis
3 is ping-pong
4 is swimming
and I like swimming and ping-pong, my favourite sports will be stored into the varchar column as:
x3,x4
Now we don't want to just walk up to the client and claim that his system isn't right. We would like to back it up with proof that the operation to fetch the sports from [member_sports] is more efficient than simply storing the fields as a varchar.
Is there any documentation that can back our claims? Help!

Ask your client if they care about storing accurate information1 rather than random strings.
Then set them a series of challenges. First, ensure that the sport information is in the correct "domain". For the member_sports table, that is:
sport_id int not null
^
|--correct type
For their "store everything in a varchar column" solution, I guess you're writing a CHECK constraint. A regex would probably help here but there's no native support for regex in SQL Server - so you're either bodging it or calling out to a CLR function to make sure that only actual int values are stored.
Next, we not only want to make sure that the domain is correct but that the sports are actually defined in your system. For member_sports, that's:
CONSTRAINT FK_Member_Sports_Sports FOREIGN KEY (Sport_ID) references Sports (Sport_ID)
For their "store everything in a varchar column" I guess this is going to be a far more complex CHECK constraint using UDFs to query other tables. It's going to be messy and procedural. Plus if you want to prevent a row from being removed from sports while it's still referenced by any member, you're talking about a trigger on the sports table that has to query every row in members2`.
Finally, let's say that it's meaningless for the same sport to be recorded for a single member multiple times. For member_sports, that is (if it's not the PK):
CONSTRAINT UQ_Member_Sports UNIQUE (Member_ID,Sport_ID)
For their "store everything in a varchar column" it's another horrifically procedural UDF called from a CHECK constraint.
Even if the varchar variant performed better (unlikely since you need to be ripping strings apart and T-SQL's string manipulation functions are notoriously weak (see above re: regex)) for certain values of "performs better", how do they propose that the data is meaningful and not nonsense?
Writing the procedural variants that can also cope with nonsense is an even more challenging endeavour.
In case it's not clear from the above - I am a big fan of Declarative Referential Integrity (DRI). Stating what you want versus focussing on mechanisms is a huge part of why SQL appeals to me. You construct the right DRI and know that your data is always correct (or, at least, as you expect it to be)
1"The application will always do this correctly" isn't a good answer. If you manage to build an application and related database in which nobody ever writes some direct SQL to fix something, I guess you'll be the first.
But in most circumstances, there's always more than one application, and even if the other application is a direct SQL client only employed by developers, you're already beyond being able to trust that the application will always act correctly. And bugs in applications are far more likely than bugs in SQL database engine's implementations of constraints, which have been tested far more times than any individual application's attempt to enforce constraints.
2Let alone the far more likely query - find all members who are associated with a particular sport. A second index on member_sports makes this a trivial query3. No indexes help the "it's somewhere in this string" solution and you're looking at a table scan with no indexing opportunities.
3Any index that has sport_id first should be able to satisfy such a query.

How to choose between UUIDs, autoincrement/sequence keys and sequence tables for database primary keys?

I'm looking at the pros and cons of these three primary methods of coming up with primary keys for database rows.
So assuming I am using a database that supports more than one of these methods, is there a simple heuristic to determine what the best option would be for me?
How do considerations such a distributed/multiple masters, performance requirements, ORM use, security and testing have on the choice?
Any unexpected drawbacks that one might run into?

UUIDs
Unless these are generated "in increasing monotonic sequence" they can drastically hurt/fragment indexes. Support for UUID generation varies by system. While usable, I would not use a UUID as my primary clustered index/PK in most cases. If needed I would likely make it a secondary column, perhaps indexed, perhaps not.
Some people argue that UUIDs can be used to safely generate/merge records from an arbitrary number of systems. While a UUID (depending upon method) generally has an astronomically small chance of collision, it is possible to -- at least with some outside input or very bad luck :) -- generate collisions. I am of the belief that only a true PK should be transmitted between systems, which I would argue is not (or should not be) a database-generated UUID in most cases.
autoincrement/sequence keys and sequence tables
This really depends on what the database supports well. Some databases support sequences which are more flexible that a simple "auto-increment". This may or may not be desirable (or may be the only way for this kind of task simply, even). Sequence tables are generally more flexible yet, but if this kind of "flexibility" is needed I would be tempted to go back and visit the design-pattern, especially if it involves the use of triggers. While I dislike "limiting ORMs", that may also make a difference in choosing the "simpler" auto-increment or sequence types/database support.
Regardless of the method used, when using surrogate primary keys, the true primary key should still be identified and encoded into the schema.
In addition, I argue that "security compromises through exposing an auto-sequence PK" are a result of incorrectly exposing an internal database property. While a very simple way to handle CRUD operation, I believe there is a distinction between the internal keys and the exposed keys (e.g. pretty customer number).
Just my two cents.
Edit, additional replies to Tim:
I think the generated vs. true PK question is a very good one and one I need to consider also. I'd like UUIDs in general to the points you make. My hesitation was in size vs. an int/long. Was not aware of potential indexing de-optimizations, which is a much bigger concern for me.
I wouldn't really worry about the size -- if a UUID is best, then it's best. If it's not, then it's not. In the overall scheme the extra 12bytes over an int likely won't make much of a difference. SQL Server 2005+ supports the newsequentialid UUID generation function to avoid the fragmentation associated with normal UUID generation. The page discusses it some. I am sure that other databases have similar solutions.
And by "encoded into the schema", do you mean more than adding a uniqueness constraint?
Yes. The primary key doesn't have to be the only [unique] constraint. Just using a surrogate PK doesn't mean the database model should be compromised :-) Additional indexes can also be used to cover, etc.
And by "distinction between", are you saying that surrogate primary keys never leak out?
The wording in my initial post was a tad hard. It's not "never" so much as "if they do and it matters then that's another problem". Often times people complain of insecurity through guessable numbers -- e.g. if your order is 23 then there is likely an order 22 and 24, etc. If this is your "protection" and/or can leak sensitive information then the system is already flawed. (Separating internal and external ids does not inherently fix this issue and authentication/authorization is still required. However, it is one issue raised against using "sequential ids" -- I find encoding a nonce into distributed URLs handles this for my use-case rather well.)
More to what I really wanted to get across: Just because the surrogate PK id happens to be 8942 doesn't mean that it's order 8942. That is, keeping with the "some fields are internal only to db" design, the order "number" might be entirely unrelated on the surface (but fully supported in the DB model), such as "#2010-42c" or whatever makes sense for the business requirement(s). It is this external number that should be exposed in most cases.
I feel that sometimes the generated key is really the true primary key as other fields are mutable (eg. user may change email and username).
This may be the case within a database and I will not argue this statement. However, once again holding that the surrogate PK's are internal to the database, just make sure to only export/import tuples that can be well-identified. If the username/email may change, then this might very well include a UUID assigned upon account creation -- and could very well be the surrogate PK itself.
Of course, as with everything, remain open and fit the model to the problem, not the problem to the model :-) For a service like twitter, for instance, they use their own number generation schema. See Twitter's new ID generation. Unlike [some] UUID generation, the approach by twitter (assuming that all the servers are correctly setup) guarantees that none of the distributed machines/processes will ever generate a duplicate ID, requires only 64-bits, and maintains rough ordering (most significant bits are time-stamp). (The number of records generated by twitter may be in no way related to local requirements ;-)
Happy coding.

Database Table design

I am having a problem choosing the variable types for database table.
Can someone please give me some general guidelines as to how to choose the types? The following are some of the questions I have --
What should an userid be? An INT seems small as the design should take large number of users into account. So if not INT what else? BIGINT? VARCHAR? Am I not thinking straight?
When should I choose varchar and text and tinytext, binary?
I searched online and did not find any helpful links. Can someone point me in the right direction? Maybe I need to read book.

Update for 2013: The internet now has roughly the same number of users as a signed 32-bit integer. Facebook has about a billion users. So if you're planning on being the next Google or Facebook, and you definitely won't have time to re-engineer between now and then, go with something bigger. Otherwise stick with what's easy.
As of 2010:
An int seems small? Are you anticipating having 2,147,483,648 users? (For the record, that's like 1 in 3 people in the world, or 400 million more than the total number of people who use the internet; or 5 and a half times more users than Facebook).
Answer: use an int. Don't overthink it. If you get in a situation where our old friend the 32-bit integer just isn't cutting it for you, you will probably also have a staff of genius college graduates and $5b capital infusion to help solve the problem then.

I would actually make a user ID a character type since I'd rather log in as "pax" than 14860. However, if it has to be numeric (such as if you want the name to be changed easily or you want to minimise storage for foreign keys to the user table), an integer should be fine. Just how many users are you expecting to have? :-)
I would always use varchar for character data since I know that's supported on all SQL databases (or at least the ones that matter). I prefer to stay with the standards as much as possible so moving between databases is easier.

Your username field should probably be a varchar (you pick the length). If you have an id_user on the user table, I agree with the above answers that int (2 billion) will more than adequately cover your needs.
As far as you alphanumeric data types, it depends on how long you want it to be. For example, I think a varchar(255), or probably even smaller (personally I'd go with 50), would work for a username and password field. This chart might help you.
Binary would be for storing non-character data such as images. bit can be used for boolean values.
A couple other useful types are uniqueidentifier for UUID/GUID and xml for XML. Although you can also use varchar or text types for those too.
There's also nvarchar for Unicode in SQL Server and several date types (I'm glad they came out with date in SQL Server 2008 which doesn't include a time such as smalldatetime).

SQL 2008 data types - Which ones to use?

I'm using SQL Express 2008 edition. I've planned my database tables/relationships on a piece of paper and was ready to begin.
Unfortunately, when I clicked on data type for my pkID_Officer I was offered more than I expected, and therefore doubt has set in.
Can anyone point me to a place where these dates types are explained and examples are given as to what fields work better with which data types.
Examples:
is int for an ID (primarykey) still the obvious choice or does the uniqueidentifier takes it's crown?
Telephone numbers where the digits are separated by '.' (01.02.03.04.05)
Email
items that will be hyper-links
nChar and vChar?
Any help is always appreciated.
Thanks
Mike.

The MSDN site has a good overview of the SQL 2008 datatypes.
http://msdn.microsoft.com/en-us/library/ms187752.aspx
For the ID field use a guid if it needs to be unique across separate systems or tables or you want to be able to generate the ID outside of the database. Otherwise an int/identity value works just fine.
I store telephone numbers a character data since I won't ever be doing calculations on it. I would think email would be stored the same way.
As for hyperlinks you can basically store the hyperlink by itself as a varchar and render the link on the client or store the markup itself in the database. Really depends on the circumstances.
Use nvarchar if you ever think you'll need to support double byte languages now or in the future.

For primary key I always prefer (as starting point) to use an auto increment int. It makes everything more usable and you don't have any "natural" relationship with the real data. Of course there can be exception to this...

is int for an ID (primarykey) still the obvious choice or does the uniqueidentifier takes it's crown?
I personally prefer INT IDENTITY over GUIDs - especially for your clustered index. GUIDs are random in nature and thus lead to a lot of index fragmentation and therefore poor performance when used as a clustered index on SQL Server. INT doesn't have this trouble, plus it's only 4 bytes vs. 16 bytes, so if you have lots of rows, and lots of non-clustered indexes (the clustered key gets added to each and every entry in each and every non-clustered index), using a GUID will unnecessarily bloat your space requirements (on disk and also in your machine's RAM)
Telephone numbers where the digits are separated by '.' (01.02.03.04.05)
Email
items that will be hyper-links
I'd use string fields for all of these.
VARCHAR is fine, as long as you don't need any "foreign" language support, e.g. it's okay for English and Western European languages, but fails on Eastern European and Asian languages (Cyrillic, Chinese etc.).
NVARCHAR will handle all those extra pesky languages at a price - each character is stored in 2 bytes, e.g. a string of 100 chars will use 200 bytes of storage - ALWAYS.
Hope this helps a bit !
Marc

uniqueidentifier are GUIDs:
http://de.wikipedia.org/wiki/Globally_Unique_Identifier
Guids
- are worldwide unique
- have in DotNet e.g. her own objects (System.Guid)
- are 16 digits long
If you want such things, then use it. If you are fine with int-ids, then it's ok.
Telephone-Numbers / emails / Hyperlings are normal strings.

NCHAR/NVARCHR are Unicode counterpoints of CHAR/VARCHAR datatypes. I almost always use them in my apps - unless I have compelling reasons not to use them.

What's your opinion on using UUIDs as database row identifiers, particularly in web apps?

I've always preferred to use long integers as primary keys in databases, for simplicity and (assumed) speed. But when using a REST or Rails-like URL scheme for object instances, I'd then end up with URLs like this:
http://example.com/user/783
And then the assumption is that there are also users with IDs of 782, 781, ..., 2, and 1. Assuming that the web app in question is secure enough to prevent people entering other numbers to view other users without authorization, a simple sequentially-assigned surrogate key also "leaks" the total number of instances (older than this one), in this case users, which might be privileged information. (For instance, I am user #726 in stackoverflow.)
Would a UUID/GUID be a better solution? Then I could set up URLs like this:
http://example.com/user/035a46e0-6550-11dd-ad8b-0800200c9a66
Not exactly succinct, but there's less implied information about users on display. Sure, it smacks of "security through obscurity" which is no substitute for proper security, but it seems at least a little more secure.
Is that benefit worth the cost and complexity of implementing UUIDs for web-addressable object instances? I think that I'd still want to use integer columns as database PKs just to speed up joins.
There's also the question of in-database representation of UUIDs. I know MySQL stores them as 36-character strings. Postgres seems to have a more efficient internal representation (128 bits?) but I haven't tried it myself. Anyone have any experience with this?
Update: for those who asked about just using the user name in the URL (e.g., http://example.com/user/yukondude), that works fine for object instances with names that are unique, but what about the zillions of web app objects that can really only be identified by number? Orders, transactions, invoices, duplicate image names, stackoverflow questions, ...

I can't say about the web side of your question. But uuids are great for n-tier applications. PK generation can be decentralized: each client generates it's own pk without risk of collision.
And the speed difference is generally small.
Make sure your database supports an efficient storage datatype (16 bytes, 128 bits).
At the very least you can encode the uuid string in base64 and use char(22).
I've used them extensively with Firebird and do recommend.

For what it's worth, I've seen a long running stored procedure (9+ seconds) drop to just a few hundred milliseconds of run time simply by switching from GUID primary keys to integers. That's not to say displaying a GUID is a bad idea, but as others have pointed out, joining on them, and indexing them, by definition, is not going to be anywhere near as fast as with integers.

I can answer you that in SQL server if you use a uniqueidentifier (GUID) datatype and use the NEWID() function to create values you will get horrible fragmentation because of page splits. The reason is that when using NEWID() the value generated is not sequential. SQL 2005 added the NEWSEQUANTIAL() function to remedy that
One way to still use GUID and int is to have a guid and an int in a table so that the guid maps to the int. the guid is used externally but the int internally in the DB
for example
457180FB-C2EA-48DF-8BEF-458573DA1C10 1
9A70FF3C-B7DA-4593-93AE-4A8945943C8A 2
1 and 2 will be used in joins and the guids in the web app. This table will be pretty narrow and should be pretty fast to query

Why couple your primary key with your URI?
Why not have your URI key be human readable (or unguessable, depending on your needs), and your primary index integer based, that way you get the best of both worlds. A lot of blog software does that, where the exposed id of the entry is identified by a 'slug', and the numeric id is hidden away inside of the system.
The added benefit here is that you now have a really nice URL structure, which is good for SEO. Obviously for a transaction this is not a good thing, but for something like stackoverflow, it is important (see URL up top...). Getting uniqueness isn't that difficult. If you are really concerned, store a hash of the slug inside a table somewhere, and do a lookup before insertion.
edit: Stackoverflow doesn't quite use the system I describe, see Guy's comment below.

Rather than URLs like this:
http://example.com/user/783
Why not have:
http://example.com/user/yukondude
Which is friendlier to humans and doesn't leak that tiny bit of information?

You could use an integer which is related to the row number but is not sequential. For example, you could take the 32 bits of the sequential ID and rearrange them with a fixed scheme (for example, bit 1 becomes bit 6, bit 2 becomes bit 15, etc..).
This will be a bidirectional encryption, and you will be sure that two different IDs will always have different encryptions.
It would obviously be easy to decode, if one takes the time to generate enough IDs and get the schema, but, if I understand correctly your problem, you just want to not give away information too easily.

We use GUIDs as primary keys for all our tables as it doubles as the RowGUID for MS SQL Server Replication. Makes it very easy when the client suddenly opens an office in another part of the world...

I don't think a GUID gives you many benefits. Users hate long, incomprehensible URLs.
Create a shorter ID that you can map to the URL, or enforce a unique user name convention (http://example.com/user/brianly). The guys at 37Signals would probably mock you for worrying about something like this when it comes to a web app.
Incidentally you can force your database to start creating integer IDs from a base value.

It also depends on what you care about for your application. For n-tier apps GUIDs/UUIDs are simpler to implement and are easier to port between different databases. To produce Integer keys some database support a sequence object natively and some require custom construction of a sequence table.
Integer keys probably (I don't have numbers) provide an advantage for query and indexing performance as well as space usage. Direct DB querying is also much easier using numeric keys, less copy/paste as they are easier to remember.

I work with a student management system which uses UUID's in the form of an integer. They have a table which hold the next unique ID.
Although this is probably a good idea for an architectural point of view, it makes working with on a daily basis difficult. Sometimes there is a need to do bulk inserts and having a UUID makes this very difficult, usually requiring writing a cursor instead of a simple SELECT INTO statement.

I've tried both in real web apps.
My opinion is that it is preferable to use integers and have short, comprehensible URLs.
As a developer, it feels a little bit awful seeing sequential integers and knowing that some information about total record count is leaking out, but honestly - most people probably don't care, and that information has never really been critical to my businesses.
Having long ugly UUID URLs seems to me like much more of a turn off to normal users.

I think that this is one of these issues that cause quasi-religious debates, and its almost futile to talk about. I would just say use what you prefer. In 99% of systems it will no matter which type of key you use, so the benefits (stated in the other posts) of using one sort over the other will never be an issue.

I think using a GUID would be the better choice in your situation. It takes up more space but it's more secure.

YouTube uses 11 characters with base64 encoding which offers 11^64 possibilities, and they are usually pretty manageable to write. I wonder if that would offer better performance than a full on UUID. UUID converted to base 64 would be double the size I believe.
More information can be found here: https://www.youtube.com/watch?v=gocwRvLhDf8

Pros and Cons of UUID
Note: uuid_v7 is time based uuid instead of random. So you can
use it to order by creation date and solve some performance issues
with db inserts if you do really many of them.
Pros:
can be generated on api level (good for distributed systems)
hides count information about entity
doesn't have limit 2,147,483,647 as 32-bit int
removes layer of errors related to passing one entity id userId: 25 to get another bookId: 25 accidently
more friendly graphql usage as ID key
Cons:
128-bit instead 32-bit int (slightly bigger size in db and ~40% bigger index, around ~30MB for 1 million rows), should be a minor concern
can't be sorted by creation (can be solved with uuid_v7)
non-time-ordered UUID versions such as UUIDv4 have poor database index locality (can be solved with uuid_v7)
URL usage
Depending on app you may care or not care about url. If you don't care, just use uuid as is, it's fine.
If you care, then you will need to decide on url format.
Best case scenario is a use of unique slug if you ok with never changing it:
http://example.com/sale/super-duper-phone
If your url is generated from title and you want to change slug on title change there is a few options. Use it as is and query by uuid (slug is just decoration):
http://example.com/book/035a46e0-6550-11dd-ad8b-0800200c9a66/new-title
Convert it to base64url:
you can get uuid back from AYEWXcsicACGA6PT7v_h3A
AYEWXcsicACGA6PT7v_h3A - 22 characters
035a46e0-6550-11dd-ad8b-0800200c9a66 - 36 characters
http://example.com/book/AYEWXcsicACGA6PT7v_h3A/new-title
Generate a unique short 11 chars length string just for slug usage:
http://example.com/book/icACEWXcsAY-new-title
http://example.com/book/icACEWXcsAY/new-title
If you don't want uuid or short id in url and want only slug, but do care about seo and user bookmarks, you will need to redirect all request from
http://example.com/sale/phone-1-title
to
http://example.com/sale/phone-1-title-updated
this will add additional complexity of managing slug history, adding fallback to history for all queries where slug is used and redirects if slugs doesn't match

As long as you use a DB system with efficient storage, HDD is cheap these days anyway...
I know GUID's can be a b*tch to work with some times and come with some query overhead however from a security perspective they are a savior.
Thinking security by obscurity they fit well when forming obscure URI's and building normalised DB's with Table, Record and Column defined security you cant go wrong with GUID's, try doing that with integer based id's.