I'm busy with the database design of a new project, and I'm not sure whether to use UUIDs or normal table-unique auto-increment ids.
Up to now, the sites I've built have all run on a single server, and very heavy traffic has never been too much of a concern. However, this web application will eventually run concurrently on multiple servers, serve an API, and need to process thousands of requests per second, and I want to make sure that the design I choose now doesn't cripple any of those possibilities later.
I have my suspicions, of course, and they should be clear through the way I phrased my question, but I would like to hear from those with more experience what trouble I can run into later if I do or don't have UUIDs, and what I should really be basing my decision on.
So, in short: What are the considerations I should give into deciding whether or not to use UUIDs for all database models, so that any one object can be identified uniquely by one string, and when is it appropriate to use this as the primary key, instead of table-by-table auto-increment?
Note: I've seen this question (When are you truly forced to use UUID as part of the design?), and read all the answers, but they mostly answer "How rarely do UUIDs collide", instead of "When is it appropriate to use them".
One consideration that I've used when deciding on UUIDs vs. auto-increment ids is whether they're going to be user-visible, and if so, whether I want users to know how many I have of that table. For example, if I didn't want to make public the number of registered users my site has, I wouldn't assign auto-increment user ids.
And to address one other specific point you raised, it's still possible to use auto-incrementing ids with multiple servers (though not with the built-in MySQL). You just need to start all the ids at different offsets, and increment accordingly. That is, if you had 3 servers, you could start server A at 1, server B at 2, and server C at 3, and then increment the ids by 10 each time instead of 1. That way, you could guarantee no collisions.
And finally, the last thing I consider is how important performance is to my application. Integers are much more easily indexed than UUIDs that are string-based, so indexes are smaller, more quickly searched, etc.
UUID's or GUID's can be very useful especially for the web. If you use auto-increment values to store UserId anyone can view the source of your web pages and see the simplicity of it's use. They could then try any integer value to get data they are not supposed to see.
GUID's are not created in any sequential format, therefore if you create them one right after the other, there sequence can not easily be guessed.
I don't think it's necessary to use GUID's for simple lookup type data such as ColorId 1=Blue, 2=Red, 3=Green.
GUID's are also very useful for session and state management.
That's my $0.02
Related
First, I know this is a rather subjective question but I need some kind of formal documentation to help me educate my client.
Background - a large enterprise application with hundreds of tables and SP's, all neatly designed with normalized tables and foreign keys using identity columns.
Our client has a few employees writing complex reports in Crystal enterprise using a replicated copy of our production Db.
We have tables that store what I would classify as 'system' base information, such as a list of office locations, or departments within the company, standard set of roles for users, statuses of other objects (open/closed etc), basically data that doesn't change often.
The issue - the report designers and financial analysts are writing queries with hardcoded identity values inside of them. Something like this
SELECT xxx FROM OFFICE WHERE OFFICE_ID = 6
I'm greatly simplifying here, but basically they're using these hard coded int values inside their procedures all over the place.
For SQL developers seeing this will obviously make you facepalm as it's just a built-in instinct not to do this.
However, surprisingly I can't find any documentation or even best practices articles as to why this shouldn't be done.
They would argue it's fine to do this since the values never change, and they're right, within that single system those values won't change, however across multiple environments (staging/QA/Dev) those values can and are absolutely different, making their reporting design approach non-portable and only able to function in 1 isolated server environment.
Do any of the SQL guru's out there have any more in-depth information/articles etc that I can use to help educate my client on why they should avoid this approach?
Seems to me the strongest argument to your report writers is your second to last sentence "...those values can and are absolutely different [between environments]". That would be pretty much the gist of my response to them.
Of course there's always gray area to any question. Identity columns are essentially magic numbers. They have the benefit to the database of being...
Small
Sequential
Fast to seek and join on, sort by and create
...but have the downside of being of completely meaningless, and in effect, randomly assigned (sort the inserts into that table one way, you get a different identity per row than if you sorted the other way). As such, in cases where you have to look up something specific like that, it's common use also include a "business/natural/alternate" key (e.g. maybe (a completely made up example) [CategoryName] where CatgoryName is something short, unique and human readable, while. [CategoryId] is an identity, but not something intended to be sought on)
If you have a website with, say, a dropdown menu, usually the natural key gets put into the visible part of the drop down, and the surrogate/identity key gets passed around on the back end, invisible to the end user.
This gets a little trickier when you have people writing queries directly against the database. If they're owners of the data, they may know things about the larger data structure which they can take advantage of in *cough "clever" ways. If you know the keys wont change and you know what those values are, there might be a case to be made just referencing those. But again, not if they're going to be different when you query a different server.
Of course the flip side is, if you don't want them to use the identity values, you'll have to give them an alternative. And if your tables don't already include a business/natural/alternate key, you're going to have to add one wherever one doesn't already exist.
Also, there's nothing wrong with that alternate key being an integer too (maybe you already have company-wide identifiers for your offices of 1, 2, 3 etc), but the point is that it's deterministic no matter where you run your query.
I am trying out Google App Engine Java, however the absence of a unique constraint is making things difficult.
I have been through this post and this blog suggests a method to implement something similar. My background is in MySQL.Moving to datastore without a unique constraint makes me jittery because I never had to worry about duplicate values before and checking each value before inserting a new value still has room for error.
"No, you still cannot specify unique
during schema creation."
-- David Underhill talks about GAE and the unique constraint (post link)
What are you guys using to implement something similar to a unique or primary key?
I heard about a abstract datastore layer created using the low level api which worked like a regular RDB, which however was not free(however I do not remember the name of the software)
Schematic view of my problem
sNo = biggest serial_number in the db
sNo++
Insert new entry with sNo as serial_number value //checkpoint
User adds data pertaining to current serial_number
Update entry with data where serial_number is sNo
However at line number 3(checkpoint), I feel two users might add the same sNo. And that is what is preventing me from working with appengine.
This and other similar questions come up often when talking about transitioning from a traditional RDB to a BigTable-like datastore like App Engine's.
It's often useful to discuss why the datastore doesn't support unique keys, since it informs the mindset you should be in when thinking about your data storage schemes. The reason unique constraints are not available is because it greatly limits scalability. Like you've said, enforcing the constraint means checking all other entities for that property. Whether you do it manually in your code or the datastore does it automatically behind the scenes, it still needs to happen, and that means lower performance. Some optimizations can be made, but it still needs to happen in one way or another.
The answer to your question is, really think about why you need that unique constraint.
Secondly, remember that keys do exist in the datastore, and are a great way of enforcing a simple unique constraint.
my_user = MyUser(key_name=users.get_current_user().email())
my_user.put()
This will guarantee that no MyUser will ever be created with that email ever again, and you can also quickly retrieve the MyUser with that email:
my_user = MyUser.get(users.get_current_user().email())
In the python runtime you can also do:
my_user = MyUser.get_or_create(key_name=users.get_current_user().email())
Which will insert or retrieve the user with that email.
Anything more complex than that will not be scalable though. So really think about whether you need that property to be globally unique, or if there are ways you can remove the need for that unique constraint. Often times you'll find with some small workarounds you didn't need that property to be unique after all.
You can generate unique serial numbers for your products without needing to enforce unique IDs or querying the entire set of entities to find out what the largest serial number currently is. You can use transactions and a singleton entity to generate the 'next' serial number. Because the operation occurs inside a transaction, you can be sure that no two products will ever get the same serial number.
This approach will, however, be a potential performance chokepoint and limit your application's scalability. If it is the case that the creation of new serial numbers does not happen so often that you get contention, it may work for you.
EDIT:
To clarify, the singleton that holds the current -- or next -- serial number that is to be assigned is completely independent of any entities that actually have serial numbers assigned to them. They do not need to be all be a part of an entity group. You could have entities from multiple models using the same mechanism to get a new, unique serial number.
I don't remember Java well enough to provide sample code, and my Python example might be meaningless to you, but here's pseudo-code to illustrate the idea:
Receive request to create a new inventory item.
Enter transaction.
Retrieve current value of the single entity of the SerialNumber model.
Increment value and write it to the database
Return value as you exit transaction.
Now, the code that does all the work of actually creating the inventory item and storing it along with its new serial number DOES NOT need to run in a transaction.
Caveat: as I stated above, this could be a major performance bottleneck, as only one serial number can be created at any one time. However, it does provide you with the certainty that the serial number that you just generated is unique and not in-use.
I encountered this same issue in an application where users needed to reserve a timeslot. I needed to "insert" exactly one unique timeslot entity while expecting users to simultaneously request the same timeslot.
I have isolated an example of how to do this on app engine, and I blogged about it. The blog posting has canonical code examples using Datastore, and also Objectify. (BTW, I would advise to avoid JDO.)
I have also deployed a live demonstration where you can advance two users toward reserving the same resource. In this demo you can experience the exact behavior of app engine datastore click by click.
If you are looking for the behavior of a unique constraint, these should prove useful.
-broc
I first thought an alternative to the transaction technique in broc's blog, could be to make a singleton class which contains a synchronized method (say addUserName(String name)) responsible adding a new entry only if it is unique or throwing an exception. Then make a contextlistener which instantiates a single instance of this singleton, adding it as an attribute to the servletContext. Servlets then can call the addUserName() method on the singleton instance which they obtain through getServletContext.
However this is NOT a good idea because GAE is likely to split the app across multiple JVMs so multiple singleton class instances could still occur, one in each JVM. see this thread
A more GAE like alternative would be to write a GAE module responsible for checking uniqueness and adding new enteries; then use manual or basic scaling with...
<max-instances>1</max-instances>
Then you have a single instance running on GAE which acts as a single point of authority, adding users one at a time to the datastore. If you are concerned about this instance being a bottleneck you could improve the module, adding queuing or an internal master/slave architecture.
This module based solution would allow many unique usernames to be added to the datastore in a short space of time, without risking entitygroup contention issues.
I'm using Google Gears to be able to use an application offline (I know Gears is deprecated). The problem I am facing is the synchronization with the database on the server.
The specific problem is the primary keys or more exactly, the foreign keys. When sending the information to the server, I could easily ignore the primary keys, and generate new ones. But then how would I know what the relations are.
I had one sollution in mind, bet the I would need to save all the pk for every client. What is the best way to synchronize multiple client with one server db.
Edit:
I've been thinking about it, and I guess seqential primary keys are not the best solution, but what other possibilities are there? Time based doesn't seem right because of collisions which could happen.
A GUID comes to mind, is that an option? It looks like generating a GUID in javascript is not that easy.
I can do something with natural keys or composite keys. As I'm thinking about it, that looks like the best solution. Can I expect any problems with that?
This is not quite a full answer, but might at least provide you with some ideas...
The question you're asking (and the problem you're trying to address) is not specific to Google Gears, and will remains valid with other solutions, like HTML 5 or systems based or Flash/Air.
There's been a presentation about that subject given during the last ZendCon a few month ago -- and the slides are available on slideshare : Planning for Synchronization with Browser-Local Databases
Going through thoses slides, you'll see notes about a couple of possibilities that might come to mind (some did actually come to your mind, or in other answers) :
Using GUID
Composite Keys
Primary key pool (i.e. reserve a range of keys beforehand)
Of course, for each one of those, there are advantages... and drawbacks -- I will not copy-paste them : take a look at the slides ;-)
Now, in your situation, which solution will be the best ? Hard to say, actually -- and the sooner your think about synchronisation, the better/easier it'll probably be : adding stuff into an application is so much simpler when that application is still in its design stage ^^
First, it might be interesting to determine whether :
Your application is generally connected, and being dis-connected only rarely happens
Or if your application is generally dis-connected, and only connects once in a while.
Then, what are you going to synchronise ?
Data ?
Like "This is the list of all commands made by that user"
With that data replicated on each dis-connected device, of course -- which can each modify it
In this case, if one user deletes a line, and another one adds a line, how to know which one has the "true" data ?
Or actions made on those data ?
Like "I am adding an entry in the list of commands made by that user"
In this case, if one user deletes a line, and another one adds a line, it's easy to synchronize, as you just have to synchronise those two actions to your central DB
But this is not quite easy to implements, especially for a big application / system : each time an action is made, you have to kind of log it !
There is also a specific problem to which we don't generally think -- until it happens : especially if your synchronisation process can take some time (if you have a lot of data, if you don't synchronise often, ...), what if the synchronisation is stopped when it's not finished yet ?
For instance, what if :
A user, in a train, has access to the network, with some 3G card
The synchronisation starts
there is a tunnel -- and the connection is lost.
Having half-synchronised data might not be that good, in most situations...
So, you have to find a solution to that problem, too : in most cases, the synchronisation has to be atomic !
I've came up with the following solution:
Every client gets a unique id from the server. Everywhere a primary key is referenced, I use a composite key with the client id and an auto increment field.
This way, the combination is unique, and it's easy to implement. The only thing left is making sure every client does get a unique id.
I just found out one drawback: SQLite doesn't support autoincrement on composite primary keys, so I would have to handle the id's myself.
I would use a similar setup to your latest answer. However, to get around your auto-increment issue, I would use a single auto-increment surrogate key in your master database and then store the client primary key and your client id as well. That way you are not losing or changing any data in the process and you are also tracking which client the data was originally sourced from.
Be sure to also set up a unique index on your Client Pk, Client Id to enable referential integrity from any child tables.
Is there a reasonable limit to how many objects the client can create while disconnected?
One possibilty I can see is to create a sort of "local sequence".
When your client connects to the central server, it gets a numeric ID, say a 7 digit number (the server generates it as a sequence).
The actual PKs are created as strings like this: 895051|000094 or 895051|005694 where the first part is the 7 digit number sent from the server, and the second part is a "local" sequence managed by the client.
As soon as you synch with the central, you can get a new 7 digit number and restart your local sequence. This is not too different from what you were proposing, all in all. It just makes the actual PK completely independant from the client identity.
Another bonus is that if you have a scenario where the client has never connected to the server, it can use 000000|000094 locally, require a new number from the server and update the keys on its side before sending back to the server for synch (this is tricky if you have lots of FK constraints though, and could not be feasible).
I'm working on a project that I want to have be as flexible and scalable as possible from the beginning. A problem I'm concerned about is one best described by Joshua Schacter in Founders at Work, who noted it as one detail he wish he would've planned for ahead of time.
Scaling past one machine, one database, is very challenging, even with replication. The tools that are there are not quite right.
For example, when you add things to a table and it numbers them, that means you can't have a second machine also adding to them because the numbers will collide. So what do you do? You have to come up with some completely different way to do it.
Do you have a central server that hands out number sets, or do you come up with something that's not numbers? Do you use random numbers and hope they never collide? Whatever it is, auto-assigned IDs just don't fly.
Has anyone here faced this problem? What are ways to move beyond auto-incremented IDs, or is there a way to have them scale with multiple servers?
Use GUID/UUID (globally/universally unique identifier). In theory it's guaranteed to be unique across multiple machines.
GUIDs, your chances of collision are astronomically low.
It's also possible to have (what we called) SmartGUIDs (usually called COMB GUIDS - see this analysis, particularly page 7) where you can encode a timestamp within the GUID, so you get record creation date information "for free" - so you can save a timestamp column for record creation datetime - which gets back some of what you lost on moving from 32-bit integer to 128-bit GUID. These can also be guaranteed to be monotonic, unlike regular GUIDs, which can be useful for clustered indexes and for sorting.
You can also use composite keys with some kind of server/db ID with a regular auto-increment identity or auto-number.
I've always preferred to use long integers as primary keys in databases, for simplicity and (assumed) speed. But when using a REST or Rails-like URL scheme for object instances, I'd then end up with URLs like this:
http://example.com/user/783
And then the assumption is that there are also users with IDs of 782, 781, ..., 2, and 1. Assuming that the web app in question is secure enough to prevent people entering other numbers to view other users without authorization, a simple sequentially-assigned surrogate key also "leaks" the total number of instances (older than this one), in this case users, which might be privileged information. (For instance, I am user #726 in stackoverflow.)
Would a UUID/GUID be a better solution? Then I could set up URLs like this:
http://example.com/user/035a46e0-6550-11dd-ad8b-0800200c9a66
Not exactly succinct, but there's less implied information about users on display. Sure, it smacks of "security through obscurity" which is no substitute for proper security, but it seems at least a little more secure.
Is that benefit worth the cost and complexity of implementing UUIDs for web-addressable object instances? I think that I'd still want to use integer columns as database PKs just to speed up joins.
There's also the question of in-database representation of UUIDs. I know MySQL stores them as 36-character strings. Postgres seems to have a more efficient internal representation (128 bits?) but I haven't tried it myself. Anyone have any experience with this?
Update: for those who asked about just using the user name in the URL (e.g., http://example.com/user/yukondude), that works fine for object instances with names that are unique, but what about the zillions of web app objects that can really only be identified by number? Orders, transactions, invoices, duplicate image names, stackoverflow questions, ...
I can't say about the web side of your question. But uuids are great for n-tier applications. PK generation can be decentralized: each client generates it's own pk without risk of collision.
And the speed difference is generally small.
Make sure your database supports an efficient storage datatype (16 bytes, 128 bits).
At the very least you can encode the uuid string in base64 and use char(22).
I've used them extensively with Firebird and do recommend.
For what it's worth, I've seen a long running stored procedure (9+ seconds) drop to just a few hundred milliseconds of run time simply by switching from GUID primary keys to integers. That's not to say displaying a GUID is a bad idea, but as others have pointed out, joining on them, and indexing them, by definition, is not going to be anywhere near as fast as with integers.
I can answer you that in SQL server if you use a uniqueidentifier (GUID) datatype and use the NEWID() function to create values you will get horrible fragmentation because of page splits. The reason is that when using NEWID() the value generated is not sequential. SQL 2005 added the NEWSEQUANTIAL() function to remedy that
One way to still use GUID and int is to have a guid and an int in a table so that the guid maps to the int. the guid is used externally but the int internally in the DB
for example
457180FB-C2EA-48DF-8BEF-458573DA1C10 1
9A70FF3C-B7DA-4593-93AE-4A8945943C8A 2
1 and 2 will be used in joins and the guids in the web app. This table will be pretty narrow and should be pretty fast to query
Why couple your primary key with your URI?
Why not have your URI key be human readable (or unguessable, depending on your needs), and your primary index integer based, that way you get the best of both worlds. A lot of blog software does that, where the exposed id of the entry is identified by a 'slug', and the numeric id is hidden away inside of the system.
The added benefit here is that you now have a really nice URL structure, which is good for SEO. Obviously for a transaction this is not a good thing, but for something like stackoverflow, it is important (see URL up top...). Getting uniqueness isn't that difficult. If you are really concerned, store a hash of the slug inside a table somewhere, and do a lookup before insertion.
edit: Stackoverflow doesn't quite use the system I describe, see Guy's comment below.
Rather than URLs like this:
http://example.com/user/783
Why not have:
http://example.com/user/yukondude
Which is friendlier to humans and doesn't leak that tiny bit of information?
You could use an integer which is related to the row number but is not sequential. For example, you could take the 32 bits of the sequential ID and rearrange them with a fixed scheme (for example, bit 1 becomes bit 6, bit 2 becomes bit 15, etc..).
This will be a bidirectional encryption, and you will be sure that two different IDs will always have different encryptions.
It would obviously be easy to decode, if one takes the time to generate enough IDs and get the schema, but, if I understand correctly your problem, you just want to not give away information too easily.
We use GUIDs as primary keys for all our tables as it doubles as the RowGUID for MS SQL Server Replication. Makes it very easy when the client suddenly opens an office in another part of the world...
I don't think a GUID gives you many benefits. Users hate long, incomprehensible URLs.
Create a shorter ID that you can map to the URL, or enforce a unique user name convention (http://example.com/user/brianly). The guys at 37Signals would probably mock you for worrying about something like this when it comes to a web app.
Incidentally you can force your database to start creating integer IDs from a base value.
It also depends on what you care about for your application. For n-tier apps GUIDs/UUIDs are simpler to implement and are easier to port between different databases. To produce Integer keys some database support a sequence object natively and some require custom construction of a sequence table.
Integer keys probably (I don't have numbers) provide an advantage for query and indexing performance as well as space usage. Direct DB querying is also much easier using numeric keys, less copy/paste as they are easier to remember.
I work with a student management system which uses UUID's in the form of an integer. They have a table which hold the next unique ID.
Although this is probably a good idea for an architectural point of view, it makes working with on a daily basis difficult. Sometimes there is a need to do bulk inserts and having a UUID makes this very difficult, usually requiring writing a cursor instead of a simple SELECT INTO statement.
I've tried both in real web apps.
My opinion is that it is preferable to use integers and have short, comprehensible URLs.
As a developer, it feels a little bit awful seeing sequential integers and knowing that some information about total record count is leaking out, but honestly - most people probably don't care, and that information has never really been critical to my businesses.
Having long ugly UUID URLs seems to me like much more of a turn off to normal users.
I think that this is one of these issues that cause quasi-religious debates, and its almost futile to talk about. I would just say use what you prefer. In 99% of systems it will no matter which type of key you use, so the benefits (stated in the other posts) of using one sort over the other will never be an issue.
I think using a GUID would be the better choice in your situation. It takes up more space but it's more secure.
YouTube uses 11 characters with base64 encoding which offers 11^64 possibilities, and they are usually pretty manageable to write. I wonder if that would offer better performance than a full on UUID. UUID converted to base 64 would be double the size I believe.
More information can be found here: https://www.youtube.com/watch?v=gocwRvLhDf8
Pros and Cons of UUID
Note: uuid_v7 is time based uuid instead of random. So you can
use it to order by creation date and solve some performance issues
with db inserts if you do really many of them.
Pros:
can be generated on api level (good for distributed systems)
hides count information about entity
doesn't have limit 2,147,483,647 as 32-bit int
removes layer of errors related to passing one entity id userId: 25 to get another bookId: 25 accidently
more friendly graphql usage as ID key
Cons:
128-bit instead 32-bit int (slightly bigger size in db and ~40% bigger index, around ~30MB for 1 million rows), should be a minor concern
can't be sorted by creation (can be solved with uuid_v7)
non-time-ordered UUID versions such as UUIDv4 have poor database index locality (can be solved with uuid_v7)
URL usage
Depending on app you may care or not care about url. If you don't care, just use uuid as is, it's fine.
If you care, then you will need to decide on url format.
Best case scenario is a use of unique slug if you ok with never changing it:
http://example.com/sale/super-duper-phone
If your url is generated from title and you want to change slug on title change there is a few options. Use it as is and query by uuid (slug is just decoration):
http://example.com/book/035a46e0-6550-11dd-ad8b-0800200c9a66/new-title
Convert it to base64url:
you can get uuid back from AYEWXcsicACGA6PT7v_h3A
AYEWXcsicACGA6PT7v_h3A - 22 characters
035a46e0-6550-11dd-ad8b-0800200c9a66 - 36 characters
http://example.com/book/AYEWXcsicACGA6PT7v_h3A/new-title
Generate a unique short 11 chars length string just for slug usage:
http://example.com/book/icACEWXcsAY-new-title
http://example.com/book/icACEWXcsAY/new-title
If you don't want uuid or short id in url and want only slug, but do care about seo and user bookmarks, you will need to redirect all request from
http://example.com/sale/phone-1-title
to
http://example.com/sale/phone-1-title-updated
this will add additional complexity of managing slug history, adding fallback to history for all queries where slug is used and redirects if slugs doesn't match
As long as you use a DB system with efficient storage, HDD is cheap these days anyway...
I know GUID's can be a b*tch to work with some times and come with some query overhead however from a security perspective they are a savior.
Thinking security by obscurity they fit well when forming obscure URI's and building normalised DB's with Table, Record and Column defined security you cant go wrong with GUID's, try doing that with integer based id's.