Should I check for UUID collisions? - uuid

I'm creating a program where I heavily use UUIDs to identify things like users and groups. Given the extremely low chance of a UUID already being taken, should I worry about the possibility of a collision?

That very much depends on A) your requirements B) the underlying implementation and "real" chances of your "library" coming up with a duplicated UUID.
Example: I am working on a software stack were our customers are running with several "worker machines" W1, W2, ... to Wn. And one or more "manager machines" M1, M2.
Those workers create "objects" identified by a UUID. We fully trust the "standard" libraries to create those UUIDs there - thus: no additional checking.
On the other hand, those manager instances see all workers; and to avoid that conflicts on their level, we make sure that each worker has a UUID; and that UUID is factored into the UUIDs generated on the worker machines.
In other words: if your UUIDs are generated and used on one system; then there is (not much) reason to be worried. But when your ids need to be unique within a larger context, you could look into such kind of precautions.

Related

Is it safe to use UUIDs generated from 2 different systems without a chance of collision?

We have an old system that generated some UUIDs. We have more records that need UUID but can't use the old system to generate them, so we will need to generate them elsewhere. This immediately struck me as not a good idea and have been searching for an answer but haven't found this exact question. There would be no check to make sure the UUID wasn't generated already in the old system. The UUID would just get populated for records that don't have one. Safe?
If you use UUID V1, all values generated are guaranteed to be unique. However, most folks avoid V1 because it leaks data about the system that generated it (specially, the MAC address) and the exact time, neither of which is ideal in many contexts.
Most folks use UUID V4, which is statistically unique. While it is theoretically possible for the same value to be generated twice, the odds of it ever actually happening before the heat death of the universe is approximately zero, which is good enough for any practical purpose.
UUID V3/V5 are used for when you want predictable values, which doesn’t sound like a good fit for your needs.

Java. UUID, how does it works

Does the java.util.UUID is unique in each file?
For example:
We have file IDFile1.java, generating here a random id, using UUID.randomUUID().toString().
And we have file2 with name IDFile2.java, which generates another id.
Will this two file IDs collide with each other?
Is there any way to "turn back" used ID, generated from java.util.UUID, that will mean, this ID could be used again?
The purpose of functions like randomUUID is to munge together various pieces of information which are likely to have some amount of randomness such that two UUIDs generated at different times or in different places would have an extremely small likelihood of matching unless all sources of potential randomness happened to yield identical results. No effort is made to keep track of which UUIDs have or have not been issued; instead, the goal is to have enough randomness that the probability of an unintentional match will be small relative to e.g. the probability of a computer being smashed to a million pieces by a meteor strike.
Note that the system may use "number of UUIDs issued" as part of its UUID calculation, but that would only be one of many factors that go into it. The purpose of such a counter isn't to allow one to "go back" to a previous UUID, but rather to ensure that if e.g. two UUIDs are requested nearly simultaneously without any source of randomness having become available between them, the two requests will yield different values.
UUIDs are unrelated to what file generates it. It doesn't matter which file generates them they more than likely will not be the same.
For question 2 the way UUIDs are generated doesn't really allow for them to be regenerated in any meaningful way. They are usually generated based on some info from your computer, the current time and other stuff. The Java algorithm uses a cryptographically secure random number generator and are known as type 4 UUIDs.

When is it appropriate to use UUIDs for a web project?

I'm busy with the database design of a new project, and I'm not sure whether to use UUIDs or normal table-unique auto-increment ids.
Up to now, the sites I've built have all run on a single server, and very heavy traffic has never been too much of a concern. However, this web application will eventually run concurrently on multiple servers, serve an API, and need to process thousands of requests per second, and I want to make sure that the design I choose now doesn't cripple any of those possibilities later.
I have my suspicions, of course, and they should be clear through the way I phrased my question, but I would like to hear from those with more experience what trouble I can run into later if I do or don't have UUIDs, and what I should really be basing my decision on.
So, in short: What are the considerations I should give into deciding whether or not to use UUIDs for all database models, so that any one object can be identified uniquely by one string, and when is it appropriate to use this as the primary key, instead of table-by-table auto-increment?
Note: I've seen this question (When are you truly forced to use UUID as part of the design?), and read all the answers, but they mostly answer "How rarely do UUIDs collide", instead of "When is it appropriate to use them".
One consideration that I've used when deciding on UUIDs vs. auto-increment ids is whether they're going to be user-visible, and if so, whether I want users to know how many I have of that table. For example, if I didn't want to make public the number of registered users my site has, I wouldn't assign auto-increment user ids.
And to address one other specific point you raised, it's still possible to use auto-incrementing ids with multiple servers (though not with the built-in MySQL). You just need to start all the ids at different offsets, and increment accordingly. That is, if you had 3 servers, you could start server A at 1, server B at 2, and server C at 3, and then increment the ids by 10 each time instead of 1. That way, you could guarantee no collisions.
And finally, the last thing I consider is how important performance is to my application. Integers are much more easily indexed than UUIDs that are string-based, so indexes are smaller, more quickly searched, etc.
UUID's or GUID's can be very useful especially for the web. If you use auto-increment values to store UserId anyone can view the source of your web pages and see the simplicity of it's use. They could then try any integer value to get data they are not supposed to see.
GUID's are not created in any sequential format, therefore if you create them one right after the other, there sequence can not easily be guessed.
I don't think it's necessary to use GUID's for simple lookup type data such as ColorId 1=Blue, 2=Red, 3=Green.
GUID's are also very useful for session and state management.
That's my $0.02

How to choose between UUIDs, autoincrement/sequence keys and sequence tables for database primary keys?

I'm looking at the pros and cons of these three primary methods of coming up with primary keys for database rows.
So assuming I am using a database that supports more than one of these methods, is there a simple heuristic to determine what the best option would be for me?
How do considerations such a distributed/multiple masters, performance requirements, ORM use, security and testing have on the choice?
Any unexpected drawbacks that one might run into?
UUIDs
Unless these are generated "in increasing monotonic sequence" they can drastically hurt/fragment indexes. Support for UUID generation varies by system. While usable, I would not use a UUID as my primary clustered index/PK in most cases. If needed I would likely make it a secondary column, perhaps indexed, perhaps not.
Some people argue that UUIDs can be used to safely generate/merge records from an arbitrary number of systems. While a UUID (depending upon method) generally has an astronomically small chance of collision, it is possible to -- at least with some outside input or very bad luck :) -- generate collisions. I am of the belief that only a true PK should be transmitted between systems, which I would argue is not (or should not be) a database-generated UUID in most cases.
autoincrement/sequence keys and sequence tables
This really depends on what the database supports well. Some databases support sequences which are more flexible that a simple "auto-increment". This may or may not be desirable (or may be the only way for this kind of task simply, even). Sequence tables are generally more flexible yet, but if this kind of "flexibility" is needed I would be tempted to go back and visit the design-pattern, especially if it involves the use of triggers. While I dislike "limiting ORMs", that may also make a difference in choosing the "simpler" auto-increment or sequence types/database support.
Regardless of the method used, when using surrogate primary keys, the true primary key should still be identified and encoded into the schema.
In addition, I argue that "security compromises through exposing an auto-sequence PK" are a result of incorrectly exposing an internal database property. While a very simple way to handle CRUD operation, I believe there is a distinction between the internal keys and the exposed keys (e.g. pretty customer number).
Just my two cents.
Edit, additional replies to Tim:
I think the generated vs. true PK question is a very good one and one I need to consider also. I'd like UUIDs in general to the points you make. My hesitation was in size vs. an int/long. Was not aware of potential indexing de-optimizations, which is a much bigger concern for me.
I wouldn't really worry about the size -- if a UUID is best, then it's best. If it's not, then it's not. In the overall scheme the extra 12bytes over an int likely won't make much of a difference. SQL Server 2005+ supports the newsequentialid UUID generation function to avoid the fragmentation associated with normal UUID generation. The page discusses it some. I am sure that other databases have similar solutions.
And by "encoded into the schema", do you mean more than adding a uniqueness constraint?
Yes. The primary key doesn't have to be the only [unique] constraint. Just using a surrogate PK doesn't mean the database model should be compromised :-) Additional indexes can also be used to cover, etc.
And by "distinction between", are you saying that surrogate primary keys never leak out?
The wording in my initial post was a tad hard. It's not "never" so much as "if they do and it matters then that's another problem". Often times people complain of insecurity through guessable numbers -- e.g. if your order is 23 then there is likely an order 22 and 24, etc. If this is your "protection" and/or can leak sensitive information then the system is already flawed. (Separating internal and external ids does not inherently fix this issue and authentication/authorization is still required. However, it is one issue raised against using "sequential ids" -- I find encoding a nonce into distributed URLs handles this for my use-case rather well.)
More to what I really wanted to get across: Just because the surrogate PK id happens to be 8942 doesn't mean that it's order 8942. That is, keeping with the "some fields are internal only to db" design, the order "number" might be entirely unrelated on the surface (but fully supported in the DB model), such as "#2010-42c" or whatever makes sense for the business requirement(s). It is this external number that should be exposed in most cases.
I feel that sometimes the generated key is really the true primary key as other fields are mutable (eg. user may change email and username).
This may be the case within a database and I will not argue this statement. However, once again holding that the surrogate PK's are internal to the database, just make sure to only export/import tuples that can be well-identified. If the username/email may change, then this might very well include a UUID assigned upon account creation -- and could very well be the surrogate PK itself.
Of course, as with everything, remain open and fit the model to the problem, not the problem to the model :-) For a service like twitter, for instance, they use their own number generation schema. See Twitter's new ID generation. Unlike [some] UUID generation, the approach by twitter (assuming that all the servers are correctly setup) guarantees that none of the distributed machines/processes will ever generate a duplicate ID, requires only 64-bits, and maintains rough ordering (most significant bits are time-stamp). (The number of records generated by twitter may be in no way related to local requirements ;-)
Happy coding.

What is a UUID?

Well, what is one?
It's an identification number that will uniquely identify something. The idea being that id number will be universally unique. Thus, no two things should have the same uuid. In fact, if you were to generate 10 trillion uuids, there would be something along the lines of a .00000006 chance of two uuids being the same.
Standardized identifiers
UUIDs are defined in RFC 4122. They're Universally Unique IDentifiers, that can be generated without the use of a centralized authority. There are four major types of UUIDs which are used in slightly different scenarios. All UUIDs are 128 bits in length, but are commonly represented as 32 hexadecimal characters separated by four hyphens.
Version 1 UUIDs, the most common, combine a MAC address and a timestamp to produce sufficient uniqueness. In the event of multiple UUIDs being generated fast enough that the timestamp doesn't increment before the next generation, the timestamp is manually incremented by 1. If no MAC address is available, or if its presence would be undesirable for privacy reasons, 6 random bytes sourced from a cryptographically secure random number generator may be used for the node ID instead.
Version 3 and Version 5 UUIDs, the least common, use the MD5 and SHA1 hash functions respectively, plus a namespace, plus an already unique data value to produce a unique ID. This can be used to generate a UUID from a URL for example.
Version 4 UUIDs, are simply 128 bits of random data, with some bit-twiddling to identify the UUID version and variant.
UUID collisions are extremely unlikely to happen, especially not in a single application space.
UUID stands for Universally Unique IDentifier.
It's a 128-bit value used for a unique identification in software development. UUID is the same as GUID (Microsoft) and is part of the Distributed Computing Environment (DCE), standardized by the Open Software Foundation (OSF).
As mentioned, they are intended to have a high likelihood of uniqueness over space and time and are computationally difficult to guess. It's generation is based on the current timestamp and the unique property of the workstation that generated the UUID.
Image from https://segment.com/blog/a-brief-history-of-the-uuid/
It's a very long string of bits that is supposed to be unique now and forever, i.e. no possible clash with any other UUID produced by you or anybody else in the world .
The way it works is simply using current timestamp, and an internet related unique property of the computer that generated it (like the IP address, which ought to be unique at the moment you're connected to the internet; or the MAC address, which is more low level, a hard-wired ID for your network card) is part of the bit string.
Originally every network card in the world has its own unique MAC address, but in later generations, you can change the MAC address through software, so it's not as much reliable as a unique ID anymore.
It's a Universally Unique Identifier
A UUID is a 128-bit number that is used to uniquely identify some entity. Depending on the specific mechanisms used, a UUID is guaranteed to be different or is, at least, extremely likely to be different from any other UUID generated. The UUID relies upon a combination of components to ensure uniqueness. A UUID contains a reference to the network address of the host that generated the UUID, a timestamp and a randomly generated component. Because the network address identifies a unique computer, and the timestamp is unique for each UUID generated from a particular host, those two components should sufficiently ensure uniqueness.
I just want to add that it is better to use usUUID (Static windows identifiers).
For example if a computer user that relies, on a third party software like a screen reader for blind or low vision users, the other software (in this case the screen reder) will play better with unique identifiers!
After all how happy will you be if someone moves your car after you know the place you parked it at!!!
A universally unique identifier (UUID) is a 128-bit number used to identify information in computer systems. The term globally unique identifier (GUID) is also used, typically in software created by Microsoft.
When generated according to the standard methods, UUIDs are for practical purposes unique. Their uniqueness does not depend on a central registration authority or coordination between the parties generating them, unlike most other numbering schemes. While the probability that a UUID will be duplicate.

Resources