Which version of the UUID should you use? I saw a lot of threads explaining what each version entails, but I am having trouble figuring out what's best for what applications.
There are two different ways of generating a UUID.
If you just need a unique ID, you want a version 1 or version 4.
Version 1: This generates a unique ID based on a network card MAC address and current time. If any of these things is sensitive in any way, don't use this. The advantage of this version is that, while looking at a list of UUIDs generated by machines you trust, you can easily know whether many UUIDs got generated by the same machine, or infer some time relationship between them.
Version 4: These are generated from random (or pseudo-random) numbers. If you just need to generate a UUID, this is probably what you want. The advantage of this version is that when you're debugging and looking at a long list of information matched with UUIDs, it's quicker to spot matches.
If you need to generate reproducible UUIDs from given names, you want a version 3 or version 5. If you are interacting with other systems, this choice was already made and you should check with version and namespaces they use.
Version 3: This generates a unique ID from an MD5 hash of a namespace and name. If are dealing with very strict resource requirements (e.g. a very busy Arduino board), use this.
Version 5: This generates a unique ID from an SHA-1 hash of a namespace and name. This is the more secure and generally recommended version.
If you want a random number, use a random number library. If you want a unique identifier with effectively 0.00...many more 0s here...001% chance of collision, you should use UUIDv1. See Nick's post for UUIDv3 and v5.
UUIDv1 is NOT secure. It isn't meant to be. It is meant to be UNIQUE, not un-guessable. UUIDv1 uses the current timestamp, plus a machine identifier, plus some random-ish stuff to make a number that will never be generated by that algorithm again. This is appropriate for a transaction ID (even if everyone is doing millions of transactions/s).
To be honest, I don't understand why UUIDv4 exists... from reading RFC4122, it looks like that version does NOT eliminate possibility of collisions. It is just a random number generator. If that is true, than you have a very GOOD chance of two machines in the world eventually creating the same "UUID"v4 (quotes because there isn't a mechanism for guaranteeing U.niversal U.niqueness). In that situation, I don't think that algorithm belongs in a RFC describing methods for generating unique values. It would belong in a RFC about generating randomness. For a set of random numbers:
chance_of_collision = 1 - (set_size! / (set_size - tries)!) / (set_size ^ tries)
That's a very general question. One answer is: "it depends what kind of UUID you wish to generate". But a better one is this: "Well, before I answer, can you tell us why you need to code up your own UUID generation algorithm instead of calling the UUID generation functionality that most modern operating systems provide?"
Doing that is easier and safer, and since you probably don't need to generate your own, why bother coding up an implementation? In that case, the answer becomes use whatever your O/S, programming language or framework provides. For example, in Windows, there is CoCreateGuid or UuidCreate or one of the various wrappers available from the numerous frameworks in use. In Linux there is uuid_generate.
If you, for some reason, absolutely need to generate your own, then at least have the good sense to stay away from generating v1 and v2 UUIDs. It's tricky to get those right. Stick, instead, to v3, v4 or v5 UUIDs.
Update:
In a comment, you mention that you are using Python and link to this. Looking through the interface provided, the easiest option for you would be to generate a v4 UUID (that is, one created from random data) by calling uuid.uuid4().
If you have some data that you need to (or can) hash to generate a UUID from, then you can use either v3 (which relies on MD5) or v5 (which relies on SHA1). Generating a v3 or v5 UUID is simple: first pick the UUID type you want to generate (you should probably choose v5) and then pick the appropriate namespace and call the function with the data you want to use to generate the UUID from. For example, if you are hashing a URL you would use NAMESPACE_URL:
uuid.uuid3(uuid.NAMESPACE_URL, 'https://ripple.com')
Please note that this UUID will be different than the v5 UUID for the same URL, which is generated like this:
uuid.uuid5(uuid.NAMESPACE_URL, 'https://ripple.com')
A nice property of v3 and v5 URLs is that they should be interoperable between implementations. In other words, if two different systems are using an implementation that complies with RFC4122, they will (or at least should) both generate the same UUID if all other things are equal (i.e. generating the same version UUID, with the same namespace and the same data). This property can be very helpful in some situations (especially in content-addressible storage scenarios), but perhaps not in your particular case.
Postgres documentation describes the differences between UUIDs. A couple of them:
V3:
uuid_generate_v3(namespace uuid, name text) - This function generates a version 3 UUID in the given namespace using the specified input name.
V4:
uuid_generate_v4 - This function generates a version 4 UUID, which is derived entirely from random numbers.
Since it's not mentioned yet: you can use uuidv1 if you want to be able to sort your entities by creation time without a separate, explicit timestamp. While that's not 100 % precise and in many cases not the best way to go (due to the lack of explicity), it comes handy in some scenarios, e.g. when you're working with a Cassanda database.
Version 1: UUIDs using a timestamp and monotonic counter.
Version 3: UUIDs based on the MD5 hash of some data.
Version 4: UUIDs with random data.
Version 5: UUIDs based on the SHA1 hash of some data.
Version 6: UUIDs using a timestamp and monotonic counter.
Version 7: UUIDs using a Unix timestamp.
Version 8: UUIDs using user-defined data.
Read more at Rust documentation.
Related
I am building database for physical objects and trying to make sure I represent them in the most intelligent way possible.
I would like to generate a QR code with a UUID, and would like to have the following attributes:
Guaranteed to be unique
Ideally would allow there to be a canonical UUID for something, and then an iterable component of an instance of it that would allow for knowing that without looking at the database. For example, I might have 10 identical lamps; ideally the encoding would have one UUID for the lamp, then a way to identify each one. Maybe this would be a UUID with something additional tagged onto it? Or is it better to just link them together on the backend and use conventional UUIDs?
My understanding is that the best way to generate the UUID would be to do a version-5 using my domain to namespace it. I would also like to have a human-readable shortened version similar to the git short hash; I assume I could take the first 7-8 characters for that purpose then have a collision likelihood for that shortened version that is XXX (what would that likelihood be???).
We have an old system that generated some UUIDs. We have more records that need UUID but can't use the old system to generate them, so we will need to generate them elsewhere. This immediately struck me as not a good idea and have been searching for an answer but haven't found this exact question. There would be no check to make sure the UUID wasn't generated already in the old system. The UUID would just get populated for records that don't have one. Safe?
If you use UUID V1, all values generated are guaranteed to be unique. However, most folks avoid V1 because it leaks data about the system that generated it (specially, the MAC address) and the exact time, neither of which is ideal in many contexts.
Most folks use UUID V4, which is statistically unique. While it is theoretically possible for the same value to be generated twice, the odds of it ever actually happening before the heat death of the universe is approximately zero, which is good enough for any practical purpose.
UUID V3/V5 are used for when you want predictable values, which doesn’t sound like a good fit for your needs.
I find myself often in a situation where I want to generate a very large number of UUIDs rapidly. It would be desirable to generate one UUID "properly," then manipulate it to generate the rest. For example, I might just "add one" to the lowest bits of the UUID.
For most of my applications, I control the consumers of the UUIDs sufficiently that I can guarantee that they accept these UUIDs without complication. I'm curious whether I could say the same if I did not control the consumers.
In particular, I'm interested in the following three algorithms:
Generate a UUID v5 with some unique string. For subsequent UUIDs, add 1 to the "node" section (last bytes)
Generate a UUID v4 from a pseudorandom source. Again, add 1 to the "node" section for subsequent UUIDs (in effect, using a tainted random number source with a strong correlation to the first number)
Generate a UUIDv5 with some unique string. Manually change the version number to v4, and begin adding 1 for subsequent UUIDs.
All of these are clearly not standard by the intent of the standard. However, are there any issues which would cause this to run afowl of the letter of the law, just just the spirit?
I'm planning on storing a bunch of records in a file, where each record is then signed with libsodium. However, I would like future versions of my program to be able to check signatures the current version has made, and ideally vice-versa.
For the current version of Sodium, signatures are made using the Ed25519 algorithm. I imagine that the default primitive can change in new versions of Sodium (otherwise libsodium wouldn't expose a way to choose a particular one, I think).
Should I...
Always use the default primitive (i.e. crypto_sign)
Use a specific primitive (i.e. crypto_sign_ed25519)
Do (1), but store the value of sodium_library_version_major() in the file (either in a dedicated 'sodium version' field or a general 'file format revision' field) and quit if the currently running version is lower
Do (3), but also store crypto_sign_primitive()
Do (4), but also store crypto_sign_bytes() and friends
...or should I do something else entirely?
My program will be written in C.
Let's first identify the set of possible problems and then try to solve it. We have some data (a record) and a signature. The signature can be computed with different algorithms. The program can evolve and change its behaviour, the libsodium can also (independently) evolve and change its behaviour. On the signature generation front we have:
crypto_sign(), which uses some default algorithm to produce signatures (at the moment of writing is just invokes crypto_sign_ed25519())
crypto_sign_ed25519(), which produces signatures based on specific ed25519 algorithm
I assume that for one particular algorithm given the same input data and the same key we'll always get the same result, as it's math and any deviation from this rule would make the library completely unusable.
Let's take a look at the two main options:
Using crypto_sign_ed25519() all the time and never changing this. Not that bad of an option, because it's simple and as long as crypto_sign_ed25519() exists in libsodium and is stable in its output you have nothing to worry about with stable fixed-size signature and zero management overhead for this. Of course, in future someone can discover some horrible problem with this algorithm and if you're not prepared to change the algorithm that could mean horrible problem for you.
Using crypto_sign(). With this we suddenly have a lot of problems, because the algorithm can change, so you must store some metadata along with the signature, which opens up a set of questions:
what to store?
should this metadata be record-level or file-level?
What do we have in mentioned functions for the second approach?
sodium_library_version_major() is a function to tell us the library API version. It's not directly related to changes in supported/default algorithms so it's of little use for our problems.
crypto_sign_primitive() is a function that returns a string identifying the algorithm used in crypto_sign(). That's a perfect match for what we need, because supposedly its output will change at exactly the time when the algorithm would change.
crypto_sign_bytes() is a function that returns the size of signature produced by crypto_sign() in bytes. That's useful for determining the amount of storage needed for the signature, but it can easily stay the same if algorithm changes, so it's not the metadata we need to store explicitly.
Now that we know what to store there is a question of processing that stored data. You need to get the algorithm name and use that to invoke matching verification function. Unfortunately, from what I see, libsodium itself doesn't provide any simple way to get the proper function given the algorithm name (like EVP_get_cipherbyname() or EVP_get_digestbyname() in openssl), so you need to make one yourself (which of course should fail for unknown name). And if you have to make one yourself maybe it would be even easier to store some numeric identifier instead of the name from library (more code though).
Now let's get back to file-level vs record-level. To solve that there are another two questions to ask — can you generate new signatures for old records at any given time (is that technically possible, is that allowed by policy) and do you need to append new records to old files?
If you can't generate new signatures for old records or you need to append new records and don't want the performance penalty of signature regeneration, then you don't have much choice and you need to:
have dynamic-size field for your signature
store the algorithm (dynamic string field or internal (for your application) ID) used to generate the signature along with the signature itself
If you can generate new signatures or especially if you don't need to append new records, then you can get away with simpler file-level approach when you store the algorithm used in a special file-level field and, if the signature algorithm changes, regenerate all signatures when saving the file (or use the old one when appending new records, that's also more of a compatibility policy question).
Other options? Well, what's so special about crypto_sign()? It's that its behaviour is not under your control, libsodium developers choose the algorithm for you (no doubt they choose good one), but if you have any versioning information in your file structure (not signature-specific, I mean) nothing prevents you from making your own particular choice and using one algorithm with one file version and another with another (with conversion code when needed, of course). Again, that's also based on the assumption that you can generate new signature and that's allowed by policy.
Which brings us back to the original two choices with question of whether it's worth the trouble of doing all that compared to just using crypto_sign_ed25519(). That mostly depends on your program life span, I'd probably say (just as an opinion) that if that's less than 5 years then it's easier to just use one particular algorithm. If it can easily be more than 10 years, then no, you really need to be able to survive algorithm (and probably even whole crypto library) changes.
Just use the high-level API.
Functions from the high-level API are not going to use a different algorithm without the major version of the library being bumped.
The only breaking change one can expect in libsodium 1.x.y is the removal of deprecated/undocumented functions (that don't even exist in current releases compiled with the --enable-minimal switch). Everything else will remain backward compatible.
New algorithms might be introduced in 1.x.y versions without high-level wrappers, and will be stabilized and exposed via a new high-level API in libsodium 2.
Therefore, do not bother calling crypto_sign_ed25519(). Just use crypto_sign().
Well, what is one?
It's an identification number that will uniquely identify something. The idea being that id number will be universally unique. Thus, no two things should have the same uuid. In fact, if you were to generate 10 trillion uuids, there would be something along the lines of a .00000006 chance of two uuids being the same.
Standardized identifiers
UUIDs are defined in RFC 4122. They're Universally Unique IDentifiers, that can be generated without the use of a centralized authority. There are four major types of UUIDs which are used in slightly different scenarios. All UUIDs are 128 bits in length, but are commonly represented as 32 hexadecimal characters separated by four hyphens.
Version 1 UUIDs, the most common, combine a MAC address and a timestamp to produce sufficient uniqueness. In the event of multiple UUIDs being generated fast enough that the timestamp doesn't increment before the next generation, the timestamp is manually incremented by 1. If no MAC address is available, or if its presence would be undesirable for privacy reasons, 6 random bytes sourced from a cryptographically secure random number generator may be used for the node ID instead.
Version 3 and Version 5 UUIDs, the least common, use the MD5 and SHA1 hash functions respectively, plus a namespace, plus an already unique data value to produce a unique ID. This can be used to generate a UUID from a URL for example.
Version 4 UUIDs, are simply 128 bits of random data, with some bit-twiddling to identify the UUID version and variant.
UUID collisions are extremely unlikely to happen, especially not in a single application space.
UUID stands for Universally Unique IDentifier.
It's a 128-bit value used for a unique identification in software development. UUID is the same as GUID (Microsoft) and is part of the Distributed Computing Environment (DCE), standardized by the Open Software Foundation (OSF).
As mentioned, they are intended to have a high likelihood of uniqueness over space and time and are computationally difficult to guess. It's generation is based on the current timestamp and the unique property of the workstation that generated the UUID.
Image from https://segment.com/blog/a-brief-history-of-the-uuid/
It's a very long string of bits that is supposed to be unique now and forever, i.e. no possible clash with any other UUID produced by you or anybody else in the world .
The way it works is simply using current timestamp, and an internet related unique property of the computer that generated it (like the IP address, which ought to be unique at the moment you're connected to the internet; or the MAC address, which is more low level, a hard-wired ID for your network card) is part of the bit string.
Originally every network card in the world has its own unique MAC address, but in later generations, you can change the MAC address through software, so it's not as much reliable as a unique ID anymore.
It's a Universally Unique Identifier
A UUID is a 128-bit number that is used to uniquely identify some entity. Depending on the specific mechanisms used, a UUID is guaranteed to be different or is, at least, extremely likely to be different from any other UUID generated. The UUID relies upon a combination of components to ensure uniqueness. A UUID contains a reference to the network address of the host that generated the UUID, a timestamp and a randomly generated component. Because the network address identifies a unique computer, and the timestamp is unique for each UUID generated from a particular host, those two components should sufficiently ensure uniqueness.
I just want to add that it is better to use usUUID (Static windows identifiers).
For example if a computer user that relies, on a third party software like a screen reader for blind or low vision users, the other software (in this case the screen reder) will play better with unique identifiers!
After all how happy will you be if someone moves your car after you know the place you parked it at!!!
A universally unique identifier (UUID) is a 128-bit number used to identify information in computer systems. The term globally unique identifier (GUID) is also used, typically in software created by Microsoft.
When generated according to the standard methods, UUIDs are for practical purposes unique. Their uniqueness does not depend on a central registration authority or coordination between the parties generating them, unlike most other numbering schemes. While the probability that a UUID will be duplicate.