Non-standard UUID generation (such as using counters) - uuid

I find myself often in a situation where I want to generate a very large number of UUIDs rapidly. It would be desirable to generate one UUID "properly," then manipulate it to generate the rest. For example, I might just "add one" to the lowest bits of the UUID.
For most of my applications, I control the consumers of the UUIDs sufficiently that I can guarantee that they accept these UUIDs without complication. I'm curious whether I could say the same if I did not control the consumers.
In particular, I'm interested in the following three algorithms:
Generate a UUID v5 with some unique string. For subsequent UUIDs, add 1 to the "node" section (last bytes)
Generate a UUID v4 from a pseudorandom source. Again, add 1 to the "node" section for subsequent UUIDs (in effect, using a tainted random number source with a strong correlation to the first number)
Generate a UUIDv5 with some unique string. Manually change the version number to v4, and begin adding 1 for subsequent UUIDs.
All of these are clearly not standard by the intent of the standard. However, are there any issues which would cause this to run afowl of the letter of the law, just just the spirit?

Related

Is it safe to use UUIDs generated from 2 different systems without a chance of collision?

We have an old system that generated some UUIDs. We have more records that need UUID but can't use the old system to generate them, so we will need to generate them elsewhere. This immediately struck me as not a good idea and have been searching for an answer but haven't found this exact question. There would be no check to make sure the UUID wasn't generated already in the old system. The UUID would just get populated for records that don't have one. Safe?
If you use UUID V1, all values generated are guaranteed to be unique. However, most folks avoid V1 because it leaks data about the system that generated it (specially, the MAC address) and the exact time, neither of which is ideal in many contexts.
Most folks use UUID V4, which is statistically unique. While it is theoretically possible for the same value to be generated twice, the odds of it ever actually happening before the heat death of the universe is approximately zero, which is good enough for any practical purpose.
UUID V3/V5 are used for when you want predictable values, which doesn’t sound like a good fit for your needs.

Does this bin-packing variant have a name?

I have what sounds like a typical bin-packing problem: x products of differing sizes need to be packed into y containers of differing capacities, minimizing the number of containers used, as well as minimizing the wasted space.
I can simplify the problem in that product sizes and container capacities can be reduced to standard 1-dimensional units. i.e. this product is 1 unit big while that one is 3 units, this box holds 6 units, that one 12. Think of eggs and cartons, or cases of beer.
But there's an additional constraint: each container has a particular attribute (we'll call it colour ), and each product has a set of colours it is compatible with. There is no correlation between colour and product/container sizing; One product may be colour-compatible with the entire palette, Another may only be compatible with the red containers.
Is this problem variant already described in literature? If so, what is its name?
I think there is no special name for this variant. Although the coloring constraint first gives the impression it's graph coloring related, it's not. It's simply a limitation on the values for a variable.
In a typical solver implementation, each product (= item) will have a variable to which container it's assigned. The color constraints just reduces the value range for a specific variable. So instead of specifying that all variables use the same value range, make it variable specific. (For example, in OptaPlanner this is the difference between a value range provided by the solution generally or by the entity specifically.) So the coloring constraint doesn't even need to be a constraint: it can be part of the model in most solvers.
Any solver that can handle bin packing should be able to handle this variant. Your problem is actually a relaxation of the Roadef 2012 Machine Reassignment problem, which is about assigning processes to computers. Simply drop all the constraints, except for 1 resource usage constraints and the constraint which excludes certain processes to certain machines. That use case is implemented in many solvers. (Although, in practice it is probably be easier to start from a basic bin packing example such as Cloud Balancing.)
Most likely 2d bin-packing or classic knapsack problem.

Java. UUID, how does it works

Does the java.util.UUID is unique in each file?
For example:
We have file IDFile1.java, generating here a random id, using UUID.randomUUID().toString().
And we have file2 with name IDFile2.java, which generates another id.
Will this two file IDs collide with each other?
Is there any way to "turn back" used ID, generated from java.util.UUID, that will mean, this ID could be used again?
The purpose of functions like randomUUID is to munge together various pieces of information which are likely to have some amount of randomness such that two UUIDs generated at different times or in different places would have an extremely small likelihood of matching unless all sources of potential randomness happened to yield identical results. No effort is made to keep track of which UUIDs have or have not been issued; instead, the goal is to have enough randomness that the probability of an unintentional match will be small relative to e.g. the probability of a computer being smashed to a million pieces by a meteor strike.
Note that the system may use "number of UUIDs issued" as part of its UUID calculation, but that would only be one of many factors that go into it. The purpose of such a counter isn't to allow one to "go back" to a previous UUID, but rather to ensure that if e.g. two UUIDs are requested nearly simultaneously without any source of randomness having become available between them, the two requests will yield different values.
UUIDs are unrelated to what file generates it. It doesn't matter which file generates them they more than likely will not be the same.
For question 2 the way UUIDs are generated doesn't really allow for them to be regenerated in any meaningful way. They are usually generated based on some info from your computer, the current time and other stuff. The Java algorithm uses a cryptographically secure random number generator and are known as type 4 UUIDs.

Which UUID version to use?

Which version of the UUID should you use? I saw a lot of threads explaining what each version entails, but I am having trouble figuring out what's best for what applications.
There are two different ways of generating a UUID.
If you just need a unique ID, you want a version 1 or version 4.
Version 1: This generates a unique ID based on a network card MAC address and current time. If any of these things is sensitive in any way, don't use this. The advantage of this version is that, while looking at a list of UUIDs generated by machines you trust, you can easily know whether many UUIDs got generated by the same machine, or infer some time relationship between them.
Version 4: These are generated from random (or pseudo-random) numbers. If you just need to generate a UUID, this is probably what you want. The advantage of this version is that when you're debugging and looking at a long list of information matched with UUIDs, it's quicker to spot matches.
If you need to generate reproducible UUIDs from given names, you want a version 3 or version 5. If you are interacting with other systems, this choice was already made and you should check with version and namespaces they use.
Version 3: This generates a unique ID from an MD5 hash of a namespace and name. If are dealing with very strict resource requirements (e.g. a very busy Arduino board), use this.
Version 5: This generates a unique ID from an SHA-1 hash of a namespace and name. This is the more secure and generally recommended version.
If you want a random number, use a random number library. If you want a unique identifier with effectively 0.00...many more 0s here...001% chance of collision, you should use UUIDv1. See Nick's post for UUIDv3 and v5.
UUIDv1 is NOT secure. It isn't meant to be. It is meant to be UNIQUE, not un-guessable. UUIDv1 uses the current timestamp, plus a machine identifier, plus some random-ish stuff to make a number that will never be generated by that algorithm again. This is appropriate for a transaction ID (even if everyone is doing millions of transactions/s).
To be honest, I don't understand why UUIDv4 exists... from reading RFC4122, it looks like that version does NOT eliminate possibility of collisions. It is just a random number generator. If that is true, than you have a very GOOD chance of two machines in the world eventually creating the same "UUID"v4 (quotes because there isn't a mechanism for guaranteeing U.niversal U.niqueness). In that situation, I don't think that algorithm belongs in a RFC describing methods for generating unique values. It would belong in a RFC about generating randomness. For a set of random numbers:
chance_of_collision = 1 - (set_size! / (set_size - tries)!) / (set_size ^ tries)
That's a very general question. One answer is: "it depends what kind of UUID you wish to generate". But a better one is this: "Well, before I answer, can you tell us why you need to code up your own UUID generation algorithm instead of calling the UUID generation functionality that most modern operating systems provide?"
Doing that is easier and safer, and since you probably don't need to generate your own, why bother coding up an implementation? In that case, the answer becomes use whatever your O/S, programming language or framework provides. For example, in Windows, there is CoCreateGuid or UuidCreate or one of the various wrappers available from the numerous frameworks in use. In Linux there is uuid_generate.
If you, for some reason, absolutely need to generate your own, then at least have the good sense to stay away from generating v1 and v2 UUIDs. It's tricky to get those right. Stick, instead, to v3, v4 or v5 UUIDs.
Update:
In a comment, you mention that you are using Python and link to this. Looking through the interface provided, the easiest option for you would be to generate a v4 UUID (that is, one created from random data) by calling uuid.uuid4().
If you have some data that you need to (or can) hash to generate a UUID from, then you can use either v3 (which relies on MD5) or v5 (which relies on SHA1). Generating a v3 or v5 UUID is simple: first pick the UUID type you want to generate (you should probably choose v5) and then pick the appropriate namespace and call the function with the data you want to use to generate the UUID from. For example, if you are hashing a URL you would use NAMESPACE_URL:
uuid.uuid3(uuid.NAMESPACE_URL, 'https://ripple.com')
Please note that this UUID will be different than the v5 UUID for the same URL, which is generated like this:
uuid.uuid5(uuid.NAMESPACE_URL, 'https://ripple.com')
A nice property of v3 and v5 URLs is that they should be interoperable between implementations. In other words, if two different systems are using an implementation that complies with RFC4122, they will (or at least should) both generate the same UUID if all other things are equal (i.e. generating the same version UUID, with the same namespace and the same data). This property can be very helpful in some situations (especially in content-addressible storage scenarios), but perhaps not in your particular case.
Postgres documentation describes the differences between UUIDs. A couple of them:
V3:
uuid_generate_v3(namespace uuid, name text) - This function generates a version 3 UUID in the given namespace using the specified input name.
V4:
uuid_generate_v4 - This function generates a version 4 UUID, which is derived entirely from random numbers.
Since it's not mentioned yet: you can use uuidv1 if you want to be able to sort your entities by creation time without a separate, explicit timestamp. While that's not 100 % precise and in many cases not the best way to go (due to the lack of explicity), it comes handy in some scenarios, e.g. when you're working with a Cassanda database.
Version 1: UUIDs using a timestamp and monotonic counter.
Version 3: UUIDs based on the MD5 hash of some data.
Version 4: UUIDs with random data.
Version 5: UUIDs based on the SHA1 hash of some data.
Version 6: UUIDs using a timestamp and monotonic counter.
Version 7: UUIDs using a Unix timestamp.
Version 8: UUIDs using user-defined data.
Read more at Rust documentation.

What is a UUID?

Well, what is one?
It's an identification number that will uniquely identify something. The idea being that id number will be universally unique. Thus, no two things should have the same uuid. In fact, if you were to generate 10 trillion uuids, there would be something along the lines of a .00000006 chance of two uuids being the same.
Standardized identifiers
UUIDs are defined in RFC 4122. They're Universally Unique IDentifiers, that can be generated without the use of a centralized authority. There are four major types of UUIDs which are used in slightly different scenarios. All UUIDs are 128 bits in length, but are commonly represented as 32 hexadecimal characters separated by four hyphens.
Version 1 UUIDs, the most common, combine a MAC address and a timestamp to produce sufficient uniqueness. In the event of multiple UUIDs being generated fast enough that the timestamp doesn't increment before the next generation, the timestamp is manually incremented by 1. If no MAC address is available, or if its presence would be undesirable for privacy reasons, 6 random bytes sourced from a cryptographically secure random number generator may be used for the node ID instead.
Version 3 and Version 5 UUIDs, the least common, use the MD5 and SHA1 hash functions respectively, plus a namespace, plus an already unique data value to produce a unique ID. This can be used to generate a UUID from a URL for example.
Version 4 UUIDs, are simply 128 bits of random data, with some bit-twiddling to identify the UUID version and variant.
UUID collisions are extremely unlikely to happen, especially not in a single application space.
UUID stands for Universally Unique IDentifier.
It's a 128-bit value used for a unique identification in software development. UUID is the same as GUID (Microsoft) and is part of the Distributed Computing Environment (DCE), standardized by the Open Software Foundation (OSF).
As mentioned, they are intended to have a high likelihood of uniqueness over space and time and are computationally difficult to guess. It's generation is based on the current timestamp and the unique property of the workstation that generated the UUID.
Image from https://segment.com/blog/a-brief-history-of-the-uuid/
It's a very long string of bits that is supposed to be unique now and forever, i.e. no possible clash with any other UUID produced by you or anybody else in the world .
The way it works is simply using current timestamp, and an internet related unique property of the computer that generated it (like the IP address, which ought to be unique at the moment you're connected to the internet; or the MAC address, which is more low level, a hard-wired ID for your network card) is part of the bit string.
Originally every network card in the world has its own unique MAC address, but in later generations, you can change the MAC address through software, so it's not as much reliable as a unique ID anymore.
It's a Universally Unique Identifier
A UUID is a 128-bit number that is used to uniquely identify some entity. Depending on the specific mechanisms used, a UUID is guaranteed to be different or is, at least, extremely likely to be different from any other UUID generated. The UUID relies upon a combination of components to ensure uniqueness. A UUID contains a reference to the network address of the host that generated the UUID, a timestamp and a randomly generated component. Because the network address identifies a unique computer, and the timestamp is unique for each UUID generated from a particular host, those two components should sufficiently ensure uniqueness.
I just want to add that it is better to use usUUID (Static windows identifiers).
For example if a computer user that relies, on a third party software like a screen reader for blind or low vision users, the other software (in this case the screen reder) will play better with unique identifiers!
After all how happy will you be if someone moves your car after you know the place you parked it at!!!
A universally unique identifier (UUID) is a 128-bit number used to identify information in computer systems. The term globally unique identifier (GUID) is also used, typically in software created by Microsoft.
When generated according to the standard methods, UUIDs are for practical purposes unique. Their uniqueness does not depend on a central registration authority or coordination between the parties generating them, unlike most other numbering schemes. While the probability that a UUID will be duplicate.

Resources