Sorry in advance as this question is similar (but not the same!) to others.
Anyway, I need to be able to generate surrogate keys in more than one location to be synchronized at a later time. I was considering using GUIDs, however these keys may have to appear in the parameters of a URL and GUIDs would be really complicated and ugly.
I was considering a scheme that would allow me to use integers, providing better performance in the database, but obviously I cannot simply use auto numbers. The idea is to use a key with two meanings - the High-Low Strategy as I believe it is called. The key would consist of the source (where it was generated, generally 1 of 2 locations in this business case) and the auto incremented value. For instance:
1-000000567,
1-000000568,
1-000000569,
1-000000570,
...
And for another source:
2-000000567,
2-000000567,
...
This would also mean that I could store them in the database as integers (i.e "2-000000567" would become the integer "2000000567").
Can anyone see any issues with this? Such as indexing or fragmentations that may occur? Or perhaps even a better way of doing it?
Just to confirm, there is no business meaning in this key, the user will never see it (except perhaps in the parameters of a URL) nor use it.
I look forward to your opinions and appreciate your time, Thanks a million :)
This explains the hilo algorithm which you refer to: What's the Hi/Lo algorithm?
It's the often-used solution to "disconnect" problems such as yours. For i.e. if you're using Hibernate/nHibernate, it's one of the recommended primary key options.
Related
Trying to define some policy for keys in a key-value store (we are using Redis). The keyspace should be:
Shardable (can introduce more servers and spread out the keyspace between them)
Namespaced (there should be some mechanism to "group" keys together logically, for example by domain or associated concepts)
Efficient (try to use as little as possible space in the DB for keys, to allow for as much data as possible)
As collision-less as possible (avoid keys for two different objects to be equal)
Two alternatives that I have considered are these:
Use prefixes for namespaces, separated by some character (like human_resources:person:<some_id>).The upside of this is that it is pretty scalable and easy to understand. The downside would be possible conflicts depending on the separator (what if id has the character : in it?), and possibly size efficiency (too many nested namespaces might create very long keys).
Use some data structure (like Ordered Set or Hash) to store namespaces. The main drawback to this would be loss of "shardability", since the structure to store the namespaces would need to be in a single database.
Question: What would be a good way to manage a keyspace in a sharded setup? Should we use one these alternatives, or is there some other, better pattern that we have not considered?
Thanks very much!
The generally accepted convention in the Redis world is option 1 - i.e. namespaces separated by a character such as colon. That said, the namespaces are almost always one level deep. For example : person:12321 instead of human_resources:person:12321.
How does this work with the 4 guidelines you set?
Shardable - This approach is shardable. Each key can get into a different shard or same shard depending on how you set it up.
Namespaced Namespace as a way to avoid collisions works with this approach. However, namespaces as a way to group keys doesn't work out. In general, using keys as a way to group data is a bad idea. For example, what if the person moves from department to another? If you change the key, you will have to update all references - and that gets tricky.
Its best to ensure the key never changes for an object. Grouping can then be handled externally by creating a separate index.
For example, lets say you want to group people by department, by salary range, by location. Here's how you'd do it -
Individual people go in separate hash with keys persons:12321
Create a set for each group by - For example : persons_by:department - and only store the numeric identifiers for each person in this set. For example [12321, 43432]. This way, you get the advantages of Redis' Integer Set
Efficient The method explained above is pretty efficient memory wise. To save some more memory, you can compress the keys further on the application side. For example, you can store p:12321 instead of persons:12321. You should do this only if you have determined via profiling that you need such memory savings. In general, it isn't worth the cost.
Collision Free This depends on your application. Each User or Person should have a primary key that never changes. Use this in your Redis key, and you won't have collisions.
You mentioned two problems with this approach, and I will try to address them
What if the id has a colon?
It is of course possible, but your application's design should prevent it. Its best not to allow special characters in identifiers - because they will be used across multiple systems. For example, the identifier will very likely be a part of the URL, and colon is a reserved character even for urls.
If you really must allow special characters in your identifier, you would have to write a small wrapper in your code that encodes the special characters. URL encoding is perfectly capable of handling this.
Size Efficiency
There is a cost to long keys, however it isn't too much. In general, you should worry about the data size of your values rather than the keys. If you think keys are consuming too much memory, profile the database using a tool like redis-rdb-tools.
If you do determine that key size is a problem and want to save the memory, you can write a small wrapper that rewrites the keys using an alias.
I'm working on a project that I want to have be as flexible and scalable as possible from the beginning. A problem I'm concerned about is one best described by Joshua Schacter in Founders at Work, who noted it as one detail he wish he would've planned for ahead of time.
Scaling past one machine, one database, is very challenging, even with replication. The tools that are there are not quite right.
For example, when you add things to a table and it numbers them, that means you can't have a second machine also adding to them because the numbers will collide. So what do you do? You have to come up with some completely different way to do it.
Do you have a central server that hands out number sets, or do you come up with something that's not numbers? Do you use random numbers and hope they never collide? Whatever it is, auto-assigned IDs just don't fly.
Has anyone here faced this problem? What are ways to move beyond auto-incremented IDs, or is there a way to have them scale with multiple servers?
Use GUID/UUID (globally/universally unique identifier). In theory it's guaranteed to be unique across multiple machines.
GUIDs, your chances of collision are astronomically low.
It's also possible to have (what we called) SmartGUIDs (usually called COMB GUIDS - see this analysis, particularly page 7) where you can encode a timestamp within the GUID, so you get record creation date information "for free" - so you can save a timestamp column for record creation datetime - which gets back some of what you lost on moving from 32-bit integer to 128-bit GUID. These can also be guaranteed to be monotonic, unlike regular GUIDs, which can be useful for clustered indexes and for sorting.
You can also use composite keys with some kind of server/db ID with a regular auto-increment identity or auto-number.
I have a situation where I need to store a general piece of data (could be an int, float, or string) in my database, but I don't know ahead of time which it will be. I need a table (or less preferably tables) to store this unknown typed data.
What I think I am going to do is have a column for each data type, only use one for each record and leave the others NULL. This requires some logic above the database, but this is not too much of a problem because I will be representing these records in models anyway.
Basically, is there a best practice way to do something like this? I have not come up with anything that is less of a hack than this, but it seems like this is a somewhat common problem. Thanks in advance.
EDIT: Also, is this considered 3NF?
You could easily do that if you used SQLite as a database backend :
Any column in a version 3 database, except an INTEGER PRIMARY KEY column, may be used to store any type of value.
For other RDBMS systems, I would go with Philip's solution.
Note that in my line of software (business applications), I cannot think of any situation where this kind of requirement would be needed (a value with an unknown datatype). Unless the domain model was flawed, of course... I can imagine that other lines of software may incur different practices, but I suggest that you consider rethinking your overall design.
If your application can reliably convert datatypes, you might consider a single column solution based on a variable-length binary column, with a second column to track original data type. (I did a very small routine based on this once before, and it worked well enough.) Testing would show if conversion is more efficiently handled on the application or database side.
If I were to do this I would choose either your method, or I would cast everything to string and use only one column. Of course there would be another column with the type (which would probably be useful for the first method too).
For faster code I would probably go with your method.
I'm building a database that will store information on a range of objects (such as scientific papers, specimens, DNA sequences, etc.) that all have a presence online and can be identified by a URL, or an identifier such as a DOI. Using these GUIDs as the primary key for the object seems a reasonable idea, and I've followed delicious and Connotea in using the md5 hash of the GUID. You'll see the md5 hash in your browser status bar if you mouse over the edit or delete buttons in a delicious or Connotea book mark. For example, the bookmark for http://stackoverflow/ is
http://delicious.com/url/e4a42d992025b928a586b8bdc36ad38d
where e4a42d992025b928a586b8bdc36ad38d ais the md5 hash of http://stackoverflow/.
Does anybody have views on the pros and cons of this approach?
For me an advantage of this approach (as opposed to using an auto incrementing primary key generated by the database itself) is that I have to do a lot of links between objects, and by using md5 hashes I can store these links externally in a file (say, as the result of data mining/scraping), then import them in bulk into the database. In the same way, if the database has to be rebuilt from scratch, the URLs to the objects won't change because they use the md5 hash.
I'd welcome any thoughts on whether this sounds sensible, or whether there other (better?) ways of doing this.
It's perfectly fine.
Accidental collision of MD5 is impossible in all practical scenarios (to get a 50% chance of collision you'd have to hash 6 billion URLs per second, every second, for 100 years).
It's such an improbable chance that you're trillion times more likely to get your data messed up due to an undetected hardware failure than due to an actual collision.
Even though there is a known collision attack against MD5, intentional malicious collisions are currently impossible against hashed URLs.
The type of collision you'd need to intentionally collide with a hash of another URL is called a pre-image attack. There are no known pre-image attacks against MD5. As of 2017 there's no research that comes even close to feasibility, so even a determined well-funded attacker can't compute a URL that would hash to a hash of any existing URL in your database.
The only known collision attack against MD5 is not useful for attacking URL-like keys. It works by generating a pair of binary blobs that collide only with each other. The blobs will be relatively long, contain NUL and other unprintable bytes, so they're extremely unlikely to resemble anything like a URL.
After browsing stackoverfow a little more I found an earlier question Advantages and disadvantages of GUID / UUID database keys which covers much of this ground.
Multiple strings can produce the same md5 hash. Primary keys must be unique. So using the hash as the primary key is not good. Better is to use the GUID directly.
Is a GUID suitable for use in a URL. Sure. Here's a GUID (actually, a UUID) I jsut created using Java: 1ccb9467-e326-4fed-b9a7-7edcba52be84
The url could be:
http://example.com/view?id=1ccb9467-e326-4fed-b9a7-7edcba52be84
It's longish but perfectly usable and achieves what you describe.
Maybe this document is something you want to read:
http://www.hpl.hp.com/techreports/2002/HPL-2002-216.pdf
Often lots of different urls point to the same page.
http://example.com/
example.com
http://www.example.com/
http://example.com/index.html
http://example.com/.
https://example.com/
etc.
This might or might not be a problem for you.
MD5 is considered deprecated - at least for cryptographic purposes, but I would suggest only using md5 for backwards compatibility with existing stuff. You should have a good reason to go with md5 when we do have other hash algos out there that aren't (at least yet) broken.
Problems I see with the approach:
Duplicate objects, because the url identifier is different
(As arend mentioned)
URLs changing
The latter being the one that might be important - this could be done as simply as a remove and an add. That is, if these ids are never visible/storable outside the database. (Like as a component of a URL.)
I guess these won't be a problem for DOIs.
How would it work with a non-autonumber integer id setup, but where the offline inserter agent creates the numbers? (Can use a dedicated range of numbers, maybe?)
Might have a problem with duplication should two users independently add the same url?
md5 hash is almost unique, but is not totally unique unique so don't use it as primary key. It is depreciated for cryptographic use. There is less chance of key collision, but if you have pretty big database with billions of rows, there is still some chance of collision. If you insist using hash as primary key use other better hash. You cannot use non unique values for Primary Key.
If you have pretty big table, don't use it. If you have small table, you might use it, but not recommended.
I've always preferred to use long integers as primary keys in databases, for simplicity and (assumed) speed. But when using a REST or Rails-like URL scheme for object instances, I'd then end up with URLs like this:
http://example.com/user/783
And then the assumption is that there are also users with IDs of 782, 781, ..., 2, and 1. Assuming that the web app in question is secure enough to prevent people entering other numbers to view other users without authorization, a simple sequentially-assigned surrogate key also "leaks" the total number of instances (older than this one), in this case users, which might be privileged information. (For instance, I am user #726 in stackoverflow.)
Would a UUID/GUID be a better solution? Then I could set up URLs like this:
http://example.com/user/035a46e0-6550-11dd-ad8b-0800200c9a66
Not exactly succinct, but there's less implied information about users on display. Sure, it smacks of "security through obscurity" which is no substitute for proper security, but it seems at least a little more secure.
Is that benefit worth the cost and complexity of implementing UUIDs for web-addressable object instances? I think that I'd still want to use integer columns as database PKs just to speed up joins.
There's also the question of in-database representation of UUIDs. I know MySQL stores them as 36-character strings. Postgres seems to have a more efficient internal representation (128 bits?) but I haven't tried it myself. Anyone have any experience with this?
Update: for those who asked about just using the user name in the URL (e.g., http://example.com/user/yukondude), that works fine for object instances with names that are unique, but what about the zillions of web app objects that can really only be identified by number? Orders, transactions, invoices, duplicate image names, stackoverflow questions, ...
I can't say about the web side of your question. But uuids are great for n-tier applications. PK generation can be decentralized: each client generates it's own pk without risk of collision.
And the speed difference is generally small.
Make sure your database supports an efficient storage datatype (16 bytes, 128 bits).
At the very least you can encode the uuid string in base64 and use char(22).
I've used them extensively with Firebird and do recommend.
For what it's worth, I've seen a long running stored procedure (9+ seconds) drop to just a few hundred milliseconds of run time simply by switching from GUID primary keys to integers. That's not to say displaying a GUID is a bad idea, but as others have pointed out, joining on them, and indexing them, by definition, is not going to be anywhere near as fast as with integers.
I can answer you that in SQL server if you use a uniqueidentifier (GUID) datatype and use the NEWID() function to create values you will get horrible fragmentation because of page splits. The reason is that when using NEWID() the value generated is not sequential. SQL 2005 added the NEWSEQUANTIAL() function to remedy that
One way to still use GUID and int is to have a guid and an int in a table so that the guid maps to the int. the guid is used externally but the int internally in the DB
for example
457180FB-C2EA-48DF-8BEF-458573DA1C10 1
9A70FF3C-B7DA-4593-93AE-4A8945943C8A 2
1 and 2 will be used in joins and the guids in the web app. This table will be pretty narrow and should be pretty fast to query
Why couple your primary key with your URI?
Why not have your URI key be human readable (or unguessable, depending on your needs), and your primary index integer based, that way you get the best of both worlds. A lot of blog software does that, where the exposed id of the entry is identified by a 'slug', and the numeric id is hidden away inside of the system.
The added benefit here is that you now have a really nice URL structure, which is good for SEO. Obviously for a transaction this is not a good thing, but for something like stackoverflow, it is important (see URL up top...). Getting uniqueness isn't that difficult. If you are really concerned, store a hash of the slug inside a table somewhere, and do a lookup before insertion.
edit: Stackoverflow doesn't quite use the system I describe, see Guy's comment below.
Rather than URLs like this:
http://example.com/user/783
Why not have:
http://example.com/user/yukondude
Which is friendlier to humans and doesn't leak that tiny bit of information?
You could use an integer which is related to the row number but is not sequential. For example, you could take the 32 bits of the sequential ID and rearrange them with a fixed scheme (for example, bit 1 becomes bit 6, bit 2 becomes bit 15, etc..).
This will be a bidirectional encryption, and you will be sure that two different IDs will always have different encryptions.
It would obviously be easy to decode, if one takes the time to generate enough IDs and get the schema, but, if I understand correctly your problem, you just want to not give away information too easily.
We use GUIDs as primary keys for all our tables as it doubles as the RowGUID for MS SQL Server Replication. Makes it very easy when the client suddenly opens an office in another part of the world...
I don't think a GUID gives you many benefits. Users hate long, incomprehensible URLs.
Create a shorter ID that you can map to the URL, or enforce a unique user name convention (http://example.com/user/brianly). The guys at 37Signals would probably mock you for worrying about something like this when it comes to a web app.
Incidentally you can force your database to start creating integer IDs from a base value.
It also depends on what you care about for your application. For n-tier apps GUIDs/UUIDs are simpler to implement and are easier to port between different databases. To produce Integer keys some database support a sequence object natively and some require custom construction of a sequence table.
Integer keys probably (I don't have numbers) provide an advantage for query and indexing performance as well as space usage. Direct DB querying is also much easier using numeric keys, less copy/paste as they are easier to remember.
I work with a student management system which uses UUID's in the form of an integer. They have a table which hold the next unique ID.
Although this is probably a good idea for an architectural point of view, it makes working with on a daily basis difficult. Sometimes there is a need to do bulk inserts and having a UUID makes this very difficult, usually requiring writing a cursor instead of a simple SELECT INTO statement.
I've tried both in real web apps.
My opinion is that it is preferable to use integers and have short, comprehensible URLs.
As a developer, it feels a little bit awful seeing sequential integers and knowing that some information about total record count is leaking out, but honestly - most people probably don't care, and that information has never really been critical to my businesses.
Having long ugly UUID URLs seems to me like much more of a turn off to normal users.
I think that this is one of these issues that cause quasi-religious debates, and its almost futile to talk about. I would just say use what you prefer. In 99% of systems it will no matter which type of key you use, so the benefits (stated in the other posts) of using one sort over the other will never be an issue.
I think using a GUID would be the better choice in your situation. It takes up more space but it's more secure.
YouTube uses 11 characters with base64 encoding which offers 11^64 possibilities, and they are usually pretty manageable to write. I wonder if that would offer better performance than a full on UUID. UUID converted to base 64 would be double the size I believe.
More information can be found here: https://www.youtube.com/watch?v=gocwRvLhDf8
Pros and Cons of UUID
Note: uuid_v7 is time based uuid instead of random. So you can
use it to order by creation date and solve some performance issues
with db inserts if you do really many of them.
Pros:
can be generated on api level (good for distributed systems)
hides count information about entity
doesn't have limit 2,147,483,647 as 32-bit int
removes layer of errors related to passing one entity id userId: 25 to get another bookId: 25 accidently
more friendly graphql usage as ID key
Cons:
128-bit instead 32-bit int (slightly bigger size in db and ~40% bigger index, around ~30MB for 1 million rows), should be a minor concern
can't be sorted by creation (can be solved with uuid_v7)
non-time-ordered UUID versions such as UUIDv4 have poor database index locality (can be solved with uuid_v7)
URL usage
Depending on app you may care or not care about url. If you don't care, just use uuid as is, it's fine.
If you care, then you will need to decide on url format.
Best case scenario is a use of unique slug if you ok with never changing it:
http://example.com/sale/super-duper-phone
If your url is generated from title and you want to change slug on title change there is a few options. Use it as is and query by uuid (slug is just decoration):
http://example.com/book/035a46e0-6550-11dd-ad8b-0800200c9a66/new-title
Convert it to base64url:
you can get uuid back from AYEWXcsicACGA6PT7v_h3A
AYEWXcsicACGA6PT7v_h3A - 22 characters
035a46e0-6550-11dd-ad8b-0800200c9a66 - 36 characters
http://example.com/book/AYEWXcsicACGA6PT7v_h3A/new-title
Generate a unique short 11 chars length string just for slug usage:
http://example.com/book/icACEWXcsAY-new-title
http://example.com/book/icACEWXcsAY/new-title
If you don't want uuid or short id in url and want only slug, but do care about seo and user bookmarks, you will need to redirect all request from
http://example.com/sale/phone-1-title
to
http://example.com/sale/phone-1-title-updated
this will add additional complexity of managing slug history, adding fallback to history for all queries where slug is used and redirects if slugs doesn't match
As long as you use a DB system with efficient storage, HDD is cheap these days anyway...
I know GUID's can be a b*tch to work with some times and come with some query overhead however from a security perspective they are a savior.
Thinking security by obscurity they fit well when forming obscure URI's and building normalised DB's with Table, Record and Column defined security you cant go wrong with GUID's, try doing that with integer based id's.