What is the best one way permutation function I could use to digest an e-mail so I can use it as a primary key without storing personal data?
I'm getting my first F2P game ready: a simple yet (hopefully) addictive 2D casual puzzler based on aiming mechanics. It's made with Unity and will be released on Android very soon.
In order for the player to keep the same data across different devices, I have an SQL table with the device e-mail as the primary key, then another string as the savegame data.
But I don't want to store the user e-mail for privacy reasons.
So I thought of digesting it with some function that would use the original e-mail to generate a new string that:
is unique (will never collide with another string generated from a different e-mail address)
is not decypherable (there should be no way to obtain the original e-mail from the digested string - or at least it should be hard enough)
This way I could still use the Android device e-mail to retrieve the savegame data, without storing personal data from the player.
As far as I've researched, the solution seems to be called a one way permutation function. The problem is that I can't seem to find an appropriate function on the internet; instead, all answers seem to be plagued with solutions for password hashing, which is very interesting (salting, MD5, SHAXXX...) but don't meet my first requirement of no collision.
Thank you in advance for any answer on this topic.
What you need is a cryptographic hash function such as SHA-256. Such functions are designed to be collision resistant, Git uses an older version SHA-1. Most languages/systems have support of this, just Google "Android SHA-256" along with your language of choice.
One option is to append a creation timestamp.
Update: Since SHA-256 does not provide sufficient collision resistance consider s GUID, from RFC 4122: "A UUID is 128 bits long, and can guarantee uniqueness across space and time.". Of course you need to find a good implementation.
Related
I have a doubt regarding the exposure of internal database primary keys.
I have decided to use UUIDs in place of auto-increment longs (see here for details). This way, among other things, people cannot discover the relative size of my data or their growth over time.
Now, the UUID doesn't provide any internal information but it is not very URL friendly, although it is URL safe. Furthermore if long PKs shouldn't be exposed, then UUIDs shouldn't either.
Usually to make UUIDs more user friendly, people base64 encode them.
Example:
- UUID: 7b3149e7-bdab-4895-b659-a5f5b0d0
- base64: ezFJ572rSJW2WQAApfWw0A
My point is: anyone could still take those base64 string from the url and decode them in order to obtain the original UUID. This means that even in this case UUIDs would end up being exposed as well.
Should I use another type of encoding? Is out there something already known or should I create my custom encoding? If yes, should I follow any guidelines?
Thank you
On the first look to be able to provide a small tiny level of Secrecy to those Identifiers you can use one way Hash functions such as SHA2(which is a Cryptographic function and not Encoding). This will literally buy you no specific security advantage.
If you are relying only on Object Reference IDs for access control and try to make them secret then I suggest you think twice at your Access Control and Authorization Model.
It is good to have random/non-guessable/Collision Free Object Reference IDs, however If you are relying on Secrecy of Reference ID for security this is a big flaw (in Old OWASP Top10 this was referred as Direct Object Reference Identifier Issue and in OWASP 2017 this is referred as Broken Access Control Issue). You need to consider a Full AAA chain: Authentication,Authorization,Audit/Accountability for Access by relying on a Random unique Token with a short validity period, which later on can be used to decide on Authorization and Access levels of your system's to be tied with a subject and permit them to interact with the Objects that they are entitled with.
The reason you aren't supposed to expose PKs is that they may (a) leak information and (b) allow people to guess other values. Neither is true of UUIDs (at least v3/4/5), which is one of the main reasons to use them in the first place. The human factor you mention is why so many folks use base64 (or other) encoding; it's not for security.
That said, you should never rely on URL secrecy as security; there are far too many ways that URLs leak, and your users may even do it intentionally--but they'd be very upset if sending a link to their friend meant that friend had full access to their account.
I will release my GAE application in a few months on a closed beta state, so that just a few users can use it and I get some date and know where and how to improve it. My idea was that I use a key system to let them access the application.
What I want to do:
I want to generate a punch of keys and store them with Datastore. When a users comes to the application the first time he logs in with his Google account and has to enter a key to activate his account.
My question:
My previous software didn't require such license keys or similar so this is a new area for me. Do you think this is good way to realize a closed beta? My second idea was to generate a bunch of keys and validate them with a system like other popular software does it, but I think this is unnecessary and I wan't to avoid a that someone can make a key-gen. Just generating, storing, then checking the key if it exists in the Datastore, setting it to used and activating the account would be my suggestion.
How can I generate a lot of valid and easily add more (without duplicates) keys. I'm thankful for every experience and suggestion.
As a refinement to Ashley's suggestion, if you'd like to generate shorter and/or easier to type IDs, you can generate some random data and encode it using base32:
base64.b32encode(os.urandom(8)).strip('=')
Make it a bit more readable by inserting hyphens:
'-'.join(base64.b32encode(os.urandom(8)).strip('=')[5*x:5*(x+1)] for x in range(3))
This gives you codes like the following:
'C6ZVG-NJ6KA-CWE'
Then just store the result in your datastore and hand them out to users. I'd suggest storing the code without the hyphens, and stripping those characters before checking the database. If you want to get really fancy, base32's alphabet is chosen to avoid characters that look similar; you could substitute those characters before you do the check to account for typos.
8 bytes of random data gives you 2^64 possible invite codes; if you hand out, say, 2^16 (65,536) of them, an attacker will still have to try 2^48 (about 300 trillion) codes to find a valid one. You can make your codes shorter at the cost of reducing the search space, if you want.
I use UUID for generating random keys:
UUID.randomUUID().toString().replace("-", "");
From the docs: "The UUID is generated using a cryptographically strong pseudo random number generator".
Generate a long list of them in the datastore, and then when a user arrives at something like: yourapp.com/betainvite/blahblahkey you can simply check if the key is in the table, and if it's rsvp property is null (or already set to the date it was used, in which case you deny the invite).
You could store the key against your User too, so you can find out who used each one and when.
Also good idea to maintain an invited date on the keys, then as you use each one you can mark it as invited, so you don't double invite people.
I'm creating a simple license key system to "keep honest people honest". I don't care about especially stringent cryptography.
If they get to annoyed with the demo limitations, they go to my registration website, pay, and give me their email. I give them a license key.
I'm keeping things really simple, so:
license_key = md5(email + "Salt_String");
I have PHP and C# functions run that same algorithm and get the same key.
The problem is that the output of these functions is a 32-character string like:
A69761CF99316358D04771C5ECFCCDC5
Which is potentially hard to remember/type. Yes, I know about copy/paste, but I want to make it REALLY easy for all paying customers to unlock the software.
Should I somehow convert this long string into something shorter?
Lets say I use only the first 6 digits, so: A69761
There are obviously way more cryptographic collisions in that, but will it matter at all in practical use?
Any other ideas to make the thing more human readable/typeable?
To left 6-10 symbols will be enough - the user anyway will not be able to guess the code, and it would be easy to type in.
Also good idea would be to register each license on your server, so that you will be able to check that user is really honest, and didn't give a license key to another person.
In my experience, asking the user to type or copy/paste a 30-character code indeed leads to frustrated customers. It's not that it's so difficult. It's simply a hurdle that people don't care for.
The solution I've used for my business is to have separate trial and purchased downloads. To get their licensed copy, the customer types in their email address and a short user ID on the download form. Entering only the email automatically resends the user ID. You didn't ask about this, but a system to automatically look up whatever code the customer needs is even more important than having a simple system. The download system looks up the user's details in the database and serves a SetupSomeProductCustomerName.exe that has the user's license embedded in it. This setup installs the customer's licensed copy without requiring any further identification or server connections.
This system has worked really well for us. The customer has only one file to back up and no serial numbers to lose to make sure they can reinstall the software in the future.
That said, if you prefer to use a system using a one-way hash, simply use an algorithm that generates a smaller hash. E.g. CRC-32 results in 8 hexadecimal digits.
There's no point in the hash being cryptographically secure. A cracker will simply walk through your code, copy the entire block of code that mutates the email address into the license key, and paste that into their keygen. Then they can generate license keys for any email address. They can do that regardless of how complex your hashing algorithm is.
If you want to prevent this, you need to use public key encryption, which results in keys that are far too long to type in. If you go that route, you'll either need to annoy your customers with long keys to paste in or separate key files, or use the personalized download system I described above.
I'm working on a multi-tenant application that will be implementing service APIs. I don't want to expose the default auto increment key for security reasons and data migration/replication concerns so I'm looking at alternative keys. GUID/UUID is an obvious choice but they make the URL a bit long and while reading an article about them I saw that Google uses "truncated SHA1" for their URL IDs.
How does this work? It's my understanding that you hash part/all of the object contents to come up with the key. My objects can change over time so hashing the whole object wouldn't work since the key will need to remain the same over time. Could I implement UUIDs and hash those? What limitations/issues are there in using SHA1 for keys (e.g. max records, collision, etc.)?
I've been searching Google but haven't come up with the right search query.
/* edit: more information about environment */
Currently we are a Java shop using Spring/Hibernate with MySQL in back. We are in process to switch core development to Grails which is where this idea will be implemented.
I thought about a similar problem some time ago and ended up implementing Blowfish in the URL. It's not super safe but gives much shorter URLs than for instance SHA256 and also it's completely collision free.
That's actually a pretty solid idea, though it might make key lookups a little tough (unless you hashed the key and kept it inline in the table, I suppose). You'd just have to hash every key you use, though if you're auto-incrementing, that's no problem. You wouldn't even need a GUID - you could even just hash the key, since it's a one-way operation and can't be easily reversed. You could even "salt" your key before you hash it, which would make it virtually unbreakable by making the key unpredictable.
There is a concern about collision, but with SHA1, your hash is 160 bits, or has 1.46 × 10^48 unique values, which should be enough to support some fraction of that many unique keys without worrying about a collision. If you've got enough keys that you're still worried about a collision, you can upgrade to something like SHA256 or even SHA512, which should be plenty long as to avoid any reasonable concern about a collision.
If you need some hashing code, post the language you're using and I can find some, though there's plenty available online if you know what you're looking for.
I'm building a database that will store information on a range of objects (such as scientific papers, specimens, DNA sequences, etc.) that all have a presence online and can be identified by a URL, or an identifier such as a DOI. Using these GUIDs as the primary key for the object seems a reasonable idea, and I've followed delicious and Connotea in using the md5 hash of the GUID. You'll see the md5 hash in your browser status bar if you mouse over the edit or delete buttons in a delicious or Connotea book mark. For example, the bookmark for http://stackoverflow/ is
http://delicious.com/url/e4a42d992025b928a586b8bdc36ad38d
where e4a42d992025b928a586b8bdc36ad38d ais the md5 hash of http://stackoverflow/.
Does anybody have views on the pros and cons of this approach?
For me an advantage of this approach (as opposed to using an auto incrementing primary key generated by the database itself) is that I have to do a lot of links between objects, and by using md5 hashes I can store these links externally in a file (say, as the result of data mining/scraping), then import them in bulk into the database. In the same way, if the database has to be rebuilt from scratch, the URLs to the objects won't change because they use the md5 hash.
I'd welcome any thoughts on whether this sounds sensible, or whether there other (better?) ways of doing this.
It's perfectly fine.
Accidental collision of MD5 is impossible in all practical scenarios (to get a 50% chance of collision you'd have to hash 6 billion URLs per second, every second, for 100 years).
It's such an improbable chance that you're trillion times more likely to get your data messed up due to an undetected hardware failure than due to an actual collision.
Even though there is a known collision attack against MD5, intentional malicious collisions are currently impossible against hashed URLs.
The type of collision you'd need to intentionally collide with a hash of another URL is called a pre-image attack. There are no known pre-image attacks against MD5. As of 2017 there's no research that comes even close to feasibility, so even a determined well-funded attacker can't compute a URL that would hash to a hash of any existing URL in your database.
The only known collision attack against MD5 is not useful for attacking URL-like keys. It works by generating a pair of binary blobs that collide only with each other. The blobs will be relatively long, contain NUL and other unprintable bytes, so they're extremely unlikely to resemble anything like a URL.
After browsing stackoverfow a little more I found an earlier question Advantages and disadvantages of GUID / UUID database keys which covers much of this ground.
Multiple strings can produce the same md5 hash. Primary keys must be unique. So using the hash as the primary key is not good. Better is to use the GUID directly.
Is a GUID suitable for use in a URL. Sure. Here's a GUID (actually, a UUID) I jsut created using Java: 1ccb9467-e326-4fed-b9a7-7edcba52be84
The url could be:
http://example.com/view?id=1ccb9467-e326-4fed-b9a7-7edcba52be84
It's longish but perfectly usable and achieves what you describe.
Maybe this document is something you want to read:
http://www.hpl.hp.com/techreports/2002/HPL-2002-216.pdf
Often lots of different urls point to the same page.
http://example.com/
example.com
http://www.example.com/
http://example.com/index.html
http://example.com/.
https://example.com/
etc.
This might or might not be a problem for you.
MD5 is considered deprecated - at least for cryptographic purposes, but I would suggest only using md5 for backwards compatibility with existing stuff. You should have a good reason to go with md5 when we do have other hash algos out there that aren't (at least yet) broken.
Problems I see with the approach:
Duplicate objects, because the url identifier is different
(As arend mentioned)
URLs changing
The latter being the one that might be important - this could be done as simply as a remove and an add. That is, if these ids are never visible/storable outside the database. (Like as a component of a URL.)
I guess these won't be a problem for DOIs.
How would it work with a non-autonumber integer id setup, but where the offline inserter agent creates the numbers? (Can use a dedicated range of numbers, maybe?)
Might have a problem with duplication should two users independently add the same url?
md5 hash is almost unique, but is not totally unique unique so don't use it as primary key. It is depreciated for cryptographic use. There is less chance of key collision, but if you have pretty big database with billions of rows, there is still some chance of collision. If you insist using hash as primary key use other better hash. You cannot use non unique values for Primary Key.
If you have pretty big table, don't use it. If you have small table, you might use it, but not recommended.