How safe is the md5 hashing algorithmic if I know part of the original data

How safe is the md5 hashing algorithmic if I know part of the original data - md5

I'm planning on using an md5 hash to confirm or not a secure operation, part of the hash original data is public, the other part is not:
partnerId : fixed 15 chars string (PUBLIC)
amount: int value, from 0 to 6500000 (PUBLIC)
transactionId: string, 5 to 30 chars (PUBLIC)
secure: string, yet to decide how long it needs to be. (PRIVATE)
The resulting hash is md5(partnerId.amount.transactionId.secure);
The secure constant is safe in my partner server and in my server, so in theory we are the only one that can replicate the hash. But I wonder how long needs to be the secure variable in order to keep the hash safe, how long in percentage terms compare to the public part of the hash, 80 private/20 public ? 50 / 50?
Is there maybe an equation to measure this?

Related

How can we turn hash sha256 of a passphase into an EC_key private key?

i have a question, i just practice C & OpenSSL recently & notice this is the common way to create EC_Key:
EC_KEY *eckey = EC_KEY_new();
EC_GROUP *ecgroup= EC_GROUP_new_by_curve_name(NID_secp192k1);
int set_group_status = EC_KEY_set_group(eckey,ecgroup);
int gen_status = EC_KEY_generate_key(eckey);
This method generate EC_key based on a random interger. May i ask if is there any code that we can declare a hash sha256 of a passphase & make it private key of a EC_key we just create since i read that EC_key's private key has the same format with hash sha256?
//Example
char* exam = "somewhere over the rainbow";
unsigned char output[32];
SHA256(exam, strlen(exam), output);

Not directly for that curve.
An ECC private key is actually a random integer less than the order of the base point, or equivalently the order of the group generated by the base point.
Although it is not true for all ECC curves (groups), the X9/Certicom/NIST prime curves were generated so that the generated group order is equal to the curve order (formally, cofactor = 1), and the curve order is always close to the underlying field order which for these curves was chosen very close to 2N.
Thus a private key for a 256-bit prime curve, like P-256/secp256r1 (commonly used in TLS, and SSH, and some other applications) or secp256k1 (used in Bitcoin and some derivative coins), is almost a random 256-bit string -- close enough that in practice it will work.
Similarly for secp192k1 a random 192-bit string is close enough, and could be generated by taking the first 192 bits of a SHA-256 output (or last, or middle, if you prefer) as long as it was computed on input (your passphrase) having sufficient entropy to provide the desired security.
If by passphrase you mean a phrase chosen by a person, no. There is abundant evidence that people do not choose randomly even when they try to, and passwords and passphrases chosen by people, and not 'strengthened' cryptographically which your method does not, are regularly broken. As an example, this was tried in the Bitcoin community a few years ago under the name 'brain wallet' -- i.e. your private key, giving access to your bitcoins, is in your brain. Many of these keys were broken and the bitcoins stolen.
If you mean a series of words (not really a meaningful phrase) generated randomly by the computer to have sufficient entropy, or by some other process that actually is random like rolling fair dice, then yes. The current standard in Bitcoin for a 'seed phrase' is 12 words from a list of 2048 giving 128 bits of entropy plus 4 bits of redundancy; for your curve you only need 96 bits of entropy so 9 such words would work (although it isn't standard). Numerous other similar schemes have been developed and used over the years. In practice you will probably have to write this 'phrase' down and/or store it somewhere, and then secure that storage appropriately.

Combining two GUID/UUIDs with MD5, any reasons this is a bad idea?

I am faced with the need of deriving a single ID from N IDs and at first a i had a complex table in my database with FirstID, SecondID, and a varbinary(MAX) with remaining IDs, and while this technically works its painful, slow, and centralized so i came up with this:
simple version in C#:
Guid idA = Guid.NewGuid();
Guid idB = Guid.NewGuid();
byte[] data = new byte[32];
idA.ToByteArray().CopyTo(data, 0);
idB.ToByteArray().CopyTo(data, 16);
byte[] hash = MD5.Create().ComputeHash(data);
Guid newID = new Guid(hash);
now a proper version will sort the IDs and support more than two, and probably reuse the MD5 object, but this should be faster to understand.
Now security is not a factor in this, none of the IDs are secret, just saying this 'cause everyone i talk to react badly when you say MD5, and MD5 is particularly useful for this as it outputs 128 bits and thus can be converted directly to a new Guid.
now it seems to me that this should be just dandy, while i may increase the odds of a collision of Guids it still seems like i could do this till the sun burns out and be no where near running into a practical issue.
However i have no clue how MD5 is actually implemented and may have overlooked something significant, so my question is this: is there any reason this should cause problems? (assume sub trillion records and ideally the output IDs should be just as global/universal as the other IDs)

My first thought is that you would not be generating a true UUID. You would end up with an arbitrary set of 128-bits. But a UUID is not an arbitrary set of bits. See the 'M' and 'N' callouts in the Wikipedia page. I don't know if this is a concern in practice or not. Perhaps you could manipulate a few bits (the 13th and 17th hex digits) inside your MD5 output to transform the hash outbut to a true UUID, as mentioned in this description of Version 4 UUIDs.
Another issue… MD5 does not do a great job of distributing generated values across the range of possible outputs. In other words, some possible values are more likely to be generated more often than other values. Or as the Wikipedia article puts it, MD5 is not collision resistant.
Nevertheless, as you pointed out, probably the chance of a collision is unrealistic.
I might be tempted to try to increase the entropy by repeating your combined value to create a much longer input to the MD5 function. In your example code, take that 32 octet value and use it repeatedly to create a value 10 or 1,000 times longer (320 octects, 32,000 or whatever).
In other words, if working with hex strings for my own convenience here instead of the octets of your example, given these two UUIDs:
78BC2A6B-4F03-48D0-BB74-051A6A75CCA1
FCF1B8E4-5548-4C43-995A-8DA2555459C8
…instead of feeding this to the MD5 function:
78BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C8
…feed this:
78BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C878BC2A6B-4F03-48D0-BB74-051A6A75CCA1FCF1B8E4-5548-4C43-995A-8DA2555459C8
…or something repeated even longer.

Are standard hash functions like MD5 or SHA1 quaranteed to be unique for small input (4 bytes)?

Scenario:
I'm writing web service, that will act like identity provider for 3pty application. I have to send to this 3pty application some unique identifier of our user. In our database, unique user identifier is integer (4 bytes, 32 bites). Per our security rules I can't send those in plain form - so sending them out hashed (trough function like MD5 or SHA1) was my first idea.
Problem:
The result of MD5 is 16 bytes, result of SHA1 is 40 bytes, I know they can't be unique for larger input sets, but given the fact my input set is only 4 bytes long (smaller then hashed results) - are they guaranteed to be unique, or am I doomed to some poor-man hash function (like xoring the integer input with some number, shifting bites, adding predefined bites, etc.) ?

For what you're trying to achieve (preventing a 3rd party from determining your user identifier), a straight MD5 or SHA1 hash is insufficient. 32 bits = about 4 billion values, it would take less than 2 hours for the 3rd party to brute force every value (#1m hashes/sec). I'd really suggest using HMAC-SHA1 instead.
As for collisions, this question has an extremely good answer on their likelihood. tl;dr For 32-bits of input, a collision is excessively small.
If your user identifiers aren't random (they increment by 1 or there is a known algorithm for creating them), then there's no reason you can't generate every hash to make sure that no collision will occur.
This will check the first 10,000,000 integers for a collision with HMAC-SHA1 (will take about 2 minutes to run):
public static bool checkCollisionHmacSha1(byte[] key){
HMACSHA1 mac = new HMACSHA1(key);
HashSet<byte[]> values = new HashSet<byte[]>();
bool collision = false;
for(int i = 0; i < 10000000 && collision == false; i++){
byte[] value = BitConverter.GetBytes(i);
collision = !values.Add(mac.ComputeHash(value));
if (collision)
break;
}
return collision;
}

First, SHA1 is 20 bytes not 40 bytes.
Second, although input is very small, there still may be a collision. It is best to test this, but I do not know a feasible way to do that.
In order to prevent any potential collision:
1 - Hash your input and produce the 16/20 bytes of hash
2 - Spray your actual integer onto this hash.
Like put a byte of your int every 4/5 bytes.
This will guarantee the uniqueness by using the input itself.
Also, take a look at Collision Column part

hashing function guaranteed to be unique?

In our app we're going to be handed png images along with a ~200 character byte array. I want to save the image with a filename corresponding to that bytearray, but not the bytearray itself, as i don't want 200 character filenames. So, what i thought was that i would save the bytearray into the database, and then MD5 it to get a short filename. When it comes time to display a particular image, i look up its bytearray, MD5 it, then look for that file.
So far so good. The problem is that potentially two different bytearrays could hash down to the same MD5. Then, one file would effectively overwrite another. Or could they? I guess my questions are
Could two ~200 char bytearrays MD5-hash down to the same string?
If they could, is it a once-per-10-ages-of-the-universe sort of deal or something that could conceivably happen in my app?
Is there a hashing algorithm that will produce a (say) 32 char string that's guaranteed to be unique?

It's logically impossible to get a 32 byte code from a 200 byte source which is unique among all possible 200 byte sources, since you can store more information in 200 bytes than in 32 bytes.
They only exception would be that the information stored in these 200 bytes would also fit into 32 bytes, in which case your source date format would be extremely inefficient and space-wasting.

When hashing (as opposed to encrypting), you're reducing the information space of the data being hashed, so there's always a chance of a collision.
The best you can hope for in a hash function is that all hashes are evenly distributed in the hash space and your hash output is large enough to provide your "once-per-10-ages-of-the-universe sort of deal" as you put it!
So whether a hash is "good enough" for you depends on the consequences of a collision. You could always add a unique id to a checksum/hash to get the best of both worlds.

Why don't you use a unique ID from your database?

The probability of two hashes will likely to collide depends on the hash size. MD5 produces 128-bit hash. So for 2128+1 number of hashes there will be at least one collision.
This number is 2160+1 for SHA1 and 2512+1 for SHA512.
Here this rule applies. The more the output bits the more uniqueness and more computation. So there is a trade off. What you have to do is to choose an optimal one.

Could two ~200 char bytearrays MD5-hash down to the same string?
Considering that there are more 200 byte strings than 32 byte strings (MD5 digests), that is guaranteed to be the case.
All hash functions have that problem, but some are more robust than MD5. Try SHA-1. git is using it for the same purpose.

It may happen that two MD5 hashes collides (are the same). In 1996, a flaw was found in MD5 algorithm, and cryptanalysts advised to switch to SHA-1 hashing algorithm.
So, I will advise you to switch to SHA-1 (40 characters). But do not worry: I doubt that your two pictures will get the same hash. I think you can assume this risk in your application.

As other said before. Hash doesnt give you what you need unless you are fine with risk of collision.
Database is helpful here.
You get unique index for each 200 long string. No collisions here, and you need to set your 200 long names to be indexed, in that way it will use extra memory but it will sort it for you making search very very fast. You get unique id which can be easily used for filenames.

I have'nt worked much on hashing algorithms but as per my understanding there is always a chance of collison in hashing algorithm i.e. two differnce object may be hashed to same hash value but it is guaranteed that every time a object will be hashed to same hash value. There are other techniques that may be used for this , like linear probing.

How long (max characters) can a datastore entity key_name be? Is it bad to haver very long key_names?

What is the maximum number of characters that can be used to define the key_name of a datastore entity?
Is it bad to have very long key_names?
For example:
Lets say we use key_names of a 170 characters, which is the length of a Twitter message 140 plus 10 numeric characters for latitude and 10 for longtitude and 10 for a timestamp.
(Reasoning of such a key_name: So by using such a key_name we can easily and quickly be sure of no duplicate postings, since the same message should not come from the same place and time more than once.)

key names are limited to 500 characters, just like string property values. see e.g. Key.to_path(), which calls ValidateString():
http://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/api/datastore_types.py#413
which defaults max_len to _MAX_STRING_LENGTH, which is 500:
http://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/api/datastore_types.py#87

There's no hard maximum - the maximum length of a key name is the maximum length of a key, less some overhead, and keys can get pretty long.
It is bad to have very long key names, however: Apart from storing and retrieving it, every index entry contains the key name it's referring to, so longer key names mean higher indexing overhead. If you want to ensure uniqueness over a large text, your best option is to make the key name the MD5 or SHA1 sum of the input, which ensures both uniqueness and a short(-ish) key name.