SQL Server hash algorithms - sql-server

If my input length is less than the hash output length, are there any hashing algorithms that can guarantee no collisions.
I know by nature that a one way hash can have collisions across multiple inputs due to the lossy nature of the hashing, especially when considering input size is often greater than output size, but does that still apply with smaller input sizes?

Use a symmetric block cipher with a randomly chosen static key. Encryption can never produce a duplicate because that would prevent unambiguous decryption.
This scheme will force a certain output length which is a multiple of the cipher block size. If you can make use a variable-length output you can use a stream cipher as well.

Your question sounds like you're looking for a perfect hash function. The problem with perfect hash functions is they tend to be tailored towards a specific set of data.
The following assumes you're not trying to hide, secure or encrypt the data...
To think of it another way, the easiest way to "generate" a perfect hash function that will accept your inputs is to map the data you want to store to a table and associate those inputs with a surrogate primary key. You then create a unique constraint for the column (or columns) to ensure the input you're mapping only maps to a single surrogate value.
The surrogate key could be int, bigint or a guid. It all depends on how many rows you're looking to store.

If your input lengths are known to be small, such as 32 bits, then you could actually enumerate through all possible inputs and check the resulting hashes for collisions. That's only going to be 4294967296 possible inputs, and shouldn't take to terribly long to enumerate all of them. Essentially you'd be building a rainbow table to test for collisions.
If there is some security relying on this though, one of the issues is if an attacker knows your input lengths are constrained, it makes it easy for them to also perform the same enumeration to create a map/table that will map hashes back to the original values. "attacker" is a pretty terrible term here though because I have no context of how you are using these hashes and whether you are concerned about being able to reverse them.

Related

How to store multiple values for a key in a TRIE structure?

So I am still in the theory portion of my project (a phonebook) and am wondering how to store multiple values for a single key in a TRIE structure. When I looked it up most people said when creating a phone book use a TRIE: but if I wanted to store number, email, address, etc all under the key - which would be the name - how would that work? Could I still use a TRIE? Or am I thinking about this the wrong way? Thanks.
I think usually you would create separate indexes for each dimension (i.e. phone number, name, ...). These would normally be B-trees or Hashmaps.
Generally, these individual indexes will allow faster lookup than a multi-dimensional index (even though multidimensional TRIEs have very fast look up speed).
If you really want to use a multi-dimensional try, have a look at the PH-Tree, it is a true multi-dimensional TRIE (disclaimer: I am self-advertising here).
There are Java and C++ implementations, but they are all aimed at 64bit numbers, e.g. coordinates in space. WARNING: The available implementations allow only one entry per key, so you will have to store Lists for each key in order to allow multiple entries per key.
If you want to use the PH-Tree for strings etc (I would treat the phone number as a string): you can either write your own PH-Tree (not easy to do) or encode the strings in a magic number. For example, convert the leading six characters into numbers by using their ASCII code and create a small hash of the whole string and store the hash in the remaining two bytes. There are many ways to improve this number. This number can then be used as one dimension of the key.
Conceptually, the PH-Tree interleaves the bits of all dimensions into a single long bitstring. These long bit-strings are then stored in a TRIE, however there are some quirks (i.e. each node has up to 2^dimension children). For a full explanation, please have a look at the papers on the website.
In summary: I would not use a multi-dimensional index unless you really need it. If you need to use a multi-dimensional index, the PH-Tree may be a good choice, it is very fast to update and scales comparatively well with increasing dimensionality.

Encrypt hash map keys while keeping constant lookup speed

I would like to encrypt the keys and values of a hash map with AES256 CBC, individually.
The challenge is to encrypt the keys while maintaining the constant lookup speed and security (mainly against dictionary attacks).
I read about blind indices, but these need some randomness at creation (salt, nonce) and it is impossible for the lookup function to recreate the nonce when searching. At lookup we would need to know where to fetch the nonce from for a particular key, which in the end would mean to be vulnerable elsewhere.
So far, I can only think of two options.
First one would be to just not encrypt keys, although I would prefer to do it.
The second one would be to obtain the blind indices by applying a transformation like
blind_index(key) = encrypt(digest(key))
but the problem here is that you need a unique initialisation vector for each key encryption, which brings us again to the problem described above: having a table of IVs used, in order for the lookup function to be able to reconstruct the blind index when searching, which is moving the same problem elsewhere.
For the second approach, my thought was: since I always encrypt unique values (keys are unique and even if they are substrings of one another, e.g. 'awesome' and 'awesome_key', they are hashed before encryption, so they look quite different in their 'hashed & unencrypted' form) I could use a global IV for all encryptions, which can be easily accessible to the lookup function. Since the lookup function requires the encryption key, only the owner will be able to compute the blind index correctly and in the map itself there will be no visible similarities between keys that are similar in plaintext.
The big problem I have with the second approach is that it violates the idea of never using IVs for more than one encryption. I could obfuscate the IV 'to make it more secure,' but that's again a bad idea since IVs are supposed to be plaintext.
More details about the circumstances:
app for mobile
map will be dumped to a file
map will be accessible for lookup through a REST API
Maybe I should use a different algorithm (e.g. EBC)?
Thanks in advance!
This is completely in the realm of Format Preserving Encryption (FPE). However, applying it is hard and libraries that handle it well are not all that common. FPE takes a an amount of bits or even a range and then returns an encrypted value of the same size or in the same range. This ciphertext is pseudo-random in the given domain as long as the input values are unique (which, for keys in a hash table, they are by definition).
If you may expand your ciphertext compared to the plaintext then you could also look at SIV modes (AES-SIV or AES-GCM_SIV, which are much easier to handle. These return a byte array, which could turn into a String, e.g. by using base64 encoding. Otherwise you could wrap the byte array and provide your own equals and hashCode method. Note that these expand your plaintext relatively significantly; these are authenticated modes. Advantage: the IV gets calculated from the input and any change in the input will randomize the ciphertext again.
Finally, you could of course simply use an IV or nonce to produce your ciphertext and prefix it to the value. However, beware that reencryption of changed values using the same IV would be rather dangerous, as you may leak information through repetition. In some modes this could entirely break the confidentiality offered. So you would have to prevent reuse of the IV.
The use of ECB is certainly not recommended for strings. A single block encrypt would work of course if the input is (or can be expanded to) a single block.

Computing the key of a symmetric encryption given the plaintext and ciphertext

As part of an assignment I need to make an algorithm that takes 2 files as input, one containing a plaintext and one containing a ciphertext. Considering the encryption model is hardcoded/known, and is a symmetric encryption, is there a way to use openSSL to compute the key used to encrypt the provided plaintext into the provided ciphertext?
For convenience i used 5 paragraphs of Lorem Ipsum as a plaintext, and blowfish as the cipher.
The openSSL documentation and Google have proved less than useful.
Thank you!
No, the ability to do that would pretty much defeat the entire purpose of cryptography. There might be tools that can do that sort of thing with trivial systems (Caesar cipher for example) but if keys could be computed in reasonable times for current cryptosystems they would be broken.
What you are looking at is a "Known Plaintext Attack": if the attacker knows both the ciphertext and the plaintext, can the key be found?
All good modern ciphers, including Blowfish, are designed to resist this attack. Hence, as has been said, the answer to your question is, "No, you can't find the key."
No you can't.
Not for the blowfish algorithm.
The reason for that is however not that any encryption scheme would be broken if it were possible to derive the key from a pair of plain text and cipher, even if it is easy to do so.
The rest of this answer is to explain that.
There is at least one encryption scheme which is secure in spite of allowing to derive the key. It is the one-time-pad encryption scheme, which happens to be the only known truly secure encryption scheme, for being proveably unbreakable.
The point is that deriving the key of one message only breaks an encryption scheme, if the knowing the key of one message allows decryption of all future messages. This in turn is only applicable, if the same key is reused.
The specialty of the one-time-pad encryption is
a) each key is used for only a single message and never again
(this is why it is called "pad", referring to a notepad with many keys, from which the sheet with a used key is easily taken away and destroyed)
b) the key is as long as the message
(otherwise deriving the key for a part of the cipher with a partial known plain text would allow decrypting the rest of the message)
With those attributes, encrypting even with the humble XOR is unbreakable, each bit in the message corresponding to its own dedicated bit in the key. This is also as fast as de-/encryption gets and never increases the message length.
There is of course a huge disadvantage to the one-time-pad encryption, namely key logistics. Using this encryption is hardly ever applicable, because of the need to provide the receiver of a message with many large keys (or better a very long key which can be used partially for any size of message) and to do so beforehand.
This is the reason for the one-time-pad encryption not being used in spite of the fact that it is safer and faster than all used others and at least as size-efficient.
Other encryption schemes are considered practically secure, otherwise they would of course not be used.
It is however necessary to increase the key sizes in parallel with any noticable progress of crypto-analysis. There is no mathmatical proof that any other algorithm is underivable (meaning it is impossible to derive the key from a plain-cipher-pair). No math expert accepts "I cannot think of any way to do that." proof for something being impossible. On top of that, new technologies could reduce the time for key derivation, or for finding plain text without key, to a fraction, spelling sudden doom to commonly used keylengths.
The symmetry or asymmetry of the algorithm is irrelevant by the way. Both kinds can be derivable or not.
Only the keysize in relation to message length is important. Even with the one-time-pad encryption, a short key (message length being a multiple of key length)
has to be used more than once. If the first part of a cipherhas a known plain text and allows to derive the key, reusing it allows to find the unknown plain for the rest of the message.
This is also true for block cipher schemes, which change the key for each block, but still allow finding the new key with the knowledge of the previous key, if it is the same. Hybrid schemes which use one (possibly asymmetric) main key to create multiple (usually symmetric) block keys which cannot be derived from each other are, for the sake of this answer, considered derivable if the main key can be derived. There is of course no widely used algorithm for which this is true.
For any scheme, the risk of being derivable increases with the ration of the number of bits in key to the number of bits in the message. The more pairs of cipher bits and plain bits relate to each key bit, the more information is available for analysis. For a one to one relation, restricting the information of one plain-cipher pair to that single pair is possible.
Because of this any derivable encryption requires a key length equal to message length.
In reverse, this means that only non-derivable encryptions can have short keys. And having short keys is of course an advantage, especially if key length implies processing duration. Most encryption schemes take longer with longer keys. The one-time-pad however is equally fast for any key length.
So any algorithm with easy key logistics (no need to agree on huge amounts of keybits beforehand) will be non-derivable. Also any algorithm with acceptable speed will be non-derivable.
Both is true for any widely used algorithm, including blowfish.
It is however not true for all algorithms, especially not for the only truly safe one, the one-time-pad encryption (XOR).
So the answer to your specific question is indeed:
You can't with blowfish and most algorithms you probably think of. But ...

Short Text Database Key's vs. Numeric Keys; When is either more efficient than the other?

I am well aware that if I use a nvarchar field as a primary key, or as a foreign key, that this will add some time and space overhead to the usage of the generated index in the majority (if not all) of cases.
As a general rule, using numeric keys are a good idea but under certain common circumstances (small sets of data for instance) it isn't a problem to use text based keys.
However, I am wondering if anyone could provide rigorous information on whether is it MORE efficient, or at least equal, to use text for database keys rather than numeric values under certain circumstances.
Consider a case where a table contains a short list of records. For our example, we'll say we need 50 records. Each record needs an ID. You could use, generic int (or even smallint) numbers (e.g. [1...50]) OR you could assign meaningful, 2 character values to a char(2) field (e.g. [AL, AK, AZ, AR, ... WI]).
In the above case, we could assume that using a char(2) field is potentially more efficient than using an int key since the char data is 2-bytes, vs. 4-bytes used with a int. Using a smallint field theoretically be just as efficient as the char(2) field and, possibly, a varchar(2) field.
The benefit from using the text based key over the numeric key is that the values are readable, which should make it obvious to many that my list of 50 records is likely a list of US States,
As stated, using keys that are smaller or equal in size of a comparable numeric key should be of similar efficiency. However, depending on the architecture and design of the database engine it is possible that in-practice usage may yield unexpected results.
With that stated, is it ever more, equal or less efficient to use any form of text-based value as a key within SQL Server?
I don't need obsessively thorough research results (though I wouldn't mind it), but I am looking for an answer that goes beyond stating what we would expect from a database.
Definitively, how does efficiency of text-based keys compare to numeric-based keys as the size of the text key increases/decreases?
In most cases considerations driven by the business requirements (use cases) will far outweigh any performance differences between numeric v. text keys. Unless you are looking at very large and/or very high throughput systems your choice of primary key type should be based on how the keys will be used rather than any small difference in performance you will see between numeric and text keys.
Think in assembly to find out the answer. You stated this:
we could assume that using a char(2) field is potentially more efficient than using an int key since the char data is 2-bytes, vs. 4-bytes used with a int. Using a smallint field theoretically be just as efficient as the char(2) field and, possibly, a varchar(2) field.
This isn't true, as you can't move 2 characters simultaneously in a single instruction (to my knowledge). So even as a char is smaller than a 4-byte int, you have to move them one-by-one into the register to do a comparison. To compare two instances of a 4-byte int, even if it is larger in size, you only need 1 move instruction per int (disregarding that you also need to move them out of the register back into the memory).
So what happens if you use an int:
Move one of them into one register
Move the other into another
Do a comparison operation
Move to appropriate memory location depending on the comparison result
In the case of a char, however:
Move one of them into one register
Move the other into another
Do a comparison
If you are lucky, and the order can be determined, then done, and the cost is the same as that in the case of ints.
If they are equal, rinse and repeat using the subsequent characters until the order or equality can be determined. Obviously, this is more costly.
Point is that on low level, the determining factor is not the data size in this case but the number of instructions needed.
Apart from the low-level stuff:
Yes, there might be cases where it simply doesn't matter because of the small amount of data that are not likely to ever change - chemical symbols of primitive elements for example (though I am not sure whether I'd use them as PKs).
Generally, you don't use artificial PKs for time and space considerations, but because if they don't have anything to do with in-real-life stuff, they are not subject of change. Can you imagine that the name of a US state ever changes? I can. If it happens, you would have to update the record itself (if the abbreviation changes too, ofc.), and all other records that reference it. If you use an int instead, then your record will have nothing to do with what happens in reality, in which case you only have to update the abbreviation and the state name itself and you can sit back assured that everything is consistent.
Comparing short strings is not always as trivial as comparing the numeric value of their binary representations. When you also have to consider internationalization, you need to rely on custom (or framework/platform-provided) logic to compare them. To use my language as an example, the letter 'Á' has a decimal value of 193, which is greater than the value of 66 of letter 'B', yet, in the Hungarian alphabet, 'Á' preceedes 'B'.
Using textual data rather than an arificial numeric PK can also cause some fragmentation and the write operations are likely to be slower. The reason for this is that an artificial, monotonously increasing numeric PK will cause your newly created rows to be inserted to the end of the table in all cases thereby avoiding the need to "move stuff around to free up space in between".

generating custom uniqueidentifier sql server

im using sql server 2012, is it possible to generate a uniqueidentifier value based on two or three values mostly varchars or decimal, i mean any data type which takes 0-9 and a-z.
Usually uniqueidentifier varies from system to system. For my requirement, I need a custom one, when ever i call this function, it should get me the same value in all the systems.
I have been thinking of converting the values into varbinary and taking certain parts of it and generating a uniqueidentifier. How good is this approach.
Im still working on this approach.
Please provide your suggestions.
What you describe is a hash of the values. Use HASHBYTES function to digest your values into a hash. But your definition of the problem contradicts the requirement for uniqueness since, by definition, reducing an input of size M to a hash of size N, where N < M, may generate collisions. If you truly need uniqueness then redefine the requirements in a manner which would at least allow for uniqueness. Namely, the requirement for it should get me the same value in all the systems must be dropped since the only way to guarantee it is to output exactly the input. If you remove this requirement then the new requirement are satisfied by NEWID() (yes, it does not consider the input, but it doesn't have to in order to meet your requirements).
The standards document for Uniqueidentifier goes to some length showing how they are generated. http://www.ietf.org/rfc/rfc4122.txt
I would give this a read (especially 4.1.2. as that breaks down how a guid should be generated) and maybe I would keep use the timestamp components but hard code your network location element which will give you what you are looking for.

Resources