How long (max characters) can a datastore entity key_name be? Is it bad to haver very long key_names? - google-app-engine

What is the maximum number of characters that can be used to define the key_name of a datastore entity?
Is it bad to have very long key_names?
For example:
Lets say we use key_names of a 170 characters, which is the length of a Twitter message 140 plus 10 numeric characters for latitude and 10 for longtitude and 10 for a timestamp.
(Reasoning of such a key_name: So by using such a key_name we can easily and quickly be sure of no duplicate postings, since the same message should not come from the same place and time more than once.)

key names are limited to 500 characters, just like string property values. see e.g. Key.to_path(), which calls ValidateString():
http://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/api/datastore_types.py#413
which defaults max_len to _MAX_STRING_LENGTH, which is 500:
http://code.google.com/p/googleappengine/source/browse/trunk/python/google/appengine/api/datastore_types.py#87

There's no hard maximum - the maximum length of a key name is the maximum length of a key, less some overhead, and keys can get pretty long.
It is bad to have very long key names, however: Apart from storing and retrieving it, every index entry contains the key name it's referring to, so longer key names mean higher indexing overhead. If you want to ensure uniqueness over a large text, your best option is to make the key name the MD5 or SHA1 sum of the input, which ensures both uniqueness and a short(-ish) key name.

Related

SQL Server : create table columns for most efficient size

My SQL Server database was created & designed by a freelance developer.
I see the database getting quite big and I want to ensure that the column datatypes are the most efficient in preserving the size as small as possible.
Most columns were created as
VARCHAR (255), NULL
This covers those where they are
Numerics with a length of 2 numbers maximum
Numerics where a length will never be more than 3 numbers or blank
Alpha which will contain just 1 letter or are blank
Then there are a number of columns which are alphanumeric with a maximum of 10
alphanumeric characters with a maximum of 25.
There is one big alphanumeric column which can be up to 300 characters.
There has been an amendment for a column which show time taken in seconds to race an event. Under 1000 seconds and up to 2 decimal places
This is set as DECIMAL (18,2) NULL
The question is can I reduce the size of the database by changing the column data types, or was the original design, optimum for purpose?
You should definitely strive to use the most appropriate data types for all columns - and in this regard, that freelance developer did a very poor job - both from a point of consistency and usability (just try to sum up the numbers in a VARCHAR(255) column, or sort by their numeric value - horribly bad design...), but also from a performance point of view.
Numerics with a length of 2 numbers maximum
Numerics where a length will never be more than 3 numbers or blank
-> if you don't need any fractional decimal points (only whole numbers) - use INT
Alpha which will contain just 1 letter or are blank
-> in this case, I'd use a CHAR(1) (or NCHAR(1) if you need to be able to handle Unicode characters, like Hebrew, Arabic, Cyrillic or east asian languages). Since it's really only ever 1 character (or nothing), there's no need or point in using a variable-length string datatype, since that only adds at least 2 bytes of overhead per string stored
There is one big alphanumeric column which can be up to 300 characters.
-> That's a great candidate for a VARCHAR(300) column (or again: NVARCHAR(300) if you need to support Unicode). Here I'd definitely use a variable-length string type to avoid padding the column with spaces up to the defined length if you really want to store fewer characters.

Allowed characters in AppEngine Datastore key name

If I create a named key for use in Google AppEngine, what kind of String is the key-name? Does it use Unicode characters or is it a binary string?
More specifically, if I want to have my key-name made up of 8-bit binary data, is there a way I can do it? If not, can I at least use 7-bit binary data? Or are there any reserved values? Does it use NULL as the End-Of-String marker, for example?
GAE docs do not specify any restrictions on the key-name String. So a String with any content should be valid.
If you want to use a binary data as an identifier, then you should encode it into a String. You can use any of the binary-to-text encoding methods: most used seem to be Base64 (3 bytes = 4 chars) and BinHex (1 byte = 2 chars).
I meanwhile had some time to actually test this out by generating a bunch of keys with binary names and then performing a kind-only query to get all the keys back. Here are the results:
Any binary character is fine. If you create an entity with key name "\x00\x13\x127\x255", a query will find this entity and its key name will return that same string
The AppEngine Dashboard, Database Viewer, and other tools will simply omit characters that aren't displayable, so the key names "\x00test" and \x00\x00test will both show up as separate entities, but their keys are both shown as "test"
I have not tested all available AppEngine tools, just some of the basics in the Console, so there may be other tools that get confused by such keys...
Keys are UTF-8 encoded, so any character between 128 and 255 takes up 2 bytes of storage
From this, I would derive the following recommendations:
If you need to be able to work with individual entities from the AppEngine console and need to identify them by key, you are limited to printable characters and thus need to encode the binary key name into a String either in Base16 (hex; 50% overhead), Base64 (33% overhead), or Base85 (25% overhead)
If you don't care about key readability, but need to pack as much data as possible into the key name with minimal storage use, use Base128 encoding (i.e. 7-bits only; 14% overhead) to avoid the implicit UTF-8 encoding (50% overhead!) of 8-bit data data
Asides:
I will accept #PeterKnego's answer instead of this one since this one basically only confirms and expands on what he already assumed correctly.
From looking through the source code of the Java API, I think that the UTF-8 encoding of the key-name happens in the API (while building the protocol buffer) rather than in BigTable, so if you really want to go nuts on storage space maximization, it may be possible to build your own protocol buffers and store full 8-bit data without overhead. But this is probably asking for trouble...

Are standard hash functions like MD5 or SHA1 quaranteed to be unique for small input (4 bytes)?

Scenario:
I'm writing web service, that will act like identity provider for 3pty application. I have to send to this 3pty application some unique identifier of our user. In our database, unique user identifier is integer (4 bytes, 32 bites). Per our security rules I can't send those in plain form - so sending them out hashed (trough function like MD5 or SHA1) was my first idea.
Problem:
The result of MD5 is 16 bytes, result of SHA1 is 40 bytes, I know they can't be unique for larger input sets, but given the fact my input set is only 4 bytes long (smaller then hashed results) - are they guaranteed to be unique, or am I doomed to some poor-man hash function (like xoring the integer input with some number, shifting bites, adding predefined bites, etc.) ?
For what you're trying to achieve (preventing a 3rd party from determining your user identifier), a straight MD5 or SHA1 hash is insufficient. 32 bits = about 4 billion values, it would take less than 2 hours for the 3rd party to brute force every value (#1m hashes/sec). I'd really suggest using HMAC-SHA1 instead.
As for collisions, this question has an extremely good answer on their likelihood. tl;dr For 32-bits of input, a collision is excessively small.
If your user identifiers aren't random (they increment by 1 or there is a known algorithm for creating them), then there's no reason you can't generate every hash to make sure that no collision will occur.
This will check the first 10,000,000 integers for a collision with HMAC-SHA1 (will take about 2 minutes to run):
public static bool checkCollisionHmacSha1(byte[] key){
HMACSHA1 mac = new HMACSHA1(key);
HashSet<byte[]> values = new HashSet<byte[]>();
bool collision = false;
for(int i = 0; i < 10000000 && collision == false; i++){
byte[] value = BitConverter.GetBytes(i);
collision = !values.Add(mac.ComputeHash(value));
if (collision)
break;
}
return collision;
}
First, SHA1 is 20 bytes not 40 bytes.
Second, although input is very small, there still may be a collision. It is best to test this, but I do not know a feasible way to do that.
In order to prevent any potential collision:
1 - Hash your input and produce the 16/20 bytes of hash
2 - Spray your actual integer onto this hash.
Like put a byte of your int every 4/5 bytes.
This will guarantee the uniqueness by using the input itself.
Also, take a look at Collision Column part

Key-indexed search in existence tables

I have a doubt about one of my book's statement.
Talking about key-indexed search in a symbol table, at a certain point it says: "If there aren't records (but only keys), we can use a bit table. In this case, the symbol table is called existence table, because we can consider the k-th bit as an indicator whether the k key there is or there isn't in the table. For example, using a 313-word table on a 32-bit computer, we can use this method to quickly determine whether a given 4-digit internal telephone number was already assigned."
Well, I know what a word is, thus that existence table should be a 10.016-bit table, in that case. But what does it mean? What does that fact of the 4-digit telephone number have to do with it? And so, how you can implement a symbol table with key-indexed search, when the records correspond to the keys?
There are 9000 four-digit numbers (in base 10, decimal), and 10000 (nonnegative) numbers with at most four digits, so a table with more than 10,000 bits is sufficient to indicate for each of these numbers whether it's present (is bit no n set or not?). For five-digit numbers - 90,000 of them - you'd need a larger table.
Since the bit-table can only tell you either "yes, we have it" or "no, we haven't", you can't use it if you need any information exceeding that. But if that's all you need to know, any injective mapping of keys to indices into the table (array) gives you access to that information, compactly stored. In the case of the telephone numbers, the mapping is trivial.
You can use a bittable of 10000 bits (each bit corresponding to a phone number), which fit in 313 bytes (10000/32 = 312.5 ~= 313)

hashing function guaranteed to be unique?

In our app we're going to be handed png images along with a ~200 character byte array. I want to save the image with a filename corresponding to that bytearray, but not the bytearray itself, as i don't want 200 character filenames. So, what i thought was that i would save the bytearray into the database, and then MD5 it to get a short filename. When it comes time to display a particular image, i look up its bytearray, MD5 it, then look for that file.
So far so good. The problem is that potentially two different bytearrays could hash down to the same MD5. Then, one file would effectively overwrite another. Or could they? I guess my questions are
Could two ~200 char bytearrays MD5-hash down to the same string?
If they could, is it a once-per-10-ages-of-the-universe sort of deal or something that could conceivably happen in my app?
Is there a hashing algorithm that will produce a (say) 32 char string that's guaranteed to be unique?
It's logically impossible to get a 32 byte code from a 200 byte source which is unique among all possible 200 byte sources, since you can store more information in 200 bytes than in 32 bytes.
They only exception would be that the information stored in these 200 bytes would also fit into 32 bytes, in which case your source date format would be extremely inefficient and space-wasting.
When hashing (as opposed to encrypting), you're reducing the information space of the data being hashed, so there's always a chance of a collision.
The best you can hope for in a hash function is that all hashes are evenly distributed in the hash space and your hash output is large enough to provide your "once-per-10-ages-of-the-universe sort of deal" as you put it!
So whether a hash is "good enough" for you depends on the consequences of a collision. You could always add a unique id to a checksum/hash to get the best of both worlds.
Why don't you use a unique ID from your database?
The probability of two hashes will likely to collide depends on the hash size. MD5 produces 128-bit hash. So for 2128+1 number of hashes there will be at least one collision.
This number is 2160+1 for SHA1 and 2512+1 for SHA512.
Here this rule applies. The more the output bits the more uniqueness and more computation. So there is a trade off. What you have to do is to choose an optimal one.
Could two ~200 char bytearrays MD5-hash down to the same string?
Considering that there are more 200 byte strings than 32 byte strings (MD5 digests), that is guaranteed to be the case.
All hash functions have that problem, but some are more robust than MD5. Try SHA-1. git is using it for the same purpose.
It may happen that two MD5 hashes collides (are the same). In 1996, a flaw was found in MD5 algorithm, and cryptanalysts advised to switch to SHA-1 hashing algorithm.
So, I will advise you to switch to SHA-1 (40 characters). But do not worry: I doubt that your two pictures will get the same hash. I think you can assume this risk in your application.
As other said before. Hash doesnt give you what you need unless you are fine with risk of collision.
Database is helpful here.
You get unique index for each 200 long string. No collisions here, and you need to set your 200 long names to be indexed, in that way it will use extra memory but it will sort it for you making search very very fast. You get unique id which can be easily used for filenames.
I have'nt worked much on hashing algorithms but as per my understanding there is always a chance of collison in hashing algorithm i.e. two differnce object may be hashed to same hash value but it is guaranteed that every time a object will be hashed to same hash value. There are other techniques that may be used for this , like linear probing.

Resources