Allowed characters in AppEngine Datastore key name - google-app-engine

If I create a named key for use in Google AppEngine, what kind of String is the key-name? Does it use Unicode characters or is it a binary string?
More specifically, if I want to have my key-name made up of 8-bit binary data, is there a way I can do it? If not, can I at least use 7-bit binary data? Or are there any reserved values? Does it use NULL as the End-Of-String marker, for example?

GAE docs do not specify any restrictions on the key-name String. So a String with any content should be valid.
If you want to use a binary data as an identifier, then you should encode it into a String. You can use any of the binary-to-text encoding methods: most used seem to be Base64 (3 bytes = 4 chars) and BinHex (1 byte = 2 chars).

I meanwhile had some time to actually test this out by generating a bunch of keys with binary names and then performing a kind-only query to get all the keys back. Here are the results:
Any binary character is fine. If you create an entity with key name "\x00\x13\x127\x255", a query will find this entity and its key name will return that same string
The AppEngine Dashboard, Database Viewer, and other tools will simply omit characters that aren't displayable, so the key names "\x00test" and \x00\x00test will both show up as separate entities, but their keys are both shown as "test"
I have not tested all available AppEngine tools, just some of the basics in the Console, so there may be other tools that get confused by such keys...
Keys are UTF-8 encoded, so any character between 128 and 255 takes up 2 bytes of storage
From this, I would derive the following recommendations:
If you need to be able to work with individual entities from the AppEngine console and need to identify them by key, you are limited to printable characters and thus need to encode the binary key name into a String either in Base16 (hex; 50% overhead), Base64 (33% overhead), or Base85 (25% overhead)
If you don't care about key readability, but need to pack as much data as possible into the key name with minimal storage use, use Base128 encoding (i.e. 7-bits only; 14% overhead) to avoid the implicit UTF-8 encoding (50% overhead!) of 8-bit data data
Asides:
I will accept #PeterKnego's answer instead of this one since this one basically only confirms and expands on what he already assumed correctly.
From looking through the source code of the Java API, I think that the UTF-8 encoding of the key-name happens in the API (while building the protocol buffer) rather than in BigTable, so if you really want to go nuts on storage space maximization, it may be possible to build your own protocol buffers and store full 8-bit data without overhead. But this is probably asking for trouble...

Related

How are strings stored as bytes in databases?

This question may be a little vague, but let me try to explain it clearly. I have been reading a database related tutorial, and it mentioned tables are serialized to bytes to be persisted on the disk. When we deserialize them, we can locate each column based on the size of its type.
For example, we have a table:
---------------------------------------------------
| id (unsigned int 8) | timestamp (signed int 32) |
---------------------------------------------------
| Some Id | Some time |
---------------------------------------------------
When we are deserializing a byte array loaded from a file, we know the first 8 bits are the id, and the following 32 bits are the timestamp.
But the tutorial never mentioned how strings are handled in databases. They are not specific to a limited size, like 32 bits, and they are not predictable the size wise (there can always be a long long String, who knows). So how exactly does databases handle strings?
I know in RDBMS, you need to specify the size of the string as Varchar(45) for example, then it becomes easier. But what about dbs like MongoDB or Redis, which does not require a specification for string sizes, do they just assume it to be a specific length and increase the size once a longer one comes in?
That is basically my vague non-specific question, hope someone can give me some ideas on this. Thank you very much
In MongoDB, documents are serialized as BSON (Binary JSON-like objects). See BSON spec for more details regarding the datatypes for each type.
For string type, it is stored as:
<unsigned32 strsizewithnull><cstring>
From this line in the MongoDB source.
So a string field is stored with its length (including the null terminator) in the BSON object. The string itself is UTF-8 encoded as per BSON spec, so it can be encoded using a variable amount of bytes per symbol. Together with other fields that makes up a document, they are compressed using Snappy by default. This compressed representation is the one persisted to disk.
WiredTiger is a no-overwrite storage engine. If that document is updated, WiredTiger creates a new document and updates the internal pointer to the new one, and mark the old document as "space available for reuse".

How to write a keyboard layout dll using the KbdLayerDescriptor symbol?

Looking an example source code wasn't enough, and I couldn't find any official documentation about theKbdLayerDescriptorsymbol. So I have still some questions about it :
What is the purpose of the ligature table, or more precisely how does it works. Is it for writing pre‑composed characters ? If not, does it means automatically insert the ZERO WIDTH JOINER character, or it simply write several characters without ligature ?
Is is possible to define three or more shift states with keys of the numeric pad ?
I saw theKBD_TYPEneed to be defined. What are the purpose of each integer values ?
Is it possible to use Unicode values larger than 16 bits like the mathematical𝚤 ?
I saw keyboards layout use[HKLM\SYSTEM\CurrentControlSet\Control\Keyboard Layout\DosKeybCodes]and[HKLM\SYSTEM\CurrentControlSet\Control\Keyboard Layouts]but it seems it is not the only registry keys that need to be completed in order to register a system wide keyboard. So what are the required registry keys for installing a system wide keyboard layout ?

What would be the best database server to store value of PI?

Say 100 million digits, one string.
The purpose is to query the DB to find recurrence of a search string.
While I know that LONGTEXT type in MySQL would allow for the string to be stored, I am not sure that querying the substring would actually result in acceptable performances.
Would a NoSQL key-value model would perform better?
Any suggestion, experience (does not have to be PI..).
This may be taking you in the wrong direction but...
Using MySQL seems to be a high-overhead way of solving the specific problem of finding a string in a file.
100M digits, 8 bytes each, is only 100MB. In Python, you could write the file as a sequence of bytes, each byte representing an ascii number. Or, you could pack them into nibbles (4 bits will cover digits 0-9).
In Python, you can read the file using:
fInput = open(<yourfilenamehere>, "rb")
fInput.seek(number_of_digit_you_want)
fInput.read(1) # to get a single byte
From this, it is easy to build a search solution to seek out a specific string.
Again, mySQL might be the way to go but, for your application, a non-high-overhead method might be the ticket.

Facebook content collation and non-western encoded characters

If a user writes a string of text in arabic into a facebook comment and saves, what is the collation type of the data storage?
I don't believe that they use a mysql table for comments, but I've just messed with the topic using a localhost mysql table, where I stored some arabic in a binary character.
it transformed the text into some presumably escaped sequence of character. but once you've saved it, it stayed that way.
If you consider i18n, even when I have facebook set to english, typing in other non-western encoded characters still saves and displays correctly.
any insight into how they've achieved this?
First; I don't know for sure but I don't believe MySQL comes into play anywhere for this.
The right thing to do is store it UTF-8 in <some-system>, period. Which might as well be MySQL I guess. I don't know specifics but I do believe MySQL (and PHP for that matter**) are not really up-to-par with UTF-8/Unicode support and so they might manifest some "glitches". For example, you need to execute "set names=utf8" or some crazy stuff first thing after opening the connection for utf8 to work at all (which might be why your test didn't work). Also, I remember something about MySQL not supporting 4-byte encoded UTF-8 characters, only up to 3. Don't know if that is true currently, but I vaguely remember something about it. [edit] Should be fixed in 5.5+
I don't know about Arabic but they might be the 4-byte kind. [edit] They should need 2 or 3 bytes.
And while we're on glitches: about PHP I remember stuff like strlen() returning bytes instead of actual characters etc. If I'm not mistaken it has some mb_XXX functions (multibyte string) that should handle UTF-8 better. [edit] Turns out it does.
I don't see how i18n and setting facebook to English (or Swahili for that matter) would affect this at all. It's just the language used in the interface (and maybe/probably affecting datetime formatting etc.) and has nothing to do with user-generated content.
Oh, almost forgot the obligatory The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)-link
** Just mentioning it because it usually goes hand-in-hand with MySQL.

Is it a good idea to use an integer column for storing US ZIP codes in a database?

From first glance, it would appear I have two basic choices for storing ZIP codes in a database table:
Text (probably most common), i.e. char(5) or varchar(9) to support +4 extension
Numeric, i.e. 32-bit integer
Both would satisfy the requirements of the data, if we assume that there are no international concerns. In the past we've generally just gone the text route, but I was wondering if anyone does the opposite? Just from brief comparison it looks like the integer method has two clear advantages:
It is, by means of its nature, automatically limited to numerics only (whereas without validation the text style could store letters and such which are not, to my knowledge, ever valid in a ZIP code). This doesn't mean we could/would/should forgo validating user input as normal, though!
It takes less space, being 4 bytes (which should be plenty even for 9-digit ZIP codes) instead of 5 or 9 bytes.
Also, it seems like it wouldn't hurt display output much. It is trivial to slap a ToString() on a numeric value, use simple string manipulation to insert a hyphen or space or whatever for the +4 extension, and use string formatting to restore leading zeroes.
Is there anything that would discourage using int as a datatype for US-only ZIP codes?
A numeric ZIP code is -- in a small way -- misleading.
Numbers should mean something numeric. ZIP codes don't add or subtract or participate in any numeric operations. 12309 - 12345 does not compute the distance from downtown Schenectady to my neighborhood.
Granted, for ZIP codes, no one is confused. However, for other number-like fields, it can be confusing.
Since ZIP codes aren't numbers -- they just happen to be coded with a restricted alphabet -- I suggest avoiding a numeric field. The 1-byte saving isn't worth much. And I think that that meaning is more important than the byte.
Edit.
"As for leading zeroes..." is my point. Numbers don't have leading zeros. The presence of meaningful leading zeros on ZIP codes is yet another proof that they're not numeric.
Are you going to ever store non-US postal codes? Canada is 6 characters with some letters. I usually just use a 10 character field. Disk space is cheap, having to rework your data model is not.
Use a string with validation. Zip codes can begin with 0, so numeric is not a suitable type. Also, this applies neatly to international postal codes (e.g. UK, which is up to 8 characters). In the unlikely case that postal codes are a bottleneck, you could limit it to 10 characters, but check out your target formats first.
Here are validation regexes for UK, US and Canada.
Yes, you can pad to get the leading zeroes back. However, you're theoretically throwing away information that might help in case of errors. If someone finds 1235 in the database, is that originally 01235, or has another digit been missed?
Best practice says you should say what you mean. A zip code is a code, not a number. Are you going to add/subtract/multiply/divide zip codes? And from a practical perspective, it's far more important that you're excluding extended zips.
Normally you would use a non-numerical datatype such as a varchar which would allow for more zip code types. If you are dead set on only allowing 5 digit [XXXXX] or 9 digit [XXXXX-XXXX] zip codes, you could then use a char(5) or char(10), but I would not recommend it. Varchar is the safest and most sane choice.
Edit: It should also be noted that if you don't plan on doing numerical calculations on the field, you should not use a numerical data type. ZIP Code is a not a number in the sense that you add or subtract against it. It is just a string that happens to be made up typically of numbers, so you should refrain from using numerical data types for it.
From a technical standpoint, some points raised here are fairly trivial. I work with address data cleansing on a daily basis - in particular cleansing address data from all over the world. It's not a trivial task by any stretch of the imagination. When it comes to zip codes, you could store them as an integer although it may not be "semantically" correct. The fact is, the data is of a numeric form whether or not, strictly speaking it is considered numeric in value.
However, the very real drawback of storing them as numeric types is that you'll lose the ability to easily see if the data was entered incorrectly (i.e. has missing values) or if the system removed leading zeros leading to costly operations to validate potentially invalid zip codes that were otherwise correct.
It's also very hard to force the user to input correct data if one of the repercussions is a delay of business. Users often don't have the patience to enter correct data if it's not immediately obvious. Using a regex is one way of guaranteeing correct data, however if the user enters a value that doesn't conform and they're displayed an error, they may just omit this value altogether or enter something that conforms but is otherwise incorrect. One example [using Canadian postal codes] is that you often see A0A 0A0 entered which isn't valid but conforms to the regex for Canadian postal codes. More often than not, this is entered by users who are forced to provide a postal code, but they either don't know what it is or don't have all of it correct.
One suggestion is to validate the whole of the entry as a unit validating that the zip code is correct when compared with the rest of the address. If it is incorrect, then offering alternate valid zip codes for the address will make it easier for them to input valid data. Likewise, if the zip code is correct for the street address, but the street number falls outside the domain of that zip code, then offer alternate street numbers for that zip code/street combination.
No, because
You never do math functions on zip code
Could contain dashes
Could start with 0
NULL values sometimes interpreted as zero in case of scalar types
like integer (e.g. when you export the data somehow)
Zip code, even if it's a number, is a designation of an area,
meaning this is a name instead of a numeric quantity of anything
Unless you have a business requirement to perform mathematical calculations on ZIP code data, there's no point in using an INT. You're over engineering.
Hope this helps,
Bill
ZIP Codes are traditionally digits, as well as a hyphen for Zip+4, but there is at least one Zip+4 with a hyphen and capital letters:
10022-SHOE
https://www.prnewswire.com/news-releases/saks-fifth-avenue-celebrates-the-10th-birthday-of-its-famed-10022-shoe-salon-300504519.html
Realistically, a lot of business applications will not need to support this edge case, even if it is valid.
Integer is nice, but it only works in the US, which is why most people don't do it. Usually I just use a varchar(20) or so. Probably overkill for any locale.
If you were to use an integer for US Zips, you would want to multiply the leading part by 10,000 and add the +4. The encoding in the database has nothing to do with input validation. You can always require the input to be valid or not, but the storage is matter of how much you think your requirements or the USPS will change. (Hint: your requirements will change.)
I learned recently that in Ruby one reason you would want to avoid this is because there are some zip codes that begin with leading zeroes, which–if stored as in integer–will automatically be converted to octal.
From the docs:
You can use a special prefix to write numbers in decimal, hexadecimal, octal or binary formats. For decimal numbers use a prefix of 0d, for hexadecimal numbers use a prefix of 0x, for octal numbers use a prefix of 0 or 0o…
I think the ZIP code in the int datatype can affect the ML-model. Probably, the higher the code can create outlier in the data for the calculation

Resources