Variant UUID format - uuid

RFC 4122 defines UUID as having a specific format:
Timestamp: 8 bytes; Clock sequence and variant: 2 bytes; Node: 6 bytes
In my application, I use UUIDs for different types of entities, say foo, bar and baz. IOW, every object of type foo gets a UUID, so does every bar and baz. This results in a slew of UUIDs in the input and output (logs, CLI) of the application.
To validate the input and to make the output easier to handle, I am considering devoting one nibble (4 bits) to the type of the entity: foo, bar, baz, etc. For example, if that nibble were '5', it would indicate it is a UUID of a foo object. Each UUID would be self-describing in this scheme, can be validated on input and can be automatically prefixed with category names in the output, like "foo:5ace6432-..."
Is it a good idea to tweak the UUID format in this way? Randomness is not a source of worry, as we still have 118 bits (128 bits total - 6 bits used by the standard for variants - 4 bits used by me).
Where should I place this nibble? If I place it up front, it would overwrite a part of the timestamp. if I place it at the end, t would be less visible. If I place it somewhere in the middle, it will be even less visible. Is overwriting a part of the timestamp a problem?
Thanks.

UUIDs are 128 bit numbers, format is just for reading.
Is a bad practice try to interpret the internal parts. The UUID should be opaque: one single unit that identifies something. If you need other information about your entities, you need other fields.
In short: self-describing UUIDs is a bad idea.

Related

Is it possible to get the original UUID1 from a timestamp?

Say, for example, I generate a time based UUID with the following program.
import uuid
uuid = uuid.uuid1()
print uuid
print uuid.time
I get the following:
47702997-155d-11ea-92d3-6030d48747ec
137946228962896279
Can I get back the original UUID, that is 47702997-155d-11ea-92d3-6030d48747ec, if I know the timestamp (137946228962896279)?
I am reading about UUID version 1 and found a few programs that "kind" of tries to reverses it, but, every time, I am getting a different UUID.
The part that is always changing are the timestamp part (last 4 digits of the first block - 47702997) and the clock_sequence (92d3).
If it is possible to get back the original UUID, what would I need?
Any help/direction is greatly appreciated.
I also made a post in Security Stackexchange but later realized that this question should have been posted here.
The more I look into it, I can see that this is not at all possible since the timestamp does not contain information on the clock_sequence unless I am wrong, in which case, please correct me.
A UUIDv1 contains two main pieces: a temporally unique part (aka timestamp) and a spatially unique part. By design, the two are completely independent, so if you throw one of the pieces away, there is no way to recover it later from the other one.
More generally, notice how the entire UUID is 32 hexadecimal digits, or 128 bits of information, but the timestamp is 18 decimal digits, or only about 60 bits of information. Even without knowing much about UUIDs you could guess that some of the bits are redundant or fixed (and a few are fixed, or at least guessable), but over half of them? Not likely, which means this translation is not reversible.

Attempting to read raw database file

I'm attempting to read data from a database file (which employees c-tree data structure). This is a very old product, and for various reasons, the ODBC drivers are no longer available to me.
What I have found is that the data is basically just line-by-line "flat-file". So my plan is to simply read the raw binary data from the file, and in effect, fashion my own custom-built ODBC.
Using a tool provided by the the c-tree company themselves, I have even been able to get the details of each field address (i.e. where it starts), its length (length of the byte array) and a column that I assume is actually telling me how to field is encoded (see below):
ADDRESS LENGTH TYPE(encoding?) FIELD NAME
0 8 (128-0x80) CT_ARRAY Reserved
8 4 (59-0x3B) CT_INT4U Record_ID
12 2 (41-0x29) CT_INT2U Type
14 2 (41-0x29) CT_INT2U Changes
16 52 (144-0x90) CT_FSTRING Name
Am I correct in assuming that something like "(128-0x80)" should be the only information I need to decode the field into actual text? Or is it likely there's some further encryption that I'm not considering here?
Also could anyone tell me perhaps what exactly "(128-0x80)" is? I recognise the 0x80 as hex, but what does the 128 mean? With, at the very least, some kind of terminology to describe this thing, I could do some more google research.
Thanks in advance!
This type encoding is purely internal as our byte value representation of that data type.
For example: 80x (hex) = 128 (decimal) = data type CT_ARRAY and likewise for the others.
ADDRESS LENGTH TYPE(encoding?) FIELD NAME
0 8 (128-0x80) CT_ARRAY Reserved
8 4 (59-0x3B) CT_INT4U Record_ID
12 2 (41-0x29) CT_INT2U Type
14 2 (41-0x29) CT_INT2U Changes
16 52 (144-0x90) CT_FSTRING Name
You can view the data type descriptions online at documentation: https://docs.faircom.com/doc/ctreeplus/28396.htm
It is likely this is a fixed length record. Variable length records will contain a 10-byte header on each which will need to be accounted for. Also, this appears to be a 1- or 2-byte (16 bit application) packed aligned data record, which should always be considered. Other C-structure compiler defined alignments will complicate data extraction.
The "Reserved" field could be simply the placeholder marker for our 1 byte deleted record mark and deleted record stack value (also described in our documentation). However, it could also contain application specific data relevant only to that application as it is 8 bytes long.
There should no other encryption or encoding of the data (certainly no Unicode).
Unfortunately, this ODBC Driver has been discontinued. There is a way for you to easily extract all data as it was intended with the c-treeACE database, but you will need to contact FairCom support for more information. The link for support is faircom.com/support.

Allowed characters in AppEngine Datastore key name

If I create a named key for use in Google AppEngine, what kind of String is the key-name? Does it use Unicode characters or is it a binary string?
More specifically, if I want to have my key-name made up of 8-bit binary data, is there a way I can do it? If not, can I at least use 7-bit binary data? Or are there any reserved values? Does it use NULL as the End-Of-String marker, for example?
GAE docs do not specify any restrictions on the key-name String. So a String with any content should be valid.
If you want to use a binary data as an identifier, then you should encode it into a String. You can use any of the binary-to-text encoding methods: most used seem to be Base64 (3 bytes = 4 chars) and BinHex (1 byte = 2 chars).
I meanwhile had some time to actually test this out by generating a bunch of keys with binary names and then performing a kind-only query to get all the keys back. Here are the results:
Any binary character is fine. If you create an entity with key name "\x00\x13\x127\x255", a query will find this entity and its key name will return that same string
The AppEngine Dashboard, Database Viewer, and other tools will simply omit characters that aren't displayable, so the key names "\x00test" and \x00\x00test will both show up as separate entities, but their keys are both shown as "test"
I have not tested all available AppEngine tools, just some of the basics in the Console, so there may be other tools that get confused by such keys...
Keys are UTF-8 encoded, so any character between 128 and 255 takes up 2 bytes of storage
From this, I would derive the following recommendations:
If you need to be able to work with individual entities from the AppEngine console and need to identify them by key, you are limited to printable characters and thus need to encode the binary key name into a String either in Base16 (hex; 50% overhead), Base64 (33% overhead), or Base85 (25% overhead)
If you don't care about key readability, but need to pack as much data as possible into the key name with minimal storage use, use Base128 encoding (i.e. 7-bits only; 14% overhead) to avoid the implicit UTF-8 encoding (50% overhead!) of 8-bit data data
Asides:
I will accept #PeterKnego's answer instead of this one since this one basically only confirms and expands on what he already assumed correctly.
From looking through the source code of the Java API, I think that the UTF-8 encoding of the key-name happens in the API (while building the protocol buffer) rather than in BigTable, so if you really want to go nuts on storage space maximization, it may be possible to build your own protocol buffers and store full 8-bit data without overhead. But this is probably asking for trouble...

Which SQL Server 2008 data type should be used for image and text data?

I have a table in SQLServer, and one of the fields is expecting rather mixed data; sometimes it will receive a text value, let's say for the sake of argument:
ASCII ART PICTURE OF A CAT WITH A POORLY SPELT CAPTION
Then sometimes it will contain an actual JPEG which contains images of cats with poorly spelt captions. What's the best datatype to use in this instance? A big, long, object.
We can assume that the ASCII art pictures are fairly small, say 16 chars x 16 chars. You could store them in a VARCHAR(256) field, at least if it wasn't for those pesky big 2MB JPEGs that people send.
Additional bit
Semantically, we can say that both of these are images of cats, albeit in very different formats. The question I guess I'm asking is that if there is an appropriate datatype to handle values which are semantically the same thing, but could come in very different forms.
Additional bit #2
In this case, the amount of data about our cat varies wildly. Sometimes the cat caption image is very small, and we only need ~256bytes of information to be stored. But sometimes it is very large, if it were a plaintext ASCII cat image, it might take up ~256megabytes, but we can shove it in a JPEG and it only takes up 2megabytes. Yay, moar cat images can be stored on our harddrive! But naturally, I do not want to convert those ~256byte ASCII images to a 2megabyte JPEG, because then I can't fit as many pictures of cats on my harddrive.
In case there's a suggestion along these lines - I can't make a new format for those 2MB JPEGs, the output will always be 2MB for it to be compatible with the legacy, third-party, closed source cat-caption generator. That generates captions of such incredible awesomeness that rewriting it is a difficult task, and outside of the scope of the software being written.
NB: The actual real life use case is to do with data coming off a genetic analysis machine, if you're interested, but I'd rather the question be a more general use case of cat pictures. This being the internet and all.
Additional bit #3
We can say that the database structure looks a bit like
MyCatContent
idCatContent INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
catContentType INT NOT NULL REFERENCES CatContentType (catContentType),
content {whatSortOfFieldIsThis?}
CatContentType
catContentType INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
description VARCHAR(20) NOT NULL

Phone Number Columns in a Database

In the last 3 companies I've worked at, the phone number columns are of type varchar(n). The reason being that they might want to store extensions (ext. 333). But in every case, the "-" characters are stripped out when inserting and updating. I don't understand why the ".ext" characters are okay to store but not the "-" character. Has any one else seen this and what explanation can you think of for doing it this way? If all you want to store is the numbers, then aren't you better off using an int field? Conversely, if you want to store the number as a string/varchar, then why not keep all the characters and not bother with formatting on display and cleaning on write?
I'm also interested in hearing about other ways in which phone number storage is implemented in other places.
Quick test: are you going to add/subtract/multiply/divide Phone Numbers? Nope. Similarly to SSNs, Phone Numbers are discrete pieces of data that can contain actual numbers, so a string type is probably most appropriate.
one point with storing phone numbers is a leading 0.
eg: 01202 8765432
in an int column, the 0 will be stripped of, which makes the phone number invalid.
I would hazard a guess at the - being swapped for spaces is because they dont actually mean anything
eg: 123-456-789 = 123 456 789 = 123456789
Personally, I wouldn't strip out any characters, as depending on where the phone number is from, it could mean different things. Leave the phone number in the exact format it was entered, as obviously that's the way the person who typed it in is used to seeing it.
It doesn't really matter how you store it, as long as it's consistent. The norm is to strip out formatting characters, but you can also store country code, area code, exchange, and extension separately if you have a need to query on those values. Again, the requirement is that it's consistent - otherwise querying it is a PITA.
Another reason I can think of not to store phone numbers as 'numbers' but as strings of characters, is that often enough part of the software stack you'd use to access the database (PHP, I am looking at you) wouldn't support big enough integers (natively) to be able to store some of the longer and/or exotic phone numbers.
Largest number that 32-bits can carry, without sign, is 4294967295. That wouldn't work for just any Russian mobile phone number, take, for instance, the number 4959261234.
So you have yourself an extra inconvenience of finding a way to carry more than 32-bits worth of number data. Even though databases have long supported very large integers, you only need one bad link in the chain for a showstopper. Like PHP, again.
Stripping some characters and allowing others may have an impact if the database table is going to drive another system, e.g. IP Telephony of some sort. Depending on the systems involved, it may be legitimate to have etc.333 as a suffix, whereas the developers may not have accounted for "-" in the string (and yes, I am guessing here...)
As for storing as a varchar rather than an int, this is just plain-ole common sense to me. As mentioned before, leading zeros may be stripped in an int field, the query on an int field may perform implicit math functions (which could also explain stripping "-" from the text, you don't want to enter 555-1234 and have it stored as -679 do you?)
In short, I don't know the exact reasoning, but can deduce some possibilities.
I'd opt to store the digits as a string and add the various "()" and "-" in my display code. It does get more difficult with international numbers. We handle it by having various "internationalized" display formats depending on country.
What I like to do if I know the phone numbers are only going to be within a specific region, such as North America, is to change the entry into 4 fields. 3 for area code, 3 for prefix, 3 for line, and maybe 5 for extension. I then insert these as 1 field with '-' and maybe an 'e' to designate extension. Any searching of course also needs to follow the same process. This ensures I get more regular data and even allows for the number to be used for actually making a phone call, once the - and the extension are removed. I can also get back to original 4 fields easily.
Good stuff! It seems that the main point is that the formatting of the phone number is not actually part of the data but is instead an aspect of the source country. Still, by keeping the extension part of the number as is, one might be breaking the model of separating the formatting from the data. I doubt that all countries use the same syntax/format to describe an extension. Additionally, if integrating with a phone system is a (possible) requirement, then it might be better to store the extension separately and build the message as it is expected. But Mark also makes a good point that if you are consistent, then it probably won't matter how you store it since you can query and process it consistently as well.
Thank you Eric for the link to the other question.
When an automated telephone system uses a field to make a phone call it may not be able to tell what characters it should use and which it should ignore in dialing. A human being may see a "(" or ")" or "-" character and know these are considered delimiters separating the area code, npa, and nxx of the phone number. Remember though that each character represents a binary pattern that, unless pre-programmed to ignore, would be entered by an automated dialer. To account for this it is better to store the equivalent of only the characters a user would press on the phone handset and even better that the individual values be stored in separate columns so the dialer can use individual fields without having to parse the string.
Even if not using dialing automation it is a good practice to store things you dont need to update in the future. It is much easier to add characters between fields than strip them out of strings.
In comment of using a string vs. integer datatype as noted above strings are the proper way to store phone numbers based on variations between countries. There is an important caveat to that though in that while aggregating statistics for reporting (i.e. SUM of how many numbers or calls) character strings are MUCH slower to count than integers. To account for this its important to add an integer as an identity column that you can use for counting instead of the varchar or char field datatype.

Resources