Attempting to read raw database file - database

I'm attempting to read data from a database file (which employees c-tree data structure). This is a very old product, and for various reasons, the ODBC drivers are no longer available to me.
What I have found is that the data is basically just line-by-line "flat-file". So my plan is to simply read the raw binary data from the file, and in effect, fashion my own custom-built ODBC.
Using a tool provided by the the c-tree company themselves, I have even been able to get the details of each field address (i.e. where it starts), its length (length of the byte array) and a column that I assume is actually telling me how to field is encoded (see below):
ADDRESS LENGTH TYPE(encoding?) FIELD NAME
0 8 (128-0x80) CT_ARRAY Reserved
8 4 (59-0x3B) CT_INT4U Record_ID
12 2 (41-0x29) CT_INT2U Type
14 2 (41-0x29) CT_INT2U Changes
16 52 (144-0x90) CT_FSTRING Name
Am I correct in assuming that something like "(128-0x80)" should be the only information I need to decode the field into actual text? Or is it likely there's some further encryption that I'm not considering here?
Also could anyone tell me perhaps what exactly "(128-0x80)" is? I recognise the 0x80 as hex, but what does the 128 mean? With, at the very least, some kind of terminology to describe this thing, I could do some more google research.
Thanks in advance!

This type encoding is purely internal as our byte value representation of that data type.
For example: 80x (hex) = 128 (decimal) = data type CT_ARRAY and likewise for the others.
ADDRESS LENGTH TYPE(encoding?) FIELD NAME
0 8 (128-0x80) CT_ARRAY Reserved
8 4 (59-0x3B) CT_INT4U Record_ID
12 2 (41-0x29) CT_INT2U Type
14 2 (41-0x29) CT_INT2U Changes
16 52 (144-0x90) CT_FSTRING Name
You can view the data type descriptions online at documentation: https://docs.faircom.com/doc/ctreeplus/28396.htm
It is likely this is a fixed length record. Variable length records will contain a 10-byte header on each which will need to be accounted for. Also, this appears to be a 1- or 2-byte (16 bit application) packed aligned data record, which should always be considered. Other C-structure compiler defined alignments will complicate data extraction.
The "Reserved" field could be simply the placeholder marker for our 1 byte deleted record mark and deleted record stack value (also described in our documentation). However, it could also contain application specific data relevant only to that application as it is 8 bytes long.
There should no other encryption or encoding of the data (certainly no Unicode).
Unfortunately, this ODBC Driver has been discontinued. There is a way for you to easily extract all data as it was intended with the c-treeACE database, but you will need to contact FairCom support for more information. The link for support is faircom.com/support.

Related

How are strings stored as bytes in databases?

This question may be a little vague, but let me try to explain it clearly. I have been reading a database related tutorial, and it mentioned tables are serialized to bytes to be persisted on the disk. When we deserialize them, we can locate each column based on the size of its type.
For example, we have a table:
---------------------------------------------------
| id (unsigned int 8) | timestamp (signed int 32) |
---------------------------------------------------
| Some Id | Some time |
---------------------------------------------------
When we are deserializing a byte array loaded from a file, we know the first 8 bits are the id, and the following 32 bits are the timestamp.
But the tutorial never mentioned how strings are handled in databases. They are not specific to a limited size, like 32 bits, and they are not predictable the size wise (there can always be a long long String, who knows). So how exactly does databases handle strings?
I know in RDBMS, you need to specify the size of the string as Varchar(45) for example, then it becomes easier. But what about dbs like MongoDB or Redis, which does not require a specification for string sizes, do they just assume it to be a specific length and increase the size once a longer one comes in?
That is basically my vague non-specific question, hope someone can give me some ideas on this. Thank you very much
In MongoDB, documents are serialized as BSON (Binary JSON-like objects). See BSON spec for more details regarding the datatypes for each type.
For string type, it is stored as:
<unsigned32 strsizewithnull><cstring>
From this line in the MongoDB source.
So a string field is stored with its length (including the null terminator) in the BSON object. The string itself is UTF-8 encoded as per BSON spec, so it can be encoded using a variable amount of bytes per symbol. Together with other fields that makes up a document, they are compressed using Snappy by default. This compressed representation is the one persisted to disk.
WiredTiger is a no-overwrite storage engine. If that document is updated, WiredTiger creates a new document and updates the internal pointer to the new one, and mark the old document as "space available for reuse".

How to encode list of 16 numbers in less than 2 bytes

I need to convey information of availability of 16 items with their id(0-15) in a variable.
char item_availablity[16];
I can encode it with 2 bytes where every bit is mapped with one item id where 1 represents available and 0 represents unavailable
For ex 0000100010001000
This number has information that Items with id 4,8,12 are available
I need to encode this information by using less than 2 Bytes.
Is this possible? If so, how?
To put it simply:
No, you cannot. It's simply impossible to store 1 bit of information about 16 separate things in less than 16 bits. That is, in the general case.
An exception is if there are some restrictions. For instance, let's call the items i_1, i_2 ... i_16. If you know that i_1 is available if and only if i_2 also is available, then you can encode the availability about these two items in just one bit.
A more complicated example is that i_1 is available iff i_2 or i_3 is available. Then you could store the availability for these three items in two bits.
But for the general case, nope, it's completely impossible.
Trying to think out of the box here - If some items are more rare than others, you could use a variable length encoding so that, on average, it would take less than 16 bits to store the information. However, there will be combinations of availabilities where it would take more than 16 bits.

Variant UUID format

RFC 4122 defines UUID as having a specific format:
Timestamp: 8 bytes; Clock sequence and variant: 2 bytes; Node: 6 bytes
In my application, I use UUIDs for different types of entities, say foo, bar and baz. IOW, every object of type foo gets a UUID, so does every bar and baz. This results in a slew of UUIDs in the input and output (logs, CLI) of the application.
To validate the input and to make the output easier to handle, I am considering devoting one nibble (4 bits) to the type of the entity: foo, bar, baz, etc. For example, if that nibble were '5', it would indicate it is a UUID of a foo object. Each UUID would be self-describing in this scheme, can be validated on input and can be automatically prefixed with category names in the output, like "foo:5ace6432-..."
Is it a good idea to tweak the UUID format in this way? Randomness is not a source of worry, as we still have 118 bits (128 bits total - 6 bits used by the standard for variants - 4 bits used by me).
Where should I place this nibble? If I place it up front, it would overwrite a part of the timestamp. if I place it at the end, t would be less visible. If I place it somewhere in the middle, it will be even less visible. Is overwriting a part of the timestamp a problem?
Thanks.
UUIDs are 128 bit numbers, format is just for reading.
Is a bad practice try to interpret the internal parts. The UUID should be opaque: one single unit that identifies something. If you need other information about your entities, you need other fields.
In short: self-describing UUIDs is a bad idea.

Can I limit specific characters in a SQL Server column? Will it improve size and query speed?

I couldn't figure out the correct terminology for what I am asking so I apologize if this is in the wrong place or format.
I am rebuilding a database, call it aspsessionsv2. It consists of a single table with over 11 billion rows. Column 1 is a string and has no limits other than under 20 characters. The other columns all contain HEX data... so there isn't any reason for that field to store characters outside of A-F and 0-9. So...
Is there a way I can configure SQL Server to limit the field to those characters?
Will that reduce the overall size of the database?
Will that speed up queries to a database of this size?
What got me to thinking about this was WinRAR. I compressed a 50GB file containing only HEX characters down to 206MB. That blows my mind even though I understand how it works so I am curious if I can do the same "compression" in a way on a SQL Server database.
Thank you!
Been a little bit since I have asked a question. Here is some tech info: Windows Server 2008 R2, SQL Server 2008, 10 Columns, 11 Billion Rows
You could use a blob (binary large object), that would cut the size of the hexadecimal-data fields in half. Often hexadecimal encoding is used to circumvent character encoding issues.
You could also use a Base-64 encoding rather than a base-16 (hexadecimal) encoding; it would use 6 bits per character rather than 4, and only increase the storage relative to a blob 4:3 times, instead of increasing it 2-fold in the case of hexadecimal strings.
If you are using varchar or nvarchar to store strings of characters 0-9 and A-F, then you should really be using varbinary type instead. Each pair of hexadecimal characters represent one byte, so with varbinary each byte of data needs 1 byte on disk, with varchar each byte of data needs 2 bytes on disk, with nvarchar each byte of data needs 4 bytes on disk.
Having varbinary instead of varchar will reduce the overall size of the database and it will speed up queries, because less bytes need to be read from disk.
Hex values are just numbers so you are likely better off storing them as such. For example 123abc would convert nicely to 1194684 and would only require 4 bytes instead of 8 bytes (6 characters + 2 byte varchar overhead). So provided the number isn't going to go above 2147483647 you can store them all as int.
However, if you wanted to restrict the column to only containing the values 0-9 and a-f, then you could use a check constraint, something like this:
ALTER TABLE YourTable
ADD CONSTRAINT CK_YourTable_YourColumn CHECK (YourColumn NOT LIKE '%[^0-9a-z]%')

Allowed characters in AppEngine Datastore key name

If I create a named key for use in Google AppEngine, what kind of String is the key-name? Does it use Unicode characters or is it a binary string?
More specifically, if I want to have my key-name made up of 8-bit binary data, is there a way I can do it? If not, can I at least use 7-bit binary data? Or are there any reserved values? Does it use NULL as the End-Of-String marker, for example?
GAE docs do not specify any restrictions on the key-name String. So a String with any content should be valid.
If you want to use a binary data as an identifier, then you should encode it into a String. You can use any of the binary-to-text encoding methods: most used seem to be Base64 (3 bytes = 4 chars) and BinHex (1 byte = 2 chars).
I meanwhile had some time to actually test this out by generating a bunch of keys with binary names and then performing a kind-only query to get all the keys back. Here are the results:
Any binary character is fine. If you create an entity with key name "\x00\x13\x127\x255", a query will find this entity and its key name will return that same string
The AppEngine Dashboard, Database Viewer, and other tools will simply omit characters that aren't displayable, so the key names "\x00test" and \x00\x00test will both show up as separate entities, but their keys are both shown as "test"
I have not tested all available AppEngine tools, just some of the basics in the Console, so there may be other tools that get confused by such keys...
Keys are UTF-8 encoded, so any character between 128 and 255 takes up 2 bytes of storage
From this, I would derive the following recommendations:
If you need to be able to work with individual entities from the AppEngine console and need to identify them by key, you are limited to printable characters and thus need to encode the binary key name into a String either in Base16 (hex; 50% overhead), Base64 (33% overhead), or Base85 (25% overhead)
If you don't care about key readability, but need to pack as much data as possible into the key name with minimal storage use, use Base128 encoding (i.e. 7-bits only; 14% overhead) to avoid the implicit UTF-8 encoding (50% overhead!) of 8-bit data data
Asides:
I will accept #PeterKnego's answer instead of this one since this one basically only confirms and expands on what he already assumed correctly.
From looking through the source code of the Java API, I think that the UTF-8 encoding of the key-name happens in the API (while building the protocol buffer) rather than in BigTable, so if you really want to go nuts on storage space maximization, it may be possible to build your own protocol buffers and store full 8-bit data without overhead. But this is probably asking for trouble...

Resources