How are strings stored as bytes in databases? - database

This question may be a little vague, but let me try to explain it clearly. I have been reading a database related tutorial, and it mentioned tables are serialized to bytes to be persisted on the disk. When we deserialize them, we can locate each column based on the size of its type.
For example, we have a table:
---------------------------------------------------
| id (unsigned int 8) | timestamp (signed int 32) |
---------------------------------------------------
| Some Id | Some time |
---------------------------------------------------
When we are deserializing a byte array loaded from a file, we know the first 8 bits are the id, and the following 32 bits are the timestamp.
But the tutorial never mentioned how strings are handled in databases. They are not specific to a limited size, like 32 bits, and they are not predictable the size wise (there can always be a long long String, who knows). So how exactly does databases handle strings?
I know in RDBMS, you need to specify the size of the string as Varchar(45) for example, then it becomes easier. But what about dbs like MongoDB or Redis, which does not require a specification for string sizes, do they just assume it to be a specific length and increase the size once a longer one comes in?
That is basically my vague non-specific question, hope someone can give me some ideas on this. Thank you very much

In MongoDB, documents are serialized as BSON (Binary JSON-like objects). See BSON spec for more details regarding the datatypes for each type.
For string type, it is stored as:
<unsigned32 strsizewithnull><cstring>
From this line in the MongoDB source.
So a string field is stored with its length (including the null terminator) in the BSON object. The string itself is UTF-8 encoded as per BSON spec, so it can be encoded using a variable amount of bytes per symbol. Together with other fields that makes up a document, they are compressed using Snappy by default. This compressed representation is the one persisted to disk.
WiredTiger is a no-overwrite storage engine. If that document is updated, WiredTiger creates a new document and updates the internal pointer to the new one, and mark the old document as "space available for reuse".

Related

Mongodb 16 MB document example, how much actual data?

Does anyone have a practical downloadable/viewable example of a 16mb (max size) mongodb doucument?
Should be alot of data but im trying to get the feel and understanding how much data can you store in a 16 mb document,
like "How many sql rows of a 10 column table would that be" sort of question
Thanks
You can calculate the size of various documents using the BSON spec.
For example, a document {a:1} consisting of one key with an integer value would take 5+1+2+4=12 bytes.
You can use various drivers to convert your data to BSON to see how much space it actually takes up:
serene% irb -rbson
irb(main):001:0> {a:1}.to_bson.to_s
=> "\f\x00\x00\x00\x10a\x00\x01\x00\x00\x00\x00"
irb(main):002:0> {a:1}.to_bson.to_s.length
=> 12
If you have, let's say, documents which are flat (non-nested) mappings with keys that are 10 bytes long and 64-bit integer values, each key-value pair takes up 1+10+1+8=20 bytes. You can have about 800,000 such key-value pairs in a single document.
As Wernfried Domscheit said, https://ourworldindata.org/coronavirus-source-data provides a json with alot of data to view and test with
I suggest to import the data into MongoDB compass to view all the documents clearly.

Attempting to read raw database file

I'm attempting to read data from a database file (which employees c-tree data structure). This is a very old product, and for various reasons, the ODBC drivers are no longer available to me.
What I have found is that the data is basically just line-by-line "flat-file". So my plan is to simply read the raw binary data from the file, and in effect, fashion my own custom-built ODBC.
Using a tool provided by the the c-tree company themselves, I have even been able to get the details of each field address (i.e. where it starts), its length (length of the byte array) and a column that I assume is actually telling me how to field is encoded (see below):
ADDRESS LENGTH TYPE(encoding?) FIELD NAME
0 8 (128-0x80) CT_ARRAY Reserved
8 4 (59-0x3B) CT_INT4U Record_ID
12 2 (41-0x29) CT_INT2U Type
14 2 (41-0x29) CT_INT2U Changes
16 52 (144-0x90) CT_FSTRING Name
Am I correct in assuming that something like "(128-0x80)" should be the only information I need to decode the field into actual text? Or is it likely there's some further encryption that I'm not considering here?
Also could anyone tell me perhaps what exactly "(128-0x80)" is? I recognise the 0x80 as hex, but what does the 128 mean? With, at the very least, some kind of terminology to describe this thing, I could do some more google research.
Thanks in advance!
This type encoding is purely internal as our byte value representation of that data type.
For example: 80x (hex) = 128 (decimal) = data type CT_ARRAY and likewise for the others.
ADDRESS LENGTH TYPE(encoding?) FIELD NAME
0 8 (128-0x80) CT_ARRAY Reserved
8 4 (59-0x3B) CT_INT4U Record_ID
12 2 (41-0x29) CT_INT2U Type
14 2 (41-0x29) CT_INT2U Changes
16 52 (144-0x90) CT_FSTRING Name
You can view the data type descriptions online at documentation: https://docs.faircom.com/doc/ctreeplus/28396.htm
It is likely this is a fixed length record. Variable length records will contain a 10-byte header on each which will need to be accounted for. Also, this appears to be a 1- or 2-byte (16 bit application) packed aligned data record, which should always be considered. Other C-structure compiler defined alignments will complicate data extraction.
The "Reserved" field could be simply the placeholder marker for our 1 byte deleted record mark and deleted record stack value (also described in our documentation). However, it could also contain application specific data relevant only to that application as it is 8 bytes long.
There should no other encryption or encoding of the data (certainly no Unicode).
Unfortunately, this ODBC Driver has been discontinued. There is a way for you to easily extract all data as it was intended with the c-treeACE database, but you will need to contact FairCom support for more information. The link for support is faircom.com/support.

What would be the best database server to store value of PI?

Say 100 million digits, one string.
The purpose is to query the DB to find recurrence of a search string.
While I know that LONGTEXT type in MySQL would allow for the string to be stored, I am not sure that querying the substring would actually result in acceptable performances.
Would a NoSQL key-value model would perform better?
Any suggestion, experience (does not have to be PI..).
This may be taking you in the wrong direction but...
Using MySQL seems to be a high-overhead way of solving the specific problem of finding a string in a file.
100M digits, 8 bytes each, is only 100MB. In Python, you could write the file as a sequence of bytes, each byte representing an ascii number. Or, you could pack them into nibbles (4 bits will cover digits 0-9).
In Python, you can read the file using:
fInput = open(<yourfilenamehere>, "rb")
fInput.seek(number_of_digit_you_want)
fInput.read(1) # to get a single byte
From this, it is easy to build a search solution to seek out a specific string.
Again, mySQL might be the way to go but, for your application, a non-high-overhead method might be the ticket.

Allowed characters in AppEngine Datastore key name

If I create a named key for use in Google AppEngine, what kind of String is the key-name? Does it use Unicode characters or is it a binary string?
More specifically, if I want to have my key-name made up of 8-bit binary data, is there a way I can do it? If not, can I at least use 7-bit binary data? Or are there any reserved values? Does it use NULL as the End-Of-String marker, for example?
GAE docs do not specify any restrictions on the key-name String. So a String with any content should be valid.
If you want to use a binary data as an identifier, then you should encode it into a String. You can use any of the binary-to-text encoding methods: most used seem to be Base64 (3 bytes = 4 chars) and BinHex (1 byte = 2 chars).
I meanwhile had some time to actually test this out by generating a bunch of keys with binary names and then performing a kind-only query to get all the keys back. Here are the results:
Any binary character is fine. If you create an entity with key name "\x00\x13\x127\x255", a query will find this entity and its key name will return that same string
The AppEngine Dashboard, Database Viewer, and other tools will simply omit characters that aren't displayable, so the key names "\x00test" and \x00\x00test will both show up as separate entities, but their keys are both shown as "test"
I have not tested all available AppEngine tools, just some of the basics in the Console, so there may be other tools that get confused by such keys...
Keys are UTF-8 encoded, so any character between 128 and 255 takes up 2 bytes of storage
From this, I would derive the following recommendations:
If you need to be able to work with individual entities from the AppEngine console and need to identify them by key, you are limited to printable characters and thus need to encode the binary key name into a String either in Base16 (hex; 50% overhead), Base64 (33% overhead), or Base85 (25% overhead)
If you don't care about key readability, but need to pack as much data as possible into the key name with minimal storage use, use Base128 encoding (i.e. 7-bits only; 14% overhead) to avoid the implicit UTF-8 encoding (50% overhead!) of 8-bit data data
Asides:
I will accept #PeterKnego's answer instead of this one since this one basically only confirms and expands on what he already assumed correctly.
From looking through the source code of the Java API, I think that the UTF-8 encoding of the key-name happens in the API (while building the protocol buffer) rather than in BigTable, so if you really want to go nuts on storage space maximization, it may be possible to build your own protocol buffers and store full 8-bit data without overhead. But this is probably asking for trouble...

Which SQL Server 2008 data type should be used for image and text data?

I have a table in SQLServer, and one of the fields is expecting rather mixed data; sometimes it will receive a text value, let's say for the sake of argument:
ASCII ART PICTURE OF A CAT WITH A POORLY SPELT CAPTION
Then sometimes it will contain an actual JPEG which contains images of cats with poorly spelt captions. What's the best datatype to use in this instance? A big, long, object.
We can assume that the ASCII art pictures are fairly small, say 16 chars x 16 chars. You could store them in a VARCHAR(256) field, at least if it wasn't for those pesky big 2MB JPEGs that people send.
Additional bit
Semantically, we can say that both of these are images of cats, albeit in very different formats. The question I guess I'm asking is that if there is an appropriate datatype to handle values which are semantically the same thing, but could come in very different forms.
Additional bit #2
In this case, the amount of data about our cat varies wildly. Sometimes the cat caption image is very small, and we only need ~256bytes of information to be stored. But sometimes it is very large, if it were a plaintext ASCII cat image, it might take up ~256megabytes, but we can shove it in a JPEG and it only takes up 2megabytes. Yay, moar cat images can be stored on our harddrive! But naturally, I do not want to convert those ~256byte ASCII images to a 2megabyte JPEG, because then I can't fit as many pictures of cats on my harddrive.
In case there's a suggestion along these lines - I can't make a new format for those 2MB JPEGs, the output will always be 2MB for it to be compatible with the legacy, third-party, closed source cat-caption generator. That generates captions of such incredible awesomeness that rewriting it is a difficult task, and outside of the scope of the software being written.
NB: The actual real life use case is to do with data coming off a genetic analysis machine, if you're interested, but I'd rather the question be a more general use case of cat pictures. This being the internet and all.
Additional bit #3
We can say that the database structure looks a bit like
MyCatContent
idCatContent INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
catContentType INT NOT NULL REFERENCES CatContentType (catContentType),
content {whatSortOfFieldIsThis?}
CatContentType
catContentType INT NOT NULL AUTO_INCREMENT PRIMARY KEY,
description VARCHAR(20) NOT NULL

Resources