Why LMDB database taking more than actual data size? - database

I put around 11K key&values in LMDB database .
LMDB database file size become 21Mb.
For the same data the leveldb is taking 8Mb only (with snappy compression).
LMDB env info ,
VERSION=3
format=bytevalue
type=btree
mapsize=1073741824
maxreaders=126
db_pagesize=4096
TO check why LMDB file size is more ,I iterated through all keys & values inside
the database. The total size of all key & value is 10Mb.
But the actual size of the file is 21Mb.
Remaining file size of 11Mb (21Mb - 10Mb) used for what purpose???!!.
If i compress data before put operation ,only 2Mb got reduced
Why LMDB database file size is more than actual data size?
Any way to shrink it ?

The database is bigger than the original file because lmdb requires to do some bookeeping to keep the data sorted. Also, there is an overhead because even if your record (key + value) is say 1kb lmdb allocates a fixed size of space to store those. I don't know the actual value. But this overhead is always expected.
Compression doesn't work well on small records.
lmdb doesn't support prefix or block compression. Your best bet is to use a key-value store that does, like wiredtiger.

Related

Why does SQLite store hundreds of null bytes?

In a database I'm creating, I was curious why the size was so much larger than the contents, and checked out the hex code. In a 4 kB file (single row as a test), there are two major chunks that are roughly 900 and 1000 bytes, along with a couple smaller ones that are all null bytes 0x0
I can't think of any logical reason it would be advantageous to store thousands of null bytes, increasing the size of the database significantly.
Can someone explain this to me? I've tried searching, and haven't been able to find anything.
The structure of a SQLite database file (`*.sqlite) is described in this page:
https://www.sqlite.org/fileformat.html
SQLite files are partitioned into "pages" which are between 512 and 65536 bytes long - in your case I imagine the page size is probably 1KiB. If you're storing data that's smaller than 1KiB (as you are with your single test row, which I imagine is maybe 100 bytes long?) then that leaves 900 bytes left - and unused (deallocated) space is usually zeroed-out before (and after) use.
It's the same way computer working memory (RAM) works - as RAM also uses paging.
I imagine you expected the file to be very compact with a terse internal representation; this is the case with some file formats - such as old-school OLE-based Office documents but others (and especially database files) require a different file layout that is optimized simultaneously for quick access, quick insertion of new data, and is also arranged to help prevent internal fragmentation - this comes at the cost of some wasted space.
A quick thought-experiment will demonstrate why mutable (i.e. non-read-only) databases cannot use a compact internal file structure:
Think of a single database table as being like a CSV file (and CSVs themselves are compact enough with very little wasted space).
You can INSERT new rows by appending to the end of the file.
You can DELETE an existing row by simply overwriting the row's space in the file with zeroes. Note that you cannot actually "delete" the space by "moving" data (like using the Backspace key in Notepad) because that means copying all of the data in the file around - this is largely a bad idea.
You can UPDATE a row by checking to see if the new row's width will fit in the current space (and overwrite the remaining space with zeros), or if not, then append a new row at the end and overwrite the existing row (a-la INSERT then DELETE)
But what if you have two database tables (with different columns) and need to store them in the same file? One approach is to simply mix each table's rows in the same flat file - but for other reasons that's a bad idea. So instead, inside your entire *.sqlite file, you create "sub-files", that have a known, fixed size (e.g. 4KiB) that store only rows for a single table until the sub-file is full; they also store a pointer (like a linked-list) to the next sub-file that contains the rest of the data, if any. Then you simply create new sub-files as you need more space inside the file and set-up their next-file pointers. These sub-files are what a "page" is in a database file, and is how you can have multiple read/write database tables contained within the same parent filesystem file.
Then in addition to these pages to store table data, you also need to store the indexes (which is what allows you to locate a table row near-instantly without needing to scan the entire table or file) and other metadata, such as the column-definitions themselves - and often they're stored in pages too. Relational (tabular) database files can be considered filesystems in their own right (just encapsulated in a parent filesystem... which could be inside a *.vhd file... which could be buried inside a varbinary database column... inside another filesystem), and even the database systems themselves have been compared to operating-systems (as they offer an environment for programs (stored procedures) to run, they offer IO services, and so on - it's almost circular if you look at the old COBOL-based mainframes from the 1970s when all of your IO operations were restricted to just computer record management operations (insert, update, delete).

how should i store this image?

I currently have a bunch of tiny images that i am saving to the server in a folder.
Would it be more convenient to store the base64 string into the
database and when needed pull it out?
I am ultimately looking for a way to conserve disk space, the method i am looking for will need to be able to withstand heavy traffic. 4M users minimum.
All 4M users will be looking at the image.
Would it be bad to retrieve data like that so often?
If i am using the database i will be using mysqli with PHPmyadmin.
EXPERIMENT:
Testing with image: Cute.jpeg
Size: 2.65kb
Size on Disk: 4.00kb
Base64 string:
/9j/4AAQSkZJRgABAQAAAQABAAD//gA7Q1JFQVRPUjogZ2QtanBlZyB2MS4wICh1c2luZyBJSkcgSlBFRyB2NjIpLCBxdWFsaXR5ID0gNzUK/9sAQwAIBgYHBgUIBwcHCQkICgwUDQwLCwwZEhMPFB0aHx4dGhwcICQuJyAiLCMcHCg3KSwwMTQ0NB8nOT04MjwuMzQy/9sAQwEJCQkMCwwYDQ0YMiEcITIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIy/8AAEQgAZABkAwEiAAIRAQMRAf/EAB8AAAEFAQEBAQEBAAAAAAAAAAABAgMEBQYHCAkKC//EALUQAAIBAwMCBAMFBQQEAAABfQECAwAEEQUSITFBBhNRYQcicRQygZGhCCNCscEVUtHwJDNicoIJChYXGBkaJSYnKCkqNDU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6g4SFhoeIiYqSk5SVlpeYmZqio6Slpqeoqaqys7S1tre4ubrCw8TFxsfIycrS09TV1tfY2drh4uPk5ebn6Onq8fLz9PX29/j5+v/EAB8BAAMBAQEBAQEBAQEAAAAAAAABAgMEBQYHCAkKC//EALURAAIBAgQEAwQHBQQEAAECdwABAgMRBAUhMQYSQVEHYXETIjKBCBRCkaGxwQkjM1LwFWJy0QoWJDThJfEXGBkaJicoKSo1Njc4OTpDREVGR0hJSlNUVVZXWFlaY2RlZmdoaWpzdHV2d3h5eoKDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uLj5OXm5+jp6vLz9PX29/j5+v/aAAwDAQACEQMRAD8A9OsbK00eyS3gRUVR+J/Gmy6gN2AwzWbd3sbZ3PxRYFHO6KMAH+IjNZ899FsaxpqKNa3lkY5Ix9auI+fwrK88ltu5T9Tg1ejf5R7U4smSL2/jrTTLjvVNpiFOO9Rm44x3xQ5iUS75o9acJsjrWYZjuPPFPWf+VJSG0aBk96TzsVTEw6mlJ4zmtYyM2iaS5IP1pqTqTy1Z8knm5AfBFUXmkjYjzC30ccU+dgopm+0MMrbix/OiufFwSP8Aj4k/F1/wop+2YvYozLS08+UPON7/AN3sK2GdYrdsYUfWs60l3DYDy3U+1Ra7frZW6LjczdFzgfjXAnZHazQtXkLFjtYD0/8Ar1e+2cVkaYsjWgklI3NzgdKLpmRSQGrW7SMXqzSa5zzmo/Oz3rOs7gTblPOOfwonuAm456c/lWd7lWNET/Nin+eN1YyXYOWzzipUuFlH3ufShSFJGoLnJz0FTC8QjGQPqa5XUL8wzBNwUHv6U6C5MoGxsf7R61pGo9jOSNm8kELiVJNp/OoZvst8o8whJF/iXisnULqe0t94KOR3ORU9hMdQsxI6FH6EZpylqOK0CSzl38TBx2O7H9aKV51hbZKp3DvntRSuaD9LIeYP2GQPrXOeM74QSxynJRHAIH1ro7WRYEUjkL94+/8AnNcZ4y/eXHk88zKAPxzXP1SNXomdXDrcS2kbHKrtHbisnUfFdklm7zX0MJ/hR2wx/wCAjnHvitnTbOJ7JAwU/KBzXm3jjwbNLqT3llHu3/6xR2966eV312Oe91oa3h7xjbzXFwgffjoR3zW0L43AZsnnPFeaeGfCup/2sQEMQUAksMAivXYNGFvbA7cttFZTVnoXHbUyHu3VsL0A5qtLq72z7yw2itFLM+aWI+UnGKh1Lw+bi3k8o4YjjI79qhIbOYvvFdve6osLuqAYyzsFH611lvqa2ccPnLH5T/dbIOfx6GvGtb0XULbWJYZYXZ1xyo4rp/Bnhu/kDm5keOB/+WR7++PWt+WyutzF67not5e201jKA46cDdgil0C5/wCJf7ryD6jNc5rOhm1sn2s23B71vaEI49OiAOV8sEe/FZyb3ZULW0NKd/Mk3AjGO4orJtrpJYiZpTHIGIZSKKV2aG4sSCzRF+7xz/WvPdfvBda47oS0ccg5FburanPb6etpa75ruXEaKvJJNY11or6baIs53Tn5pCPX0rmdRXujVrSx32hFZbKM9iB3rRuNLWc7lP1rL8KOn2CM7ug5zXRvMC2xe/XFdyldHMk7la0023i5WMH1NSXCoi7QAoNTbhjG4DHbtVK5uYQCGccVmzQjNrD5YwB1qVIlcFcKM+tZ/wDaNuzbQ4wD0HerdvdRP8qyc+hFCd2EkZ9/o0V0/wB1RKPunHWnafoxt8mQgn0xWo+7BCkHHaoTd4UpKhz2NbRsjGRzviVALSQYy2CAKztJkYaWMdUCj8O9P8UXAaPEeR2PrVHQrryleKQdFHX0NY1J3ZVPY3DZWvBcnJGeCKKrPeALGsZBVVxk0VNzWx2mm+HLHSS0ygyXDDBkfkgeg9K57xHp6S7mAPNdldPhTXO3su9irdKVaCSsgi29Wc14bmlguzalRtzxXbuUjwqjBNcfOghnWeMcq2c10sd9FcWyyLycU6GqsTLRkd5cwbDulOR6Vymra/a29u8JcPJnk5rbnKNIQ/fgGvDfEd3Lp+tXdqwZsOcM3ce1bQpqT94mU2lod3beI7RGwqMSOc+tb9lrVpeYw4TH3+a8Qk1eRxjGB6A8V03gKQ3mssXysccZJPYk8CtXRpqLsZ+0m3qeyW15CYiFl80jrzzRcHdavcJvQjnHas20EdtNHMOVf5SCO/Y/zH41b1bUIrXTWTIDzDGP/r1hBXVn0ByOTuvNvbv94SRmt+Pww97aRyWsoSdR9x+Aw+tVtNEAweDz3rrbGZcDaB+FVToxesiXNrY5+Pwrq6IF+z5+jCiu8im/djkUVr9Wj3D6zIp3khweawbxgM5ao9Tu5pEmaFyVxtP+ww6Ef1H+Ti/2uLmFfMASQ/Kwz0cdR/UVx1ZJs2jNbEk8qr2HPQVnxay9nO25iY3PQdj7f41nXl+fKaTcdozk+lZk0jSKSW5bgY9Kx5ralyaOnutWjniLxNkDPANebeJmW+lOUYOp+9jk1tC/+zK7KMgdQe4qJ44NRh8w/KwJ4PvWlPEWXvGT1OFSxleVY8cE4zXaaDI2lA2+0BVOGbqSaprpMYm3qudhB+8auXCmNpT6sP581f1m8rIizvY6+XXYV09kkl4wffHv+FYw1CW9ujPM7FyAB7D0965+WZjKtsh5PJI7GtW0VYUXPAUcZ71lOo73HbW50NgztOTHnYgAI9z/AIf1rsNMlIAzXDadIokgXdkmRmbH0/xrrLPzXbKtxW9GaJZ1i3DKoCjIxRVSCUCIDIyOtFd3OYM5SyuZTYXJ3HIuCM/XmuWtp5G1KS2Y5jZS3uCASOaKK8OX2fQuXwonKK8ZVgCGyD+ZFVYVBvojyNkoVQOw6UUVz1G7ItvY5q4P+n3Ef8IxgfjViMlYEwcZkFFFdP2V8gRo2iKyMxHJ5P50l8AdNuhjoSQe/Siio/5eL+uxfUy7BAbKOT+PzCM9+9bskjShS/JYZP1oorKu9fmzN/CXLS2j+0IBkYUng49K3tJZ/socuxJBJyaKK66Px/15DluWLad5TMzHkyf0FFFFaJuxys//2Q==
Base64 string length: 3628
Test:
Varchar(3628) is 3629 bytes, with 1 byte reserved for byte length.
So the total size for the string is ~3.5kb
Conclusion: I save about 0.5kb
Things i might not have considered Mysqli table byte size. If the table byte size reserve is 0.5kb or more then This would be irrational, and i should store my images locally.
Storing the base64 image into a database will not conserve disk space; however, If you are considering storing the image in a database you should look into blob objects.
First, if you're storing binary data in the database, you should really look at the BLOB family of data types (LONGBLOB, TINYBLOB, etc). Forget worries about saving 0.5kb of storage space; use the proper type and minimize the amount of other conversion you have to do. Just my two cents.
If you're worried about heavy traffic, I don't think storing the file itself in the database is the way to go at all. You could run some benchmark tests yourself to determine the amount of overhead involved, but from what I've seen it's often pretty standard in those situations to store the file on disk and store the path only in the database (and metadata or whatever else you need). That way you don't overburden the database (and if you need to scale, I find it easier to scale up your file system and/or webserver than to increase your database capacity).

how to calculate row size of an unstructured data?

In classical RDBMS it' relatively easy to calculate maximum row size by adding max size of each field defined within a table. This value multiplied by predicted number of rows will give max table size excluding indexes, logs etc.
Today in the era of structured way of storing unstructured data it's relatively hard to tell what will be the optimal table size.
Is there any way to calculate or predict table or even database growth and storage requirements without sample data load ?
What are your ways of calculating row size and planning storage capacity for unstructured database ?
It is pretty much the same. Find the average size of data you need to persist and multiply it with your estimated transaction count per time unit.
Database engines may allocate datafile chunks exponentially (first 16mb then 32mb etc.) so you need to know about the workings of your dbms engine to translate the data size to physical storage space size.

File fragmentation when data changes

If i add some letters to middle of file is that operation make file fragmentation?
I just want to store data to file and sometimes update them. So...when i made many changes is it may be bad for perfomance issue?
It depends of changes size.
- If you add only few letters and after this change size of file doesn't grow above size of allocated clusters number - then it'll not affect fagmentation.
- When after adding letters system needs to allocate aditional cluster on disc - then fragmentation could increase.

How do databases deal with data tables that cannot fit in memory?

Suppose you have a really large table, say a few billion unordered rows, and now you want to index it for fast lookups. Or maybe you are going to bulk load it and order it on the disk with a clustered index. Obviously, when you get to a quantity of data this size you have to stop assuming that you can do things like sorting in memory (well, not without going to virtual memory and taking a massive performance hit).
Can anyone give me some clues about how databases handle large quantities of data like this under the hood? I'm guessing there are algorithms that use some form of smart disk caching to handle all the data but I don't know where to start. References would be especially welcome. Maybe an advanced databases textbook?
Multiway Merge Sort is a keyword for sorting huge amounts of memory
As far as I know most indexes use some form of B-trees, which do not need to have stuff in memory. You can simply put nodes of the tree in a file, and then jump to varios position in the file. This can also be used for sorting.
Are you building a database engine?
Edit: I built a disc based database system back in the mid '90's.
Fixed size records are the easiest to work with because your file offset for locating a record can be easily calculated as a multiple of the record size. I also had some with variable record sizes.
My system needed to be optimized for reading. The data was actually stored on CD-ROM, so it was read-only. I created binary search tree files for each column I wanted to search on. I took an open source in-memory binary search tree implementation and converted it to do random access of a disc file. Sorted reads from each index file were easy and then reading each data record from the main data file according to the indexed order was also easy. I didn't need to do any in-memory sorting and the system was way faster than any of the available RDBMS systems that would run on a client machine at the time.
For fixed record size data, the index can just keep track of the record number. For variable length data records, the index just needs to store the offset within the file where the record starts and each record needs to begin with a structure that specifies it's length.
You would have to partition your data set in some way. Spread out each partition on a separate server's RAM. If I had a billion 32-bit int's - thats 32 GB of RAM right there. And thats only your index.
For low cardinality data, such as Gender (has only 2 bits - Male, Female) - you can represent each index-entry in less than a byte. Oracle uses a bit-map index in such cases.
Hmm... Interesting question.
I think that most used database management systems using operating system mechanism for memory management, and when physical memory ends up, memory tables goes to swap.

Resources