Does anyone know of a way to shrink/compact a db4o database?
What do you mean by compact/shrink? Make an existing database smaller? Or do you want to compress the database?
One of roles for this is defragmentation. This frees up unused space in the database. When you delete objects, the database doesn't shrink. Instead the space is marked as free for new objects. However overtime with a the manipulation of the database the file get fragmented. So defragmentation brings back unused space in the database.
When you store a lot of strings, you should consider using UTF8 encoding instead of default full unicode. This saves a lot of space, because a now a character uses only one byte.
EmbeddedConfiguration config = Db4oEmbedded.newConfiguration();
config.common().stringEncoding(StringEncodings.utf8());
ObjectContainer database = Db4oEmbedded.openFile(config,"database.db4o");
Note that you cannot change this setting for existing databases! You need to defragment the database in order to change the string encoding.
To compress the database you could use a compressing storage-implementation. However I don't know any available implementation which does this.
Related
In a database I'm creating, I was curious why the size was so much larger than the contents, and checked out the hex code. In a 4 kB file (single row as a test), there are two major chunks that are roughly 900 and 1000 bytes, along with a couple smaller ones that are all null bytes 0x0
I can't think of any logical reason it would be advantageous to store thousands of null bytes, increasing the size of the database significantly.
Can someone explain this to me? I've tried searching, and haven't been able to find anything.
The structure of a SQLite database file (`*.sqlite) is described in this page:
https://www.sqlite.org/fileformat.html
SQLite files are partitioned into "pages" which are between 512 and 65536 bytes long - in your case I imagine the page size is probably 1KiB. If you're storing data that's smaller than 1KiB (as you are with your single test row, which I imagine is maybe 100 bytes long?) then that leaves 900 bytes left - and unused (deallocated) space is usually zeroed-out before (and after) use.
It's the same way computer working memory (RAM) works - as RAM also uses paging.
I imagine you expected the file to be very compact with a terse internal representation; this is the case with some file formats - such as old-school OLE-based Office documents but others (and especially database files) require a different file layout that is optimized simultaneously for quick access, quick insertion of new data, and is also arranged to help prevent internal fragmentation - this comes at the cost of some wasted space.
A quick thought-experiment will demonstrate why mutable (i.e. non-read-only) databases cannot use a compact internal file structure:
Think of a single database table as being like a CSV file (and CSVs themselves are compact enough with very little wasted space).
You can INSERT new rows by appending to the end of the file.
You can DELETE an existing row by simply overwriting the row's space in the file with zeroes. Note that you cannot actually "delete" the space by "moving" data (like using the Backspace key in Notepad) because that means copying all of the data in the file around - this is largely a bad idea.
You can UPDATE a row by checking to see if the new row's width will fit in the current space (and overwrite the remaining space with zeros), or if not, then append a new row at the end and overwrite the existing row (a-la INSERT then DELETE)
But what if you have two database tables (with different columns) and need to store them in the same file? One approach is to simply mix each table's rows in the same flat file - but for other reasons that's a bad idea. So instead, inside your entire *.sqlite file, you create "sub-files", that have a known, fixed size (e.g. 4KiB) that store only rows for a single table until the sub-file is full; they also store a pointer (like a linked-list) to the next sub-file that contains the rest of the data, if any. Then you simply create new sub-files as you need more space inside the file and set-up their next-file pointers. These sub-files are what a "page" is in a database file, and is how you can have multiple read/write database tables contained within the same parent filesystem file.
Then in addition to these pages to store table data, you also need to store the indexes (which is what allows you to locate a table row near-instantly without needing to scan the entire table or file) and other metadata, such as the column-definitions themselves - and often they're stored in pages too. Relational (tabular) database files can be considered filesystems in their own right (just encapsulated in a parent filesystem... which could be inside a *.vhd file... which could be buried inside a varbinary database column... inside another filesystem), and even the database systems themselves have been compared to operating-systems (as they offer an environment for programs (stored procedures) to run, they offer IO services, and so on - it's almost circular if you look at the old COBOL-based mainframes from the 1970s when all of your IO operations were restricted to just computer record management operations (insert, update, delete).
I have an SQLite database.
I created the tables and filled them with a considerable amount of data.
Then I cleared the database by deleting and recreating the tables. I confirmed that all the data had been removed and the tables were empty by looking at them using SQLite Administrator.
The problem is that the size of the database file (*.db3) remained the same after it had been cleared.
This is of course not desirable as I would like to regain the space that was taken up by the data once I clear it.
Did anyone make a similar observation and/or know what is going on?
What can be done about it?
From here:
When an object (table, index, trigger, or view) is dropped from the database, it leaves behind empty space. This empty space will be reused the next time new information is added to the database. But in the meantime, the database file might be larger than strictly necessary. Also, frequent inserts, updates, and deletes can cause the information in the database to become fragmented - scrattered out all across the database file rather than clustered together in one place.
The VACUUM command cleans the main database by copying its contents to a temporary database file and reloading the original database file from the copy. This eliminates free pages, aligns table data to be contiguous, and otherwise cleans up the database file structure.
Databases sizes work like water marks e.g. if the water rises the water mark goes up, when the water receeds the water mark stays where it was
You should look into shrinking databases
My application needs to store millions of binary blocks in a Postgres database (a few thousands of those might arrive per second). Most of the blocks are 16K in size though some might be smaller. I understand that I can use text, bytea or blob columns or I can store the binary data in files outside the database and put their paths in the table.
Considering that high write-throughput is my most important goal, which option is the most appropriate for my situation?
bytea is the sensible option here - almost the only option.
There is no advantage to using text, varchar etc. Don't store encoded binary in them. That's an option you should immediately disregard.
There is no blob type in PostgreSQL. I think you might mean lob, which is a wrapper for oid used for looking up "large objects" in the pg_largeobject table. This is useful when you need virtual "files" in the database where you can seek, overwrite, append, etc, but it's not at all suited to your use case.
You could store paths or filenames then look them up externally, but you're going to have a lot of very small files. You're also going to need a sidechannel for clients to read and write them, since you can't use the PostgreSQL protocol directly. You'll need to handle backup/restore and replication for them separately. They won't get deleted if a transaction rolls back or if the corresponding database tuple gets deleted so you'll need a cleanup system to remove no-longer-needed files. It'll get messy. This is worth doing when the files are big, long-lived, and mostly static, but it doesn't sound like that's the case for you.
Store the binary in bytea columns directly, and preferably use the binary protocol support in PgJDBC or libpq to exchange the bytea values between client and server without needing encoding. Have minimal indexes on the table you write to. (Under some circumstances you can even go without defining a primary key, but that's kind of an expert level option). If you don't mind losing the data in the table on unplanned restart, use an unlogged table. Otherwise batch writes and use asynchronous commits and/or a commit delay.
See also How to speed up insertion performance in PostgreSQL
Try all your options, benchmark them, and figure out which one works best for you.
I currently have a bunch of tiny images that i am saving to the server in a folder.
Would it be more convenient to store the base64 string into the
database and when needed pull it out?
I am ultimately looking for a way to conserve disk space, the method i am looking for will need to be able to withstand heavy traffic. 4M users minimum.
All 4M users will be looking at the image.
Would it be bad to retrieve data like that so often?
If i am using the database i will be using mysqli with PHPmyadmin.
EXPERIMENT:
Testing with image: Cute.jpeg
Size: 2.65kb
Size on Disk: 4.00kb
Base64 string:
/9j/4AAQSkZJRgABAQAAAQABAAD//gA7Q1JFQVRPUjogZ2QtanBlZyB2MS4wICh1c2luZyBJSkcgSlBFRyB2NjIpLCBxdWFsaXR5ID0gNzUK/9sAQwAIBgYHBgUIBwcHCQkICgwUDQwLCwwZEhMPFB0aHx4dGhwcICQuJyAiLCMcHCg3KSwwMTQ0NB8nOT04MjwuMzQy/9sAQwEJCQkMCwwYDQ0YMiEcITIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIy/8AAEQgAZABkAwEiAAIRAQMRAf/EAB8AAAEFAQEBAQEBAAAAAAAAAAABAgMEBQYHCAkKC//EALUQAAIBAwMCBAMFBQQEAAABfQECAwAEEQUSITFBBhNRYQcicRQygZGhCCNCscEVUtHwJDNicoIJChYXGBkaJSYnKCkqNDU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6g4SFhoeIiYqSk5SVlpeYmZqio6Slpqeoqaqys7S1tre4ubrCw8TFxsfIycrS09TV1tfY2drh4uPk5ebn6Onq8fLz9PX29/j5+v/EAB8BAAMBAQEBAQEBAQEAAAAAAAABAgMEBQYHCAkKC//EALURAAIBAgQEAwQHBQQEAAECdwABAgMRBAUhMQYSQVEHYXETIjKBCBRCkaGxwQkjM1LwFWJy0QoWJDThJfEXGBkaJicoKSo1Njc4OTpDREVGR0hJSlNUVVZXWFlaY2RlZmdoaWpzdHV2d3h5eoKDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uLj5OXm5+jp6vLz9PX29/j5+v/aAAwDAQACEQMRAD8A9OsbK00eyS3gRUVR+J/Gmy6gN2AwzWbd3sbZ3PxRYFHO6KMAH+IjNZ899FsaxpqKNa3lkY5Ix9auI+fwrK88ltu5T9Tg1ejf5R7U4smSL2/jrTTLjvVNpiFOO9Rm44x3xQ5iUS75o9acJsjrWYZjuPPFPWf+VJSG0aBk96TzsVTEw6mlJ4zmtYyM2iaS5IP1pqTqTy1Z8knm5AfBFUXmkjYjzC30ccU+dgopm+0MMrbix/OiufFwSP8Aj4k/F1/wop+2YvYozLS08+UPON7/AN3sK2GdYrdsYUfWs60l3DYDy3U+1Ra7frZW6LjczdFzgfjXAnZHazQtXkLFjtYD0/8Ar1e+2cVkaYsjWgklI3NzgdKLpmRSQGrW7SMXqzSa5zzmo/Oz3rOs7gTblPOOfwonuAm456c/lWd7lWNET/Nin+eN1YyXYOWzzipUuFlH3ufShSFJGoLnJz0FTC8QjGQPqa5XUL8wzBNwUHv6U6C5MoGxsf7R61pGo9jOSNm8kELiVJNp/OoZvst8o8whJF/iXisnULqe0t94KOR3ORU9hMdQsxI6FH6EZpylqOK0CSzl38TBx2O7H9aKV51hbZKp3DvntRSuaD9LIeYP2GQPrXOeM74QSxynJRHAIH1ro7WRYEUjkL94+/8AnNcZ4y/eXHk88zKAPxzXP1SNXomdXDrcS2kbHKrtHbisnUfFdklm7zX0MJ/hR2wx/wCAjnHvitnTbOJ7JAwU/KBzXm3jjwbNLqT3llHu3/6xR2966eV312Oe91oa3h7xjbzXFwgffjoR3zW0L43AZsnnPFeaeGfCup/2sQEMQUAksMAivXYNGFvbA7cttFZTVnoXHbUyHu3VsL0A5qtLq72z7yw2itFLM+aWI+UnGKh1Lw+bi3k8o4YjjI79qhIbOYvvFdve6osLuqAYyzsFH611lvqa2ccPnLH5T/dbIOfx6GvGtb0XULbWJYZYXZ1xyo4rp/Bnhu/kDm5keOB/+WR7++PWt+WyutzF67not5e201jKA46cDdgil0C5/wCJf7ryD6jNc5rOhm1sn2s23B71vaEI49OiAOV8sEe/FZyb3ZULW0NKd/Mk3AjGO4orJtrpJYiZpTHIGIZSKKV2aG4sSCzRF+7xz/WvPdfvBda47oS0ccg5FburanPb6etpa75ruXEaKvJJNY11or6baIs53Tn5pCPX0rmdRXujVrSx32hFZbKM9iB3rRuNLWc7lP1rL8KOn2CM7ug5zXRvMC2xe/XFdyldHMk7la0023i5WMH1NSXCoi7QAoNTbhjG4DHbtVK5uYQCGccVmzQjNrD5YwB1qVIlcFcKM+tZ/wDaNuzbQ4wD0HerdvdRP8qyc+hFCd2EkZ9/o0V0/wB1RKPunHWnafoxt8mQgn0xWo+7BCkHHaoTd4UpKhz2NbRsjGRzviVALSQYy2CAKztJkYaWMdUCj8O9P8UXAaPEeR2PrVHQrryleKQdFHX0NY1J3ZVPY3DZWvBcnJGeCKKrPeALGsZBVVxk0VNzWx2mm+HLHSS0ygyXDDBkfkgeg9K57xHp6S7mAPNdldPhTXO3su9irdKVaCSsgi29Wc14bmlguzalRtzxXbuUjwqjBNcfOghnWeMcq2c10sd9FcWyyLycU6GqsTLRkd5cwbDulOR6Vymra/a29u8JcPJnk5rbnKNIQ/fgGvDfEd3Lp+tXdqwZsOcM3ce1bQpqT94mU2lod3beI7RGwqMSOc+tb9lrVpeYw4TH3+a8Qk1eRxjGB6A8V03gKQ3mssXysccZJPYk8CtXRpqLsZ+0m3qeyW15CYiFl80jrzzRcHdavcJvQjnHas20EdtNHMOVf5SCO/Y/zH41b1bUIrXTWTIDzDGP/r1hBXVn0ByOTuvNvbv94SRmt+Pww97aRyWsoSdR9x+Aw+tVtNEAweDz3rrbGZcDaB+FVToxesiXNrY5+Pwrq6IF+z5+jCiu8im/djkUVr9Wj3D6zIp3khweawbxgM5ao9Tu5pEmaFyVxtP+ww6Ef1H+Ti/2uLmFfMASQ/Kwz0cdR/UVx1ZJs2jNbEk8qr2HPQVnxay9nO25iY3PQdj7f41nXl+fKaTcdozk+lZk0jSKSW5bgY9Kx5ralyaOnutWjniLxNkDPANebeJmW+lOUYOp+9jk1tC/+zK7KMgdQe4qJ44NRh8w/KwJ4PvWlPEWXvGT1OFSxleVY8cE4zXaaDI2lA2+0BVOGbqSaprpMYm3qudhB+8auXCmNpT6sP581f1m8rIizvY6+XXYV09kkl4wffHv+FYw1CW9ujPM7FyAB7D0965+WZjKtsh5PJI7GtW0VYUXPAUcZ71lOo73HbW50NgztOTHnYgAI9z/AIf1rsNMlIAzXDadIokgXdkmRmbH0/xrrLPzXbKtxW9GaJZ1i3DKoCjIxRVSCUCIDIyOtFd3OYM5SyuZTYXJ3HIuCM/XmuWtp5G1KS2Y5jZS3uCASOaKK8OX2fQuXwonKK8ZVgCGyD+ZFVYVBvojyNkoVQOw6UUVz1G7ItvY5q4P+n3Ef8IxgfjViMlYEwcZkFFFdP2V8gRo2iKyMxHJ5P50l8AdNuhjoSQe/Siio/5eL+uxfUy7BAbKOT+PzCM9+9bskjShS/JYZP1oorKu9fmzN/CXLS2j+0IBkYUng49K3tJZ/socuxJBJyaKK66Px/15DluWLad5TMzHkyf0FFFFaJuxys//2Q==
Base64 string length: 3628
Test:
Varchar(3628) is 3629 bytes, with 1 byte reserved for byte length.
So the total size for the string is ~3.5kb
Conclusion: I save about 0.5kb
Things i might not have considered Mysqli table byte size. If the table byte size reserve is 0.5kb or more then This would be irrational, and i should store my images locally.
Storing the base64 image into a database will not conserve disk space; however, If you are considering storing the image in a database you should look into blob objects.
First, if you're storing binary data in the database, you should really look at the BLOB family of data types (LONGBLOB, TINYBLOB, etc). Forget worries about saving 0.5kb of storage space; use the proper type and minimize the amount of other conversion you have to do. Just my two cents.
If you're worried about heavy traffic, I don't think storing the file itself in the database is the way to go at all. You could run some benchmark tests yourself to determine the amount of overhead involved, but from what I've seen it's often pretty standard in those situations to store the file on disk and store the path only in the database (and metadata or whatever else you need). That way you don't overburden the database (and if you need to scale, I find it easier to scale up your file system and/or webserver than to increase your database capacity).
We have 20.000.000 generated textfiles every year, average size is approx 250 Kb each (35 Kb zipped).
We must put these files in some kind of archive for 10 years. No need to search inside textfiles, but we must be able to find one texfile by searching on 5-10 metadata fields such as "productname", "creationdate", etc.
I'm considering zipping each file and storing them in a SQL Server database with 5-10 searchable (indexed) columns and a varbinary(MAX) column for the zipped file data.
The database will be grow huge over the years; 5-10 Tb. So I think we need to partition data for example by keeping one database per year.
I've been looking into using FILESTREAM in SQL Server for the varbinary column that holds the data, but it seems this is more suitable for blobs > 1 Mb?
Any other suggestions on how to manage such data volumes?
I'd say keeping the files in the filesystem would be a better idea. And you can keep file name and path in the DB. Here's a similar question.
Filestream is definitely more suited to larger blobs (750kB-1MB) as the overhead required to open the external file begins to impact read and write performance vs. vb(max) blob storage for small files. If this is not so much of an issue (ie. reads of blob data after the initial write are infrequent, and the blobs are effectively immutable) then it's definitely an option.
I would probably suggest keeping the files directly in a vb(max) column if you can guarantee they won't get much larger in size, but have this table stored in a seperate filegroup using the TEXTIMAGE_ON option which would allow you to move it to different storage from the rest of the metadata if necessary. Also, make sure to design your schema so the actual storage of blobs can be split over multiple filegroups either using partitions or via some multiple table scheme so you can scale to different disks if necessary in the future.
Keeping the blobs directly associated with the SQL metadata either via Filestream or direct vb(max) storage has many advantages over dealing with filesystem / SQL inconsistencies not limited to ease of backup and other management operations.
I assume by "generated" you mean something like data are being injected into document templates, and so there's much repetition of text content, i.e. "boilerplate" ?
20 million of such "generated" files per year is ~55,000 per day, ~2300 per hour!
I would manage such volume by not generating text files in the first place, and instead by creating database abstracts that contain the data that are pumped into the generated text, so that you can reconstitute the full document if necessary.
If you mean something else by "generated" could you elaborate?