I currently have a bunch of tiny images that i am saving to the server in a folder.
Would it be more convenient to store the base64 string into the
database and when needed pull it out?
I am ultimately looking for a way to conserve disk space, the method i am looking for will need to be able to withstand heavy traffic. 4M users minimum.
All 4M users will be looking at the image.
Would it be bad to retrieve data like that so often?
If i am using the database i will be using mysqli with PHPmyadmin.
EXPERIMENT:
Testing with image: Cute.jpeg
Size: 2.65kb
Size on Disk: 4.00kb
Base64 string:
/9j/4AAQSkZJRgABAQAAAQABAAD//gA7Q1JFQVRPUjogZ2QtanBlZyB2MS4wICh1c2luZyBJSkcgSlBFRyB2NjIpLCBxdWFsaXR5ID0gNzUK/9sAQwAIBgYHBgUIBwcHCQkICgwUDQwLCwwZEhMPFB0aHx4dGhwcICQuJyAiLCMcHCg3KSwwMTQ0NB8nOT04MjwuMzQy/9sAQwEJCQkMCwwYDQ0YMiEcITIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIyMjIy/8AAEQgAZABkAwEiAAIRAQMRAf/EAB8AAAEFAQEBAQEBAAAAAAAAAAABAgMEBQYHCAkKC//EALUQAAIBAwMCBAMFBQQEAAABfQECAwAEEQUSITFBBhNRYQcicRQygZGhCCNCscEVUtHwJDNicoIJChYXGBkaJSYnKCkqNDU2Nzg5OkNERUZHSElKU1RVVldYWVpjZGVmZ2hpanN0dXZ3eHl6g4SFhoeIiYqSk5SVlpeYmZqio6Slpqeoqaqys7S1tre4ubrCw8TFxsfIycrS09TV1tfY2drh4uPk5ebn6Onq8fLz9PX29/j5+v/EAB8BAAMBAQEBAQEBAQEAAAAAAAABAgMEBQYHCAkKC//EALURAAIBAgQEAwQHBQQEAAECdwABAgMRBAUhMQYSQVEHYXETIjKBCBRCkaGxwQkjM1LwFWJy0QoWJDThJfEXGBkaJicoKSo1Njc4OTpDREVGR0hJSlNUVVZXWFlaY2RlZmdoaWpzdHV2d3h5eoKDhIWGh4iJipKTlJWWl5iZmqKjpKWmp6ipqrKztLW2t7i5usLDxMXGx8jJytLT1NXW19jZ2uLj5OXm5+jp6vLz9PX29/j5+v/aAAwDAQACEQMRAD8A9OsbK00eyS3gRUVR+J/Gmy6gN2AwzWbd3sbZ3PxRYFHO6KMAH+IjNZ899FsaxpqKNa3lkY5Ix9auI+fwrK88ltu5T9Tg1ejf5R7U4smSL2/jrTTLjvVNpiFOO9Rm44x3xQ5iUS75o9acJsjrWYZjuPPFPWf+VJSG0aBk96TzsVTEw6mlJ4zmtYyM2iaS5IP1pqTqTy1Z8knm5AfBFUXmkjYjzC30ccU+dgopm+0MMrbix/OiufFwSP8Aj4k/F1/wop+2YvYozLS08+UPON7/AN3sK2GdYrdsYUfWs60l3DYDy3U+1Ra7frZW6LjczdFzgfjXAnZHazQtXkLFjtYD0/8Ar1e+2cVkaYsjWgklI3NzgdKLpmRSQGrW7SMXqzSa5zzmo/Oz3rOs7gTblPOOfwonuAm456c/lWd7lWNET/Nin+eN1YyXYOWzzipUuFlH3ufShSFJGoLnJz0FTC8QjGQPqa5XUL8wzBNwUHv6U6C5MoGxsf7R61pGo9jOSNm8kELiVJNp/OoZvst8o8whJF/iXisnULqe0t94KOR3ORU9hMdQsxI6FH6EZpylqOK0CSzl38TBx2O7H9aKV51hbZKp3DvntRSuaD9LIeYP2GQPrXOeM74QSxynJRHAIH1ro7WRYEUjkL94+/8AnNcZ4y/eXHk88zKAPxzXP1SNXomdXDrcS2kbHKrtHbisnUfFdklm7zX0MJ/hR2wx/wCAjnHvitnTbOJ7JAwU/KBzXm3jjwbNLqT3llHu3/6xR2966eV312Oe91oa3h7xjbzXFwgffjoR3zW0L43AZsnnPFeaeGfCup/2sQEMQUAksMAivXYNGFvbA7cttFZTVnoXHbUyHu3VsL0A5qtLq72z7yw2itFLM+aWI+UnGKh1Lw+bi3k8o4YjjI79qhIbOYvvFdve6osLuqAYyzsFH611lvqa2ccPnLH5T/dbIOfx6GvGtb0XULbWJYZYXZ1xyo4rp/Bnhu/kDm5keOB/+WR7++PWt+WyutzF67not5e201jKA46cDdgil0C5/wCJf7ryD6jNc5rOhm1sn2s23B71vaEI49OiAOV8sEe/FZyb3ZULW0NKd/Mk3AjGO4orJtrpJYiZpTHIGIZSKKV2aG4sSCzRF+7xz/WvPdfvBda47oS0ccg5FburanPb6etpa75ruXEaKvJJNY11or6baIs53Tn5pCPX0rmdRXujVrSx32hFZbKM9iB3rRuNLWc7lP1rL8KOn2CM7ug5zXRvMC2xe/XFdyldHMk7la0023i5WMH1NSXCoi7QAoNTbhjG4DHbtVK5uYQCGccVmzQjNrD5YwB1qVIlcFcKM+tZ/wDaNuzbQ4wD0HerdvdRP8qyc+hFCd2EkZ9/o0V0/wB1RKPunHWnafoxt8mQgn0xWo+7BCkHHaoTd4UpKhz2NbRsjGRzviVALSQYy2CAKztJkYaWMdUCj8O9P8UXAaPEeR2PrVHQrryleKQdFHX0NY1J3ZVPY3DZWvBcnJGeCKKrPeALGsZBVVxk0VNzWx2mm+HLHSS0ygyXDDBkfkgeg9K57xHp6S7mAPNdldPhTXO3su9irdKVaCSsgi29Wc14bmlguzalRtzxXbuUjwqjBNcfOghnWeMcq2c10sd9FcWyyLycU6GqsTLRkd5cwbDulOR6Vymra/a29u8JcPJnk5rbnKNIQ/fgGvDfEd3Lp+tXdqwZsOcM3ce1bQpqT94mU2lod3beI7RGwqMSOc+tb9lrVpeYw4TH3+a8Qk1eRxjGB6A8V03gKQ3mssXysccZJPYk8CtXRpqLsZ+0m3qeyW15CYiFl80jrzzRcHdavcJvQjnHas20EdtNHMOVf5SCO/Y/zH41b1bUIrXTWTIDzDGP/r1hBXVn0ByOTuvNvbv94SRmt+Pww97aRyWsoSdR9x+Aw+tVtNEAweDz3rrbGZcDaB+FVToxesiXNrY5+Pwrq6IF+z5+jCiu8im/djkUVr9Wj3D6zIp3khweawbxgM5ao9Tu5pEmaFyVxtP+ww6Ef1H+Ti/2uLmFfMASQ/Kwz0cdR/UVx1ZJs2jNbEk8qr2HPQVnxay9nO25iY3PQdj7f41nXl+fKaTcdozk+lZk0jSKSW5bgY9Kx5ralyaOnutWjniLxNkDPANebeJmW+lOUYOp+9jk1tC/+zK7KMgdQe4qJ44NRh8w/KwJ4PvWlPEWXvGT1OFSxleVY8cE4zXaaDI2lA2+0BVOGbqSaprpMYm3qudhB+8auXCmNpT6sP581f1m8rIizvY6+XXYV09kkl4wffHv+FYw1CW9ujPM7FyAB7D0965+WZjKtsh5PJI7GtW0VYUXPAUcZ71lOo73HbW50NgztOTHnYgAI9z/AIf1rsNMlIAzXDadIokgXdkmRmbH0/xrrLPzXbKtxW9GaJZ1i3DKoCjIxRVSCUCIDIyOtFd3OYM5SyuZTYXJ3HIuCM/XmuWtp5G1KS2Y5jZS3uCASOaKK8OX2fQuXwonKK8ZVgCGyD+ZFVYVBvojyNkoVQOw6UUVz1G7ItvY5q4P+n3Ef8IxgfjViMlYEwcZkFFFdP2V8gRo2iKyMxHJ5P50l8AdNuhjoSQe/Siio/5eL+uxfUy7BAbKOT+PzCM9+9bskjShS/JYZP1oorKu9fmzN/CXLS2j+0IBkYUng49K3tJZ/socuxJBJyaKK66Px/15DluWLad5TMzHkyf0FFFFaJuxys//2Q==
Base64 string length: 3628
Test:
Varchar(3628) is 3629 bytes, with 1 byte reserved for byte length.
So the total size for the string is ~3.5kb
Conclusion: I save about 0.5kb
Things i might not have considered Mysqli table byte size. If the table byte size reserve is 0.5kb or more then This would be irrational, and i should store my images locally.
Storing the base64 image into a database will not conserve disk space; however, If you are considering storing the image in a database you should look into blob objects.
First, if you're storing binary data in the database, you should really look at the BLOB family of data types (LONGBLOB, TINYBLOB, etc). Forget worries about saving 0.5kb of storage space; use the proper type and minimize the amount of other conversion you have to do. Just my two cents.
If you're worried about heavy traffic, I don't think storing the file itself in the database is the way to go at all. You could run some benchmark tests yourself to determine the amount of overhead involved, but from what I've seen it's often pretty standard in those situations to store the file on disk and store the path only in the database (and metadata or whatever else you need). That way you don't overburden the database (and if you need to scale, I find it easier to scale up your file system and/or webserver than to increase your database capacity).
Related
I put around 11K key&values in LMDB database .
LMDB database file size become 21Mb.
For the same data the leveldb is taking 8Mb only (with snappy compression).
LMDB env info ,
VERSION=3
format=bytevalue
type=btree
mapsize=1073741824
maxreaders=126
db_pagesize=4096
TO check why LMDB file size is more ,I iterated through all keys & values inside
the database. The total size of all key & value is 10Mb.
But the actual size of the file is 21Mb.
Remaining file size of 11Mb (21Mb - 10Mb) used for what purpose???!!.
If i compress data before put operation ,only 2Mb got reduced
Why LMDB database file size is more than actual data size?
Any way to shrink it ?
The database is bigger than the original file because lmdb requires to do some bookeeping to keep the data sorted. Also, there is an overhead because even if your record (key + value) is say 1kb lmdb allocates a fixed size of space to store those. I don't know the actual value. But this overhead is always expected.
Compression doesn't work well on small records.
lmdb doesn't support prefix or block compression. Your best bet is to use a key-value store that does, like wiredtiger.
I have to store a lot of crawl and log data in a Datastore with an efficient compression ratio.
So far I tried and installed Cassandra, Couchbase, Mysql and an FlatFile format and read the architectual overview of Google Big Table, Hypertable and the LevelDB File Layout.
Cassandra and Couchbase are about 1/5 in disk size of the uncompressed Mysql Database, but I want better results.
So I need a Simple Data Store with high compression features as in vertica, teradata, oracle and sqlserver products. (Page level compression)
The actual flatFile dataSet looks like
/oil_type/gas_station/2014-03/2014-03-05-23.csv
/oil_type/gas_station/2014-03/2014-03-06-00.csv
/oil_type/gas_station/2014-03/2014-03-06-01.csv
Per File are about 400 high redundant entries each about 5kb
A File can be compressed from 1722 KB to 39 KB so an compression ratio of 44:1 up to 100:1 depending on the compression chunk size should be possible.
Defining the use case:
I have to poll all relevant gas_station webpages/apis every 30 seconds to get up to the minute pricing information, because it is not possible to write a parser for every gas station, a generic solution is required for index creation. With a database holding all craweld gas station pages a generic parser can easily be developed and backtestet. With this raw data model data loss through broken specific converters should be avoided.
With keys like "oil_type-gas_station-timestamp-content", its easy and efficient to compare two gas_station pricings over time. For reading a Time Serie that is smaller then the compression chunk size only 2 to 4 chunks should be decompressed.
So the following features are optimal:
SSTables
Configurable Compression Options (Level,Compression Engine,Chunk Size (from 64kb to 10 MB))
Range Scans
Java Bindings
column datasore for better compression
Nice to have:
Replication
Multi Master
write quorum of 1
Forward and backward iteration over the data. (to compare two time series)
configurable replica distribution
few dependencies
Question:
Wich free Database is able to hold archived data of high redundant crawl data (only a few bytes change) , compresses good and does not use too much time to query a random record.
(In opposit to the mysql archive format, that has to decompress the whole table until the requested row)
Maybe there is a log database, that is able to index a lot of log lines and compresses them internaly? (scope of logstash, fluentd, flume)
If someone would know some benchmarks, numbers on this topic it would help a lot, to evaluate the right technology.
I am glad for your help!
Assuming you are in a multithreaded environment, possible multi-process, LevelDB is NOT a good idea.
Cassandra is written in Java, therefore you'll see excessive memory consumption when handling a big load of big files, at least without tweaking the JVM. Also, since it's written in Java, it probably won't be fast enough for really good compression.
I use HyperTable on my Linux-box to store photos and movies.
You can use HyperTable from any language with Thrift-support.
Additionally, if you need it, you can use the C++ drivers, for extra-speed.
One thing that is nice about HyperTable is that it doesn't add a dependency on Java, since it's written in C++, which also means it's blazing fast and not garbage-collected (no memory overhead).
Hypertable does have a Java client however, out of the box.
I use my own C# Thrift-client, which I ported from Java.
See >here< for the code.
Since HyperTable operates on Byte-Arrays, you can simply put your file in the thrift-client as byte-array, and HyperTable will compress it for you automagically, if you have told it to do so in the column definition.
You could also try MongoDb, if you absolutely want to.
Mongo actually derives from humongous, by the way.
However, I must say I have never "really" used it.
We have 20.000.000 generated textfiles every year, average size is approx 250 Kb each (35 Kb zipped).
We must put these files in some kind of archive for 10 years. No need to search inside textfiles, but we must be able to find one texfile by searching on 5-10 metadata fields such as "productname", "creationdate", etc.
I'm considering zipping each file and storing them in a SQL Server database with 5-10 searchable (indexed) columns and a varbinary(MAX) column for the zipped file data.
The database will be grow huge over the years; 5-10 Tb. So I think we need to partition data for example by keeping one database per year.
I've been looking into using FILESTREAM in SQL Server for the varbinary column that holds the data, but it seems this is more suitable for blobs > 1 Mb?
Any other suggestions on how to manage such data volumes?
I'd say keeping the files in the filesystem would be a better idea. And you can keep file name and path in the DB. Here's a similar question.
Filestream is definitely more suited to larger blobs (750kB-1MB) as the overhead required to open the external file begins to impact read and write performance vs. vb(max) blob storage for small files. If this is not so much of an issue (ie. reads of blob data after the initial write are infrequent, and the blobs are effectively immutable) then it's definitely an option.
I would probably suggest keeping the files directly in a vb(max) column if you can guarantee they won't get much larger in size, but have this table stored in a seperate filegroup using the TEXTIMAGE_ON option which would allow you to move it to different storage from the rest of the metadata if necessary. Also, make sure to design your schema so the actual storage of blobs can be split over multiple filegroups either using partitions or via some multiple table scheme so you can scale to different disks if necessary in the future.
Keeping the blobs directly associated with the SQL metadata either via Filestream or direct vb(max) storage has many advantages over dealing with filesystem / SQL inconsistencies not limited to ease of backup and other management operations.
I assume by "generated" you mean something like data are being injected into document templates, and so there's much repetition of text content, i.e. "boilerplate" ?
20 million of such "generated" files per year is ~55,000 per day, ~2300 per hour!
I would manage such volume by not generating text files in the first place, and instead by creating database abstracts that contain the data that are pumped into the generated text, so that you can reconstitute the full document if necessary.
If you mean something else by "generated" could you elaborate?
Ok everyone, I have an excellent challenge for you. Here is the format of my data :
ID-1 COL-11 COL-12 ... COL-1P
...
ID-N COL-N1 COL-N2 ... COL-NP
ID is my primary key and index. I just use ID to query my database. The datamodel is very simple.
My problem is as follow:
I have 64Go+ of data as defined above and in a real-time application, I need to query my database and retrieve the data instantly. I was thinking about 2 solutions but impossible to set up.
First use sqlite or mysql. One table is needed with one index on ID column. The problem is that the database will be too large to have good performance, especially for sqlite.
Second is to store everything in memory into a huge hashtable. RAM is the limit.
Do you have another suggestion? How about to serialize everything on the filesystem and then, at each query, store queried data into a cache system?
When I say real-time, I mean about 100-200 query/second.
A thorough answer would take into account data access patterns. Since we don't have these, we just have to assume equal probably distribution that a row will be accessed next.
I would first try using a real RDBMS, either embedded or local server, and measure the performance. If this this gives 100-200 queries/sec then you're done.
Otherwise, if the format is simple, then you could create a memory mapped file and handle the reading yourself using a binary search on the sorted ID column. The OS will manage pulling pages from disk into memory, and so you get free use of caching for frequently accessed pages.
Cache use can be optimized more by creating a separate index, and grouping the rows by access pattern, such that rows that are often read are grouped together (e.g. placed first), and rows that are often read in succession are placed close to each other (e.g. in succession.) This will ensure that you get the most back for a cache miss.
Given the way the data is used, you should do the following:
Create a record structure (fixed size) that is large enough to contain one full row of data
Export the original data to a flat file that follows the format defined in step 1, ordering the data by ID (incremental)
Do a direct access on the file and leave caching to the OS. To get record number N (0-based), you multiply N by the size of a record (in byte) and read the record directly from that offset in the file.
Since you're in read-only mode and assuming you're storing your file in a random access media, this scales very well and it doesn't dependent on the size of the data: each fetch is a single read in the file. You could try some fancy caching system but I doubt this would gain you much in terms of performance unless you have a lot of requests for the same data row (and the OS you're using is doing poor caching). make sure you open the file in read-only mode, though, as this should help the OS figure out the optimal caching mechanism.
Does anyone know of a way to shrink/compact a db4o database?
What do you mean by compact/shrink? Make an existing database smaller? Or do you want to compress the database?
One of roles for this is defragmentation. This frees up unused space in the database. When you delete objects, the database doesn't shrink. Instead the space is marked as free for new objects. However overtime with a the manipulation of the database the file get fragmented. So defragmentation brings back unused space in the database.
When you store a lot of strings, you should consider using UTF8 encoding instead of default full unicode. This saves a lot of space, because a now a character uses only one byte.
EmbeddedConfiguration config = Db4oEmbedded.newConfiguration();
config.common().stringEncoding(StringEncodings.utf8());
ObjectContainer database = Db4oEmbedded.openFile(config,"database.db4o");
Note that you cannot change this setting for existing databases! You need to defragment the database in order to change the string encoding.
To compress the database you could use a compressing storage-implementation. However I don't know any available implementation which does this.