I have an application which produces a large amount of data, that is all written once and then unchangeable (by law), and is rarely ever read. When it is read, it is always read in its entirety, as in, all the data for 2012 is read in one shot, and either processed for reporting or output in a different format for export (or gasp printed). The only way to access the data is to access an entire day's worth of data, or more than one day.
This data is easily represented as either two or three relational tables, or as a long list of self-contained documents.
What is the most storage-space-efficient way to store such data in a file system? Specifically, we're thinking of using Amazon S3 (File storage) for storage, though we could use something like RDS (their version of MySQL).
My current best bet is a gzipped file with JSON data for the entire day, one file per day.
Unless my data was pure ASCII (and even if it was), I would probably choose a binary storage method like one of
BSON
Protocol Buffers
B encode
I would use Windows Azure's Table Storage because it allows for heterogenous structured data to be stored in a single table. Having a database-like storage will allow you to append data as needed. You can easily create new table for each year.
Related
According to GitHub, SeaweedFS is intended to be a simple and highly scalable distributed file system which enables you to store and fetch billions of files, fast. However, I don't understand the point of SeaweedFS Filer since it requires an external data store on top of SeaweedFS:
On top of the object store, optional Filer can support directories and
POSIX attributes. Filer is a separate linearly-scalable stateless
server with customizable metadata stores, e.g., MySql, Postgres,
Redis, Cassandra, HBase, Mongodb, Elastic Search, LevelDB, RocksDB,
Sqlite, MemSql, TiDB, Etcd, CockroachDB, etc.
For the Filer to work it first needs to "lookup metadata from Filer Store, which can be Cassandra/Mysql/Postgres/Redis/LevelDB/etcd/Sqlite"and then read the data from volume servers.
Since SeaweedFS Filer needs to retrieve the file metadata from another data store (such as Casandra, Scylla DB or HBase) before it could retrieve the actual file, why not use the same data store to store the actual file? What is gained by storing the file metadata in one data store and storing the actual file in SeaweedFS?
GlusterFS, for example, stores metadata as xattrs in the underlying file system so there is no need for external data stores.
Doesn't requiring an external data store defeat the whole purpose of using SeaweedFS as it requires two hops (round trips) instead of one? As we now need to 1) get the file metadata from external storage 2) get the actual file. If we would have stored the actual file on the external data store we could get it in one step,instead of two.
The metadata includes per-file metadata and also the directory structure.
The former is similar to xattrs as you mentioned.
The later is more like a graph database, which can be implemented by a key-value store or SQL store.
For a key-value store or SQL store, saving a large chunk of file content data is not efficient since there could be many times of read/write operations on each key, due to maintaining the data ordering for efficient lookup. This kind of write amplification is not good, especially if the file size is in GB/TB/PB.
I have very simple data that I need to retrieve as quickly as possible:
I have json data that is associated with a hash of an email. So the table looks like this:
email_sha256, json
and has millions of rows.
I was wondering if one of the following two options would be faster:
1 Split the single large table into many smallers (split by alphabetical order)
2 Do not use a DB at all and serve the data as files. i.e. every email hash is the name of a separate file that contains the json data.
Creating a file for each user (for each email address), looks so wrong for so many aspect:
If you needs good performance you need a small amount number of file by directory
DB were created for that, you can have an index to retrieve the information very fast.
Without a DB you need to have your own lock/synchronization mechanism
If you are using a DB why using json to store data.
If you are looking for performance, do not serialize the data to a json.
What do you mean by "fast", can you quantify this duration/delay ?
Unless (maybe) the information associated with the user are huge (The size must be very superior to one sector). But again in this case, what do you mean by fast.
So, for example let's say I wanted to setup a SQLite database that contains some data on invoices. Let's say each invoice has a date, invoice number, and company associated with it for simplicity. Is there a good way for the database to be able to access or store a PDF file(~300-700kb/file) of the specified invoice? If this wouldn't work any alternative ideas on what might work well?
Any help is greatly appreciated
You could store the data (each file) as a BLOB which is a byte array/stream so the file could basically be stored as it is within a BLOB.
However, it may be more efficient (see linked article) to just store the path to the file, or perhaps just the file name (depending upon standards) and then use that to retrieve and view the invoice.
Up to around 100k it can be more efficient to store files as BLOB. You may find this a useful document to read SQLite 35% Faster Than The Filesystem
SQLite does support a BLOB data type, which stores data exactly as it is entered. From the documentation:
The current implementation will only support a string or BLOB length up to 231-1 or 2147483647
This limit is much larger than your expected need of 300-700 KB per file, so what you want should be possible. The other thing to consider is the size of your database. Unless you expect to have well north of around 100 TB, then the database size limit also should not pose a problem.
I am searching for a key value store that can handle values with a size of some Gigabytes. I have had a look on Riak, Redis, CouchDb, MongoDB.
I want to store a workspace of a user (equals to a directory in filesystem, recursively with subdirectories and files in it) in this DB. Of course I could use the file system but then I dont't have features such as caching in RAM, failover solution, backup and replication/clustering that are supported by Redis for instance.
This implies that most of the values saved will be binary data, eventually some Gigabytes big, as one file in a workspace is mapped to one key-value tupel.
Has anyone some experiences with any of these products?
First off, getting an MD5 or CRC32 from data size of GB is going to be painfully expensive computationally. Probably better to avoid that. How about store the data in a file, and index the filename?
If you insist, though, my suggestion is still to just store the hash, not the entire data value, with a lookup array/table to the final data location. Safeness of this approach (non-unique possibility) will vary directly with the number of large samples. The longer the hash you create -- 32bit vs 64bit vs 1024bit, etc -- the safer it gets, too. Most any dictionary system in a programming language, or a database engine, will have a binary data storage mechanism. Failing that, you could store a string of the Hex value corresponding to the hashed number in a char column.
We are now using MongoDB, as it supports large binary values, is very popular and has a large user base. Maybe we are going to switch to another store, but currently it looks very good!
We have a table in mysql of 18GB which has a column "html_view" which stores HTML source data, which we are displaying on the page, but now its taking too much time to fetch html data from "html_view" column, which making the page load slow.
We want an approach which can simplify our existing structure to load the html data faster from db or from any other way.
One idea which we are planning is to store HTML data in .txt files and in db we'll just store path of the txt file and will fetch the data from that particular file by reading file. But we fear that it will make extensive read write operations n our server and may slowdown the server then.
Is there any better approach, for making this situation faster?
First of all, why store HTML in database? Why not render it on demand?
For big text tables, you could store compressed text in a byte array, or compressed and encoded in base64 as plain text.
When you have an array with large text column, how many other columns does the table have? If it's not too many, you could partition the table and create a two column key-value store. That should be faster and simpler than reading files from disk.
Have a look at the Apache Caching guide.
It explains disk and memory caching - from my pov if the content is static (as the databae table indicates), you should use Apaches capabilities instead of writing your own slower mechanisms because you add multiple layers on top.
The usual measure instead of estimating does still apply though ;-).