No-SQL Database for large values - database

I am searching for a key value store that can handle values with a size of some Gigabytes. I have had a look on Riak, Redis, CouchDb, MongoDB.
I want to store a workspace of a user (equals to a directory in filesystem, recursively with subdirectories and files in it) in this DB. Of course I could use the file system but then I dont't have features such as caching in RAM, failover solution, backup and replication/clustering that are supported by Redis for instance.
This implies that most of the values saved will be binary data, eventually some Gigabytes big, as one file in a workspace is mapped to one key-value tupel.
Has anyone some experiences with any of these products?

First off, getting an MD5 or CRC32 from data size of GB is going to be painfully expensive computationally. Probably better to avoid that. How about store the data in a file, and index the filename?
If you insist, though, my suggestion is still to just store the hash, not the entire data value, with a lookup array/table to the final data location. Safeness of this approach (non-unique possibility) will vary directly with the number of large samples. The longer the hash you create -- 32bit vs 64bit vs 1024bit, etc -- the safer it gets, too. Most any dictionary system in a programming language, or a database engine, will have a binary data storage mechanism. Failing that, you could store a string of the Hex value corresponding to the hashed number in a char column.

We are now using MongoDB, as it supports large binary values, is very popular and has a large user base. Maybe we are going to switch to another store, but currently it looks very good!

Related

Best Way to Relate a File to a Database

So, for example let's say I wanted to setup a SQLite database that contains some data on invoices. Let's say each invoice has a date, invoice number, and company associated with it for simplicity. Is there a good way for the database to be able to access or store a PDF file(~300-700kb/file) of the specified invoice? If this wouldn't work any alternative ideas on what might work well?
Any help is greatly appreciated
You could store the data (each file) as a BLOB which is a byte array/stream so the file could basically be stored as it is within a BLOB.
However, it may be more efficient (see linked article) to just store the path to the file, or perhaps just the file name (depending upon standards) and then use that to retrieve and view the invoice.
Up to around 100k it can be more efficient to store files as BLOB. You may find this a useful document to read SQLite 35% Faster Than The Filesystem
SQLite does support a BLOB data type, which stores data exactly as it is entered. From the documentation:
The current implementation will only support a string or BLOB length up to 231-1 or 2147483647
This limit is much larger than your expected need of 300-700 KB per file, so what you want should be possible. The other thing to consider is the size of your database. Unless you expect to have well north of around 100 TB, then the database size limit also should not pose a problem.

How to store write-once, read-rarely data

I have an application which produces a large amount of data, that is all written once and then unchangeable (by law), and is rarely ever read. When it is read, it is always read in its entirety, as in, all the data for 2012 is read in one shot, and either processed for reporting or output in a different format for export (or gasp printed). The only way to access the data is to access an entire day's worth of data, or more than one day.
This data is easily represented as either two or three relational tables, or as a long list of self-contained documents.
What is the most storage-space-efficient way to store such data in a file system? Specifically, we're thinking of using Amazon S3 (File storage) for storage, though we could use something like RDS (their version of MySQL).
My current best bet is a gzipped file with JSON data for the entire day, one file per day.
Unless my data was pure ASCII (and even if it was), I would probably choose a binary storage method like one of
BSON
Protocol Buffers
B encode
I would use Windows Azure's Table Storage because it allows for heterogenous structured data to be stored in a single table. Having a database-like storage will allow you to append data as needed. You can easily create new table for each year.

How to save and load a giant hash-table to-n-fro from disk?

I am trying to write a search-engine for a large collection, for learning purposes. I started with my own intuitions. Then I researched and am finally arriving at a working model.
I am constructing a giant hash-table to hold all the terms in my collection. It is very expensive to construct this from the collection. Once I have computed the table I want to save this to disk, so that whenever I want to access this hash-table in my program latter, I can load it again from disk.
Is there any standard way of doing it or do I have to invent my own file-format and hacks to do this?
Note: The has-table is only for storing all term occurrences, I am planning to store the main ranking data in a postings file and have its pointer set in corresponding term of hash-table.
I am working in C.
BDB is a library for efficiently managing flat-file databases. In particular a hash table format is supported. B-Trees are also available, in case ordered access is required.

How to store videos in a PostgreSQL database?

I am storing image files (like jpg, png) in a PostgreSQL database. I found information on how to do that here.
Likewise, I want to store videos in a PostgreSQL database. I searched the net - some say one should use a data type such as bytea to store binary data.
Can you tell me how to use a bytea column to store videos?
I would generally not recommend to store huge blobs (binary large objects) inside PostgreSQL if referential integrity is not your paramount requirement. Storing huge files in the filesystem is much more efficient:
Much faster, less disk space used, easier backups.
I have written a more comprehensive assessment of the options you've got in a previous answer to a similar question. (With deep links to the manual.)
We did some tests about practical limits of bytea datatype. There are theoretical limit 1GB. But practical limit is about 20MB. Processing larger bytea data eats too much RAM and encoding and decoding takes some time too. Personally I don't think so storing videos is good idea, but if you need it, then use a large objects - blobs.
Without knowing what programming language you are using, I can only give a general approach:
Create a table with a column of type 'bytea'.
Get the contents of the video file into a variable.
Insert a row into that table with that variable as the data for the bytea column.

storing binary data on cassandra just like MYSQL BLOB binary

can we store binary data to apache cassandra?
i'm thinking about storing image to apache cassandra
"Cassandra's public API is based on Thrift, which offers no streaming abilities -- any value written or fetched has to fit in memory. This is inherent to Thrift's design and is therefore unlikely to change. So adding large object support to Cassandra would need a special API that manually split the large objects up into pieces. A potential approach is described in http://issues.apache.org/jira/browse/CASSANDRA-265. As a workaround in the meantime, you can manually split files into chunks of whatever size you are comfortable with -- at least one person is using 64MB -- and making a file correspond to a row, with the chunks as column values. "
From: CassandraLimitations
it depends on the size, cassandra is not suitable for large binary objects, it can store up to 2gb by each column splitted into 1 mb. you can store the files in filesystem (or a cdn for web) and store the links and maybe with previews to cassandra, or you can take a look at mongodb+gridfs.

Resources