can we store binary data to apache cassandra?
i'm thinking about storing image to apache cassandra
"Cassandra's public API is based on Thrift, which offers no streaming abilities -- any value written or fetched has to fit in memory. This is inherent to Thrift's design and is therefore unlikely to change. So adding large object support to Cassandra would need a special API that manually split the large objects up into pieces. A potential approach is described in http://issues.apache.org/jira/browse/CASSANDRA-265. As a workaround in the meantime, you can manually split files into chunks of whatever size you are comfortable with -- at least one person is using 64MB -- and making a file correspond to a row, with the chunks as column values. "
From: CassandraLimitations
it depends on the size, cassandra is not suitable for large binary objects, it can store up to 2gb by each column splitted into 1 mb. you can store the files in filesystem (or a cdn for web) and store the links and maybe with previews to cassandra, or you can take a look at mongodb+gridfs.
Related
According to GitHub, SeaweedFS is intended to be a simple and highly scalable distributed file system which enables you to store and fetch billions of files, fast. However, I don't understand the point of SeaweedFS Filer since it requires an external data store on top of SeaweedFS:
On top of the object store, optional Filer can support directories and
POSIX attributes. Filer is a separate linearly-scalable stateless
server with customizable metadata stores, e.g., MySql, Postgres,
Redis, Cassandra, HBase, Mongodb, Elastic Search, LevelDB, RocksDB,
Sqlite, MemSql, TiDB, Etcd, CockroachDB, etc.
For the Filer to work it first needs to "lookup metadata from Filer Store, which can be Cassandra/Mysql/Postgres/Redis/LevelDB/etcd/Sqlite"and then read the data from volume servers.
Since SeaweedFS Filer needs to retrieve the file metadata from another data store (such as Casandra, Scylla DB or HBase) before it could retrieve the actual file, why not use the same data store to store the actual file? What is gained by storing the file metadata in one data store and storing the actual file in SeaweedFS?
GlusterFS, for example, stores metadata as xattrs in the underlying file system so there is no need for external data stores.
Doesn't requiring an external data store defeat the whole purpose of using SeaweedFS as it requires two hops (round trips) instead of one? As we now need to 1) get the file metadata from external storage 2) get the actual file. If we would have stored the actual file on the external data store we could get it in one step,instead of two.
The metadata includes per-file metadata and also the directory structure.
The former is similar to xattrs as you mentioned.
The later is more like a graph database, which can be implemented by a key-value store or SQL store.
For a key-value store or SQL store, saving a large chunk of file content data is not efficient since there could be many times of read/write operations on each key, due to maintaining the data ordering for efficient lookup. This kind of write amplification is not good, especially if the file size is in GB/TB/PB.
So, for example let's say I wanted to setup a SQLite database that contains some data on invoices. Let's say each invoice has a date, invoice number, and company associated with it for simplicity. Is there a good way for the database to be able to access or store a PDF file(~300-700kb/file) of the specified invoice? If this wouldn't work any alternative ideas on what might work well?
Any help is greatly appreciated
You could store the data (each file) as a BLOB which is a byte array/stream so the file could basically be stored as it is within a BLOB.
However, it may be more efficient (see linked article) to just store the path to the file, or perhaps just the file name (depending upon standards) and then use that to retrieve and view the invoice.
Up to around 100k it can be more efficient to store files as BLOB. You may find this a useful document to read SQLite 35% Faster Than The Filesystem
SQLite does support a BLOB data type, which stores data exactly as it is entered. From the documentation:
The current implementation will only support a string or BLOB length up to 231-1 or 2147483647
This limit is much larger than your expected need of 300-700 KB per file, so what you want should be possible. The other thing to consider is the size of your database. Unless you expect to have well north of around 100 TB, then the database size limit also should not pose a problem.
I am storing image files (like jpg, png) in a PostgreSQL database. I found information on how to do that here.
Likewise, I want to store videos in a PostgreSQL database. I searched the net - some say one should use a data type such as bytea to store binary data.
Can you tell me how to use a bytea column to store videos?
I would generally not recommend to store huge blobs (binary large objects) inside PostgreSQL if referential integrity is not your paramount requirement. Storing huge files in the filesystem is much more efficient:
Much faster, less disk space used, easier backups.
I have written a more comprehensive assessment of the options you've got in a previous answer to a similar question. (With deep links to the manual.)
We did some tests about practical limits of bytea datatype. There are theoretical limit 1GB. But practical limit is about 20MB. Processing larger bytea data eats too much RAM and encoding and decoding takes some time too. Personally I don't think so storing videos is good idea, but if you need it, then use a large objects - blobs.
Without knowing what programming language you are using, I can only give a general approach:
Create a table with a column of type 'bytea'.
Get the contents of the video file into a variable.
Insert a row into that table with that variable as the data for the bytea column.
I am searching for a key value store that can handle values with a size of some Gigabytes. I have had a look on Riak, Redis, CouchDb, MongoDB.
I want to store a workspace of a user (equals to a directory in filesystem, recursively with subdirectories and files in it) in this DB. Of course I could use the file system but then I dont't have features such as caching in RAM, failover solution, backup and replication/clustering that are supported by Redis for instance.
This implies that most of the values saved will be binary data, eventually some Gigabytes big, as one file in a workspace is mapped to one key-value tupel.
Has anyone some experiences with any of these products?
First off, getting an MD5 or CRC32 from data size of GB is going to be painfully expensive computationally. Probably better to avoid that. How about store the data in a file, and index the filename?
If you insist, though, my suggestion is still to just store the hash, not the entire data value, with a lookup array/table to the final data location. Safeness of this approach (non-unique possibility) will vary directly with the number of large samples. The longer the hash you create -- 32bit vs 64bit vs 1024bit, etc -- the safer it gets, too. Most any dictionary system in a programming language, or a database engine, will have a binary data storage mechanism. Failing that, you could store a string of the Hex value corresponding to the hashed number in a char column.
We are now using MongoDB, as it supports large binary values, is very popular and has a large user base. Maybe we are going to switch to another store, but currently it looks very good!
We have a table in mysql of 18GB which has a column "html_view" which stores HTML source data, which we are displaying on the page, but now its taking too much time to fetch html data from "html_view" column, which making the page load slow.
We want an approach which can simplify our existing structure to load the html data faster from db or from any other way.
One idea which we are planning is to store HTML data in .txt files and in db we'll just store path of the txt file and will fetch the data from that particular file by reading file. But we fear that it will make extensive read write operations n our server and may slowdown the server then.
Is there any better approach, for making this situation faster?
First of all, why store HTML in database? Why not render it on demand?
For big text tables, you could store compressed text in a byte array, or compressed and encoded in base64 as plain text.
When you have an array with large text column, how many other columns does the table have? If it's not too many, you could partition the table and create a two column key-value store. That should be faster and simpler than reading files from disk.
Have a look at the Apache Caching guide.
It explains disk and memory caching - from my pov if the content is static (as the databae table indicates), you should use Apaches capabilities instead of writing your own slower mechanisms because you add multiple layers on top.
The usual measure instead of estimating does still apply though ;-).