I am fairly new to Cassandra. I was trying to chunk files to be stored in Cassandra DB and came across the term CFS. But I was unable to find any implementation of the same and some places even saw it as deprecated.
The files expected around max of 100 MB.
Could anyone help for he same.
CFS was an attempt to store binary data in the Cassandra, and it's failed miserably. The better approach was DSEFS, but it stored only metadata in Cassandra, not actual data that were stored as binary blocks on the disks (similar to HDFS).
But you need to reconsider your decision. Cassandra isn't optimized to store big binary blobs, and although you can chunk them into smaller blocks, you'll get all sorts of the problems with repairs, bootstrap of the new nodes, etc.
The better approach would be to store in Cassandra only metadata (including location of where data is stored), but store actual files on something like, AWS S3, Azure Blob Storage, or if you're on-premise, on something like Minio.
I need to create a storage system for files that will mostly be under 16MB but I want the benefits of GridFS like versioning, custom metadata, easy backup (with mongodump), etc. I'd say maybe 10% of my files would be over 16MB so I can't rely on storing in single documents, and I don't want to recreate the API for the benefits I'm looking for. I'm also already using a mongoDB system.
Should I use GridFS?
Without further details, I'd start by suggesting you read the recommendations provided here.
Given that all of your documents won't fit within the maximum document size when stored as a BSON document as BinData, I'd recommend using the gridFS system for a consistent programming and data management experience (for developers and IT). Depending on how the files are consumed, you may be able to more efficiently stream the contents of the files to clients when they are stored in GridFS by reading and writing in chunks.
It is recomended to store files smaller than 16 MB within single document using use the BinData data type for binary data. So, what is the problem to store files that size is under 16MB within single document and use GridFS only for files that size is exceed 16MB.
We've been discussing design of a data warehouse strategy within our group for meeting testing, reproducibility, and data syncing requirements. One of the suggested ideas is to adapt a NoSQL approach using an existing tool rather than try to re-implement a whole lot of the same on a file system. I don't know if a NoSQL approach is even the best approach to what we're trying to accomplish but perhaps if I describe what we need/want you all can help.
Most of our files are large, 50+ Gig in size, held in a proprietary, third-party format. We need to be able to access each file by a name/date/source/time/artifact combination. Essentially a key-value pair style look-up.
When we query for a file, we don't want to have to load all of it into memory. They're really too large and would swamp our server. We want to be able to somehow get a reference to the file and then use a proprietary, third-party API to ingest portions of it.
We want to easily add, remove, and export files from storage.
We'd like to set up automatic file replication between two servers (we can write a script for this.) That is, sync the contents of one server with another. We don't need a distributed system where it only appears as if we have one server. We'd like complete replication.
We also have other smaller files that have a tree type relationship with the Big files. One file's content will point to the next and so on, and so on. It's not a "spoked wheel," it's a full blown tree.
We'd prefer a Python, C or C++ API to work with a system like this but most of us are experienced with a variety of languages. We don't mind as long as it works, gets the job done, and saves us time. What you think? Is there something out there like this?
Have you had a look at MongoDB's GridFS.
You can query files by the default metadata, plus your own additional metadata. Files are broken out into small chunks and you can specify which portions you want. Also, files are stored in a collection (similar to a RDBMS table) and you get Mongo's replication features to boot.
Whats wrong with a proven cluster file system? Lustre and ceph are good candidates.
If you're looking for an object store, Hadoop was built with this in mind. In my experience Hadoop is a pain to work with and maintain.
For me both Lustre and Ceph has some problems that databases like Cassandra dont have. I think the core question here is what disadvantage Cassandra and other databases like it would have as a FS backend.
Performance could obviously be one. What about space usage? Consistency?
I read some post in this regard but I still don't understand what's the best solution in my case.
I'm start writing a new webApp and the backend is going to provide about 1-10 million images. (average size 200-500kB for a single image)
My site will provide content and images to 100-1000 users at the same time.
I'd like also to keep Provider costs as low as possible (but this is a secondary requirement).
I'm thinking that File System space is less expensive if compared to the cost of DB size.
Personally I like the idea of having all my images in the DB but any suggestion will be really appreciated :)
Do you think that in my case the DB approach is the right choice?
Putting all of those images in your database will make it very, very large. This means your DB engine will be busy caching all those images (a task it's not really designed for) when it could be caching hot application data instead.
Leave the file caching up to the OS and/or your reverse proxy - they'll be better at it.
Some other reasons to store images on the file system:
Image servers can run even when the database is busy or down.
File systems are made to store files and are quite efficient at it.
Dumping data in your database means slower backups and other operations.
No server-side coded needed to serve up an image, just plain old IIS/Apache.
You can scale up faster with dirt-cheap web servers, or potentially to a CDN.
You can perform related work (generating thumbnails, etc.) without involving the database.
Your database server can keep more of the "real" table data in memory, which is where you get your database speed for queries. If it uses its precious memory to keep image files cached, that doesn't buy you hardly anything speed-wise versus having more of the photo index in memory.
Most large sites use the filesystem.
See Store pictures as files or in the database for a web app?
When dealing with binary objects, follow a document centric approach for architecture, and not store documents like pdf's and images in the database, you will eventually have to refactor it out when you start seeing all kinds of performance issues with your database. Just store the file on the file system and have the path inside a table of your databse. There is also a physical limitation on the size of the data type that you will use to serialize and save it in the database. Just store it on the file system and access it.
Your first sentence says that you've read some posts on the subject, so I won't bother putting in links to articles that cover this. In my experience, and based on what you've posted as far as the number of images and sizes of the images, you're going to pay dearly in DB performance if you store them in the DB. I'd store them on the file system.
What database are you using? MS SQL Server 2008 provides FILESTREAM storage
allows storage of and efficient access to BLOB data using a combination of SQL Server 2008 and the NTFS file system. It covers choices for BLOB storage, configuring Windows and SQL Server for using FILESTREAM data, considerations for combining FILESTREAM with other features, and implementation details such as partitioning and performance.
We use FileNet, a server optimized for imaging. It's very expensive. A cheaper solution is to use a file server.
Please don't consider storing large files on a database server.
As others have mentioned, store references to the large files in the database.
Recently, I and my colleagues, we are discussing how to build a huge storage systems which could store billions a pictures which could searched and download quickly.
Something like a fickr, but not for an online gallery. Which means, most of these picture will never be download.
My colleages suggest that we should save all these files in database directly. I really feels that it's not a good idea and I think database is not desgined for restore huge number of binary files. But I have very strong reason for why that's not a good ideas.
What do you think about it.
When dealing with binary objects, follow a document centric approach for architecture, and not store documents like pdf's and images in the database, you will eventually have to refactor it out when you start seeing all kinds of performance issues with your database. Just store the file on the file system and have the path inside a table of your databse. There is also a physical limitation on the size of the data type that you will use to serialize and save it in the database. Just store it on the file system and access it.
If you are really talking about billions of images, I would store them in the file system because retrieval will be faster than serializing and de-seralizing the images
The answers above appear to assume the database is an RDBMS. If your database is a document-oriented database with support for binary documents of the size you expect, then it may be perfectly wise to store them in the database.
It's not a good idea. The point of a database is that you can quickly resolve complex queries to retrieve textual data. While binary data can be stored in a database, it can slow transactions. This is especially true when the database is on a separate server from the running application. In the database, store meta-data and the location/filename of the images. Images themselves should be on static server(s).
What's the best way to store large JSON files in a database? I know about CouchDB, but I'm pretty sure that won't support files of the size I'll be using.
I'm reluctant to just read them off of disk, because of the time required to read and then update them. The file is an array of ~30,000 elements, so I think storing each element separately in a traditional database would kill me when I try to select them all.
I have lots of documents in CouchDB that exceed 2megs and it handles them fine. Those limits are outdated.
The only caveat is that the default javascript view server has a pretty slow JSON parser so view generation can take a while with large documents. You can use my Python view server with a C based JSON library (jsonlib2, simplejson, yajl) or use the builtin erlang views which don't even hit JSON serialization and view generation will be plenty fast.
If you intend to access specific elements one (or several) at a time, there's no way around breaking the big JSON into traditional DB rows and columns.
If you'd like to access it in one shot, you can convert it to XML and store that in the DB (maybe even compressed - XMLs are highly compressible). Most DB engines support storing an XML object. You can then read it in one shot, and if needed, translate back to JSON, using forward-read approaches like SAX, or any other efficient XML-reading technology.
But as #therefromhere commented, you could always save it as one big string (I would again check if compressing it enhances anything).
You don't really have a variety of choices here, you can cache them in RAM using something like memcached or push them to disk reading and writing them with a databsae (RDBMS like PostgreSQL/MySQL or DOD like CouchDB). The only real alternative to these is a hybrid system of caching the most frequently accessed documents in memcached for reading which is how a lot of sites operate.
2+MB isn't a massive deal to a database and providing you have plenty of RAM they will do an intelligent enough job of caching and using your RAM effectively. Do you have a frequency pattern of when and how often these documents are accessed and how man users you have to serve?