I need to setup a data storage which can store PB level of files (files are mostly small json, images and csv files, but some of them can be ~100MB binary files).
I am looking into distributed data storage which is master-less and no-single-point-of-failure.
And I found Riak and GlusterFS.
I want to ask anyone of you have used both of them before?
I know that there interface (DB/Map) is very different.
But seems to me that they are both use hashing and similar distributed tech.
Will they have similar performance, consistency and availability?
We are running a 17 node (24GB RAM, 2T disk) Riak cluster with a Bitcask backend, storing around 1 billion 3k objects. This setup is performant but very resource intensive. We are considering moving away from Riak to GlusterFS as performance is not that important for us. Perhaps using LevelDB as a backend would also mitigate our worries.
ATM the self healing properties of Riak seem stronger and the configuration seem a tad easier. In your case I'd be more comfortable storing 100MB files on GlusterFS.
Storing larger files like the 100MB files you mention would not be the right choice for plain OSS Riak.
What you'd really should use in that case is the newly announced RiakCS http://basho.com/products/riakcs/ from Basho instead.
The choice depends mostly on requirements. Generally I'd recommend Riak if you do not actually need a real filesystem (with mounting points, ACLs management and so on) and just gonna use or serve files programatically, and GlusterFS otherwise.
Related
I hope I can found a distributed filesystem which is easy to configure, easy to use, easy to learn.
Any one can help on this?
As the details relating the usage is not mentioned and as much i can infer from the question, you must try MogileFS (Easy in setting it up and maintaining). Its is from the maker's of memcached and is used to server images etc.
Please refer to the below mentioned link for better explanation.
http://code.google.com/p/mogilefs/
Lustre, Gluster or MogileFS?? for video storage, encoding and streaming
I suggest you consider of using Apache Hadoop. It has a lot of services and technologies to work with (Cassandra, HBase, etc). Quote from official site:
Apache Hadoop is a framework that allows for the distributed processing of large data sets across clusters of computers using a simple programming model. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.
Basically, Hadoop is a large framework. You can use Karmasphere studio with Hadoop. I suppose, with its help you can learn Hadoop much quicker get deeper into distibuted systems.
About HDFS: read the article "GridGain and Hadoop". Short quote from there:
today HDFS is probably the most economical way to keep very large static data set of TB and PB scale in distributed file system for a long term storage
Check out Amazon Simple Storage Service (Amazon S3).
It has (practically) unlimited storage, 100% uptime and tick most of the boxes needed for most situations. It isn't free, but is very cheap considering what you get.
I'm writing a document editing web service, in which documents can be edited via a website, or locally and pushed via git. I'm trying to decide if the documents should be stored as individual documents on the filesystem, or in a database. The points I'm wondering are:
If they're in a database, is there any way for git to see the documents?
How much higher are the overheads using the filesystem? I assume the OS is doing a lot more work. How can I alleviate some of this? For example, the web editor autosaves, what would the best way to cache the save data be, to minimise writes?
Does one scale significantly better or worse than the other? If all goes according to plan, this will be a service with many thousands of documents being accessed and edited.
If the documents go into a database, git can't directly see the documents. git will see the backing storage file(s) for the database, but have no way of correlating changes there to changes to files.
The overhead of using the database is higher than using a filesystem, as answered by Carlos. Databases are optimized for transactions, which they'll do in memory, but they have to hit the file. Unless you program the application to do database transactions at a sub-document level (Eg: changing only modified lines), the database will give you no performance improvement. Most modern filesystems do caching and you can 'write' in a way that will sit in RAM rather than going to your backing stoage as well. You'll need to manage the granularity of the 'autosaves' in your application (every change? every 30 seconds? 5 minutes?), but really, doing it at the same granularity with a database will cause the same amount of traffic to the backing store.
I think you intended to ask "does the filesystem scale as well as the database"? :) If you have some way to organize your files per-user, and you figure out the security issue of a particular user only being able to access/modify the files they should be able to (which are doable imo), the filesystem should be doable.
Filesystem will always be faster than DB, because after all, DB's store data in the Filesystem!
Git is quite efficiently on it's own as proven on github, so i say you stick with git, and workaround it.
After all, Linus should know something... ;)
I need to choose a database for storing statistical data (in fact this is a series of timestamp-value data). I understand that virtually any database can handle this, but there are a couple of requirements:
it should be fast;
it should be able to handle A LOT of
data (10s of gigabytes) and splice it
fast;
it should have a stable, maintained
and handy interface to Erlang;
it should be available from Python;
it should be able to make something
like the thing named "capped
collections" in mongodb: collection
with the capped size, with old data
being rewritten if the size reach the
limit.
I thought about mongo, but emongo seems to be a little dead - the last commit was made 7 months ago.
Riak may be a good choice (here's a Riak comparison to MongoDB). It's written in Erlang, is distributed, fault tolerant and scales linearly. It has clients for Erlang, Javascript, Java, PHP, Python, Ruby. A REST interface, a protobuf interface and so many other goodies (Map Reduce, links, replication, pre/post commit hooks, ...). It's open source and is created maintained by Basho. Basho has commercial offering of Riak as well with some extra features (like multi-site replication, SNMP monitoring, etc) but there's awsome value in the OS version.
Depending on your needs it may make sense to combine a couple of technologies. For example you could front your system with an in memory store like Redis for speed and use Riak to persist the data. Redis + Riak is a pretty sweet stack.
I think postgresql and pgsql driver it will be best solution for you.
Files on disk, rotated, will serve your demands fine. The point is you don't want to search data quickly.
redis is quite a close contender.
The only current limitation is the size of the dataset, which has to be either store in full in memory or use the VM method, in which only the key space has to fit in memory (however a bit of spare room for actual data would be nice) but has a very slow startup time.
Antirez, the developer, is rewriting the backend into something called diskstore which should solve your issue. It's not baked yet, but I have a lot of confidence in this project.
About the capped collections, redis does not have a direct way for handling that. But the LTRIM function can help you out.
We've been discussing design of a data warehouse strategy within our group for meeting testing, reproducibility, and data syncing requirements. One of the suggested ideas is to adapt a NoSQL approach using an existing tool rather than try to re-implement a whole lot of the same on a file system. I don't know if a NoSQL approach is even the best approach to what we're trying to accomplish but perhaps if I describe what we need/want you all can help.
Most of our files are large, 50+ Gig in size, held in a proprietary, third-party format. We need to be able to access each file by a name/date/source/time/artifact combination. Essentially a key-value pair style look-up.
When we query for a file, we don't want to have to load all of it into memory. They're really too large and would swamp our server. We want to be able to somehow get a reference to the file and then use a proprietary, third-party API to ingest portions of it.
We want to easily add, remove, and export files from storage.
We'd like to set up automatic file replication between two servers (we can write a script for this.) That is, sync the contents of one server with another. We don't need a distributed system where it only appears as if we have one server. We'd like complete replication.
We also have other smaller files that have a tree type relationship with the Big files. One file's content will point to the next and so on, and so on. It's not a "spoked wheel," it's a full blown tree.
We'd prefer a Python, C or C++ API to work with a system like this but most of us are experienced with a variety of languages. We don't mind as long as it works, gets the job done, and saves us time. What you think? Is there something out there like this?
Have you had a look at MongoDB's GridFS.
http://www.mongodb.org/display/DOCS/GridFS+Specification
You can query files by the default metadata, plus your own additional metadata. Files are broken out into small chunks and you can specify which portions you want. Also, files are stored in a collection (similar to a RDBMS table) and you get Mongo's replication features to boot.
Whats wrong with a proven cluster file system? Lustre and ceph are good candidates.
If you're looking for an object store, Hadoop was built with this in mind. In my experience Hadoop is a pain to work with and maintain.
For me both Lustre and Ceph has some problems that databases like Cassandra dont have. I think the core question here is what disadvantage Cassandra and other databases like it would have as a FS backend.
Performance could obviously be one. What about space usage? Consistency?
A web app I'm working on requires frequent parsing of diverse web resources (HTML, XML, RSS, etc). Once downloaded, I need to cache these resources to minimize network load. The app requires a very straightforward cache policy: only re-download a cached resource when more than X minutes have passed since the access time.
Should I:
Store both the access time (e.g. 6/29/09 at 10:50 am) and the resource itself in the database.
Store the access time and a unique identifier in the database. The unique identifier is the filename of the resource, stored on the local disk.
Use another approach or third party software solution.
Essentially, this question can be re-written as, "Which is better for storing moderate amounts of data - a database or flat files?"
Thanks for your help! :)
NB: The app is running on a VPS, so size restrictions on the database/flat files do not apply.
To answer your question: "Which is better for storing moderate amounts of data - a database or flat files?"
The answer is (in my opinion) Flat Files. Flat files are easier to backup, and easier to remove.
However, you have extra information that isn't encapsulated in this question, mainly the fact that you will need to access this stored data to determine if a resource has gone stale.
Given this need, it makes more sense to store it in a database. Flat Files do not lend themselves well for random access, and search, compared to a relational DB.
Depends on the platform, IF you use .NET
The answer is 3, use Cache object, ideally suited for this in ASP.NET
You can set time and dependency expiration,
this doc explains the cache object
https://web.archive.org/web/1/http://articles.techrepublic%2ecom%2ecom/5100-10878_11-5034946.html
Neither.
Have a look at memcached to see if it works with your server/client platform. This is easier to set up and performs much better than filesystem/rdbms based caching, provided you can spare the RAM needed for the data being cached.
All of the proposed solutions are reasonable. However, for my particular needs, I went with flat files. Oddly enough, though, I did so for reasons not mentioned in some of the other answers. It doesn't really matter to me that flat files are easier to backup and remove, and both DB and flat-file solutions allow for easy checking of whether or not the cached data has gone stale. I went with flat files first and foremost because, on my mid-sized one-box VPS LAMP architecture, I think it will be faster than a third-party cache or DB-based solution.
Thanks to all for your thoughts! :)