Distributed File Systems: GridFS vs. GlusterFS vs Ceph vs HekaFS Benchmarks [closed] - filesystems

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am currently searching for a good distributed file system.
It should:
be open-source
be horizontally scalable (replication and sharding)
have no single point of failure
have a relatively small footprint
Here are the four most promising candidates in my opinion:
GridFS (based on MongoDB)
GlusterFS
Ceph
HekaFS
The filesystem will be used mainly for media files (images and audio). There are very small as well as medium sized files (1 KB - 10 MB). The amount of files should be around several millions.
Are there any benchmarks regarding performance, CPU-load, memory-consumption and scalability? What are your experiences using these or other distributed filesystems?

I'm not sure your list is quite correct. It depends on what you mean by a file system.
If you mean a file system that is mountable in an operating system and usable by any application that reads and writes files using POSIX calls, then GridFS doesn't really qualify. It is just how MongoDB stores BSON-formatted objects. It is an Object system rather than a File system.
There is a project to make GridFS mountable, but it is a little weird because GridFS doesn't have concepts for things like hierarchical directories, although paths are allowed. Also, I'm not sure how distributed writes on gridfs-fuse would be.
GlusterFS and Ceph are comparable and are distributed, replicable mountable file systems. You can read a comparison between the two here (and followup update of comparison), although keep in mind that the benchmarks are done by someone who is a little biased. You can also watch this debate on the topic.
As for HekaFS, it is GlusterFS that is set up for cloud computing, adding encryption and multitenancy as well as an administrative UI.

After working with Ceph for 11 months I came to conclusion that it utterly sucks so I suggest to avoid it. I tried XtreemFS, RozoFS and QuantcastFS but found them not good enough either.
I wholeheartedly recommend LizardFS which is a fork of now proprietary MooseFS. LizardFS features data integrity, monitoring and superior performance with very few dependencies.
2019 update: situation has changed and LizardFS is not actively maintained any more.
MooseFS is stronger than ever and free from most LizardFS bugs. MooseFS is well maintained and it is faster than LizardFS.
RozoFS has matured and maybe worth a try.
GfarmFS have its niche but today I would have chosen MooseFS for most applications.

OrangeFS, anyone?
I am looking for a HPC DFS and found this discussion here:
http://forums.gentoo.org/viewtopic-t-901744-start-0.html
Lots of good data and comparisons :)
After some talk the OP decided for OrangeFS, quoting:
"OrangeFS. It does not support quotas nor file locks (though all i/o operations are atomic and this
way consistency is kept without locks). But it works, and works well and stable. Furthermore this is
not a general file storage oriented system, but HPC dedicated one, targeted on parallel I/O including
ROMIO support. All test were done for stripe data distribution.
a) No quotas — to hell quotas. I gave up on them anyway, even glusterfs supports not common
uid/gid based quotas, but directory size limitations, more like LVM works.
b) Multiple active metadata servers are supported and stable. Compared to dedicated metadata
storage (single node) this gives +50% performance on small files and no significant difference on
large ones.
c) Excellent performance on large data chunks (dd bs=1M). It is limited by a sum of local hard drive
(do not forget each node participates as a data server as well) speed and available network bandwidth.
CPU consumption on such load is decent and is about 50% of single core on a client node and about
10% percents on each other data server nodes.
d) Fair performance on large sets of small files. For the test I untared linux kernel 3.1. It took 5 minutes
over OrangeFS (with tuned parameters) and almost 2 minutes over NFSv4 (tuned as well) for comparison.
CPU load is about 50% of single core (of course, it is actually distributed between cores) on the client and
about several percents on each node.
e) Support of ROMIO MPI I/O API. This is a sweet yummy for MPI aware applications, which allows to use
PVFS2/OrangeFS parallel input-output features directly from applications.
f) No support for special files (sockets, fifo, block devices). Thus can't be safely used as /home and I use
NFSv4 for that task providing users quota-restricted small home space. Though most distributed
filesystems don't support special files anyway. "

I do not know about the other systems you posted but I have made a comparison of 3 PHP CMS/Frameworks on local storage vs GlusterFS to see if it does better on real world tests than raw benchmarks. Sadly not.
http://blog.lavoie.sl/2013/12/glusterfs-performance-on-different-frameworks.html

Related

How do I decide between LZ4 and Snappy compression? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
The community reviewed whether to reopen this question 6 months ago and left it closed:
Original close reason(s) were not resolved
Improve this question
I need to select a compression algorithm when configuring "well-known application".
Also, as part of my day job, my company is developing distributed application that deal with a fair amount of data. We've been looking into compressing data to try to reduce network bandwidth, but we are hitting a wall on what algorithm to use. There's too many choices.
How do I decide between LZ4 and Snappy?
TL;DR The answer is always LZ4.
First, let's discuss what they have in common
They are both algorithms that are designed to operate at "wire" speed (order of 1 GB/s per core) when compressing and decompressing.
The main use case is to apply compression before writing data to disk or to network (that usually operate nowhere near GB/s). Compress data to reduce IO, it's transparent since the compression algorithm is so fast -faster than reading/writing from the medium-.
Both algorithms appeared in early 2010s and can be considered relatively recent. It takes a good decade for newer technologies to gain adoption and for optimized stable libraries to emerge in all popular languages.
They are both widely usable now and have good libraries available (I am writing this in 2021), however that wasn't the case a few years ago .
They are both compressing at a similar speed and a similar compression ratio (except decompression speed where LZ4 is much faster).
For historical reference there is a third algorithm called LZO that plays in the same league, it's much older (paper from 1996) and not widely used.
Second, let's discuss the differences.
While they are both extremely fast, LZ4 is (slightly) faster and stronger, hence it should be preferred.
In particular when it comes to decompression speed, LZ4 is multiple times faster.
LZ algorithms are generally extremely fast at decompression (they can operate in constant time), that's one of the reasons they are popular. LZ4 was constructed to fully take advantage of that property and saturate CPU/memory bandwidth.
In addition, LZ4 is tunable, the compression level can be finely tuned from 1 to 16, which allows to have stronger compression if you have CPU to spare. It would be great if all software supporting LZ4 should would expose the compression level as a setting, but not all do.
"faster is better" of course, yet, you may be tempted to ask if it really matters at these kind of speed? Do we care to do 1 GB/s or 2 GB/s per core?
The answer is yes, because the effect is noticeable and on-the-wire compression should keep up with the hardware it's running on, including NVMe SSD (750+ MB/s) and local network (1.25+ GB/s).
For client-server applications where the server will be receiving and decompressing many streams from many clients, the cost of decompression can add up real quick. One practical example is distributed queues like Kafka that have to decompress/recompress data on the fly, adapting to whatever formats the many clients can send/receive.
Another major use case is databases, where data can be compressed before being stored on disk. A well-known example is ElasticSearch, data is compressed with LZ4 out-of-the-box (internally data is immutable/append-only which works very well with compression and logs), when you run a query on the last month of logs it could be decompressing Terabytes of data on the fly (1 GB/s doesn't sound that quick anymore ;) )
Third, compatibility and availability of libraries
Last but not least, you will need to find some libraries to support whichever compression you intend to use.
Or, if we are talking about tuning a third party application/database, you will need to see what algorithms can be configured.
As of 2021 when I am writing this answer, there are mature libraries available in all popular languages for LZ4 (and snappy (and ZSTD)).
If you're the guy developing software that could benefit from wire-speed compression, you should use LZ4. If you are looking for a stronger compression -albeit slower- you can look into ZSTD instead. Forget about snappy.
One exception may be for some Java software, that has may support snappy but not lz4.
A Bit of History and Software Archeology
There is one edge case around java software. Snappy had an optimized java implementation for much longer, notably driven by Kafka. There's a good chance you ended up on this post because you were looking into tuning Kafka compression.
Kafka settled on snappy compression early-on and required all kafka clients (in all languages) to support snappy. It drove snappy adoption and further optimization.
If you see old comparisons that puts snappy ahead, for example this extensive Kafka benchmark from CloudFlare from 2018. The reason it does is because the article is old and LZ4 wasn't equally supported/optimized at the time (CloudFlare couldn't use lz4 in the end because not all clients supported it at the time).
It's hard work to retrofit more compression algorithms into an existing systems. LZ4 (and ZSTD) should be supported now but your mileage may vary. You may need to upgrade your cluster and upgrade your client libraries. You may find that some client libraries don't support it. The difference between snappy vs lz4 being thin, it's not worth the hassle to tune if you've got either working fine.
On a side node. If you run across multiple datacenters and find yourself heavily limited by the network, you should have a look into ZSTD, that has a much stronger compression (can reduce network traffic by 2 or 3).
LZ4 is mature and widely usable now, it wasn't as much pre-2020 (same with snappy outside of java). Many software are seeing noticeable performance improvements by adopting LZ4 and then further improvements as the libraries are deeply optimized.

Is there any high performance POSIX-like filesystem without a single point of failure? [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 3 years ago.
Improve this question
We have a web service that needs a somewhat POSIX-compatible shared filesystem for the application servers (multiple redundant systems running in parallel behind redundant load balancers). We're currently running GlusterFS as the shared filesystem for the application servers but I'm not happy with the performance of the system. Compared to actual raw performance of the storage servers running GlusterFS, it starts to look more sensible to run DRBD and single NFS server with all the other GlusterFS servers (currently 3 servers) waiting in hot-standby role.
Our workload is highly read oriented and usually deals with small files and I'd be happy to use "eventually consistent" system as long as a client can request sync for a single file if needed (that is, client is prepared to wait until the file has been successfully stored in the backend storage). I'd even accept a system where such "sync" requires querying the state of the file via some other way than POSIX fdatasync(). File metadata such as modification times is not important, only filename and the contents.
I'm currently aware of possible candidates and the problems each one currently has:
GlusterFS: overall performance is pretty poor in practice, performance goes down while adding new servers/bricks.
Ceph: highly complex to configure/administrate, POSIX compatibility sacrifices performance a lot as far as I know.
MooseFS: partially obfuscated open source (huge dumps of internally written code published seldomly with intentionally lost patch history), documentation leaves lots to desire.
SeaweedFS: pretty simple design and supposedly high performance, future of this project is unclear because pretty much all code is written and maintained by Chris Lu - what happens if he no longer writes any code? Unclear if the "Filer" component supports no single point of failure.
I know that CAP theorem prevents ever having truly consistent and always available system. Is there any good system for distributed file system where writes must be durable, but read performance is really good and the system has no single point of failure?
I am Chris Lu working on SeaweedFS. There are plans to commercialize it. (By adding more advanced features.)
The filer does not have simple point of failure, you can have multiple filer instances. The filer store can be any key-value store. If you need no SPOF, you can use Cassandra, Redis cluster, CockroachDB, TiDB, or Etcd. Or you can add your own key-value store option, which is pretty easy.

Fast distributed file system for small file [closed]

Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
We don’t allow questions seeking recommendations for books, tools, software libraries, and more. You can edit the question so it can be answered with facts and citations.
Closed 1 year ago.
Improve this question
Our company has five million users. We store user's code files. Users can edit and add their files, just like web IDE, the web IDE list users's file. We use PHP functions to implement these operations, such as readdir, file_get_contents and file_put_contents. We used the MooseFS but when we read the files in the program, in particular the slow loading speed.
So, we need to replace the file system , I hope someone can give me some advice , we have a huge number of small files, which distributed file system should be used.
Five million entries is small to a relational database. I'd wonder why you feel the need to store these in a file system.
Does every user require that all files be loaded on startup? If yes, I'd wonder about the design of the system. That operation is O(N) no matter how you design it.
If you put those five million small files into a relational or NoSQL database, and then let each user connect to it and query for the particular ones they want, then you eliminate the need to load them repeatedly on startup. Problem solved.
In any distributed filesystem, one of the most crucial aspects when we consider operations on small files is network latency - it should be as small as possible (like 0.1 ms) between such distributed filesystem components. The best way to achieve it is to use reliable switch and connect all machines to the same switch.
Also, in distributed filesystems (especially in MooseFS) the best thing is scalability - it means, that the more nodes you have (and the more your calculations are distributed, i.e. done simultaneously on more than one mount), the faster the cluster is.
If you use MooseFS, please check out MooseFS 3.0, because operations on small files are improved since 3.0 version. This is an easy way for now, because you don't have to make a "revolution" (before upgrade remember to backup the /var/lib/mfs on Master Server - i.e. metadata). MooseFS can handle small files well, so maybe there's a problem in configuration?
In MooseFS additionally (still considering small files operations), one of the most important things is to have high CPU clock (like e.g. 3.7 GHz) with small amount of CPU cores and disabled energy saving options in BIOS for Master Server (because Master Server is a single-threaded process). For Chunkservers and Clients situation is different - they are multi-threaded, so you'll get better results while using multicore CPUs.
Additionally, as stated in MooseFS Best practices in paragraph 4. "Virtual Machines and MooseFS":
[...] we do not recommend running MooseFS components (especially Master Server(s)) on Virtual Machines.
So if you run MFS on VMs, you in fact may have poor results.

Creating a Cluster Fileserver System

I Currently Have 3 Fileserver each has a raid 6 array of 24 disks.
The Question is this is there any way to make them work as one big drive rather that 3 seperate systems. I need more throughput and i was thinking this was a possibilty. Maybe a Distrubted Filesystem like Hadoop?
The answer depends on the intended usage of the data on this hardware.
Hadoop file system HDFS - is something suited for very special need of the Map-Reduce processing. Main limitations, which are ok for its intended use, but problematic for others are:
a) Files can not be edited, but only appended.
b) There will be a problem to stoe many small files. It is designed for file of size 64 MB and more. The cause of this limitation that all metadata is stored in memory.
c) It is not posix compliant FS, so you can not mount it and use as regular file system by the application unaware of HDFS.
I would consider options like GlusterFS, Ceph or Lustre which are built for the cases similar to one you describe. More information is needed to give good advice of selecting one of them.

Best C language key/value database around for massive amounts of entries

I am trying to create a key/value database with 300,000,000 key/value pairs of 8 bytes each (both for the key and the value). The requirement is to have a very fast key/value mechanism which can query about 500,000 entries per second.
I tried BDB, Tokyo DB, Kyoto DB, and levelDB and they all perform very bad when it comes to databases at that size. (Their performance is not even close to their benchmarked rate at 1,000,000 entries).
I cannot store my database in memory because of hardware limitations (32 bit software), so memcached is out of the question.
I cannot use external server software as well (only a database module), and there is no need for multi-user support at all. Of course server software cannot hold 500,000 queries per second from a single endpoint anyways, so that leaves out Redis, Tokyo tyrant, etc.
David Segleau, here. Product Manager for Berkeley DB.
The most common problem with BDB performance is that people don't configure the cache size, leaving it at the default, which is pretty small. The second most common problem is that people write application behavior emulators that do random look-ups (even though their application is not really completely random) which forces them to read data out of cache. The random I/O then takes them down a path of conclusions about performance that are not based on the simulated application rather than the actual application behavior.
From your description, I'm not sure if your running into these common problems or maybe into something else entirely. In any case, our experience is that Berkeley DB tends to perform and scale very well. We'd be happy to help you identify any bottlenecks and improve your BDB application throughput. The best place to get help in this regard would be on the BDB forums at: http://forums.oracle.com/forums/forum.jspa?forumID=271. When you post to the forum it would be useful to show the critical query segments of your application code and the db_stat output showing the performance of the database environment.
It's likely that you will want to use BDB HA/Replication in order to load balance the queries across multiple servers. 500K queries/second is probably going to require a larger multi-core server or a series of smaller replicated servers. We've frequently seen BDB applications with 100-200K queries/second on commodity hardware, but 500K queries per second on 300M records in a 32-bit application is likely going to require some careful tuning. I'd suggest focusing on optimizing the performance of a the queries on the BDB application running on a single node, and then use HA to distribute that load across multiple systems in order to scale your query/second throughput.
I hope that helps.
Good luck with your application.
Regards,
Dave
I found a good benchmark comparison web page that basically compares 5 renowned databases:
LevelDB
Kyoto TreeDB
SQLite3
MDB
BerkeleyDB
You should check it out before making your choice: http://symas.com/mdb/microbench/.
P.S - I know you've already tested them, but you should also consider that your configuration for each of these tests was not optimized as the benchmark shows otherwise.
Try ZooLib.
It provides a database with a C++ API, that was originally written for a high-performance multimedia database for educational institutions called Knowledge Forum. It could handle 3,000 simultaneous Mac and Windows clients (also written in ZooLib - it's a cross-platform application framework), all of them streaming audio, video and working with graphically rich documents created by the teachers and students.
It has two low-level APIs for actually writing your bytes to disk. One is very fast but is not fault-tolerant. The other is fault-tolerant but not as fast.
I'm one of ZooLib's developers, but I don't have much experience with ZooLib's database component. There is also no documentation - you'd have to read the source to figure out how it works. That's my own damn fault, as I took on the job of writing ZooLib's manual over ten years ago, but barely started it.
ZooLib's primarily developer Andy Green is a great guy and always happy to answer questions. What I suggest you do is subscribe to ZooLib's developer list at SourceForge then ask on the list how to use the database. Most likely Andy will answer you himself but maybe one of our other developers will.
ZooLib is Open Source under the MIT License, and is really high-quality, mature code. It has been under continuous development since 1990 or so, and was placed in Open Source in 2000.
Don't be concerned that we haven't released a tarball since 2003. We probably should, as this leads lots of potential users to think it's been abandoned, but it is very actively used and maintained. Just get the source from Subversion.
Andy is a self-employed consultant. If you don't have time but you do have a budget, he would do a very good job of writing custom, maintainable top-quality C++ code to suit your needs.
I would too, if it were any part of ZooLib other than the database, which as I said I am unfamiliar with. I've done a lot of my own consulting work with ZooLib's UI framework.
300 M * 8 bytes = 2.4GB. That will probably fit into memory (if the OS does not restrict the address space to 31 bits)
Since you'll also need to handle overflow, (either by a rehashing scheme or by chaining) memory gets even tighter, for linear probing you probably need > 400M slots, chaining will increase the sizeof item to 12 bytes (bit fiddling might gain you a few bits). That would increase the total footprint to circa 3.6 GB.
In any case you will need a specially crafted kernel that restricts it's own "reserved" address space to a few hundred MB. Not impossible, but a major operation. Escaping to a disk-based thing would be too slow, in all cases. (PAE could save you, but it is tricky)
IMHO your best choice would be to migrate to a 64 bits platform.
500,000 entries per second without holding the working set in memory? Wow.
In the general case this is not possible using HDDs and even difficult SSDs.
Have you any locality properties that might help to make the task a bit easier? What kind of queries do you have?
We use Redis. Written in C, its only slightly more complicated than memcached by design. Never tried to use that many rows but for us latency is very important and it handles those latencies well and lets us store the data in the disk
Here is a bench mark blog entry, comparing redis and memcached.
Berkely DB could do it for you.
I acheived 50000 inserts per second about 8 years ago and a final database of 70 billion records.

Resources