Distributed File System For High Concurrent Access Of Small Files - filesystems

what are the DFS technologies out there for high concurrent access (say by 10000 remote threads on a local 1 gbs network) of 1,000,000 files which are only in MB size range but the DSF should provide high concurrent stream of them to users?

Common HPC filesystems such as Lustre or GPFS often do not provide good support for the scenario you describe but are instead optimized towards high bandwidth on large file accesses. In the HPC context you should consider using IO middleware such as MPI-IO or high level IO libraries such as HDF5 rather than interfacing with the file system directly. Those libraries can hide the complexity of optimizing accesses to specific file systems from your application, which one is suitable depends on the structure of your application scenario.
On the other hand, for highly concurrent and unstructured small accesses, you might want to look into Cloud related technologies, e.g. Google Filesystem, distributed key value storages, Cassandra, just to give a few pointers for further research.
The general "file" abstraction and access approach (POSIX interface) was not designed for highly concurrent access which makes it difficult to conform with the interface and provide high concurrency at the same time.
If you want more specific hints for suitable technology, please provide some more specific information about your use-case(s).

Related

How is the individual SAN elements controlled?

In a SAN environment, we would have multiple storage devices (say each of them with 1TB), so cumulatively the formed SAN network would give a 100's of GBs of storage capacity.
Which is the software that is responsible to splice this storage capacity to each VMs (say 500GB for each VMs)? Where does it reside?
I am finding it hard to picture this concept.
Depending on various technologies there are multiple ways to do this. For example, in block-storage environments, LUNs from different storage systems can be concatenated/striped/mirrored/RAIDed by a volume manager software on the target server. The same effect can be achieved by hardware virtualisation on storage systems: for example, one of the storage device can work as "roof" for all the rest of devices (also, look at the thin-provisioning topic). In NAS world, it's possible to use build big trees of filesystems using different mount-points for different storage systems.

Understanding KeyValue embedded datastore vs FileSystem

I have a basic question with regards to FileSystem usage
I want to use a embedded KeyValue store, which is very write oriented. (persistent) Say my value size is
a) 10 K
b) 1 M
and read and updates are equal in number
Cant I simply create files containing the value and there name acting as keys.
Wont it as fast as using a KeyValue store as LevelDB or RocksDB.
Can anybody please help me understand .
In principle, yes, a filesystem can be used as a key-value store. The differences only come in when you look at individual use cases and limitations in the implementations.
Without going into too much details here, there are some things likely to be very different:
A filesystem splits data into fixed size blocks. Two files can't typically occupy parts of the same block. Common block sizes are 4-16 KiB; you can calculate how much overhead your 10 KiB example would cause. Key/value stores tend to account for smaller-sized pieces of data.
Directory indexes in filesystems are often not capable of efficiently iterating over the filenames/keys in sort order. You can efficiently look up a specific key, but you can't retrieve ranges without reading pretty much all of the directory entries. Some key/value stores, including LevelDB, support efficient ordered iterating.
Some key/value stores, including LevelDB, are transactional. This means you can bundle several updates together, and LevelDB will make sure that either all of these updates make it through, or none of them do. This is very important to prevent your data getting inconsistent. Filesystems make this much harder to implement, especially when multiple files are involved.
Key/value stores usually try to keep data contiguous on disk (so data can be retrieved with less seeking), whereas modern filesystems deliberately do not do this across files. This can impact performance rather severely when reading many records. It's not an issue on solid-state disks, though.
While some filesystems do offer compression features, they are usually either per-file or per-block. As far as I can see, LevelDB compresses entire chunks of records, potentially yielding better compression (though they biased their compression strategy towards performance over compression efficiency).
Lets try to build Minimal NoSQL DB server using Linux and modern File System in 2022, just for fun, not for serious environment.
DO NOT TRY THIS IN PRODUCTION
—————————————————————————————————————————————
POSIX file Api for read write,
POSIX ACL for native user accounts and group permission management.
POSIX filename as key ((root db folder)/(tablename folder)/(partition folder)/(64bitkey)). Per db and table we can define permission for read/write using POSIX ACL. (64bitkey) is generated in compute function.
Mount BTRFS/OpenZFS/F2fs as filesystem to provide compression (Lz4/zstd) and encryption (fscrypt) as native support. F2fs is more suitable as it implements LSM which many nosql db used in their low level architecture.
Meta data is handled by filesystem so no need to implement it.
Use Linux and/or filesystem to configure page or file or disk block cache according to read write patterns as implemented in business login written in compute function or db procedure.
Use RAID and sshfs for remote replication to create Master/Slave high availability and/or backup
Compute function or db procedure for writing logic could be NodeJS file or Go binary or whatever along with standard http/tcp/ws server module which reads and write contents to DB.

Distributed File Systems: GridFS vs. GlusterFS vs Ceph vs HekaFS Benchmarks [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 8 years ago.
Improve this question
I am currently searching for a good distributed file system.
It should:
be open-source
be horizontally scalable (replication and sharding)
have no single point of failure
have a relatively small footprint
Here are the four most promising candidates in my opinion:
GridFS (based on MongoDB)
GlusterFS
Ceph
HekaFS
The filesystem will be used mainly for media files (images and audio). There are very small as well as medium sized files (1 KB - 10 MB). The amount of files should be around several millions.
Are there any benchmarks regarding performance, CPU-load, memory-consumption and scalability? What are your experiences using these or other distributed filesystems?
I'm not sure your list is quite correct. It depends on what you mean by a file system.
If you mean a file system that is mountable in an operating system and usable by any application that reads and writes files using POSIX calls, then GridFS doesn't really qualify. It is just how MongoDB stores BSON-formatted objects. It is an Object system rather than a File system.
There is a project to make GridFS mountable, but it is a little weird because GridFS doesn't have concepts for things like hierarchical directories, although paths are allowed. Also, I'm not sure how distributed writes on gridfs-fuse would be.
GlusterFS and Ceph are comparable and are distributed, replicable mountable file systems. You can read a comparison between the two here (and followup update of comparison), although keep in mind that the benchmarks are done by someone who is a little biased. You can also watch this debate on the topic.
As for HekaFS, it is GlusterFS that is set up for cloud computing, adding encryption and multitenancy as well as an administrative UI.
After working with Ceph for 11 months I came to conclusion that it utterly sucks so I suggest to avoid it. I tried XtreemFS, RozoFS and QuantcastFS but found them not good enough either.
I wholeheartedly recommend LizardFS which is a fork of now proprietary MooseFS. LizardFS features data integrity, monitoring and superior performance with very few dependencies.
2019 update: situation has changed and LizardFS is not actively maintained any more.
MooseFS is stronger than ever and free from most LizardFS bugs. MooseFS is well maintained and it is faster than LizardFS.
RozoFS has matured and maybe worth a try.
GfarmFS have its niche but today I would have chosen MooseFS for most applications.
OrangeFS, anyone?
I am looking for a HPC DFS and found this discussion here:
http://forums.gentoo.org/viewtopic-t-901744-start-0.html
Lots of good data and comparisons :)
After some talk the OP decided for OrangeFS, quoting:
"OrangeFS. It does not support quotas nor file locks (though all i/o operations are atomic and this
way consistency is kept without locks). But it works, and works well and stable. Furthermore this is
not a general file storage oriented system, but HPC dedicated one, targeted on parallel I/O including
ROMIO support. All test were done for stripe data distribution.
a) No quotas — to hell quotas. I gave up on them anyway, even glusterfs supports not common
uid/gid based quotas, but directory size limitations, more like LVM works.
b) Multiple active metadata servers are supported and stable. Compared to dedicated metadata
storage (single node) this gives +50% performance on small files and no significant difference on
large ones.
c) Excellent performance on large data chunks (dd bs=1M). It is limited by a sum of local hard drive
(do not forget each node participates as a data server as well) speed and available network bandwidth.
CPU consumption on such load is decent and is about 50% of single core on a client node and about
10% percents on each other data server nodes.
d) Fair performance on large sets of small files. For the test I untared linux kernel 3.1. It took 5 minutes
over OrangeFS (with tuned parameters) and almost 2 minutes over NFSv4 (tuned as well) for comparison.
CPU load is about 50% of single core (of course, it is actually distributed between cores) on the client and
about several percents on each node.
e) Support of ROMIO MPI I/O API. This is a sweet yummy for MPI aware applications, which allows to use
PVFS2/OrangeFS parallel input-output features directly from applications.
f) No support for special files (sockets, fifo, block devices). Thus can't be safely used as /home and I use
NFSv4 for that task providing users quota-restricted small home space. Though most distributed
filesystems don't support special files anyway. "
I do not know about the other systems you posted but I have made a comparison of 3 PHP CMS/Frameworks on local storage vs GlusterFS to see if it does better on real world tests than raw benchmarks. Sadly not.
http://blog.lavoie.sl/2013/12/glusterfs-performance-on-different-frameworks.html

Is a database transaction a good way to manage memory in an operating system?

I am looking at doing some low level programming, writing a basic operating system. I am familiar with relational databases, so I am wondering if copying the methods data is stored in web aps (MySql, SQL Server etc) is a good way to store data in memory for an operating system. This would all be just for the operating systems tasks, and such. By "good way", I mean speed mostly, but of course elegance and good architecture are factors too. I assume that most operating systems favour speed over design patterns for this? I have no experience with low level Linux, so I want to know if a relational database is a sound starting point for writing a memory manager.
No.
Memory management in an OS is a whole service of managing memory. That includes interfaces to the applications, the way to manage bare metal memory region, memory protection control, virtual memory mapping management, to list a few.
Database transaction is, on the other hand, a concept of ACID (atomicity, consistency, isolation, durability). It is to provide interfaces to users who want ACID attributes. It might be interesting, though, if your OS provides database transaction like interface to the applications, if it applicable.
The reason most, if not all, OSes favor speed is that all existing CPU power is to run applications. There is no reason to manage hardware, or provide services, if there is any applications to run. But this is not always true. OS designers, sometime, favor design or clarity over a bit of speed, for manageability. So the answer is, as always, "It depends."

Choice of embedded database?

We are building an application on an embedded platform that needs a reasonably high performance database (very low select speeds on tables with > 500,000 entries).
The database needs to be able to :
Store atomic commit information in NVRAM so that such information is preserved if power fails before the commit finishes.
Be written to NAND Flash in such a way as to level wearing across the memory (could be done using, e.g. jffs2 or yaffs2).
Currently our options appear to be a "roll-your own" approach, or possibly SQLite.
Any other options, or pointers about the details of "rolling your own" or working with SQLite appreciated!
Edit: The target has 32MB of RAM, 1MB of NVRAM and 64MB of NAND Flash. The rest of the code is C, so that is the preferred language. The target processor is an ARM. In general, the queries that need to have the most performance are pretty simple. Complex queries don't need to have the same level of performance.
Apple's iPhone (and iPod Touch) uses the SQLite DB for a lot of its functions, so there's definitely a proven flash-based platform there. However, I doubt the amount of data in any of those tables has > 500k rows.
I think this Wikipedia RDBMS comparison might help you in making your choice.
But I don't understand why you have your specific NVRAM requirement.
Codebase provides a solid portable lightweight fast isam with transactions.
If your embedded system has access to the .NET framework, you can embed VistaDB.

Resources