What is the difference between POSIX storage and NFS? - filesystems

In bioinformatics, we have been working more and more with cluster-based deployments like Kubernetes, Spark, and Hadoop. The term POSIX storage keeps coming up in documentation.
What is the difference between POSIX storage and NFS block storage (EBS)? Are the terms interchangeable? Does it basically mean anything that isn't object storage (S3) or Microsoft (SMB, CIFS)?

My understanding is:
POSIX storage refers to any storage that can be accessed using POSIX filesystem functions (ie. the usual 'fopen'), and that complies with POSIX filesystem requirements: this means that it must provide several facilities like POSIX attributes, or atomic file-blocking strictly following POSIX semantics.
This is normally storage that is attached to the host (either directly or via a SAN) through a POSIX operating system. In addition, the filesystem has to be POSIX-capable.
NFS, CIFS, other NAS filesystems, as well as HDFS (Hadoop) are not POSIX compatible. These work on top of network protocols, usually backed by some other filesystem, and their access semantics don't allow for POSIX compatibility (but see #SteveLoughran note about NFS).
NTFS and FAT are filesystems, but they are not POSIX capable (they don't support locking with the same semantics). Windows doesn't provide POSIX compatible functions either, but even Linux cannot be fully POSIX-storage-compatible on these filesystems. They are not "POSIX storage".
Amazon EBS volumes are block storage (SAN), so once a volume is attached to your host, if the filesystem you use is POSIX, and you are running a POSIX operating system, you can consider it "POSIX storage".
S3 is not a filesystem, it has its own object access API, and hence it cannot support POSIX file functions.
Most typical Linux filesystems (when mounted directly by a POSIX host) are POSIX capable (ie. ext3, ext4, xfs, zfs).

Related

POSIX way to do O_DIRECT?

Direct I/O is the most performant way to copy larger files, so I wanted to add that ability to a program.
Windows offers FILE_FLAG_WRITE_THROUGH and FILE_FLAG_NO_BUFFERING in the Win32's CreateFileA(). Linux, since 2.4.10, has the O_DIRECT flag for open().
Is there a way to achieve the same result portably within POSIX? Like how the Win32 API here works from Windows XP to Windows 11, it would be nice to do direct IO across all UNIX-like systems in one reliably portable way.
No, there is no POSIX standard for direct IO.
There are at least two different APIs and behaviors that exist as of January 2023. Linux, FreeBSD, and apparently IBM's AIX use an O_DIRECT flag to open(), while Oracle's Solaris uses a directio() function on an already-opened file descriptor.
The Linux use of the O_DIRECT flag to the POSIX open() function is documented on the Linux open() man page:
O_DIRECT (since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from thishttps://man7.org/linux/man-pages/man2/open.2.html
file. In general this will degrade performance, but it is
useful in special situations, such as https://en.wikipedia.org/wiki/QFSwhen applications do
their own caching. File I/O is done directly to/from
user-space buffers. The O_DIRECT flag on its own makes an
effort to transfer data synchronously, but does not give
the guarantees of the O_SYNC flag that data and necessary
metadata are transferred. To guarantee synchronous I/O,
O_SYNC must be used in addition to O_DIRECT. See NOTES
below for further discussion.
Linux does not clearly specify how direct IO interacts with other descriptors open on the same file, or what happens when the file is mapped using mmap(); nor any alignment or size restrictions on direct IO read or write operations. In my experience, these are all file-system specific and have been improving/becoming less restrictive over time, but most Linux filesystems require page-aligned IO buffers, and many (most? all?) (did? still do?) require page-sized reads or writes.
FreeBSD follows the Linux model: passing an O_DIRECT flag to open():
O_DIRECT may be used to minimize or eliminate the cache effects
of reading and writing. The system will attempt to avoid caching the
data you
read or write. If it cannot avoid caching the data, it will minimize the
impact the data has on the cache. Use of this flag can drastically reduce performance if not used with care.
OpenBSD does not support direct IO. There's no mention of direct IO in either the OpenBSD open() or the OpenBSD 'fcntl()` man pages.
IBM's AIX appears to support a Linux-type O_DIRECT flag to open(), but actual published IBM AIX man pages don't seem to be generally available.
SGI's Irix also supported the Linux-style O_DIRECT flag to open():
O_DIRECT
If set, all reads and writes on the resulting file descriptor will
be performed directly to or from the user program buffer, provided
appropriate size and alignment restrictions are met. Refer to the
F_SETFL and F_DIOINFO commands in the fcntl(2) manual entry for
information about how to determine the alignment constraints.
O_DIRECT is a Silicon Graphics extension and is only supported on
local EFS and XFS file systems, and remote BDS file systems.
Of interest, the XFS file system on Linux originated with SGI's Irix.
Solaris uses a completely different interface. Solaris uses a specific directio() function to set direct IO on a per-file basis:
Description
The directio() function provides advice to the system about the
expected behavior of the application when accessing the data in the
file associated with the open file descriptor fildes. The system
uses this information to help optimize accesses to the file's data.
The directio() function has no effect on the semantics of the other
operations on the data, though it may affect the performance of other
operations.
The advice argument is kept per file; the last caller of directio()
sets the advice for all applications using the file associated with
fildes.
Values for advice are defined in <sys/fcntl.h>.
DIRECTIO_OFF
Applications get the default system behavior when accessing file data.
When an application reads data from a file, the data is first cached
in system memory and then copied into the application's buffer (see
read(2)). If the system detects that the application is reading
sequentially from a file, the system will asynchronously "read ahead"
from the file into system memory so the data is immediately available
for the next read(2) operation.
When an application writes data into a file, the data is first cached
in system memory and is written to the device at a later time (see
write(2)). When possible, the system increases the performance of
write(2) operations by cacheing the data in memory pages. The data
is copied into system memory and the write(2) operation returns
immediately to the application. The data is later written
asynchronously to the device. When possible, the cached data is
"clustered" into large chunks and written to the device in a single
write operation.
The system behavior for DIRECTIO_OFF can change without notice.
DIRECTIO_ON
The system behaves as though the application is not going to reuse the
file data in the near future. In other words, the file data is not
cached in the system's memory pages.
When possible, data is read or written directly between the
application's memory and the device when the data is accessed with
read(2) and write(2) operations. When such transfers are not
possible, the system switches back to the default behavior, but just
for that operation. In general, the transfer is possible when the
application's buffer is aligned on a two-byte (short) boundary, the
offset into the file is on a device sector boundary, and the size of
the operation is a multiple of device sectors.
This advisory is ignored while the file associated with fildes is
mapped (see mmap(2)).
The system behavior for DIRECTIO_ON can change without notice.
Notice also the behavior on Solaris is different: if direct IO is enabled on a file by any process, all processes accessing that file will do so via direct IO (Solaris 10+ has no alignment or size restrictions on direct IO, so switching between direct IO and "normal" IO won't break anything.*). And if a file is mapped via mmap(), direct IO on that file is disabled entirely.
* - That's not quite true - if you're using a SAMFS or QFS filesystem in shared mode and access data from the filesystem's active metadata controller (where the filesystem must be mounted by design with the Solaris forcedirectio mount option so all access is done via direct IO on that one system in the cluster), if you disable direct IO for a file using directio( fd, DIRECTIO_OFF ), you will corrupt the filesystem. Oracle's own top-end RAC database would do that if you did a database restore on the QFS metadata controller, and you'd wind up with a corrupt filesystem.
The short answer is no.
IEEE 1003.1-2017 (the current POSIX standard afaik) doesn't mention any directives for direct I/O like O_DIRECT. That being said, a cursory glance tells me that GNU/Linux and FreeBSD support the O_DIRECT flag, while OpenBSD doesn't.
Beyond that, it appears that not all filesystems support O_DIRECT so even on a GNU/Linux system where you know your implementation of open() will recognize that directive, there's still no guarantee that you can use it.
At the end of the day, the only way I can see portable, direct I/O is runtime checks for whether or not the platform your program is running on supports it; you could do compile time checks, but I don't recommend it since filesystems can change, or your destination may not be on the OS drive. You might get super lucky and find a project out there that's already started to do this, but I kind of doubt it exists.
My recommendation for you is to start by writing your program to check for direct I/O support for your platform and act accordingly, adding checks and support for kernels and file systems you know your program will run on.
Wish I could be more help,
--K

Are there any distributed high-availability filesystems (for Linux) that are actively-developed?

Are there any distributed, high-availability filesystems (for Linux) that are actively-developed?
Let me be more specific:
Distributed means it deals gracefully with client-to-server latencies like you'd find over the public worldwide internet (300ms and up being commonplace) and occasional connectivity flakiness. This means really good client-side caching (i.e. with callbacks) is required. NFS does not do this. It also means encryption of on-the-wire data without needing an IPSEC VPN.
High availability means that data can be stored on multiple servers and the client is smart enough to try another server if it encounters problems. Putting that intelligence in the client is really important, and it's why this sort of thing can't just be grafted onto NFS. At a minimum this needs to be possible for read-only data. It would be nice for read-write data but I know that's hard.
Filesystem means a kernel driver exporting a POSIX interface and permissions and access control are enforced in the face of untrustworthy clients. SAN systems often assume the clients are trustworthy.
I'm an OpenAFS refugee. I love it but at this point I can no longer accept its requirement that all the file servers effectively "have root" on all other file servers. The proprietary disk format and overhead of having to run Kerberos infrastructure (which I wouldn't otherwise need) are also becoming increasingly problematic.
Are there any systems other than OpenAFS with these properties? Intermezzo and Coda probably qualify but aren't active projects any longer. Lustre is cool but seems to be designed for ultra-low-latency data centres. Ceph is awesome but not really a filesystem, more of a thing that runs under a filesystem (yes, there's CephFS, but it's really a showcase for Ceph and explicitly not production-ready and there's no timetable for that). Tahoe-LAFS is cool but it and GoogleFS aren't really filesystems in that they don't export a POSIX interface through a kernel module. My understanding of GFS (Global Filesystem) is that the clients can manipulate the on-disk data structures directly, so they're implicitly root-level trusted (and this is part of why it's fast) -- correct me if I'm wrong here.
Needs to be open source since I can't afford to have my data locked up in something proprietary. I don't mind paying for software, but I can't be held hostage in this situation.
Thanks,
First of all you can use local file system (mounted with -o user_xattr) to cache NFS (mounted with -o fsc) using cachefilesd (provided by cachefilesd package on Debian) through fscache facility.
Although file system that you are looking for probably do not exist, IMHO two projects came pretty close with fairly good FUSE client implementations:
LizardFS (GPL-3 licensed, hosted at Github), fork of now proprietary MooseFS.
Gfarm file system (BSD/Apache-2.0, hosted at SourceForge)
After evaluating Ceph for quite a while I came to conclusion that it is flawed (with no hope for improvement in the foreseeable future) and not suitable for serious use. XtreemFS is a disappointment too. I hope that upcoming OrangeFS version 3 (with promised data integrity checks) might not be too bad but that's remains to be seen...

Exposing a custom filesystem over SMB or other network protocol

We've written a library that implements a filesystem-like API on top of a custom database system.
We'd like to be able to mount this filesystem as an ordinary filesytem from other machines across a network.
Are there any libraries, that can run in user space, that lets other machines on the network mount this and treat it like an ordinary file system? (Preferably in Python or C++)
One of the options is to use our Callback File System and expose the filesystem as a virtual disk, which can be shared using regular Windows sharing mechanisms. Callback File System includes a kernel-mode driver, needed to do the job, and offers user-mode C++ API which you use to expose your data as a filesystem.
Once there was .NET implementation of SMB server called WinFUSE, but it's long gone and almost no traces are left.
Update: On linux you can use FUSE to implement a filesystem and mount it, and then use some mechanisms (not necessarily a library) to expose the mounted filesystem as an SMB share.

libevent2 and file io

I've been toying around with libevent2, and I've got reading files working, but it blocks. Is there any way to make file reading not block just within libevent. Or, do I need to use another IO library for files and make it pump events that I need.
fd = open("/tmp/hello_world",O_RDONLY);
evbuffer_read(buf,fd,4096);
The O_NONBLOCK flag doesn't work either.
In POSIX disks are considered "fast devices" meaning that they always block (which is why O_NONBLOCK didn't work for you). Only network sockets can be non-blocking.
There is POSIX AIO, but e.g. on Linux that comes with a bunch of restrictions making it unsuitable for general-purpose usage (only for O_DIRECT, I/O must be sector-aligned).
If you want to integrate normal POSIX IO into an asynchronous event loop it seems people resort to thread pools, where the blocking syscalls are executed in the background by one of the worker threads. One example of such a library is libeio
No.
I've yet to see a *nix where you can do non-blocking i/o on regular files without resorting to the more special AIO library (Though for some, e.g. solaris, O_NONBLOCK has an effect if e.g. someone else holds a lock on the file)
Please take a look at libuv, which is used by node.js / io.js: https://github.com/libuv/libuv
It's a good alternative to libeio, because it does perform well on all major operating systems, from Windows to the BSDs, Mac OS X and of course Linux.
It supports I/O completion ports, which makes it a better choice than libeio if you are targeting Windows.
The C code is also very readable and I highly recommend this tutorial: https://nikhilm.github.io/uvbook/

Is there a network file system for mainframes (System Z)?

Network file systems are offered by Windows and (L)Unix. Is there one for IBM Mainframes? (Hard to believe not). Does it offer standard Unix-style access (binary? Ascii? EBCDIC?)
to mainframe data areas? How are datasets/partitioned data sets treated?
Who configures into Z/OS? How do I find out more about such network file systems for mainframes?
NFS is supported on System z.
So is SMB

Resources