POSIX way to do O_DIRECT? - c

Direct I/O is the most performant way to copy larger files, so I wanted to add that ability to a program.
Windows offers FILE_FLAG_WRITE_THROUGH and FILE_FLAG_NO_BUFFERING in the Win32's CreateFileA(). Linux, since 2.4.10, has the O_DIRECT flag for open().
Is there a way to achieve the same result portably within POSIX? Like how the Win32 API here works from Windows XP to Windows 11, it would be nice to do direct IO across all UNIX-like systems in one reliably portable way.

No, there is no POSIX standard for direct IO.
There are at least two different APIs and behaviors that exist as of January 2023. Linux, FreeBSD, and apparently IBM's AIX use an O_DIRECT flag to open(), while Oracle's Solaris uses a directio() function on an already-opened file descriptor.
The Linux use of the O_DIRECT flag to the POSIX open() function is documented on the Linux open() man page:
O_DIRECT (since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from thishttps://man7.org/linux/man-pages/man2/open.2.html
file. In general this will degrade performance, but it is
useful in special situations, such as https://en.wikipedia.org/wiki/QFSwhen applications do
their own caching. File I/O is done directly to/from
user-space buffers. The O_DIRECT flag on its own makes an
effort to transfer data synchronously, but does not give
the guarantees of the O_SYNC flag that data and necessary
metadata are transferred. To guarantee synchronous I/O,
O_SYNC must be used in addition to O_DIRECT. See NOTES
below for further discussion.
Linux does not clearly specify how direct IO interacts with other descriptors open on the same file, or what happens when the file is mapped using mmap(); nor any alignment or size restrictions on direct IO read or write operations. In my experience, these are all file-system specific and have been improving/becoming less restrictive over time, but most Linux filesystems require page-aligned IO buffers, and many (most? all?) (did? still do?) require page-sized reads or writes.
FreeBSD follows the Linux model: passing an O_DIRECT flag to open():
O_DIRECT may be used to minimize or eliminate the cache effects
of reading and writing. The system will attempt to avoid caching the
data you
read or write. If it cannot avoid caching the data, it will minimize the
impact the data has on the cache. Use of this flag can drastically reduce performance if not used with care.
OpenBSD does not support direct IO. There's no mention of direct IO in either the OpenBSD open() or the OpenBSD 'fcntl()` man pages.
IBM's AIX appears to support a Linux-type O_DIRECT flag to open(), but actual published IBM AIX man pages don't seem to be generally available.
SGI's Irix also supported the Linux-style O_DIRECT flag to open():
O_DIRECT
If set, all reads and writes on the resulting file descriptor will
be performed directly to or from the user program buffer, provided
appropriate size and alignment restrictions are met. Refer to the
F_SETFL and F_DIOINFO commands in the fcntl(2) manual entry for
information about how to determine the alignment constraints.
O_DIRECT is a Silicon Graphics extension and is only supported on
local EFS and XFS file systems, and remote BDS file systems.
Of interest, the XFS file system on Linux originated with SGI's Irix.
Solaris uses a completely different interface. Solaris uses a specific directio() function to set direct IO on a per-file basis:
Description
The directio() function provides advice to the system about the
expected behavior of the application when accessing the data in the
file associated with the open file descriptor fildes. The system
uses this information to help optimize accesses to the file's data.
The directio() function has no effect on the semantics of the other
operations on the data, though it may affect the performance of other
operations.
The advice argument is kept per file; the last caller of directio()
sets the advice for all applications using the file associated with
fildes.
Values for advice are defined in <sys/fcntl.h>.
DIRECTIO_OFF
Applications get the default system behavior when accessing file data.
When an application reads data from a file, the data is first cached
in system memory and then copied into the application's buffer (see
read(2)). If the system detects that the application is reading
sequentially from a file, the system will asynchronously "read ahead"
from the file into system memory so the data is immediately available
for the next read(2) operation.
When an application writes data into a file, the data is first cached
in system memory and is written to the device at a later time (see
write(2)). When possible, the system increases the performance of
write(2) operations by cacheing the data in memory pages. The data
is copied into system memory and the write(2) operation returns
immediately to the application. The data is later written
asynchronously to the device. When possible, the cached data is
"clustered" into large chunks and written to the device in a single
write operation.
The system behavior for DIRECTIO_OFF can change without notice.
DIRECTIO_ON
The system behaves as though the application is not going to reuse the
file data in the near future. In other words, the file data is not
cached in the system's memory pages.
When possible, data is read or written directly between the
application's memory and the device when the data is accessed with
read(2) and write(2) operations. When such transfers are not
possible, the system switches back to the default behavior, but just
for that operation. In general, the transfer is possible when the
application's buffer is aligned on a two-byte (short) boundary, the
offset into the file is on a device sector boundary, and the size of
the operation is a multiple of device sectors.
This advisory is ignored while the file associated with fildes is
mapped (see mmap(2)).
The system behavior for DIRECTIO_ON can change without notice.
Notice also the behavior on Solaris is different: if direct IO is enabled on a file by any process, all processes accessing that file will do so via direct IO (Solaris 10+ has no alignment or size restrictions on direct IO, so switching between direct IO and "normal" IO won't break anything.*). And if a file is mapped via mmap(), direct IO on that file is disabled entirely.
* - That's not quite true - if you're using a SAMFS or QFS filesystem in shared mode and access data from the filesystem's active metadata controller (where the filesystem must be mounted by design with the Solaris forcedirectio mount option so all access is done via direct IO on that one system in the cluster), if you disable direct IO for a file using directio( fd, DIRECTIO_OFF ), you will corrupt the filesystem. Oracle's own top-end RAC database would do that if you did a database restore on the QFS metadata controller, and you'd wind up with a corrupt filesystem.

The short answer is no.
IEEE 1003.1-2017 (the current POSIX standard afaik) doesn't mention any directives for direct I/O like O_DIRECT. That being said, a cursory glance tells me that GNU/Linux and FreeBSD support the O_DIRECT flag, while OpenBSD doesn't.
Beyond that, it appears that not all filesystems support O_DIRECT so even on a GNU/Linux system where you know your implementation of open() will recognize that directive, there's still no guarantee that you can use it.
At the end of the day, the only way I can see portable, direct I/O is runtime checks for whether or not the platform your program is running on supports it; you could do compile time checks, but I don't recommend it since filesystems can change, or your destination may not be on the OS drive. You might get super lucky and find a project out there that's already started to do this, but I kind of doubt it exists.
My recommendation for you is to start by writing your program to check for direct I/O support for your platform and act accordingly, adding checks and support for kernels and file systems you know your program will run on.
Wish I could be more help,
--K

Related

fflush, fsync and sync vs memory layers

I know there are already similar questions and I gave them a look but I couldn't find an explicit univocal answer to my question. I was just investigating online about these functions and their relationship with memory layers. In particular I found this beautiful article that gave me a good insight about memory layers
It seems that fflush() moves data from the application to kernel filesystem buffer and it's ok, everyone seems to agree on this point. The only thing that left me puzzled was that in the same article they assumed a write-back cache saying that with fsync() "the data is saved to the stable storage layer" and after they added that "the storage may itself store the data in a write-back cache, so fsync() is still required for files opened with O_DIRECT in order to save the data to stable storage"
Reading here and there it seems like the truth is that fsync() and sync() let the data enter the storage device but if this one has caching layers it is just moved here, not at once to permanent storage and data may even be lost if there is a power failure. Unless we have a filesystem with barriers enabled and then "sync()/fsync() and some other operations will cause the appropriate CACHE FLUSH (ATA) or SYNCHRONIZE CACHE (SCSI) commands to be sent to the device" [from your website answer]
Questions:
if the data to be updated are already in the kernel buffers and my device has a volatile cache layer in write-back mode is it true, like said by the article, that operations like fsync() [and sync() I suppose] synchronize data to the stable memory layer skipping the volatile one? I think this is what happens with a write-through cache, not a write-back one. From what I read I understood that with a write-back cache on fsync() can just send data to the device that will put them in the volatile cache and they will enter the permanent memory only after
I read that fsync() works with a file descriptor and then with a single file while sync() causes a total deployment for the buffers so it applies to every data to be updated. And from this page also that fsync() waits for the end of the writing to the disk while sync() doesn't wait for the end of the actual writing to the disk. Are there other differences connected to memory data transfers between the two?
Thanks to those who will try to help
1. As you correctly concluded from your research fflush synchronizes the user-space buffered data to kernel-level cache (since it's working with FILE objects that reside at user-level and are invisible to kernel), whereas fsync or sync (working directly with file descriptors) synchronize kernel cached data with device. However, the latter comes without a guarantee that the data has been actually written to the storage device — as these usually come with their own caches as well. I would expect the same holds for msync called with MS_SYNC flag as well.
Relatedly, I find the distinction between synchronized and synchronous operations very useful when talking about the topic. Here's how Robert Love puts it succinctly:
A synchronous write operation does not return until the written data is—at least—stored in the kernel’s buffer cache. [...] A synchronized operation is more restrictive and safer than a merely synchronous operation. A synchronized write operation flushes the data to disk, ensuring that the on-disk data is always synchronized vis-à-vis the corresponding kernel buffers.
With that in mind you can call open with O_SYNC flag (together with some other flag that opens the file with a write permission) to enforce synchronized write operations. Again, as you correctly assumed this will work only with WRITE THROUGH disk caching policy, which effectively amounts to disabling disk caching.
You can read this answer about how to disable disk caching on Linux. Be sure to also check this website which also covers SCSI-based in addition to ATA-based devices (to read about different types of disks see this page on Microsoft SQL Server 2005, last updated: Apr 19, 2018).
Speaking of which, it is very informative to read about how the issue is dealt with on Windows machines:
To open a file for unbuffered I/O, call the CreateFile function with the FILE_FLAG_NO_BUFFERING and FILE_FLAG_WRITE_THROUGH flags. This prevents the file contents from being cached and flushes the metadata to disk with each write. For more information, see CreateFile.
Apparently, this is how Microsoft SQL Server 2005 family ensures data integrity:
All versions of SQL Server open the log and data files using the Win32 CreateFile function. The dwFlagsAndAttributes member includes the FILE_FLAG_WRITE_THROUGH option when opened by SQL Server. [...]
This option instructs the system to write through any intermediate cache and go directly to disk. The system can still cache write operations, but cannot lazily flush them.
I'm saying this is informative in particular because of this blog post from 2012 showing that some SATA disks ignore the FILE_FLAG_WRITE_THROUGH! I don't know what the current state of affairs is, but it seems that in order to ensure that writing to a disk is truly synchronized, you need to:
Disable disk caching using your device drivers.
Make sure that the specific device you're using supports write-through/no-caching policy.
However, if you're looking for a guarantee of data integrity you could just buy a disk with its own battery-based power supply that goes beyond capacitors (which is usually only enough for completing the on-going write processes). As put in the conclusion in the blog article mentioned above:
Bottom-line, use Enterprise-Class disks for your data and transaction log files. [...] Actually, the situation is not as dramatic as it seems. Many RAID controllers have battery-backed cache and do not need to honor the write-through requirement.
2. To (partially) answer the second question, this is from the man pages SYNC(2):
According to the standard specification (e.g., POSIX.1-2001), sync() schedules the writes, but may return before the actual writing is done. However, since version 1.3.20 Linux does actually wait. (This still does not guarantee data integrity: modern disks have large caches.)
This would imply that fsync and sync work differently, however, note they're both implemented in unistd.h which suggests some consistency between them. However, I would follow Robert Love who does not recommend using sync syscall when writing your own code.
The only real use for sync() is in the implementation of the sync utility. Applications should use fsync() and fdatasync() to commit to disk the data of only the requisite file descriptors. Note that sync() may take several minutes or longer to complete on a busy system.
"I don't have any solution, but certainly admire the problem."
From all I read from your good references, is that there is no standard. The standard ends somewhere in the kernel. The kernel controls the device driver and the device driver (possibly supplied by the disk manufacturer) controls the disk through an API (device has small computer on board). The manufacturer may have added capacitors/battery with just enough power to flush its device buffers in case of power failure, or he may have not. The device may provide a sync function but whether this truely syncs (flushes) the device buffers is not known (device dependent). So unless you select and install a device according to your specifications (and verify those specs), you are never sure.
This is a fair problem. Even after handling error conditions, you are not safe of the data presence in your storage.
man page of fsync explains this issue clearly!! :)
For applications that require tighter guarantees about the integrity of
their data, Mac OS X provides the F_FULLFSYNC fcntl. The F_FULLFSYNC fcntl asks the drive to flush all buffered data to permanent storage.
Applications, such as databases, that require a strict ordering of writes
should use F_FULLFSYNC to ensure that their data is written in the order they expect. Please see fcntl(2) for more detail.
Yes, fflush() ensures the data leaves the process memory space, but it may be in dirty pages of RAM awaiting write back. This is proof against app abort, but not system crash or power failure. Even if the power is backed up, the system could crash due to some software vulnerability! As mentioned in other answers/comments, getting the data from dirty pages written to disk magnetically or whatever SSD do, not stuck in some volatile buffer in the disk controller or drive, is a combination of the right calls or open options and the right controllers and devices! Calls give you more control over the overhead, writing more in bulk at the end of a transaction.
RDBMS, for instance, need to worry not only about the database holding files but even more about the log files that allow recovery, both after disk loss and on any RDBMS restart after a crash. In fact, some may be more sync'd in the log than the database, to preserve speed, since recovery is not a frequent process and not usually a long one. Things written to the log by transactions are guaranteed to be recoverable if the log is intact.

Why does "read" have to be a system call run in "Kernel Mode"?

As I understood, the UNIX function read() will cause an interrupt(TRAP) and invoke the system call read. I also remembered that it has to switch to "Kernel Mode" before invoking the system call read and the switching is expensive..
I was wondering that why the read operation has to be delegated to system call in "Kernel Mode", instead of being done in "User Mode" completely.
For example, if there could be a service in "User Mode" which manages the access permissions of files, the read operation can just request this service, not disturbing the Kernel..
And for the disk driver, it is said in this link that
Device drivers can run in either user or kernel mode
Does anyone have ideas about this? Why does read have to be in Kernel Mode?
Is not the way Operating Systems are designed. The definition of OS is to handle the computers' hardware and to bring resources to their users. Operating Sysmtes also have the concept of user mode and kernel mode (as you said).
By having these concepts, OS define an specific line to what a user might do and what not. Letting them manage hardware is definitely something OS don't want users to do.
read usually involves a hardware access. Accessing hardware is cumbersome and error prone and can leave the computer in an unusable state. Operating System uses drivers to control the computer's hardware.
Issuing a read (assuming a hard disk IO) generally makes a driver to send a set of commands to the disk controller, read it's output, pass it to main memory, etc. This are dangerous operations that shouldn't be trusted to User Mode.
If there would be a service in user mode to handle this. Context switch still would be needed to be done, because the service would be running as another process.
Sure thing it can be done an Operating System that allows this. But modern operating systems aren't design to fulfill this behavior.
There are other approaches to building operating systems that relies on microkernels. A microkernel just do the minimum to get the pc started and leave everything else to other modules. Meaning that if a module crashes, the system still up. That's the case of specific drivers, filesystems, etc. I don't know if microkernels let these run in user space though.
Hope this helps!
First of all: it's no longer true that calling kernel is very expensive. It used to be when causing an exception/trap/fault/interrupt were the only way to switch from user mode to kernel mode in x86 systems, but that all changed with the addition of the systenter/sysexit machine code instructions, which perform a more lightweight transition.
Even if it is/were expensive, in terms of time consumed, system calls that deal with character and block device drivers should run in kernel mode because dealing with hardware devices involves reading and writting to hardware registers, which could be memory mapped or accessed thru I/O ports.
These registers must be protected from any access from userspace process. Not doing so may lead to any process to not to use the established API for reading a file, and directly use the hardware registers to read and write to the device. In the case of a disk with file, this would allow the userprocess to bypass the filesystem entirely, and hence, all the security and permission system.
So, if we need to protect these hardware registers so no user process can use them, code that does use them cannot run at the same priviledge level as any other user process. Hence, they run in another (more priviledged) mode, which is what is called "kernel mode".
Think on what would happen if you configure a Linux system so /dev/sda (usually the main harddisk in which the root filesystem lives) is read/write to anybody and everybody:
# chmod 666 /dev/sda
Having done this is more or less the equivalent of exposing the hard disk device to any user process. You can effectively write a program that could open, read, and write files stored within this device, but at the same time, you can write a program that open, read and write ANY files within the partition, no matter which permissions files have.
That said, there are cases in which a system runs only trusted applications. This kind of system doesn't need the level of protection that is present in a general purpose system, and hence it can benefit from the increased speed that comes by not depending on layers of APIs to isolate the process from the hardware. The most widely known example would be a videoconsole system. I recall that Windows CE used to run all its programs and device drivers at the same privilege too.

Guarantees in write ahead logging implementation

If one were to issue a sequential series of write(2) in Linux/Unix seperated by fdatasync(2) or fsync(2) or sync(2) is it guaranteed that the first write() will be committed to disk before your second write()? The following SO post seems to say that such guarantees cannot be given, since there are multiple caching layers involved. For database systems which guarantee consistency this seems to be important, since in WAL (Write Ahead Logging) recovery, you'd need your logs to be persisted on disk before actually changing your data, so that in the event of an application/system failure you can revert back to your last known consistent state. How is this ensured/implemented in an actual database system?
The sync() system call is practically no help whatsoever; it promises to schedule the write-to-disk operations, but that's about all.
The normal technique used is to set the correct options when you open() the file descriptor for the disk file: O_DSYNC, O_RSYNC, O_SYNC. However, the fsync() and fdatasync() get pretty close to the same effects. You can also look at O_DIRECTIO which is often supported, though it is not standardized at all by POSIX.
Ultimately, the DBMS relies on the O/S to undertake that data written and synchronized to one disk is secure. As long as the device will always return what the DBMS last wrote, even if it is not on actual disk yet because of caching (because it is backed up in non-volatile cache, or something like that), then it isn't critical. If, on the other, you have NAS (network attached storage) that doesn't guarantee that what you last wrote (and were told was safe on disk) is returned when you read it, then your DBMS can suffer if it has to do recovery. So, you choose where you store your DBMS with care, making sure the storage works sensibly. If the storage does not work sufficiently like the hypothetical disk, you can end up losing data.
Yes, fsync in modern versions of the kernel does both flush memory (buffer cache) to disk and disk hardware buffer to platter. Man page says older kernels used to only do the first thing.
DESCRIPTION fsync() transfers ("flushes") all modified in-core data
of (i.e., modi‐ fied buffer cache pages for) the file referred to
by the file descrip‐ tor fd to the disk device (or other permanent
storage device) so that all changed information can be retrieved
even after the system crashed or was rebooted. This includes
writing through or flushing a disk cache if present. The
call blocks until the device reports that the transfer has
completed. It also flushes metadata information associ‐ ated
with the file (see stat(2)).
The fsync() implementations in older kernels and lesser used
filesys‐ tems does not know how to flush disk caches. In these
cases disk caches need to be disabled using hdparm(8) or
sdparm(8) to guarantee safe operation.
This refers to what applications can request. Fsync is an interface that filesystems provide to applications, filesystems themselves use something else underneath. Filesystems use barriers, or rather explicit flushes and FUA requests to commit the journal. Look at LWN post.

libevent2 and file io

I've been toying around with libevent2, and I've got reading files working, but it blocks. Is there any way to make file reading not block just within libevent. Or, do I need to use another IO library for files and make it pump events that I need.
fd = open("/tmp/hello_world",O_RDONLY);
evbuffer_read(buf,fd,4096);
The O_NONBLOCK flag doesn't work either.
In POSIX disks are considered "fast devices" meaning that they always block (which is why O_NONBLOCK didn't work for you). Only network sockets can be non-blocking.
There is POSIX AIO, but e.g. on Linux that comes with a bunch of restrictions making it unsuitable for general-purpose usage (only for O_DIRECT, I/O must be sector-aligned).
If you want to integrate normal POSIX IO into an asynchronous event loop it seems people resort to thread pools, where the blocking syscalls are executed in the background by one of the worker threads. One example of such a library is libeio
No.
I've yet to see a *nix where you can do non-blocking i/o on regular files without resorting to the more special AIO library (Though for some, e.g. solaris, O_NONBLOCK has an effect if e.g. someone else holds a lock on the file)
Please take a look at libuv, which is used by node.js / io.js: https://github.com/libuv/libuv
It's a good alternative to libeio, because it does perform well on all major operating systems, from Windows to the BSDs, Mac OS X and of course Linux.
It supports I/O completion ports, which makes it a better choice than libeio if you are targeting Windows.
The C code is also very readable and I highly recommend this tutorial: https://nikhilm.github.io/uvbook/

fcntl, lockf, which is better to use for file locking?

Looking for information regarding the advantages and disadvantages of both fcntl and lockf for file locking. For example which is better to use for portability? I am currently coding a linux daemon and wondering which is better suited to use for enforcing mutual exclusion.
What is the difference between lockf and fcntl:
On many systems, the lockf() library routine is just a wrapper around fcntl(). That is to say lockf offers a subset of the functionality that fcntl does.
Source
But on some systems, fcntl and lockf locks are completely independent.
Source
Since it is implementation dependent, make sure to always use the same convention. So either always use lockf from both your processes or always use fcntl. There is a good chance that they will be interchangeable, but it's safer to use the same one.
Which one you chose doesn't matter.
Some notes on mandatory vs advisory locks:
Locking in unix/linux is by default advisory, meaning other processes don't need to follow the locking rules that are set. So it doesn't matter which way you lock, as long as your co-operating processes also use the same convention.
Linux does support mandatory locking, but only if your file system is mounted with the option on and the file special attributes set. You can use mount -o mand to mount the file system and set the file attributes g-x,g+s to enable mandatory locks, then use fcntl or lockf. For more information on how mandatory locks work see here.
Note that locks are applied not to the individual file, but to the inode. This means that 2 filenames that point to the same file data will share the same lock status.
In Windows on the other hand, you can actively exclusively open a file, and that will block other processes from opening it completely. Even if they want to. I.e., the locks are mandatory. The same goes for Windows and file locks. Any process with an open file handle with appropriate access can lock a portion of the file and no other process will be able to access that portion.
How mandatory locks work in Linux:
Concerning mandatory locks, if a process locks a region of a file with a read lock, then other processes are permitted to read but not write to that region. If a process locks a region of a file with a write lock, then other processes are not permitted to read nor write to the file. What happens when a process is not permitted to access the part of the file depends on if you specified O_NONBLOCK or not. If blocking is set it will wait to perform the operation. If no blocking is set you will get an error code of EAGAIN.
NFS warning:
Be careful if you are using locking commands on an NFS mount. The behavior is undefined and the implementation widely varies whether to use a local lock only or to support remote locking.
Both interfaces are part of the POSIX standard, and nowadays both interfaces are available on most systems (I just checked Linux, FreeBSD, Mac OS X, and Solaris). Therefore, choose the one that fits better your requirements and use it.
One word of caution: it is unspecified what happens when one process locks a file using fcntl and another using lockf. In most systems these are equivalent operations (in fact under Linux lockf is implemented on top of fcntl), but POSIX says their interaction is unspecified. So, if you are interoperating with another process that uses one of the two interfaces, choose the same one.
Others have written that the locks are only advisory: you are responsible for checking whether a region is locked. Also, don't use stdio functions, if you want the to use the locking functionality.
Your main concerns, in this case (i.e. when "coding a Linux daemon and wondering which is better suited to use for enforcing mutual exclusion"), should be:
will the locked file be local or can it be on NFS?
e.g. can the user trick you into creating and locking your daemon's pid file on NFS?
how will the lock behave when forking, or when the daemon process is terminated with extreme prejudice e.g. kill -9?
The flock and fcntl commands behave differently in both cases.
My recommendation would be to use fcntl. You may refer to the File locking article on Wikipedia for an in-depth discussion of the problems involved with both solutions:
Both flock and fcntl have quirks which
occasionally puzzle programmers from
other operating systems. Whether flock
locks work on network filesystems,
such as NFS, is implementation
dependent. On BSD systems flock calls
are successful no-ops. On Linux prior
to 2.6.12 flock calls on NFS files
would only act locally. Kernel 2.6.12
and above implement flock calls on NFS
files using POSIX byte range locks.
These locks will be visible to other
NFS clients that implement
fcntl()/POSIX locks.1 Lock upgrades
and downgrades release the old lock
before applying the new lock. If an
application downgrades an exclusive
lock to a shared lock while another
application is blocked waiting for an
exclusive lock, the latter application
will get the exclusive lock and the
first application will be locked out.
All fcntl locks associated with a file
for a given process are removed when
any file descriptor for that file is
closed by that process, even if a lock
was never requested for that file
descriptor. Also, fcntl locks are not
inherited by a child process. The
fcntl close semantics are particularly
troublesome for applications which
call subroutine libraries that may
access files.
I came across an issue while using fcntl and flock recently that I felt I should report here as searching for either term shows this page near the top on both.
Be advised BSD locks, as mentioned above, are advisory. For those who do not know OSX (darwin) is BSD. This must be remembered when opening a file to write into.
To use fcntl/flock you must first open the file and get its ID. However if you have opened the file with "w" the file will instantly be zeroed out. If your process then fails to get the lock as the file is in use elsewhere, it will most likely return, leaving the file as 0kb. The process which had the lock will now find the file has vanished from underneath it, catastrophic results normally follow.
To remedy this situation, when using file locking, never open the file "w", but instead open it "a", to append. Then if the lock is successfully acquired, you can then safely clear the file as "w" would have, ie. :
fseek(fileHandle, 0, SEEK_SET);//move to the start
ftruncate(fileno((FILE *) fileHandle), 0);//clear it out
This was an unpleasant lesson for me.
As you're only coding a daemon which uses it for mutual exclusion, they are equivalent, after all, your application only needs to be compatible with itself.
The trick with the file locking mechanisms is to be consistent - use one and stick to it. Varying them is a bad idea.
I am assuming here that the filesystem will be a local one - if it isn't, then all bets are off, NFS / other network filesystems handle locking with varying degrees of effectiveness (in some cases none)

Resources