Difference between punch hole and zero range - file

I am looking at man page of fallocate and I do not understand the difference between these two. One seems to allocate blocks but without writing them, other seems to deallocate blocks without overwriting them. Either way, effect seems to be indistinguishable from user perspective. Please explain this me.
Zeroing file space
Specifying the FALLOC_FL_ZERO_RANGE flag (available since Linux 3.15)
in mode zeroes space in the byte range starting at offset and
continuing for len bytes. Within the specified range, blocks are
preallocated for the regions that span the holes in the file. After
a successful call, subsequent reads from this range will return
zeroes.
Zeroing is done within the filesystem preferably by converting the
range into unwritten extents. This approach means that the specified
range will not be physically zeroed out on the device (except for
partial blocks at the either end of the range), and I/O is
(otherwise) required only to update metadata.
If the FALLOC_FL_KEEP_SIZE flag is additionally specified in mode,
the behavior of the call is similar, but the file size will not be
changed even if offset+len is greater than the file size. This
behavior is the same as when preallocating space with
FALLOC_FL_KEEP_SIZE specified.
Not all filesystems support FALLOC_FL_ZERO_RANGE; if a filesystem
doesn't support the operation, an error is returned. The operation
is supported on at least the following filesystems:
* XFS (since Linux 3.15)
* ext4, for extent-based files (since Linux 3.15)
* SMB3 (since Linux 3.17)
Increasing file space
Specifying the FALLOC_FL_INSERT_RANGE flag (available since Linux
4.1) in mode increases the file space by inserting a hole within the
file size without overwriting any existing data. The hole will start
at offset and continue for len bytes. When inserting the hole inside
file, the contents of the file starting at offset will be shifted
upward (i.e., to a higher file offset) by len bytes. Inserting a
hole inside a file increases the file size by len bytes.
This mode has the same limitations as FALLOC_FL_COLLAPSE_RANGE
regarding the granularity of the operation. If the granularity
requirements are not met, fallocate() will fail with the error
EINVAL. If the offset is equal to or greater than the end of file,
an error is returned. For such operations (i.e., inserting a hole at
the end of file), ftruncate(2) should be used.
No other flags may be specified in mode in conjunction with
FALLOC_FL_INSERT_RANGE.
FALLOC_FL_INSERT_RANGE requires filesystem support. Filesystems that
support this operation include XFS (since Linux 4.1) and ext4 (since
Linux 4.2).

It depends on what the application wants w.r.t the disk space consumed by a file as a result of using either.
The FALLOC_FL_PUNCH_HOLE flag deallocates blocks. Since it has to be ORed with FALLOC_FL_KEEP_SIZE, what this means is you end up in a sparse file.
FALLOC_FL_ZERO_RANGE on the other hand, allocates blocks for the (offset, length) if not already present and zeroes it out. So in effect you are losing some of its sparseness if the file had holes to begin with. Also, it is a method of zeroing out regions of a file without the application manually having to write(2) zeroes.
All these flags to fallocate(2) are typically used by virtualization software like qemu.

Related

Are seeks cheaper than reads, and does forward-seeking fall foul of the sequential-access optimization?

Consider SetFilePointer. The documentation on MSDN (nor learn.microsoft.com) does not explain if a forward seek constitutes sequential access or not - this has implications for applications' IO performance.
For example, if you use CreateFile with FILE_FLAG_RANDOM_ACCESS then Win32 will use a different buffering and caching strategy compared to FILE_FLAG_SEQUENTIAL_SCAN - if you're reading a file from start to finish then you can expect better performance than with the random-access option.
However, supposing the file format you're reading does not necessitate that every byte (or even buffer-page) be read into memory, such as a flag in the file's header indicates that the first 100 bytes - or 100 kilobytes - contains no useful data. Is it wise to call ReadFile to read the next 100 bytes (or 100 kilobytes - or more?) - or will it always be faster to call SetFilePointer( file, 100, NULL, FILE_CURRENT ) to skip-over those 100 bytes?
If it is generally faster to use SetFilePointer, does the random-access vs sequential option make a difference? I would think that seeking forward constitutes a form of random-access because you could seek forward beyond the currently cached buffer (and any future buffers that the OS might have pre-loaded for you behind the scenes) but in that case will Windows always discard the cached buffers and re-read from disk? Is there a way to find out the maximum amount one can seek-forward without triggering a buffer reload?
(I would try to profile and benchmark to test my hypothesis, but all my computers at-hand have NVMe SSDs - obviously things will be very different on platter drives).
at first about SetFilePointer.
SetFilePointer internally called ZwSetInformationFile with FilePositionInformation. it full handled by I/O manager - the file system is not even called. all what is done on this call : CurrentByteOffset from FILE_OBJECT is set to given position.
so this call absolute independent from file buffering and caching strategy. more - this is absolute senseless call, which only waste time - we always can set direct offset in call to ReadFile or WriteFile - look in OVERLAPPED Offset and OffsetHigh. SetEndOfFile ? but much more better and effective call ZwSetInformationFile with FileEndOfFileInformation or SetFileInformationByHandle with FileEndOfFileInfo (SetEndOfFile of course internally call ZwSetInformationFile with FileEndOfFileInformation and before it call ZwQueryInformationFile with FilePositionInformation for read CurrentByteOffset from FILE_OBJECT - so you simply do 2-3 unnecessary extra calls to kernel in case SetEndOfFile). not exist situation when call to SetFilePointer really need.
so file position - is only software variable (CurrentByteOffset in FILE_OBJECT) which used primary by I/O manager -
filesystem always get read/write request with explicit offset - in FastIoRead as in argument or in IO_STACK_LOCATION.Parameters.Read.ByteOffset
the I/O manager get this offset or from explicit ByteOffset value to NtReadFile or from CurrentByteOffset in FILE_OBJECT if ByteOffset not present (NULL pointer for ByteOffset)
ReadFile use NULL pointer for ByteOffset if NULL pointer for OVERLAPPED, otherwise use pointer to OVERLAPPED.Offset
about question - are exist sense sequential read all bytes or just read from needed offset ?
in case we open file without caching ( FILE_NO_INTERMEDIATE_BUFFERING) - we have no choice Offset and Length passed to ReadFile or WriteFile must be a multiple of the sector size
in case using cache - we anyway nothing gain if read some additional (and not needed to us bytes) before read actual needed bytes. in any case file system will be need read this bytes from disk, if it yet not read - reading another bytes does not accelerate this process.
with FILE_FLAG_SEQUENTIAL_SCAN cache manager read more sectors from disk than need for complete current request and next reading at sequential offset - will (as minimum partially) fall cache - so count of direct read from disk (most expensive operation) will be less. but when you need read file at specific offset - sequential read bytes before this offset not help in any way - anyway will be need read this bytes
in other words - you anyway need read required bytes (at specific offset) from file - and if you before this read some another bytes - this not increase performance. only diminishes
so if you need read file at some offset - just read at this offset. and not use SetFilePointer. use explicit offset on OVERLAPPED

How to implement or emulate MADV_ZERO?

I would like to be able to zero out a range of a file memory-mapping without invoking any io (in order to efficiently sequentially overwrite huge files without incurring any disk read io).
Doing std::memset(ptr, 0, length) will cause pages to be read from disk if they are not already in memory even if the entire pages are overwritten thus totally trashing disk performance.
I would like to be able to do something like madvise(ptr, length, MADV_ZERO) which would zero out the range (similar to FALLOC_FL_ZERO_RANGE) in order to cause zero fill page faults instead of regular io page faults when accessing the specified range.
Unfortunately MADV_ZERO does not exists. Even though the corresponding flag FALLOC_FL_ZERO_RANGE does exists in fallocate and can be used with fwrite to achieve a similar effect, though without instant cross process coherency.
One possible alternative I would guess is to use MADV_REMOVE. However, that can from my understanding cause file fragmentation and also blocks other operations while completing which makes me unsure of its long term performance implications. My experience with Windows is that the similar FSCTL_SET_ZERO_DATA command can incur significant performance spikes when invoked.
My question is how one could implement or emulate MADV_ZERO for shared mappings, preferably in user mode?
1. /dev/zero/
I have read it being suggested to simply read /dev/zero into the selected range. Though I am not quite sure what "reading into the range" means and how to do it. Is it like a fread from /dev/zero into the memory range? Not sure how that would avoid a regular page fault on access?
For Linux, simply read /dev/zero into the selected range. The
kernel already optimises this case for anonymous mappings.
If doing it in general turns out to be too hard to implement, I
propose MADV_ZERO should have this effect: exactly like reading
/dev/zero into the range, but always efficient.
EDIT: Following the thread a bit further it turns out that it will actually not work.
It does not do tricks when you are dealing with a shared mapping.
2. MADV_REMOVE
One guess of implementing it in Linux (i.e. not in user application which is what I would prefer) could be by simply copying and modifying MADV_REMOVE, i.e. madvise_remove to use FALLOC_FL_ZERO_RANGE instead of FALLOC_FL_PUNCH_HOLE. Though I am bit over my head in guessing this, especially as I don't quite understand what the code around the vfs_allocate is doing:
// madvice.c
static long madvise_remove(...)
...
/*
* Filesystem's fallocate may need to take i_mutex. We need to
* explicitly grab a reference because the vma (and hence the
* vma's reference to the file) can go away as soon as we drop
* mmap_sem.
*/
get_file(f); // Increment ref count.
up_read(&current->mm->mmap_sem); // Release a read lock? Why?
error = vfs_fallocate(f,
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, // FALLOC_FL_ZERO_RANGE?
offset, end - start);
fput(f); // Decrement ref count.
down_read(&current->mm->mmap_sem); // Acquire read lock. Why?
return error;
}
You probably cannot do what you want (in user space, without hacking the kernel). Notice that writing zero pages might not incur physical disk IO because of the page cache.
You might want to replace a file segment by a file hole (but this is not exactly what you want) in a sparse file, but some file systems (e.g. VFAT) don't have holesĀ or sparse files. See lseek(2) with SEEK_HOLE, ftruncate(2)

How to check if a file of given length can be created?

I want to create a non-sparse file of a given length (i.e. 2GB), but I want to check if that is possible before actually writing stuff to disk.
In other words I want to avoid getting ENOSPC (No space left on device) while writing. I'd prefer not to create a "test file" of size 2GB or things like that just to check that there is enough space left.
Is that possible?
Use posix_fallocate(3).
From the description:
The function posix_fallocate() ensures that disk space is allocated
for the file referred to by the descriptor fd for the bytes in the
range starting at offset and continuing for len bytes. After a
successful call to posix_fallocate(), subsequent writes to bytes in
the specified range are guaranteed not to fail because of lack of
disk space
You can use the statvfs function to determine how much free bytes (and inodes) a given filesystem has.
That should be enough for a quick check, but do remember that it's not a guarantee that you'll be able to write as much (or, for that matter, that writing more than that would have failed) - other applications could also be writing to (or deleting from) the same filesystem. So do continue to check for various write errors.
fallocate or posix_fallocate can be used to allocate (and deallocate) a large chunk. Probably a better option for your use-case. (Check the man page, there's a lot of options for space management that you might find interesting.)

Reading file using fread in C

I lack formal knowledge in Operating systems and C. My questions are as follows.
When I try to read first single byte of a file using fread in C, does the entire disk block containing that byte is brought into memory or just the byte?
If entire block is brought into memory, what happens on reading
second byte since the block containing that byte is already in
memory?.
Is there significance in reading the file in size of disk blocks?
Where is the read file block kept in memory?
Here's my answers
More than 1 block, default caching is 64k. setvbuffer can change that.
On the second read, there's no I/O. The data is read from the disk cache.
No, a file is ussuly smaller than it's disk space. You'll get an error reading past the file size even if you're within the actual disk space size.
It's part of the FILE structure. This is implementation (compiler) specific so don't touch it.
The above caching is used by the C runtime library not the OS. The OS may or may not have disk caching and is a separate mechanism.

When should I use mmap for file access?

POSIX environments provide at least two ways of accessing files. There's the standard system calls open(), read(), write(), and friends, but there's also the option of using mmap() to map the file into virtual memory.
When is it preferable to use one over the other? What're their individual advantages that merit including two interfaces?
mmap is great if you have multiple processes accessing data in a read only fashion from the same file, which is common in the kind of server systems I write. mmap allows all those processes to share the same physical memory pages, saving a lot of memory.
mmap also allows the operating system to optimize paging operations. For example, consider two programs; program A which reads in a 1MB file into a buffer creating with malloc, and program B which mmaps the 1MB file into memory. If the operating system has to swap part of A's memory out, it must write the contents of the buffer to swap before it can reuse the memory. In B's case any unmodified mmap'd pages can be reused immediately because the OS knows how to restore them from the existing file they were mmap'd from. (The OS can detect which pages are unmodified by initially marking writable mmap'd pages as read only and catching seg faults, similar to Copy on Write strategy).
mmap is also useful for inter process communication. You can mmap a file as read / write in the processes that need to communicate and then use synchronization primitives in the mmap'd region (this is what the MAP_HASSEMAPHORE flag is for).
One place mmap can be awkward is if you need to work with very large files on a 32 bit machine. This is because mmap has to find a contiguous block of addresses in your process's address space that is large enough to fit the entire range of the file being mapped. This can become a problem if your address space becomes fragmented, where you might have 2 GB of address space free, but no individual range of it can fit a 1 GB file mapping. In this case you may have to map the file in smaller chunks than you would like to make it fit.
Another potential awkwardness with mmap as a replacement for read / write is that you have to start your mapping on offsets of the page size. If you just want to get some data at offset X you will need to fixup that offset so it's compatible with mmap.
And finally, read / write are the only way you can work with some types of files. mmap can't be used on things like pipes and ttys.
One area where I found mmap() to not be an advantage was when reading small files (under 16K). The overhead of page faulting to read the whole file was very high compared with just doing a single read() system call. This is because the kernel can sometimes satisify a read entirely in your time slice, meaning your code doesn't switch away. With a page fault, it seemed more likely that another program would be scheduled, making the file operation have a higher latency.
mmap has the advantage when you have random access on big files. Another advantage is that you access it with memory operations (memcpy, pointer arithmetic), without bothering with the buffering. Normal I/O can sometimes be quite difficult when using buffers when you have structures bigger than your buffer. The code to handle that is often difficult to get right, mmap is generally easier. This said, there are certain traps when working with mmap.
As people have already mentioned, mmap is quite costly to set up, so it is worth using only for a given size (varying from machine to machine).
For pure sequential accesses to the file, it is also not always the better solution, though an appropriate call to madvise can mitigate the problem.
You have to be careful with alignment restrictions of your architecture(SPARC, itanium), with read/write IO the buffers are often properly aligned and do not trap when dereferencing a casted pointer.
You also have to be careful that you do not access outside of the map. It can easily happen if you use string functions on your map, and your file does not contain a \0 at the end. It will work most of the time when your file size is not a multiple of the page size as the last page is filled with 0 (the mapped area is always in the size of a multiple of your page size).
In addition to other nice answers, a quote from Linux system programming written by Google's expert Robert Love:
Advantages of mmap( )
Manipulating files via mmap( ) has a handful of advantages over the
standard read( ) and write( ) system calls. Among them are:
Reading from and writing to a memory-mapped file avoids the
extraneous copy that occurs when using the read( ) or write( ) system
calls, where the data must be copied to and from a user-space buffer.
Aside from any potential page faults, reading from and writing to a memory-mapped file does not incur any system call or context switch
overhead. It is as simple as accessing memory.
When multiple processes map the same object into memory, the data is shared among all the processes. Read-only and shared writable
mappings are shared in their entirety; private writable mappings have
their not-yet-COW (copy-on-write) pages shared.
Seeking around the mapping involves trivial pointer manipulations. There is no need for the lseek( ) system call.
For these reasons, mmap( ) is a smart choice for many applications.
Disadvantages of mmap( )
There are a few points to keep in mind when using mmap( ):
Memory mappings are always an integer number of pages in size. Thus, the difference between the size of the backing file and an
integer number of pages is "wasted" as slack space. For small files, a
significant percentage of the mapping may be wasted. For example, with
4 KB pages, a 7 byte mapping wastes 4,089 bytes.
The memory mappings must fit into the process' address space. With a 32-bit address space, a very large number of various-sized mappings
can result in fragmentation of the address space, making it hard to
find large free contiguous regions. This problem, of course, is much
less apparent with a 64-bit address space.
There is overhead in creating and maintaining the memory mappings and associated data structures inside the kernel. This overhead is
generally obviated by the elimination of the double copy mentioned in
the previous section, particularly for larger and frequently accessed
files.
For these reasons, the benefits of mmap( ) are most greatly realized
when the mapped file is large (and thus any wasted space is a small
percentage of the total mapping), or when the total size of the mapped
file is evenly divisible by the page size (and thus there is no wasted
space).
Memory mapping has a potential for a huge speed advantage compared to traditional IO. It lets the operating system read the data from the source file as the pages in the memory mapped file are touched. This works by creating faulting pages, which the OS detects and then the OS loads the corresponding data from the file automatically.
This works the same way as the paging mechanism and is usually optimized for high speed I/O by reading data on system page boundaries and sizes (usually 4K) - a size for which most file system caches are optimized to.
An advantage that isn't listed yet is the ability of mmap() to keep a read-only mapping as clean pages. If one allocates a buffer in the process's address space, then uses read() to fill the buffer from a file, the memory pages corresponding to that buffer are now dirty since they have been written to.
Dirty pages can not be dropped from RAM by the kernel. If there is swap space, then they can be paged out to swap. But this is costly and on some systems, such as small embedded devices with only flash memory, there is no swap at all. In that case, the buffer will be stuck in RAM until the process exits, or perhaps gives it back withmadvise().
Non written to mmap() pages are clean. If the kernel needs RAM, it can simply drop them and use the RAM the pages were in. If the process that had the mapping accesses it again, it cause a page fault the kernel re-loads the pages from the file they came from originally. The same way they were populated in the first place.
This doesn't require more than one process using the mapped file to be an advantage.

Resources