How to implement or emulate MADV_ZERO? - c

I would like to be able to zero out a range of a file memory-mapping without invoking any io (in order to efficiently sequentially overwrite huge files without incurring any disk read io).
Doing std::memset(ptr, 0, length) will cause pages to be read from disk if they are not already in memory even if the entire pages are overwritten thus totally trashing disk performance.
I would like to be able to do something like madvise(ptr, length, MADV_ZERO) which would zero out the range (similar to FALLOC_FL_ZERO_RANGE) in order to cause zero fill page faults instead of regular io page faults when accessing the specified range.
Unfortunately MADV_ZERO does not exists. Even though the corresponding flag FALLOC_FL_ZERO_RANGE does exists in fallocate and can be used with fwrite to achieve a similar effect, though without instant cross process coherency.
One possible alternative I would guess is to use MADV_REMOVE. However, that can from my understanding cause file fragmentation and also blocks other operations while completing which makes me unsure of its long term performance implications. My experience with Windows is that the similar FSCTL_SET_ZERO_DATA command can incur significant performance spikes when invoked.
My question is how one could implement or emulate MADV_ZERO for shared mappings, preferably in user mode?
1. /dev/zero/
I have read it being suggested to simply read /dev/zero into the selected range. Though I am not quite sure what "reading into the range" means and how to do it. Is it like a fread from /dev/zero into the memory range? Not sure how that would avoid a regular page fault on access?
For Linux, simply read /dev/zero into the selected range. The
kernel already optimises this case for anonymous mappings.
If doing it in general turns out to be too hard to implement, I
propose MADV_ZERO should have this effect: exactly like reading
/dev/zero into the range, but always efficient.
EDIT: Following the thread a bit further it turns out that it will actually not work.
It does not do tricks when you are dealing with a shared mapping.
2. MADV_REMOVE
One guess of implementing it in Linux (i.e. not in user application which is what I would prefer) could be by simply copying and modifying MADV_REMOVE, i.e. madvise_remove to use FALLOC_FL_ZERO_RANGE instead of FALLOC_FL_PUNCH_HOLE. Though I am bit over my head in guessing this, especially as I don't quite understand what the code around the vfs_allocate is doing:
// madvice.c
static long madvise_remove(...)
...
/*
* Filesystem's fallocate may need to take i_mutex. We need to
* explicitly grab a reference because the vma (and hence the
* vma's reference to the file) can go away as soon as we drop
* mmap_sem.
*/
get_file(f); // Increment ref count.
up_read(&current->mm->mmap_sem); // Release a read lock? Why?
error = vfs_fallocate(f,
FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE, // FALLOC_FL_ZERO_RANGE?
offset, end - start);
fput(f); // Decrement ref count.
down_read(&current->mm->mmap_sem); // Acquire read lock. Why?
return error;
}

You probably cannot do what you want (in user space, without hacking the kernel). Notice that writing zero pages might not incur physical disk IO because of the page cache.
You might want to replace a file segment by a file hole (but this is not exactly what you want) in a sparse file, but some file systems (e.g. VFAT) don't have holesĀ or sparse files. See lseek(2) with SEEK_HOLE, ftruncate(2)

Related

In a POSIX unix system, is it possible to mmap() a file in such a way that it will never be swapped out to disk in favor of other files?

If not using mmap(), it seems like there should be a way to give certain files "priority", so that the only time they're swapped out is for page faults trying to bring in, e.g., executing code, or memory that was malloc()'d by some process, but never other files. One can think of situations where this could be useful. Consider search engines, which should keep their index files in cache, but which may be simultaneously writing new files (not being used for search).
There are a few ways.
The best way is with madvise(), which allows you to inform the kernel that you will need a particular range of memory soon, which gives it priority over other memory. You can also use it to say that a particular range will not be needed soon, so it should be swapped out sooner.
The hack way is with mlock(), which forces a range of memory to stay in RAM. This is generally not a good idea, and should only be used in special cases. The most common case is to store passwords in RAM so that the password cannot be recovered from the swap file after the computer is powered off. I would not use mlock() for performance tuning unless I had exhausted other options.
The worst way is to constantly poke memory, forcing it to stay fresh.

Using memory mapping in C to read binary files

While processing a very large binary file can using memory mapping in C make any difference when compared to fread ? Even if there are small differences in time it would be fine. And if it does make the process fsater any idea how to use memory mapping on a large binary file and extract data from it ?
Thanks!!
If you're going to read the entire file beginning to end, the most important thing is to let the platform know this. This will allow it to do aggressive read ahead and it will allow it to avoid polluting the cache with data that will not be read again anyway. You can do this either with memory mapping or without it. The key functions are posix_fadvise and posix_madvise.
Memory mapping is a huge win when you have random, small accesses. This is especially true when you have multiple writes to the same page. Without memory mapping, each read or write requires a user/kernel transition and a copy. With memory mapping, most operations don't.
But with sequential access, all will save is the copy. Oddly, the user/kernel transitions may be even worse. With large sequential reads, you get one user/kernel transition per read, which could be per 256KB if the reads are large. With large sequential access to a memory mapped file, you may fault every page (4KB). It depends on the kernel's "fault ahead" optimizations.
However, with memory mapping, you will save the copy, assuming you don't need to do the copy anyway. If you have to copy out of the mapped pages for any reason, then you might as well let a read operation copy them into place for you. However, if you can operate on the data in place, memory mapping may be a win.
It generally doesn't make as much of a difference as people tend to think it does. Especially when you think about how slow the disk is in comparison to all this stuff.

what's the proper buffer size for 'write' function?

I am using the low-level I/O function 'write' to write some data to disk in my code (C language on Linux). First, I accumulate the data in a memory buffer, and then I use 'write' to write the data to disk when the buffer is full. So what's the best buffer size for 'write'? According to my tests it isn't the bigger the faster, so I am here to look for the answer.
There is probably some advantage in doing writes which are multiples of the filesystem block size, especially if you are updating a file in place. If you write less than a partial block to a file, the OS has to read the old block, combine in the new contents and then write it out. This doesn't necessarily happen if you rapidly write small pieces in sequence because the updates will be done on buffers in memory which are flushed later. Still, once in a while you could be triggering some inefficiency if you are not filling a block (and a properly aligned one: multiple of block size at an offset which is a multiple of the block size) with each write operation.
This issue of transfer size does not necessarily go away with mmap. If you map a file, and then memcpy some data into the map, you are making a page dirty. That page has to be flushed at some later time: it is indeterminate when. If you make another memcpy which touches the same page, that page could be clean now and you're making it dirty again. So it gets written twice. Page-aligned copies of multiples-of a page size will be the way to go.
You'll want it to be a multiple of the CPU page size, in order to use memory as efficiently as possible.
But ideally you want to use mmap instead, so that you never have to deal with buffers yourself.
You could use BUFSIZ defined in <stdio.h>
Otherwise, use a small multiple of the page size sysconf(_SC_PAGESIZE) (e.g. twice that value). Most Linux systems have 4Kbytes pages (which is often the same as or a small multiple of the filesystem block size).
As other replied, using the mmap(2) system call could help. GNU systems (e.g. Linux) have an extension: the second mode string of fopen may contain the latter m and when that happens, the GNU libc try to mmap.
If you deal with data nearly as large as your RAM (or half of it), you might want to also use madvise(2) to fine-tune performance of mmap.
See also this answer to a question quite similar to yours. (You could use 64Kbytes as a reasonable buffer size).
The "best" size depends a great deal on the underlying file system.
The stat and fstat calls fill in a data structure, struct stat, that includes the following field:
blksize_t st_blksize; /* blocksize for file system I/O */
The OS is responsible for filling this field with a "good size" for write() blocks. However, it's also important to call write() with memory that is "well aligned" (e.g., the result of malloc calls). The easiest way to get this to happen is to use the provided <stdio.h> stream interface (with FILE * objects).
Using mmap, as in other answers here, can also be very fast for many cases. Note that it's not well suited to some kinds of streams (e.g., sockets and pipes) though.
It depends on the amount of RAM, VM, etc. as well as the amount of data being written. The more general answer is to benchmark what buffer works best for the load you're dealing with, and use what works the best.

Why is sequentially reading a large file row by row with mmap and madvise sequential slower than fgets?

Overview
I have a program bounded significantly by IO and am trying to speed it up.
Using mmap seemed to be a good idea, but it actually degrades the performance relative to just using a series of fgets calls.
Some demo code
I've squeezed down demos to just the essentials, testing against an 800mb file with about 3.5million lines:
With fgets:
char buf[4096];
FILE * fp = fopen(argv[1], "r");
while(fgets(buf, 4096, fp) != 0) {
// do stuff
}
fclose(fp);
return 0;
Runtime for 800mb file:
[juhani#xtest tests]$ time ./readfile /r/40/13479/14960
real 0m25.614s
user 0m0.192s
sys 0m0.124s
The mmap version:
struct stat finfo;
int fh, len;
char * mem;
char * row, *end;
if(stat(argv[1], &finfo) == -1) return 0;
if((fh = open(argv[1], O_RDONLY)) == -1) return 0;
mem = (char*)mmap(NULL, finfo.st_size, PROT_READ, MAP_SHARED, fh, 0);
if(mem == (char*)-1) return 0;
madvise(mem, finfo.st_size, POSIX_MADV_SEQUENTIAL);
row = mem;
while((end = strchr(row, '\n')) != 0) {
// do stuff
row = end + 1;
}
munmap(mem, finfo.st_size);
close(fh);
Runtime varies quite a bit, but never faster than fgets:
[juhani#xtest tests]$ time ./readfile_map /r/40/13479/14960
real 0m28.891s
user 0m0.252s
sys 0m0.732s
[juhani#xtest tests]$ time ./readfile_map /r/40/13479/14960
real 0m42.605s
user 0m0.144s
sys 0m0.472s
Other notes
Watching the process run in top, the memmapped version generated a few thousand page faults along the way.
CPU and memory usage are both very low for the fgets version.
Questions
Why is this the case? Is it just because the buffered file access implemented by fopen/fgets is better than the aggressive prefetching that mmap with madvise POSIX_MADV_SEQUENTIAL?
Is there an alternative method of possibly making this faster(Other than on-the-fly compression/decompression to shift IO load to the processor)? Looking at the runtime of 'wc -l' on the same file, I'm guessing this might not be the case.
POSIX_MADV_SEQUENTIAL is only a hint to the system and may be completely ignored by a particular POSIX implementation.
The difference between your two solutions is that mmap requires the file to be mapped into the virtual address space entierly, whereas fgets has the IO entirely done in kernel space and just copies the pages into a buffer that doesn't change.
This also has more potential for overlap, since the IO is done by some kernel thread.
You could perhaps increase the perceived performance of the mmap implementation by having one (or more) independent threads reading the first byte of each page. This (or these) thread then would have all the page faults and the time your application thread would come at a particular page it would already be loaded.
Reading the man pages of mmap reveals that the page faults could be prevented by adding MAP_POPULATE to mmap's flags:
MAP_POPULATE (since Linux 2.5.46): Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.
This way a page faulting pre-load thread (as suggested by Jens) will become obsolete.
Edit:
First of all the benchmarks you make should be done with the page cache flushed to get meaningful results:
echo 3 | sudo tee /proc/sys/vm/drop_caches
Additionally: The MADV_WILLNEED advice with madvise will pre-fault the required pages in (same as the POSIX_FADV_WILLNEED with fadvise). Currently unfortunately these calls block until the requested pages are faulted in, even if the documentation tells differently. But there are kernel patches underway which queue the pre-fault requests into a kernel work-queue to make these calls asynchronous as one would expect - making a separate read-ahead user space thread obsolete.
What you're doing - reading through the entire mmap space - is supposed to trigger a series of page faults. with mmap, the OS only lazily loads pages of the mmap'd data into memory (loads them when you access them). So this approach is an optimization. Although you interface with mmap as if the entire thing is in RAM, it is not all in RAM - it is just a chunk set aside in virtual memory.
In contrast, when you do a read of a file into a buffer the OS pulls the entire structure into RAM (into your buffer). This can apply alot of memory pressure, crowding out other pages, forcing them to be written back to disk. It can lead to thrashing if you're low on memory.
A common optimization technique when using mmap is to page-walk the data into memory: loop through the mmap space, incrementing your pointer by the page size, accessing a single byte per page and triggering the OS to pull all the mmap's pages into memory; triggering all these page faults. This is an optimization technique to "prime the RAM", pulling the mmap in and readying it for future use. With this approach, the OS won't need to do as much lazy loading. You can do this on a separate thread to lead the pages in prior to your main threads access - just make sure you don't run out of RAM or get too far ahead of the main thread, you'll actually begin to degrade performance.
What is the difference between page walking w/ mmap and read() into a large buffer? That's kind of complicated.
Older versions of UNIX, and some current versions, don't always use demand-paging (where the memory is divided up into chunks and swapped in / out as needed). Instead, in some cases, the OS uses traditional swapping - it treats data structures in memory as monolithic, and the entire structure is swapped in / out as needed. This may be more efficient when dealing with large files, where demand-paging requires copying pages into the universal buffer cache, and may lead to frequent swapping or even thrashing. Swapping may avoid use of the universal buffer cache - reducing memory consumption, avoiding an extra copy operation and avoiding frequent writes. Downside is you can't benefit from demand-paging.
With mmap, you're guaranteed demand-paging; with read() you are not.
Also bear in mind that page-walking in a full mmap memory space is always about 60% slower than a flat out read (not counting if you use MADV_SEQUENTIAL or other optimizations).
One note when using mmap w/ MADV_SEQUENTIAL - when you use this, you must be absolutely sure your data IS stored sequentially, otherwise this will actually slow down the paging in of the file by about 10x. Usually your data is not mapped to a continuous section of the disk, it's written to blocks that are spread around the disk. So I suggest you be careful and look closely into this.
Remember, too much data in RAM will pollute the RAM, making page faults alot more common elsewhere. One common misconception about performance is that CPU optimization is more important than memory footprint. Not true - the time it takes to travel to disk exceeds the time of CPU operations by something like 8 orders of magnitude, even with todays SSDs. Therefor, when program execution speed is a concern, memory footprint and utilization is far more important.
A nice thing about read() is the data can be stored on the stack (assuming the stack is large enough), which will further speed up processing.
Using read() with a streaming approach is a good alternative to mmap, if it fits your use case. This is kind of what you're doing with fgets/fputs (fgets/fputs is internally implemented with read). Here what you do is, in a loop, read into a buffer, process the data, & then read in the next section / overwrite the old data. Streaming like this can keep your memory consumption very low, and can be the most efficient way of doing I/O. The only downside is that you never have the entire file in memory at once, and it doesn't persist in memory. So it's a one-off approach. If you can use it - great, do it. If not... use mmap.
So whether read or mmap is faster... it depends on many factors. Testing is probably what you need to do. Generally speaking, mmap is nice if you plan on using the data for an extended period, where you will benefit from demand-paging; or if you just can't handle that amount of data in memory at once. Read() is better if you are using a streaming approach - the data doesn't have to persist, or the data can fit in memory so memory pressure isn't a concern. Also if the data won't be in memory for very long, read() may be preferable.
Now, with your current implementation - which is a sort of streaming approach - you are using fgets() and stopping on \n. Large, bulk reads are more efficient than calling read() repeatedly a million times (which is what fgets does). You don't have to use a giant buffer - you don't want excess memory pressure (which can pollute your cache & other things), & the system also has some internal buffering it uses. But you do want to be reading into a buffer of... lets say 64k in size. You definitely dont want to be calling read line by line.
You could multithread the parsing of that buffer. Just make sure the threads access data in different cache blocks - so find the size of the cache block, get your threads working on different portions of the buffer distanced by at least the cache block size.
Some more specific suggestions for your particular problem:
You might try reformatting the data into some binary format. For example, try changing the file encoding to a custom format instead of UTF-8 or whatever it is. That could reduce its size. 3.5 million lines is quite alot of characters to loop through... it's probably ~150 million character comparisons that you are doing.
If you can sort the file by line length prior to the program running... you can write an algorithm to much more quickly parse the lines - just increment a pointer and test the character you arrive at, making sure it's '\n'. Then do whatever processing you need to do.
You'll need to find a way to maintain the sorted file by inserting new data into appropriate places with this approach.
You can take this a step further - after sorting your file, maintain a list of how many lines of a given length are in the file. Use that to guide your parsing of lines - jump right to the end of each line w/out having to do character comparisons.
If you can't sort the file, just create a list of all the offsets from the start of each line to its terminating newline. 3.5 million offsets.
Write algorithms to update that list on insertion/deletion of lines from the file
When you get into file processing algorithms such as this... it begins to resemble the implementation of a noSQL database. An alternative might just be to insert all this data into a noSQL database. Depends on what you need to do: believe it or not, sometimes just raw custom file manipulation & maintenance described above is faster than any database implementation, even noSQL databases.
A few more things:
When you use this streaming approach with read() you must take care to handle the edge cases - where you reach the end of one buffer, and start a new buffer - appropriately. That's called buffer-stitching.
Lastly, on most modern systems when you use read() the data still gets stored in the universal buffer cache and then copied into your process. That's an extra copy operation. You can disable the buffer cache to speed up the IO in certain cases where you're handling big files. Beware, this will disable paging. But if the data is only in memory for a brief time, this doesn't matter.
The buffer cache is important - find a way to reenable it after the IO was finished. Maybe disable it just for the particular process, do your IO in a separate process, or something... I'm not sure about the details, but this is something that can be done.
I don't think that's actually your problem, though, tbh I think the character comparisons - once you fix that it should just be fine.
That's the best I've got, maybe the experts will have other ideas.
Carry onward!

Is there a way to pre-emptively avoid a segfault?

Here's the situation:
I'm analysing a programs' interaction with a driver by using an LD_PRELOADed module that hooks the ioctl() system call. The system I'm working with (embedded Linux 2.6.18 kernel) luckily has the length of the data encoded into the 'request' parameter, so I can happily dump the ioctl data with the right length.
However quite a lot of this data has pointers to other structures, and I don't know the length of these (this is what I'm investigating, after all). So I'm scanning the data for pointers, and dumping the data at that position. I'm worried this could leave my code open to segfaults if the pointer is close to a segment boundary (and my early testing seems to show this is the case).
So I was wondering what I can do to pre-emptively check whether the current process owns a particular offset before trying to dereference? Is this even possible?
Edit: Just an update as I forgot to mention something that could be very important, the target system is MIPS based, although I'm also testing my module on my x86 machine.
Open a file descriptor to /dev/null and try write(null_fd, ptr, size). If it returns -1 with errno set to EFAULT, the memory is invalid. If it returns size, the memory is safe to read. There may be a more elegant way to query memory validity/permissions with some POSIX invention, but this is the classic simple way.
If your embedded linux has the /proc/ filesystem mounted, you can parse the /proc/self/maps file and validate the pointer/offsets against that. The maps file contains the memory mappings of the process, see here
I know of no such possibility. But you may be able to achieve something similar. As man 7 signal mentions, SIGSEGV can be caught. Thus, I think you could
Start with dereferencing a byte sequence known to be a pointer
Access one byte after the other, at some time triggering SIGSEGV
In SIGSEGV's handler, mark a variable that is checked in the loop of step 2
Quit the loop, this page is done.
There's several problems with that.
Since several buffers may live in the same page, you might output what you think is one buffer that are, in reality, several. You may be able to help with that by also LD_PRELOADing electric fence which would, AFAIK cause the application to allocate a whole page for every dynamically allocated buffer. So you would not output several buffers thinking it is only one, but you still don't know where the buffer ends and would output much garbage at the end. Also, stack based buffers can't be helped by this method.
You don't know where the buffers end.
Untested.
Can't you just check for the segment boundaries? (I'm guessing by segment boundaries you mean page boundaries?)
If so, page boundaries are well delimited (either 4K or 8K) so simple masking of the address should deal with it.

Resources