I'm trying to patch the entry point of an ELF file directly via the e_entry field:
Elf64_Ehdr *ehdr = NULL;
Elf64_Phdr *phdr = NULL;
Elf64_Shdr *shdr = NULL;
if (argc < 2)
{
printf("Usage: %s <executable>\n", argv[0]);
exit(EXIT_SUCCESS);
}
fd = open(argv[1], O_RDWR);
if (fd < 0)
{
perror("open");
exit(EXIT_FAILURE);
}
if (fstat(fd, &st) < 0)
{
perror("fstat");
exit(EXIT_FAILURE);
}
/* map whole executable into memory */
mapped_file = mmap(NULL, st.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (mapped_file < 0)
{
perror("mmap");
exit(EXIT_FAILURE);
}
// check for an ELF file
check_elf(mapped_file, argv);
ehdr = (Elf64_Ehdr *) mapped_file;
phdr = (Elf64_Phdr *) &mapped_file[ehdr->e_phoff];
shdr = (Elf64_Shdr *) &mapped_file[ehdr->e_shoff];
mprotect((void *)((uintptr_t)&ehdr->e_entry & ~(uintptr_t)4095), 4096, PROT_READ | PROT_WRITE);
if (ehdr->e_type != ET_EXEC)
{
fprintf(stderr, "%s is not an ELF executable.\n", argv[1]);
exit(EXIT_FAILURE);
}
printf("Program entry point: %08x\n", ehdr->e_entry);
int text_found = 0;
uint64_t test_addr;
uint64_t text_end;
size_t test_len = strlen(shellcode);
int text_idx;
for (i = 0; i < ehdr->e_phnum; ++i)
{
if (text_found)
{
phdr[i].p_offset += PAGE_SIZE;
continue;
}
if (phdr[i].p_type == PT_LOAD && phdr[i].p_flags == ( PF_R | PF_X))
{
test_addr = phdr[i].p_vaddr + phdr[i].p_filesz;
text_end = phdr[i].p_vaddr + phdr[i].p_filesz;
printf("TEXT SEGMENT ends at 0x%x\n", text_end);
puts("Changing entry point...");
ehdr->e_entry = (Elf64_Addr *) test_addr;
memmove(test_addr, shellcode, test_len);
phdr[i].p_filesz += test_len;
phdr[i].p_memsz += test_len;
text_found++;
}
}
//patch sections
for (i = 0; i < ehdr->e_shnum; ++i)
{
if (shdr->sh_offset >= test_addr)
shdr->sh_offset += PAGE_SIZE;
else
if (shdr->sh_size + shdr->sh_addr == test_addr)
shdr->sh_size += test_len;
}
ehdr->e_shoff += PAGE_SIZE;
close(fd);
}
The shellcode in this case is just a bunch of NOPs with an int3 instruction at the end.
I made sure to adjust the segments and sections that come after this new code, but the problem is that as soon as I patch the entry point the program crashes, why is that?
changing:
memmove(test_addr, shellcode, test_len);
to:
memmove(mapped_file + phdr[i].p_offset + phdr[i].p_filesz, shellcode, test_len);
Seems to fix your problem. test_addr is a virtual address belong to the file you have mapped; you cannot use that directly as a pointer. The bits you want to muck with are the file map address, p_offset and p_filesz.
I suspect that you haven't enable write-access to program's header. You can do this via something like
const uintptr_t page_size = 4096;
mprotect((void *)((uintptr_t)&ehdr->e_entry & ~(uintptr_t)4095), 4096, PROT_READ | PROT_WRITE);
ehdr->e_entry = test_addr;
I have a file with some data, which is also memory-mapped. So that I have both file descriptor and the pointer to the mapped pages. Mostly the data is only read from the mapping, but eventually it's also modified.
The modification consists of modifying some data within the file (sort of headers update), plus appending some new data (i.e. writing post the current end of the file).
This data structure is accessed from different threads, and to prevent collisions I synchronize access to it (mutex and friends).
During the modification I use both the file mapping and the file descriptor. Headers are updated implicitly by modifying the mapped memory, whereas the new data is written to the file by the appropriate API (WriteFile on windows, write on posix). Worth to note that the new data and the headers belong to different pages.
Since the modification changes the file size, the memory mapping is re-initialized after every such a modification. That is, it's unmapped, and then mapped again (with the new size).
I realize that writes to the mapped memory are "asynchronous" wrt file system, and order is not guaranteed, but I thought there was no problem because I explicitly close the file mapping, which should (IMHO) act as a sort of a flushing point.
Now this works without problem on windows, but on linux (android to be exact) eventually the mapped data turns-out to be inconsistent temporarily (i.e. data is ok when retrying). Seems like it doesn't reflect the newly-appended data.
Do I have to call some synchronization API to ensure the data if flushed properly? If so, which one should I use: sync, msync, syncfs or something different?
Thanks in advance.
EDIT:
This is a pseudo-code that illustrates the scenario I'm dealing with.
(The real code is more complex of course)
struct CompressedGrid
{
mutex m_Lock;
int m_FileHandle;
void* m_pMappedMemory;
Hdr* get_Hdr() { return /* the mapped memory with some offset*/; }
void SaveGridCell(int idx, const Cell& cCompressed)
{
AutoLock scope(m_Lock);
// Write to mapped memory
get_Hdr()->m_pCellOffset[Idx] = /* current end of file */;
// Append the data
lseek64(m_FileHandle, 0, FILE_END);
write(m_FileHandle, cCompressed.pPtr, cCompressed.nSize);
// re-map
munmap(...);
m_pMappedMemory = mmap(...); // specify the new file size of course
}
bool DecodeGridCell(int idx, Cell& cRaw)
{
AutoLock scope(m_Lock);
uint64_t nOffs = get_Hdr()->m_pCellOffset[Idx] = /* ;
if (!nOffs)
return false; // unavail
const uint8_t* p = m_pMappedMemory + nOffs;
cRaw.DecodeFrom(p); // This is where the problem appears!
return true;
}
Use addr = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, offset) to map the file.
If the size of the file changes, use newaddr = mremap(addr, len, newlen, MREMAP_MAYMOVE) to update the mapping to reflect it. To extend the file, use ftruncate(fd, newlen) before remapping the file.
You can use mprotect(addr, len, protflags) to change the protection (read/write) on any pages in the mapping (both must be aligned on a page boundary). You can also tell the kernel about your future accesses via madvise(), if the mapping is too large to fit in memory at once, but the kernel seems pretty darned good at managing readahead etc. even without those.
When you make changes to the mapping, use msync(partaddr, partlen, MS_SYNC | MS_INVALIDATE) or msync(partaddr, partlen, MS_ASYNC | MS_INVALIDATE) to ensure the changes int partlen chars from partaddr forward are visible to other mappings and file readers. If you use MS_SYNC, the call returns only when the update is complete. The MS_ASYNC call tells the kernel to do the update, but won't wait until it is done. If there are no other memory maps of the file, the MS_INVALIDATE does nothing; but if there are, that tells the kernel to ensure the changes are reflected in those too.
In Linux kernels since 2.6.19, MS_ASYNC does nothing, as the kernel tracks the changes properly anyway (no msync() is needed, except possibly before munmap()). I don't know if Android kernels have patches that change that behaviour; I suspect not. It is still a good idea to keep them in the code, for portability across POSIXy systems.
mapped data turns-out to be inconsistent temporarily
Well, unless you do use msync(partaddr, partlen, MS_SYNC | MS_INVALIDATE), the kernel will do the update when it sees best.
So, if you need some changes to be visible to file readers before proceeding, use msync(areaptr, arealen, MS_SYNC | MS_INVALIDATE) in the process doing those updates.
If you don't care about the exact moment, use msync(areaptr, arealen, MS_ASYNC | MS_INVALIDATE). It'll be a no-op on current Linux kernels, but it's a good idea to keep them for portability (perhaps commented out, if necessary for performance) and to remind developers about the (lack of) synchronization expectations.
As I commented to OP, I cannot observe the synchronization issues on Linux at all. (That does not mean it does not happen on Android, because Android kernels are derivatives of Linux kernels, not exactly the same.)
I do believe the msync() call is not needed on Linux kernels since 2.6.19 at all, as long as the mapping uses flags MAP_SHARED | MAP_NORESERVE, and the underlying file is not opened using the O_DIRECT flag. The reason for this belief is that in this case, both mapping and file accesses should use the exact same page cache pages.
Here are two test programs, that can be used to explore this on Linux. First, a single-process test, test-single.c:
#define _POSIX_C_SOURCE 200809L
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <signal.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
static inline int read_from(const int fd, void *const to, const size_t len, const off_t offset)
{
char *p = (char *)to;
char *const q = (char *)to + len;
ssize_t n;
if (lseek(fd, offset, SEEK_SET) != offset)
return errno = EIO;
while (p < q) {
n = read(fd, p, (size_t)(q - p));
if (n > 0)
p += n;
else
if (n != -1)
return errno = EIO;
else
if (errno != EINTR)
return errno;
}
return 0;
}
static inline int write_to(const int fd, const void *const from, const size_t len, const off_t offset)
{
const char *const q = (const char *)from + len;
const char *p = (const char *)from;
ssize_t n;
if (lseek(fd, offset, SEEK_SET) != offset)
return errno = EIO;
while (p < q) {
n = write(fd, p, (size_t)(q - p));
if (n > 0)
p += n;
else
if (n != -1)
return errno = EIO;
else
if (errno != EINTR)
return errno;
}
return 0;
}
int main(int argc, char *argv[])
{
unsigned long tests, n, merrs = 0, werrs = 0;
size_t page;
long *map, data[2];
int fd;
char dummy;
if (argc != 3) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s FILENAME COUNT\n", argv[0]);
fprintf(stderr, "\n");
fprintf(stderr, "This program will test synchronization between a memory map\n");
fprintf(stderr, "and reading/writing the underlying file, COUNT times.\n");
fprintf(stderr, "\n");
return EXIT_FAILURE;
}
if (sscanf(argv[2], " %lu %c", &tests, &dummy) != 1 || tests < 1) {
fprintf(stderr, "%s: Invalid number of tests to run.\n", argv[2]);
return EXIT_FAILURE;
}
/* Create the file. */
page = sysconf(_SC_PAGESIZE);
fd = open(argv[1], O_RDWR | O_CREAT | O_EXCL, 0644);
if (fd == -1) {
fprintf(stderr, "%s: Cannot create file: %s.\n", argv[1], strerror(errno));
return EXIT_FAILURE;
}
if (ftruncate(fd, page) == -1) {
fprintf(stderr, "%s: Cannot resize file: %s.\n", argv[1], strerror(errno));
unlink(argv[1]);
return EXIT_FAILURE;
}
/* Map it. */
map = mmap(NULL, page, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_NORESERVE, fd, 0);
if (map == MAP_FAILED) {
fprintf(stderr, "%s: Cannot map file: %s.\n", argv[1], strerror(errno));
unlink(argv[1]);
close(fd);
return EXIT_FAILURE;
}
/* Test loop. */
for (n = 0; n < tests; n++) {
/* Update map. */
map[0] = (long)(n + 1);
map[1] = (long)(~n);
/* msync(map, 2 * sizeof map[0], MAP_SYNC | MAP_INVALIDATE); */
/* Check the file contents. */
if (read_from(fd, data, sizeof data, 0)) {
fprintf(stderr, "read_from() failed: %s.\n", strerror(errno));
munmap(map, page);
unlink(argv[1]);
close(fd);
return EXIT_FAILURE;
}
werrs += (data[0] != (long)(n + 1) || data[1] != (long)(~n));
/* Update data. */
data[0] = (long)(n * 386131);
data[1] = (long)(n * -257);
if (write_to(fd, data, sizeof data, 0)) {
fprintf(stderr, "write_to() failed: %s.\n", strerror(errno));
munmap(map, page);
unlink(argv[1]);
close(fd);
return EXIT_FAILURE;
}
merrs += (map[0] != (long)(n * 386131) || map[1] != (long)(n * -257));
}
munmap(map, page);
unlink(argv[1]);
close(fd);
if (!werrs && !merrs)
printf("No errors detected.\n");
else {
if (!werrs)
printf("Detected %lu times (%.3f%%) when file contents were incorrect.\n",
werrs, 100.0 * (double)werrs / (double)tests);
if (!merrs)
printf("Detected %lu times (%.3f%%) when mapping was incorrect.\n",
merrs, 100.0 * (double)merrs / (double)tests);
}
return EXIT_SUCCESS;
}
Compile and run using e.g.
gcc -Wall -O2 test-single -o single
./single temp 1000000
to test a million times, whether the mapping and the file contents stay in sync, when both accesses are done in the same process. Note that the msync() call is commented out, because on my machine it is not needed: I never see any errors/desynchronization during testing even without it.
The test rate on my machine is about 550,000 tests per second. Note that each tests does it both ways, so includes a read and a write. I just cannot get this to detect any errors. It is written to be quite sensitive to errors, too.
The second test program uses two child processes and a POSIX realtime signal to tell the other process to check the contents. test-multi.c:
#define _POSIX_C_SOURCE 200809L
#define _GNU_SOURCE
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <fcntl.h>
#include <signal.h>
#include <string.h>
#include <stdio.h>
#include <errno.h>
#define NOTIFY_SIGNAL (SIGRTMIN+0)
int mapper_process(const int fd, const size_t len)
{
long value = 1, count[2] = { 0, 0 };
long *data;
siginfo_t info;
sigset_t sigs;
int signum;
if (fd == -1) {
fprintf(stderr, "mapper_process(): Invalid file descriptor.\n");
return EXIT_FAILURE;
}
data = mmap(NULL, len, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, 0);
if (data == MAP_FAILED) {
fprintf(stderr, "mapper_process(): Cannot map file.\n");
return EXIT_FAILURE;
}
sigemptyset(&sigs);
sigaddset(&sigs, NOTIFY_SIGNAL);
sigaddset(&sigs, SIGINT);
sigaddset(&sigs, SIGHUP);
sigaddset(&sigs, SIGTERM);
while (1) {
/* Wait for the notification. */
signum = sigwaitinfo(&sigs, &info);
if (signum == -1) {
if (errno == EINTR)
continue;
fprintf(stderr, "mapper_process(): sigwaitinfo() failed: %s.\n", strerror(errno));
munmap(data, len);
return EXIT_FAILURE;
}
if (signum != NOTIFY_SIGNAL)
break;
/* A notify signal was received. Check the write counter. */
count[ (data[0] == value) ]++;
/* Update. */
data[0] = value++;
data[1] = -(value++);
/* Synchronize */
/* msync(data, 2 * sizeof (data[0]), MS_SYNC | MS_INVALIDATE); */
/* And let the writer know. */
kill(info.si_pid, NOTIFY_SIGNAL);
}
/* Print statistics. */
printf("mapper_process(): %lu errors out of %lu cycles (%.3f%%)\n",
count[0], count[0] + count[1], 100.0 * (double)count[0] / (double)(count[0] + count[1]));
fflush(stdout);
munmap(data, len);
return EXIT_SUCCESS;
}
static inline int read_from(const int fd, void *const to, const size_t len, const off_t offset)
{
char *p = (char *)to;
char *const q = (char *)to + len;
ssize_t n;
if (lseek(fd, offset, SEEK_SET) != offset)
return errno = EIO;
while (p < q) {
n = read(fd, p, (size_t)(q - p));
if (n > 0)
p += n;
else
if (n != -1)
return errno = EIO;
else
if (errno != EINTR)
return errno;
}
return 0;
}
static inline int write_to(const int fd, const void *const from, const size_t len, const off_t offset)
{
const char *const q = (const char *)from + len;
const char *p = (const char *)from;
ssize_t n;
if (lseek(fd, offset, SEEK_SET) != offset)
return errno = EIO;
while (p < q) {
n = write(fd, p, (size_t)(q - p));
if (n > 0)
p += n;
else
if (n != -1)
return errno = EIO;
else
if (errno != EINTR)
return errno;
}
return 0;
}
int writer_process(const int fd, const size_t len, const pid_t other)
{
long data[2] = { 0, 0 }, count[2] = { 0, 0 };
long value = 0;
siginfo_t info;
sigset_t sigs;
int signum;
sigemptyset(&sigs);
sigaddset(&sigs, NOTIFY_SIGNAL);
sigaddset(&sigs, SIGINT);
sigaddset(&sigs, SIGHUP);
sigaddset(&sigs, SIGTERM);
while (1) {
/* Update. */
data[0] = ++value;
data[1] = -(value++);
/* then write the data. */
if (write_to(fd, data, sizeof data, 0)) {
fprintf(stderr, "writer_process(): write_to() failed: %s.\n", strerror(errno));
return EXIT_FAILURE;
}
/* Let the mapper know. */
kill(other, NOTIFY_SIGNAL);
/* Wait for the notification. */
signum = sigwaitinfo(&sigs, &info);
if (signum == -1) {
if (errno == EINTR)
continue;
fprintf(stderr, "writer_process(): sigwaitinfo() failed: %s.\n", strerror(errno));
return EXIT_FAILURE;
}
if (signum != NOTIFY_SIGNAL || info.si_pid != other)
break;
/* Reread the file. */
if (read_from(fd, data, sizeof data, 0)) {
fprintf(stderr, "writer_process(): read_from() failed: %s.\n", strerror(errno));
return EXIT_FAILURE;
}
/* Check the read counter. */
count[ (data[1] == -value) ]++;
}
/* Print statistics. */
printf("writer_process(): %lu errors out of %lu cycles (%.3f%%)\n",
count[0], count[0] + count[1], 100.0 * (double)count[0] / (double)(count[0] + count[1]));
fflush(stdout);
return EXIT_SUCCESS;
}
int main(int argc, char *argv[])
{
struct timespec duration;
double seconds;
pid_t mapper, writer, p;
size_t page;
siginfo_t info;
sigset_t sigs;
int fd, status;
char dummy;
if (argc != 3) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s FILENAME SECONDS\n", argv[0]);
fprintf(stderr, "\n");
fprintf(stderr, "This program will test synchronization between a memory map\n");
fprintf(stderr, "and reading/writing the underlying file.\n");
fprintf(stderr, "The test will run for the specified time, or indefinitely\n");
fprintf(stderr, "if SECONDS is zero, but you can also interrupt it with\n");
fprintf(stderr, "Ctrl+C (INT signal).\n");
fprintf(stderr, "\n");
return EXIT_FAILURE;
}
if (sscanf(argv[2], " %lf %c", &seconds, &dummy) != 1) {
fprintf(stderr, "%s: Invalid number of seconds to run.\n", argv[2]);
return EXIT_FAILURE;
}
if (seconds > 0) {
duration.tv_sec = (time_t)seconds;
duration.tv_nsec = (long)(1000000000 * (seconds - (double)(duration.tv_sec)));
} else {
duration.tv_sec = 0;
duration.tv_nsec = 0;
}
/* Block INT, HUP, CHLD, and the notification signal. */
sigemptyset(&sigs);
sigaddset(&sigs, SIGINT);
sigaddset(&sigs, SIGHUP);
sigaddset(&sigs, SIGCHLD);
sigaddset(&sigs, NOTIFY_SIGNAL);
if (sigprocmask(SIG_BLOCK, &sigs, NULL) == -1) {
fprintf(stderr, "Cannot block the necessary signals: %s.\n", strerror(errno));
return EXIT_FAILURE;
}
/* Create the file. */
page = sysconf(_SC_PAGESIZE);
fd = open(argv[1], O_RDWR | O_CREAT | O_EXCL, 0644);
if (fd == -1) {
fprintf(stderr, "%s: Cannot create file: %s.\n", argv[1], strerror(errno));
return EXIT_FAILURE;
}
if (ftruncate(fd, page) == -1) {
fprintf(stderr, "%s: Cannot resize file: %s.\n", argv[1], strerror(errno));
unlink(argv[1]);
return EXIT_FAILURE;
}
close(fd);
fd = -1;
/* Ensure streams are flushed before forking. They should be, we're just paranoid here. */
fflush(stdout);
fflush(stderr);
/* Fork the mapper child process. */
mapper = fork();
if (mapper == -1) {
fprintf(stderr, "Cannot fork mapper child process: %s.\n", strerror(errno));
unlink(argv[1]);
return EXIT_FAILURE;
}
if (!mapper) {
fd = open(argv[1], O_RDWR);
if (fd == -1) {
fprintf(stderr, "mapper_process(): %s: Cannot open file: %s.\n", argv[1], strerror(errno));
return EXIT_FAILURE;
}
status = mapper_process(fd, page);
close(fd);
return status;
}
/* For the writer child process. (mapper contains the PID of the mapper process.) */
writer = fork();
if (writer == -1) {
fprintf(stderr, "Cannot fork writer child process: %s.\n", strerror(errno));
unlink(argv[1]);
kill(mapper, SIGKILL);
return EXIT_FAILURE;
}
if (!writer) {
fd = open(argv[1], O_RDWR);
if (fd == -1) {
fprintf(stderr, "writer_process(): %s: Cannot open file: %s.\n", argv[1], strerror(errno));
return EXIT_FAILURE;
}
status = writer_process(fd, page, mapper);
close(fd);
return status;
}
/* Wait for a signal. */
if (duration.tv_sec || duration.tv_nsec)
status = sigtimedwait(&sigs, &info, &duration);
else
status = sigwaitinfo(&sigs, &info);
/* Whatever it was, we kill the child processes. */
kill(mapper, SIGHUP);
kill(writer, SIGHUP);
do {
p = waitpid(-1, NULL, 0);
} while (p != -1 || errno == EINTR);
/* Cleanup. */
unlink(argv[1]);
printf("Done.\n");
return EXIT_SUCCESS;
}
Note that the child processes open the temporary file separately. To compile and run, use e.g.
gcc -Wall -O2 test-multi.c -o multi
./multi temp 10
The second parameter is the duration of the test, in seconds. (You can interrupt the testing safely using SIGINT (Ctrl+C) or SIGHUP.)
On my machine, the test rate is roughly 120,000 tests per second; the msync() call is commented out here also, because I don't ever see any errors/desynchronization even without it. (Plus, msync(ptr, len, MS_SYNC) and msync(ptr, len, MS_SYNC | MS_INVALIDATE) are horribly slow; with either, I can get less than 1000 tests per second, with absolutely no difference in the results. That's a 100x slowdown.)
The MAP_NORESERVE flag to mmap tells it to use the file itself as backing storage when under memory pressure, rather than swap. If you compile the code on a system that does not recognize that flag, you can omit it. As long as the mapping is not evicted from RAM, the flag does not affect the operation at all.
The idea is relatively simple, but I see some complications for implementations, so I'm wondering if it's even possible right now.
An example of what I'd like to do is to generate some data in a
buffer, then map the contents of this buffer to a file. Instead of
having the memory space virtually populated with the contents of the
file, I'd like the contents of the original buffer to be transferred
to the system cache (which should be a zero-copy operation) and
dirtied immediately (which would flush the data out to disk
eventually).
Of course the complication I mentioned is that the buffer should be deallocated and unmapped (since the data is now under the responsibility of the system cache), and I don't know how to do that either.
The important aspects are that:
The program can control when the file is created linked.
The program isn't required to anticipate the size of the file nor does it have to remap it as the dataset grows. Instead it can realloc the initial buffer (using an efficient memory allocator for this) until it is satisfied (it knows for sure that the dataset won't grow anymore) before finally mapping it to the file.
The data remains accessible through the same virtual memory address even after being mapped to the file, still without a single intra-memory copy.
One assumption is that:
We can use an arbitrary memory allocator (or memory management scheme in general) that can manage dynamic buffers more efficiently than mmap/mremap can for the memory space it manages, because the latter must deal with the filesystem to grow/shrink the file, which would always be slower.
So, (1) are these requirements too constrained? (2) Is this assumption correct?
PS: I had to arbitrarily pick the tags for this question, but I'm also interested in hearing how BSDs and Windows would do this. Of course if the POSIX API allows to do this already, that would be great.
Update: I call a buffer a space of private memory (private to the process/task in any OS with normal VMM) allocated in primary memory. The high-level goal involves generating a dataset of an arbitrary size using another input (in my case the network), then once it's generated, make it accessible for long periods of time (to the network and to the process itself), saving it to disk in the process.
If I keep the datasets in private memory and write them out normally, they'll just be swapped when the OS needs the space, which is a bit stupid since they're already on disk.
If I map another region then I have to copy the contents of the buffer to that region (which resides in the system cache), which, again, is a tad stupid since I won't use that buffer after that.
The alternative that I see is to write or use a full-blown userland cache reading and writing to the disk itself to ensure that (a) pages don't get uselessly swapped out and (b) the process doesn't hold too much memory for itself, which is never possible to do optimally anyway (better let the kernel do its job), and which is simply not a road I think is worth going down (too complex for less gains).
Update: Requirements 2 and 3 are non-issues considering Nominal Animal's answer. Of course this implies that the assumption is incorrect, as he proved is almost the case (overhead is minimal). I also relaxed requirement 1, O_TMPFILE is indeed perfect for this.
Update: A recent article on LWN mentions, somewhere in the middle: "That could possibly be done with a special write operation that would not actually cause I/O, or with a system call that would transfer a physical page into the page cache". That suggests that indeed, there is currently (April 2014) no way to do this at least with Linux (and likely other operating systems), much less with a standard API. The article is about PostgreSQL, but the issue in question is identical, except perhaps for the specific requirements to this question, which aren't defined in the article.
This is not a satisfactory answer to the question; is is more of a continuation of the comment chain.
Here is a test program one can use to measure the overhead of using a file-backed memory map, instead of an anonymous memory map.
Note that the work() function listed just fills in the memory map with random data. To be more realistic, it should simulate at least the access patterns expected from real-world usage.
#define _POSIX_C_SOURCE 200809L
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <time.h>
#include <stdint.h>
#include <string.h>
#include <errno.h>
#include <stdio.h>
/* Xorshift random number generator.
*/
static uint32_t xorshift_state[4] = {
123456789U,
362436069U,
521288629U,
88675123U
};
static int xorshift_setseed(const void *const data, const size_t len)
{
uint32_t state[4] = { 0 };
if (len < 1)
return ENOENT;
else
if (len < sizeof state)
memcpy(state, data, len);
else
memcpy(state, data, sizeof state);
if (state[0] || state[1] || state[2] || state[3]) {
xorshift_state[0] = state[0];
xorshift_state[1] = state[1];
xorshift_state[2] = state[2];
xorshift_state[3] = state[3];
return 0;
}
return EINVAL;
}
static uint32_t xorshift_u32(void)
{
const uint32_t temp = xorshift_state[0] ^ (xorshift_state[0] << 11U);
xorshift_state[0] = xorshift_state[1];
xorshift_state[1] = xorshift_state[2];
xorshift_state[2] = xorshift_state[3];
return xorshift_state[3] ^= (temp >> 8U) ^ temp ^ (xorshift_state[3] >> 19U);
}
/* Wallclock timing functions.
*/
static struct timespec wallclock_started;
static void wallclock_start(void)
{
clock_gettime(CLOCK_REALTIME, &wallclock_started);
}
static double wallclock_stop(void)
{
struct timespec wallclock_stopped;
clock_gettime(CLOCK_REALTIME, &wallclock_stopped);
return difftime(wallclock_stopped.tv_sec, wallclock_started.tv_sec)
+ (double)(wallclock_stopped.tv_nsec - wallclock_started.tv_nsec) / 1000000000.0;
}
/* Accessor function. This needs to read/modify/write the mapping,
* simulating the actual work done onto the mapping.
*/
static void work(void *const area, size_t const length)
{
uint32_t *const data = (uint32_t *)area;
size_t size = length / sizeof data[0];
size_t i;
/* Add xorshift data. */
for (i = 0; i < size; i++)
data[i] += xorshift_u32();
}
int main(int argc, char *argv[])
{
long page, size, delta, maxsize, steps;
int fd, result;
void *map, *old;
char dummy;
double seconds;
page = sysconf(_SC_PAGESIZE);
if (argc < 5 || argc > 6 || !strcmp(argv[1], "-h") || !strcmp(argv[1], "--help")) {
fprintf(stderr, "\n");
fprintf(stderr, "Usage: %s [ -h | --help ]\n", argv[0]);
fprintf(stderr, " %s MAPFILE SIZE DELTA MAXSIZE [ SEEDSTRING ]\n", argv[0]);
fprintf(stderr, "Where:\n");
fprintf(stderr, " MAPFILE backing file, '-' for none\n");
fprintf(stderr, " SIZE initial map size\n");
fprintf(stderr, " DELTA map size change\n");
fprintf(stderr, " MAXSIZE final size of the map\n");
fprintf(stderr, " SEEDSTRING seeds the Xorshift PRNG\n");
fprintf(stderr, "Note: sizes must be page aligned, each page being %ld bytes.\n", (long)page);
fprintf(stderr, "\n");
return 1;
}
if (argc >= 6) {
if (xorshift_setseed(argv[5], strlen(argv[5]))) {
fprintf(stderr, "%s: Invalid seed string for the Xorshift generator.\n", argv[5]);
return 1;
} else {
fprintf(stderr, "Xorshift initialized with { %lu, %lu, %lu, %lu }.\n",
(unsigned long)xorshift_state[0],
(unsigned long)xorshift_state[1],
(unsigned long)xorshift_state[2],
(unsigned long)xorshift_state[3]);
fflush(stderr);
}
}
if (sscanf(argv[2], " %ld %c", &size, &dummy) != 1) {
fprintf(stderr, "%s: Invalid map size.\n", argv[2]);
return 1;
} else
if (size < page || size % page) {
fprintf(stderr, "%s: Map size must be a multiple of page size (%ld).\n", argv[2], page);
return 1;
}
if (sscanf(argv[3], " %ld %c", &delta, &dummy) != 1) {
fprintf(stderr, "%s: Invalid map size change.\n", argv[2]);
return 1;
} else
if (delta % page) {
fprintf(stderr, "%s: Map size change must be a multiple of page size (%ld).\n", argv[3], page);
return 1;
}
if (delta) {
if (sscanf(argv[4], " %ld %c", &maxsize, &dummy) != 1) {
fprintf(stderr, "%s: Invalid final map size.\n", argv[3]);
return 1;
} else
if (maxsize < page || maxsize % page) {
fprintf(stderr, "%s: Final map size must be a multiple of page size (%ld).\n", argv[4], page);
return 1;
}
steps = (maxsize - size) / delta;
if (steps < 0L)
steps = -steps;
} else {
maxsize = size;
steps = 0L;
}
/* Time measurement includes the file open etc. overheads.
*/
wallclock_start();
if (strlen(argv[1]) < 1 || !strcmp(argv[1], "-"))
fd = -1;
else {
do {
fd = open(argv[1], O_RDWR | O_CREAT | O_EXCL, 0600);
} while (fd == -1 && errno == EINTR);
if (fd == -1) {
fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
return 1;
}
do {
result = ftruncate(fd, (off_t)size);
} while (result == -1 && errno == EINTR);
if (result == -1) {
fprintf(stderr, "%s: %s.\n", argv[1], strerror(errno));
unlink(argv[1]);
do {
result = close(fd);
} while (result == -1 && errno == EINTR);
return 1;
}
result = posix_fadvise(fd, 0, size, POSIX_FADV_RANDOM);
}
/* Initial mapping. */
if (fd == -1)
map = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, fd, 0);
else
map = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE, fd, 0);
if (map == MAP_FAILED) {
fprintf(stderr, "Memory map failed: %s.\n", strerror(errno));
if (fd != -1) {
unlink(argv[1]);
do {
result = close(fd);
} while (result == -1 && errno == EINTR);
}
return 1;
}
result = posix_madvise(map, size, POSIX_MADV_RANDOM);
work(map, size);
while (steps-->0L) {
if (fd != -1) {
do {
result = ftruncate(fd, (off_t)(size + delta));
} while (result == -1 && errno == EINTR);
if (result == -1) {
fprintf(stderr, "%s: Cannot grow file: %s.\n", argv[1], strerror(errno));
unlink(argv[1]);
do {
result = close(fd);
} while (result == -1 && errno == EINTR);
return 1;
}
result = posix_fadvise(fd, 0, size, POSIX_FADV_RANDOM);
}
old = map;
map = mremap(map, size, size + delta, MREMAP_MAYMOVE);
if (map == MAP_FAILED) {
fprintf(stderr, "Cannot remap memory map: %s.\n", strerror(errno));
munmap(old, size);
if (fd != -1) {
unlink(argv[1]);
do {
result = close(fd);
} while (result == -1 && errno == EINTR);
}
return 1;
}
size += delta;
result = posix_madvise(map, size, POSIX_MADV_RANDOM);
work(map, size);
}
/* Timing does not include file renaming.
*/
seconds = wallclock_stop();
munmap(map, size);
if (fd != -1) {
unlink(argv[1]);
do {
result = close(fd);
} while (result == -1 && errno == EINTR);
}
printf("%.9f seconds elapsed.\n", seconds);
return 0;
}
If you save the above as bench.c, you can compile it using
gcc -W -Wall -O3 bench.c -lrt -o bench
Run it without parameters to see the usage.
On my machine, on ext4 filesystem, running tests
./bench - 4096 4096 4096000
./bench testfile 4096 4096 4096000
yields 1.307 seconds wall clock time for the anonymous memory map, and 1.343 seconds for the file-backed memory map, meaning the file backed mapping is about 2.75% slower.
This test starts with one page memory map, then enlarges it by one page a thousand times. For tests like 4096000 4096 8192000 the difference is even smaller. The time measured does include constructing the initial file (and using posix_fallocate() to allocate the blocks on disk for the file).
Running the test on tmpfs, on ext4 over swRAID0, and on ext4 over swRAID1, on the same machine, does not seem to affect the results; all differences are lost in the noise.
While I would prefer to test this on multiple machines and kernel versions before making any sweeping statements, I do know something about how the kernel manages these memory maps. Therefore, I shall make the following claim, based on above and my own experience:
Using a file-backed memory map will not cause a significant slowdown compared to an anonymous memory map, or even compared to malloc()/realloc()/free(). I expect the difference to be under 5% in all real-world use cases, and at most 1% for typical real-world use cases; less, if the resizes are rare compared to how often the map is accessed.
To user2266481 the above means it should be acceptable to just create a temporary file on the target filesystem, to hold the memory map. (Note that it is possible to create the temporary file without allowing anyone access to it, mode 0, as access mode is only checked when opening the file.) When the contents are in final form, ftruncate() and msync() the contents, then hard-link the final file to the temporary file using link(). Finally, unlink the temporary file and close the temporary file descriptor, and the task should be completed with near-optimal efficiency.