mmap, msync and linux process termination

mmap, msync and linux process termination - c

I want to use mmap to implement persistence of certain portions of program state in a C program running under Linux by associating a fixed-size struct with a well known file name using mmap() with the MAP_SHARED flag set. For performance reasons, I would prefer not to call msync() at all, and no other programs will be accessing this file. When my program terminates and is restarted, it will map the same file again and do some processing on it to recover the state that it was in before the termination. My question is this: if I never call msync() on the file descriptor, will the kernel guarantee that all updates to the memory will get written to disk and be subsequently recoverable even if my process is terminated with SIGKILL? Also, will there be general system overhead from the kernel periodically writing the pages to disk even if my program never calls msync()?
EDIT: I've settled the problem of whether the data is written, but I'm still not sure about whether this will cause some unexpected system loading over trying to handle this problem with open()/write()/fsync() and taking the risk that some data might be lost if the process gets hit by KILL/SEGV/ABRT/etc. Added a 'linux-kernel' tag in hopes that some knowledgeable person might chime in.

I found a comment from Linus Torvalds that answers this question
http://www.realworldtech.com/forum/?threadid=113923&curpostid=114068
The mapped pages are part of the filesystem cache, which means that even if the user process that made a change to that page dies, the page is still managed by the kernel and as all concurrent accesses to that file will go through the kernel, other processes will get served from that cache.
In some old Linux kernels it was different, that's the reason why some kernel documents still tell to force msync.
EDIT: Thanks RobH corrected the link.
EDIT:
A new flag, MAP_SYNC, is introduced since Linux 4.15, which can guarantee the coherence.
Shared file mappings with this flag provide the guarantee that
while some memory is writably mapped in the address space of
the process, it will be visible in the same file at the same
offset even after the system crashes or is rebooted.
references:
http://man7.org/linux/man-pages/man2/mmap.2.html search MAP_SYNC in the page
https://lwn.net/Articles/731706/

I decided to be less lazy and answer the question of whether the data is written to disk definitively by writing some code. The answer is that it will be written.
Here is a program that kills itself abruptly after writing some data to an mmap'd file:
#include <stdint.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
typedef struct {
char data[100];
uint16_t count;
} state_data;
const char *test_data = "test";
int main(int argc, const char *argv[]) {
int fd = open("test.mm", O_RDWR|O_CREAT|O_TRUNC, (mode_t)0700);
if (fd < 0) {
perror("Unable to open file 'test.mm'");
exit(1);
}
size_t data_length = sizeof(state_data);
if (ftruncate(fd, data_length) < 0) {
perror("Unable to truncate file 'test.mm'");
exit(1);
}
state_data *data = (state_data *)mmap(NULL, data_length, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, fd, 0);
if (MAP_FAILED == data) {
perror("Unable to mmap file 'test.mm'");
close(fd);
exit(1);
}
memset(data, 0, data_length);
for (data->count = 0; data->count < 5; ++data->count) {
data->data[data->count] = test_data[data->count];
}
kill(getpid(), 9);
}
Here is a program that validates the resulting file after the previous program is dead:
#include <stdint.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <assert.h>
typedef struct {
char data[100];
uint16_t count;
} state_data;
const char *test_data = "test";
int main(int argc, const char *argv[]) {
int fd = open("test.mm", O_RDONLY);
if (fd < 0) {
perror("Unable to open file 'test.mm'");
exit(1);
}
size_t data_length = sizeof(state_data);
state_data *data = (state_data *)mmap(NULL, data_length, PROT_READ, MAP_SHARED|MAP_POPULATE, fd, 0);
if (MAP_FAILED == data) {
perror("Unable to mmap file 'test.mm'");
close(fd);
exit(1);
}
assert(5 == data->count);
unsigned index;
for (index = 0; index < 4; ++index) {
assert(test_data[index] == data->data[index]);
}
printf("Validated\n");
}

I found something adding to my confusion:
munmap does not affect the object that was mappedthat is, the call to munmap
does not cause the contents of the mapped region to be written
to the disk file. The updating of the disk file for a MAP_SHARED
region happens automatically by the kernel's virtual memory algorithm
as we store into the memory-mapped region.
this is excerpted from Advanced Programming in the UNIX® Environment.
from the linux manpage:
MAP_SHARED Share this mapping with all other processes that map this
object. Storing to the region is equiva-lent to writing to the
file. The file may not actually be updated until msync(2) or
munmap(2) are called.
the two seem contradictory. is APUE wrong?

I didnot find a very precise answer to your question so decided add one more:
Firstly about losing data, using write or mmap/memcpy mechanisms both writes to page cache and are synced to underlying storage in background by OS based on its page replacement settings/algo. For example linux has vm.dirty_writeback_centisecs which determines which pages are considered "old" to be flushed to disk. Now even if your process dies after the write call has succeeded, the data would not be lost as the data is already present in kernel pages which will eventually be written to storage. The only case you would lose data is if OS itself crashes (kernel panic, power off etc). The way to absolutely make sure your data has reached storage would be call fsync or msync (for mmapped regions) as the case might be.
About the system load concern, yes calling msync/fsync for each request is going to slow your throughput drastically, so do that only if you have to. Remember you are really protecting against losing data on OS crashes which I would assume is rare and probably something most could live with. One general optimization done is to issue sync at regular intervals say 1 sec to get a good balance.

Either the Linux manpage information is incorrect or Linux is horribly non-conformant. msync is not supposed to have anything to do with whether the changes are committed to the logical state of the file, or whether other processes using mmap or read to access the file see the changes; it's purely an analogue of fsync and should be treated as a no-op except for the purposes of ensuring data integrity in the event of power failure or other hardware-level failure.

According to the manpage,
The file may not actually be
updated until msync(2) or munmap() is called.
So you will need to make sure you call munmap() prior to exiting at the very least.

Related

Why does the page fault not cause the thread to finish its execution later?

I have the below code where I'm intentionally creating a page fault in one of the threads in file.c
util.c
#include "util.h"
// to use as a fence() instruction
extern inline __attribute__((always_inline))
CYCLES rdtscp(void) {
CYCLES cycles;
asm volatile ("rdtscp" : "=a" (cycles));
return cycles;
}
// initialize address
void init_ram_address(char* FILE_NAME){
char *filename = FILE_NAME;
int fd = open(filename, O_RDWR);
if(fd == -1) {
printf("Could not open file .\n");
exit(0);
}
void *file_address = mmap(NULL, DEFAULT_FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, 0);
ram_address = (int *) file_address;
}
// initialize address
void init_disk_address(char* FILE_NAME){
char *filename = FILE_NAME;
int fd = open(filename, O_RDWR);
if(fd == -1) {
printf("Could not open file .\n");
exit(0);
}
void *file_address = mmap(NULL, DEFAULT_FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
disk_address = (int *) file_address;
}
file.c
#include "util.h"
void *f1();
void *f2();
pthread_barrier_t barrier;
pthread_mutex_t mutex;
int main(int argc, char **argv)
{
pthread_t t1, t2;
// in ram
init_ram_address(RAM_FILE_NAME);
// in disk
init_disk_address(DISK_FILE_NAME);
pthread_create(&t1, NULL, &f1, NULL);
pthread_create(&t2, NULL, &f2, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
return 0;
}
void *f1()
{
rdtscp();
int load = *(ram_address);
rdtscp();
printf("Expecting this to be run first.\n");
}
void *f2()
{
rdtscp();
int load = *(disk_address);
rdtscp();
printf("Expecting this to be run second.\n");
}
I've used rdtscp() in the above code for fencing purposes (to ensure that the print statement get executed only after the load operation is done).
Since t2 will incur a page fault, I expect t1 to finish executing its print statement first.
To run both the threads on the same core, I run taskset -c 10 ./file.
I see that t2 prints its statement before t1. What could be the reason for this?

I think you're expecting t2's int load = *(disk_address); to cause a context switch to t1, and since you're pinning everything to the same CPU core, that would give t1 time to win the race to take the lock for stdout.
A soft page fault doesn't need to context-switch, just update the page tables with a file page from the pagecache. Despite the mapping being backed by a disk file, not anonymous memory or just copy-on-write tricks, if the the file has been read or written recently it will be hot in the pagecache and not require I/O (which would make it a hard page fault).
Maybe try evicting disk cache before a test run, like with echo 3 | sudo tee /proc/sys/vm/drop_caches if this is Linux, so access to the mmap region without MAP_POPULATE will be a hard page fault (requiring I/O).
(See *https://unix.stackexchange.com/questions/17936/setting-proc-sys-vm-drop-caches-to-clear-cache*; sync first, at least on the disk file, if it was recently written, to make sure it's page(s) are clean and able to be evicted aka dropped. Dropping caches is mainly useful for benchmarking.)
Or programmatically, you can hint the kernel with the madvise(2) system call, like madvise(MADV_DONTNEED) on a page, encouraging it to evict it from pagecache soon. (Or at least hint that your process doesn't need it; other processes might keep it hot).
In Linux kernel 5.4 and later, MADV_COLD works as a hint to evict the specified page(s) on memory pressure. ("Deactivate" probably means remove from HW page tables, so next access will at least be a soft page fault.) Or MADV_PAGEOUT is apparently supposed to get the kernel to reclaim the specified page(s) right away, I guess before the system call returns. After that, the next access should be a hard page fault.
MADV_COLD (since Linux 5.4)
Deactivate a given range of pages. This will make the pages a more probable reclaim target should there be a memory pressure. This is a nondestructive operation. The advice might be ignored for some pages in the range when it is not applicable.
MADV_PAGEOUT (since Linux 5.4)
Reclaim a given range of pages. This is done to free up memory occupied by these pages. If a page is anonymous, it will be swapped out. If a page is file-backed and dirty, it will be written back to the backing storage. The advice might be ignored for some pages in the range when it is not applicable.
These madvise args are Linux-specific. The madvise system call itself (as opposed to posix_madvise) is not guaranteed portable, but the man page gives the impression that some other systems have their own madvise system calls supporting some standard "advice" hints to the kernel.
You haven't shown the declaration of ram_address or disk_address.
If it's not a pointer-to-volatile like volatile int *disk_address, the loads may be optimized away at compile time. Writes to non-escaped local vars like int load don't have to respect "memory" clobbers because nothing else could possibly have a reference to them.
If you compiled without optimization or something, then yes the load will still happen even without volatile.

max size a single pread can obtain

I am using pread to obtain huge amount of data at one time.
but If I try to gather a huge amount of data (for instance 100mb) and save it into an array I get a segfault....
is there a hard limit on the max number of bytes a pread can read?
#define _FILE_OFFSET_BITS 64
#define BLKGETSIZE64 _IOR(0x12,114,size_t)
#define _POSIX_C_SOURCE 200809L
#include <stdio.h>
#include <inttypes.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
int readdata(int fp,uint64_t seekpoint, uint64_t seekwidth) {
int16_t buf[seekwidth];
if (pread(fp,buf,seekwidth,seekpoint)==seekwidth) {
printf("SUCCES READING AT: %"PRIu64"| WITH READ WIDTH: %"PRIu64"\n",seekpoint,seekwidth);
return 1;
} else {
printf("ERROR READING AT: %"PRIu64"| WITH READ WIDTH: %"PRIu64"\n",seekpoint,seekwidth);
return 2;
}
}
int main() {
uint64_t readwith,
offset;
int fp=open("/dev/sdc",O_RDWR);
readwith=10000; offset=0;
readdata(fp,offset,readwith);
readwith=100000; offset=0;
readdata(fp,offset,readwith);
readwith=1000000; offset=0;
readdata(fp,offset,readwith);
readwith=10000000; offset=0;
readdata(fp,offset,readwith);
readwith=10000000; offset=0;
readdata(fp,offset,readwith);
readwith=100000000; offset=0;
readdata(fp,offset,readwith);
readwith=1000000000; offset=0;
readdata(fp,offset,readwith);
readwith=10000000000; offset=0;
readdata(fp,offset,readwith);
readwith=100000000000; offset=0;
readdata(fp,offset,readwith);
readwith=1000000000000; offset=0;
readdata(fp,offset,readwith);
readwith=10000000000000; offset=0;
readdata(fp,offset,readwith);
readwith=100000000000000; offset=0;
readdata(fp,offset,readwith);
readwith=1000000000000000; offset=0;
readdata(fp,offset,readwith);
close(fp);
}

There is no hard limit on the maximum number of bytes that pread can read. However, reading that large an amount of data in one contiguous block is probably a bad idea. There are a few alternatives I'll describe later.
In your particular case, the problem is likely that you are trying to stack allocate the buffer. There is a limited amount of space available to the stack; if you run cat /proc/<pid>/limits, you can see what that is for a particular process (or just cat /proc/self/limits to check for the shell that you're running). In my case, that happens to be 8388608, or 8 MB. If you try to use more than this limit, you will get a segfault.
You can increase the maximum stack size using setrlimit or the Linux-specific prlimit, but that's generally not considered something good to do; your stack is something that is permanently allocated to each thread, so increasing the size increases how much address space each thread has allocated to it. On a 32 bit system (which are becoming less relevant, but there are still 32 bit systems out there, or 32 bit applications on 64 bit systems), this address space is actually fairly limited, so just a few threads with a large amount of stack space allocated could exhaust your address space. It would be better to take an alternate approach.
One such alternate approach is to use malloc to dynamically allocate your buffer. Then you will only use this space when you need it, not all the time for your whole stack. Yes, you do have to remember to free the data afterwards, but that's not all that hard with a little bit of careful thought in your programming.
Another approach, that can be good for large amounts of data like this, is to use mmap to map the file into your address space instead of trying to read the whole thing into a buffer. What this does is allocate a region of address space, and any time you access that address space, the data will be read from that file to populate the page that you are reading from. This can be very handy when you want random access to the file, but will not actually be reading the whole thing, you will be instead skipping around the file. Only the pages that you actually access will be read, rather than wasting time reading the whole file into a buffer and then accessing only portions of it.
If you use mmap, you will need to remember to munmap the address space afterwards, though if you're on a 64 bit system, it matters a lot less if you remember to munmap than it does if you remember to free allocated memory (on a 32 bit system, address space is actually at a premium, so leaving around large mappings can still cause problems). mmap will only use up address space, not actual memory; since it's backed by a file, if there's memory pressure the kernel can just write out any dirty data to disk and stop keeping the contents around in memory, while for an allocated buffer, it needs to actually preserve the data in RAM or swap space, which are generally fairly limited resources. And if you're just using it to read data, it doesn't even have to flush out dirty data to disk, it can just free-up the page and reuse it, and if you access the page again, it will read it back in.
If you don't need random access to all of that data at once, it's probably better to just read and process the data in smaller chunks, in a loop. Then you can use stack allocation for its simplicity, without worrying about increasing the amount of address space allocated to your stack.
edit to add: Based on your sample code and other question, you seem to be trying to read an entire 2TB disk as a single array. In this case, you will definitely need to use mmap, as you likely don't have enough RAM to hold the entire contents in memory. Here's an example; note that this particular example is specific to Linux:
#include <stdio.h>
#include <err.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <sys/ioctl.h>
#include <linux/fs.h>
#include <unistd.h>
int main(int argc, char **argv) {
if (argc != 2)
errx(1, "Wrong number of arguments");
int fd = open(argv[1], O_RDONLY);
if (fd < 0)
err(2, "Failed to open %s", argv[1]);
struct stat statbuf;
if (fstat(fd, &statbuf) != 0)
err(3, "Failed to stat %s", argv[1]);
size_t size;
if (S_ISREG(statbuf.st_mode)) {
size = statbuf.st_size;
} else if (S_ISBLK(statbuf.st_mode)) {
if (ioctl(fd, BLKGETSIZE64, &size) != 0)
err(4, "Failed to get size of block device %s", argv[1]);
}
printf("Size: %zd\n", size);
char *mapping = mmap(0, size, PROT_READ, MAP_SHARED, fd, 0);
if (MAP_FAILED == mapping)
err(5, "Failed to map %s", argv[1]);
/* do something with `mapping` */
munmap(mapping, statbuf.st_size);
return 0;
}

Can anyone help me make a Shared Memory Segment in C

I need to make a shared memory segment so that I can have multiple readers and writers access it. I think I know what I am doing as far as the semaphores and readers and writers go...
BUT I am clueless as to how to even create a shared memory segment. I want the segment to hold an array of 20 structs. Each struct will hold a first name, an int, and another int.
Can anyone help me at least start this? I am desperate and everything I read online just confuses me more.
EDIT: Okay, so I do something like this to start
int memID = shmget(IPC_PRIVATE, sizeof(startData[0])*20, IPC_CREAT);
with startData as the array of structs holding my data initialized and I get an error saying
"Segmentation Fault (core dumped)"

The modern way to obtain shared memory is to use the API, provided by the Single UNIX Specification. Here is an example with two processes - one creates a shared memory object and puts some data inside, the other one reads it.
First process:
#include <stdio.h>
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>
#define SHM_NAME "/test"
typedef struct
{
int item;
} DataItem;
int main (void)
{
int smfd, i;
DataItem *smarr;
size_t size = 20*sizeof(DataItem);
// Create a shared memory object
smfd = shm_open(SHM_NAME, O_RDWR | O_CREAT, 0600);
// Resize to fit
ftruncate(smfd, size);
// Map the object
smarr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, smfd, 0);
// Put in some data
for (i = 0; i < 20; i++)
smarr[i].item = i;
printf("Press Enter to remove the shared memory object\n");
getc(stdin);
// Unmap the object
munmap(smarr, size);
// Close the shared memory object handle
close(smfd);
// Remove the shared memory object
shm_unlink(SHM_NAME);
return 0;
}
The process creates a shared memory object with shm_open(). The object is created with an initial size of zero, so it is enlarged using ftruncate(). Then the object is memory mapped into the virtual address space of the process using mmap(). The important thing here is that the mapping is read/write (PROT_READ | PROT_WRITE) and it is shared (MAP_SHARED). Once the mapping is done, it can be accessed as a regular dynamically allocated memory (as a matter of fact, malloc() in glibc on Linux uses anonymous memory mappings for larger allocations). Then the process writes data into the array and waits until Enter is pressed. Then it unmaps the object using munmap(), closes its file handle and unlinks the object with shm_unlink().
Second process:
#include <stdio.h>
#include <sys/mman.h>
#include <fcntl.h>
#define SHM_NAME "/test"
typedef struct
{
int item;
} DataItem;
int main (void)
{
int smfd, i;
DataItem *smarr;
size_t size = 20*sizeof(DataItem);
// Open the shared memory object
smfd = shm_open(SHM_NAME, O_RDONLY, 0600);
// Map the object
smarr = mmap(NULL, size, PROT_READ, MAP_SHARED, smfd, 0);
// Read the data
for (i = 0; i < 20; i++)
printf("Item %d is %d\n", i, smarr[i].item);
// Unmap the object
munmap(smarr, size);
// Close the shared memory object handle
close(smfd);
return 0;
}
This one opens the shared memory object for read access only and also memory maps it for read access only. Any attempt to write to the elements of the smarr array would result in segmentation fault being delivered.
Compile and run the first process. Then in a separate console run the second process and observe the output. When the second process has finished, go back to the first one and press Enter to clean up the shared memory block.
For more information consult the man pages of each function or the memory management portion of the SUS (it's better to consult the man pages as they document the system-specific behaviour of these functions).

Linux :Identifying pages in memory

I want to know what part of a huge file are cached in memory. I'm using some code from fincore for that, which works this way: the file is mmaped, then fincore loops over the address space and check pages with mincore, but it's very long (several minutes) because of the file size (several TB).
Is there a way to loop on used RAM pages instead? It would be much faster, but that means I should get the list of used pages from somewhere... However I can't find a convenient system call that would allow that.
Here comes the code:
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
/* } */
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/sysinfo.h>
void
fincore(char *filename) {
int fd;
struct stat st;
struct sysinfo info;
if (sysinfo(& info)) {
perror("sysinfo");
return;
}
void *pa = (char *)0;
char *vec = (char *)0;
size_t pageSize = getpagesize();
register size_t pageIndex;
fd = open(filename, 0);
if (0 > fd) {
perror("open");
return;
}
if (0 != fstat(fd, &st)) {
perror("fstat");
close(fd);
return;
}
pa = mmap((void *)0, st.st_size, PROT_NONE, MAP_SHARED, fd, 0);
if (MAP_FAILED == pa) {
perror("mmap");
close(fd);
return;
}
/* vec = calloc(1, 1+st.st_size/pageSize); */
/* 2.2 sec for 8 TB */
vec = calloc(1, (st.st_size+pageSize-1)/pageSize);
if ((void *)0 == vec) {
perror("calloc");
close(fd);
return;
}
/* 48 sec for 8 TB */
if (0 != mincore(pa, st.st_size, vec)) {
fprintf(stderr, "mincore(%p, %lu, %p): %s\n",
pa, (unsigned long)st.st_size, vec, strerror(errno));
free(vec);
close(fd);
return;
}
/* handle the results */
/* 2m45s for 8 TB */
for (pageIndex = 0; pageIndex <= st.st_size/pageSize; pageIndex++) {
if (vec[pageIndex]&1) {
printf("%zd\n", pageIndex);
}
}
free(vec);
vec = (char *)0;
munmap(pa, st.st_size);
close(fd);
return;
}
int main(int argc, char *argv[]) {
fincore(argv[1]);
return 0;
}

The amount of information needed to represent a list is, for the pessimistic case when all or almost all pages are indeed in RAM, much higher than the bitmap - at least 64 vs 1 bits per entry. If there was such an API, when querying it about your 2 billion pages, you would have to be prepared to get 16 GB of data in the reply. Additionally, handling variable-length structures such as lists is more complex than handling a fixed-length array, so library functions, especially low-level system ones, tend to avoid the hassle.
I am also not quite sure about the implementation (how the OS interacts with the TLB and Co in this case), but it may well be that (even size difference aside) filling out the bitmap can be performed faster than creating a list due to the OS- and hardware-level structures the information is extracted from.
If you are not concerned about very fine granularity, you could have a look at /proc/<PID>/smaps. For each mapped region it shows some stats, including how much is loaded into memory (Rss field). If for the purpose of debugging you map some regions of a file with a separate mmap() call (in addition to the main mapping used for performing the actual task), you will probably get separate entries in smaps and thus see separate statistics for these regions. You almost certainly can't make billions of mappings without killing your system, but if the file is structured well, maybe having separate statistics for just a few dozen well-chosen regions can help you find the answers you are looking for.

Cached by whom?
Consider after a boot the file sits on disk. No part of it is in memory.
Now the file is opened and random reads are performed.
The file system (e.g. the kernel) will be caching.
The C standard library will be caching.
The kernel will be caching in kernel-mode memory, the C standard library in user-mode memory.
If you could issue a query, it could also be that instantly after the query - before it returns to you - the cached data in question is removed from the cached.

Why does mmap() fail with ENOMEM on a 1TB sparse file?

I've been working with large sparse files on openSUSE 11.2 x86_64. When I try to mmap() a 1TB sparse file, it fails with ENOMEM. I would have thought that the 64 bit address space would be adequate to map in a terabyte, but it seems not. Experimenting further, a 1GB file works fine, but a 2GB file (and anything bigger) fails. I'm guessing there might be a setting somewhere to tweak, but an extensive search turns up nothing.
Here's some sample code that shows the problem - any clues?
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
char * filename = argv[1];
int fd;
off_t size = 1UL << 40; // 30 == 1GB, 40 == 1TB
fd = open(filename, O_RDWR | O_CREAT | O_TRUNC, 0666);
ftruncate(fd, size);
printf("Created %ld byte sparse file\n", size);
char * buffer = (char *)mmap(NULL, (size_t)size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if ( buffer == MAP_FAILED ) {
perror("mmap");
exit(1);
}
printf("Done mmap - returned 0x0%lx\n", (unsigned long)buffer);
strcpy( buffer, "cafebabe" );
printf("Wrote to start\n");
strcpy( buffer + (size - 9), "deadbeef" );
printf("Wrote to end\n");
if ( munmap(buffer, (size_t)size) < 0 ) {
perror("munmap");
exit(1);
}
close(fd);
return 0;
}

The problem was that the per-process virtual memory limit was set to only 1.7GB. ulimit -v 1610612736 set it to 1.5TB and my mmap() call succeeded. Thanks, bmargulies, for the hint to try ulimit -a!

Is there some sort of per-user quota, limiting the amount of memory available to a user process?

My guess is the the kernel is having difficulty allocating the memory that it needs to keep up with this memory mapping. I don't know how swapped out pages are kept up with in the Linux kernel (and I assume that most of the file would be in the swapped out state most of the time), but it may end up needing an entry for each page of memory that the file takes up in a table. Since this file might be mmapped by more than one process the kernel has to keep up with the mapping from the process's point of view, which would map to another point of view, which would map to secondary storage (and include fields for device and location).
This would fit into your addressable space, but might not fit (at least contiguously) within physical memory.
If anyone knows more about how Linux does this I'd be interested to hear about it.