Mmap() an entire large file

Mmap() an entire large file - c

I am trying to "mmap" a binary file (~ 8Gb) using the following code (test.c).
#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#define handle_error(msg) \
do { perror(msg); exit(EXIT_FAILURE); } while (0)
int main(int argc, char *argv[])
{
const char *memblock;
int fd;
struct stat sb;
fd = open(argv[1], O_RDONLY);
fstat(fd, &sb);
printf("Size: %lu\n", (uint64_t)sb.st_size);
memblock = mmap(NULL, sb.st_size, PROT_WRITE, MAP_PRIVATE, fd, 0);
if (memblock == MAP_FAILED) handle_error("mmap");
for(uint64_t i = 0; i < 10; i++)
{
printf("[%lu]=%X ", i, memblock[i]);
}
printf("\n");
return 0;
}
test.c is compiled using gcc -std=c99 test.c -o test and file of test returns: test: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked (uses shared libs), for GNU/Linux 2.6.15, not stripped
Although this works fine for small files, I get a segmentation fault when I try to load a big one. The program actually returns:
Size: 8274324021
mmap: Cannot allocate memory
I managed to map the whole file using boost::iostreams::mapped_file but I want to do it using C and system calls. What is wrong with my code?

MAP_PRIVATE mappings require a memory reservation, as writing to these pages may result in copy-on-write allocations. This means that you can't map something too much larger than your physical ram + swap. Try using a MAP_SHARED mapping instead. This means that writes to the mapping will be reflected on disk - as such, the kernel knows it can always free up memory by doing writeback, so it won't limit you.
I also note that you're mapping with PROT_WRITE, but you then go on and read from the memory mapping. You also opened the file with O_RDONLY - this itself may be another problem for you; you must specify O_RDWR if you want to use PROT_WRITE with MAP_SHARED.
As for PROT_WRITE only, this happens to work on x86, because x86 doesn't support write-only mappings, but may cause segfaults on other platforms. Request PROT_READ|PROT_WRITE - or, if you only need to read, PROT_READ.
On my system (VPS with 676MB RAM, 256MB swap), I reproduced your problem; changing to MAP_SHARED results in an EPERM error (since I'm not allowed to write to the backing file opened with O_RDONLY). Changing to PROT_READ and MAP_SHARED allows the mapping to succeed.
If you need to modify bytes in the file, one option would be to make private just the ranges of the file you're going to write to. That is, munmap and remap with MAP_PRIVATE the areas you intend to write to. Of course, if you intend to write to the entire file then you need 8GB of memory to do so.
Alternately, you can write 1 to /proc/sys/vm/overcommit_memory. This will allow the mapping request to succeed; however, keep in mind that if you actually try to use the full 8GB of COW memory, your program (or some other program!) will be killed by the OOM killer.

Linux (and apparently a few other UNIX systems) have the MAP_NORESERVE flag for mmap(2), which can be used to explicitly enable swap space overcommitting. This can be useful when you wish to map a file larger than the amount of free memory available on your system.
This is particularly handy when used with MAP_PRIVATE and only intend to write to a small portion of the memory mapped range, since this would otherwise trigger swap space reservation of the entire file (or cause the system to return ENOMEM, if system wide overcommitting hasn't been enabled and you exceed the free memory of the system).
The issue to watch out for is that if you do write to a large portion of this memory, the lazy swap space reservation may cause your application to consume all the free RAM and swap on the system, eventually triggering the OOM killer (Linux) or causing your app to receive a SIGSEGV.

You don't have enough virtual memory to handle that mapping.
As an example, I have a machine here with 8G RAM, and ~8G swap (so 16G total virtual memory available).
If I run your code on a VirtualBox snapshot that is ~8G, it works fine:
$ ls -lh /media/vms/.../snap.vdi
-rw------- 1 me users 9.2G Aug 6 16:02 /media/vms/.../snap.vdi
$ ./a.out /media/vms/.../snap.vdi
Size: 9820000256
[0]=3C [1]=3C [2]=3C [3]=20 [4]=4F [5]=72 [6]=61 [7]=63 [8]=6C [9]=65
Now, if I drop the swap, I'm left with 8G total memory. (Don't run this on an active server.) And the result is:
$ sudo swapoff -a
$ ./a.out /media/vms/.../snap.vdi
Size: 9820000256
mmap: Cannot allocate memory
So make sure you have enough virtual memory to hold that mapping (even if you only touch a few pages in that file).

Related

Why does the page fault not cause the thread to finish its execution later?

I have the below code where I'm intentionally creating a page fault in one of the threads in file.c
util.c
#include "util.h"
// to use as a fence() instruction
extern inline __attribute__((always_inline))
CYCLES rdtscp(void) {
CYCLES cycles;
asm volatile ("rdtscp" : "=a" (cycles));
return cycles;
}
// initialize address
void init_ram_address(char* FILE_NAME){
char *filename = FILE_NAME;
int fd = open(filename, O_RDWR);
if(fd == -1) {
printf("Could not open file .\n");
exit(0);
}
void *file_address = mmap(NULL, DEFAULT_FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_POPULATE, fd, 0);
ram_address = (int *) file_address;
}
// initialize address
void init_disk_address(char* FILE_NAME){
char *filename = FILE_NAME;
int fd = open(filename, O_RDWR);
if(fd == -1) {
printf("Could not open file .\n");
exit(0);
}
void *file_address = mmap(NULL, DEFAULT_FILE_SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
disk_address = (int *) file_address;
}
file.c
#include "util.h"
void *f1();
void *f2();
pthread_barrier_t barrier;
pthread_mutex_t mutex;
int main(int argc, char **argv)
{
pthread_t t1, t2;
// in ram
init_ram_address(RAM_FILE_NAME);
// in disk
init_disk_address(DISK_FILE_NAME);
pthread_create(&t1, NULL, &f1, NULL);
pthread_create(&t2, NULL, &f2, NULL);
pthread_join(t1, NULL);
pthread_join(t2, NULL);
return 0;
}
void *f1()
{
rdtscp();
int load = *(ram_address);
rdtscp();
printf("Expecting this to be run first.\n");
}
void *f2()
{
rdtscp();
int load = *(disk_address);
rdtscp();
printf("Expecting this to be run second.\n");
}
I've used rdtscp() in the above code for fencing purposes (to ensure that the print statement get executed only after the load operation is done).
Since t2 will incur a page fault, I expect t1 to finish executing its print statement first.
To run both the threads on the same core, I run taskset -c 10 ./file.
I see that t2 prints its statement before t1. What could be the reason for this?

I think you're expecting t2's int load = *(disk_address); to cause a context switch to t1, and since you're pinning everything to the same CPU core, that would give t1 time to win the race to take the lock for stdout.
A soft page fault doesn't need to context-switch, just update the page tables with a file page from the pagecache. Despite the mapping being backed by a disk file, not anonymous memory or just copy-on-write tricks, if the the file has been read or written recently it will be hot in the pagecache and not require I/O (which would make it a hard page fault).
Maybe try evicting disk cache before a test run, like with echo 3 | sudo tee /proc/sys/vm/drop_caches if this is Linux, so access to the mmap region without MAP_POPULATE will be a hard page fault (requiring I/O).
(See *https://unix.stackexchange.com/questions/17936/setting-proc-sys-vm-drop-caches-to-clear-cache*; sync first, at least on the disk file, if it was recently written, to make sure it's page(s) are clean and able to be evicted aka dropped. Dropping caches is mainly useful for benchmarking.)
Or programmatically, you can hint the kernel with the madvise(2) system call, like madvise(MADV_DONTNEED) on a page, encouraging it to evict it from pagecache soon. (Or at least hint that your process doesn't need it; other processes might keep it hot).
In Linux kernel 5.4 and later, MADV_COLD works as a hint to evict the specified page(s) on memory pressure. ("Deactivate" probably means remove from HW page tables, so next access will at least be a soft page fault.) Or MADV_PAGEOUT is apparently supposed to get the kernel to reclaim the specified page(s) right away, I guess before the system call returns. After that, the next access should be a hard page fault.
MADV_COLD (since Linux 5.4)
Deactivate a given range of pages. This will make the pages a more probable reclaim target should there be a memory pressure. This is a nondestructive operation. The advice might be ignored for some pages in the range when it is not applicable.
MADV_PAGEOUT (since Linux 5.4)
Reclaim a given range of pages. This is done to free up memory occupied by these pages. If a page is anonymous, it will be swapped out. If a page is file-backed and dirty, it will be written back to the backing storage. The advice might be ignored for some pages in the range when it is not applicable.
These madvise args are Linux-specific. The madvise system call itself (as opposed to posix_madvise) is not guaranteed portable, but the man page gives the impression that some other systems have their own madvise system calls supporting some standard "advice" hints to the kernel.
You haven't shown the declaration of ram_address or disk_address.
If it's not a pointer-to-volatile like volatile int *disk_address, the loads may be optimized away at compile time. Writes to non-escaped local vars like int load don't have to respect "memory" clobbers because nothing else could possibly have a reference to them.
If you compiled without optimization or something, then yes the load will still happen even without volatile.

Memcpy Complete After Segfault

I have a PCIe endpoint device connected to the host. The ep's (endpoints) 512MB BAR is mmapped and memcpy is used to transfer data. Memcpy is quite slow (~2.5s). When I don't map all of the BAR (100bytes), but run memcpy for the full 512MB, I get a segfault within 0.5s, however when reading back the end of the BAR, the data shows the correct data. Meaning that the data reads the same as if I did mmap the whole BAR space.
How is the data being written and why is it so much faster than doing it the correct way (without the segfault)?
Code to map the whole BAR (takes 2.5s):
fd = open(filename, O_RDWR | O_SYNC)
map_base = mmap(NULL, 536870912, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
int rand_fd = open(infile, O_RDONLY);
rand_base = mmap(0, 536870912, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, 536870912);
if(munmap(map_base, map_size) == -1)
{
PRINT_ERROR;
}
close(fd);
Code to map only 100 bytes (takes 0.5s):
fd = open(filename, O_RDWR | O_SYNC)
map_base = mmap(NULL, 100, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
int rand_fd = open(infile, O_RDONLY);
rand_base = mmap(0, 536870912, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, 536870912);
if(munmap(map_base, map_size) == -1)
{
PRINT_ERROR;
}
close(fd);
To check the written data, I am using pcimem
https://github.com/billfarrow/pcimem
Edit: I was being dumb while consistent data was being 'written' after the segfault, it was not the data that it should have been. Therefore my conclusion that memcpy was completing after the segfault was false. I am accepting the answer as it provided me useful information.

Assuming filename is just an ordinary file (to save the data), leave off O_SYNC. It will just slow things down [possibly, a lot].
When opening the BAR device, consider using O_DIRECT. This may minimize caching effects. That is, if the BAR device does its own caching, eliminate caching by the kernel, if possible.
How is the data being written and why is it so much faster than doing it the correct way (without the segfault)?
The "short" mmap/read is not working. The extra data comes from the prior "full" mapping. So, your test isn't valid.
To ensure consistent results, do unlink on the output file. Do open with O_CREAT. Then, use ftruncate to extend the file to the full size.
Here is some code to try:
#define SIZE (512 * 1024 * 1024)
// remove the output file
unlink(filename);
// open output file (create it)
int ofile_fd = open(filename, O_RDWR | O_CREAT,0644)
// prevent segfault by providing space in the file
ftruncate(ofile_fd,SIZE);
map_base = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, ofile_fd, 0);
// use O_DIRECT to minimize caching effects when accessing the BAR device
#if 0
int rand_fd = open(infile, O_RDONLY);
#else
int rand_fd = open(infile, O_RDONLY | O_DIRECT);
#endif
rand_base = mmap(0, SIZE, PROT_READ, MAP_SHARED, rand_fd, 0);
memcpy(map_base, rand_base, SIZE);
if (munmap(map_base, map_size) == -1) {
PRINT_ERROR;
}
// close the output file
close(ofile_fd);
Depending upon the characteristics of the BAR device, to minimize the number of PCIe read/fetch/transaction requests, it may be helpful to ensure that it is being accessed as 32 bit (or 64 bit) elements.
Does the BAR space allow/support/encourage access as "ordinary" memory?
Usually, memcpy is smart enough to switch to "wide" memory access automatically (if memory addresses are aligned--which they are here). That is, memcpy will automatically use 64 bit fetches, with movq or possibly by using some XMM instructions, such as movdqa
It would help to know exactly which BAR device(s) you have. The datasheet/appnote should give enough information.
UPDATE:
Thanks for the sample code. Unfortunately, aarch64-gcc gives 'O_DIRECT undeclared' for some reason. Without using that flag, the speed is the same as my original code.
Add #define _GNU_SOURCE above any #include to resolve O_DIRECT
The PCIe device is an FPGA that we are developing. The bitstream is currently the Xilinx DMA example code. The BAR is just 512MB of memory for the system to R/W to. –
userYou
Serendipitously, my answer was based on my experience with access to the BAR space of a Xilinx FPGA device (it's been a while, circa 2010).
When we were diagnosing speed issues, we used a PCIe bus analyzer. This can show the byte width of the bus requests the CPU has requested. It also shows the turnaround time (e.g. Bus read request time until data packet from device is returned).
We also had to adjust the parameters in the PCIe config registers (e.g. transfer size, transaction replay) for the device/BAR. This was trial-and-error and we (I) tried some 27 different combinations before deciding on the optimum config
On an unrelated arm system (e.g. nVidia Jetson) about 3 years ago, I had to do memcpy to/from the GPU memory. It may have just been the particular cross-compiler I was using, but the disassembly of memcpy showed that it only used bytewide transfers. That is, it wasn't as smart as its x86 counterpart. I wrote/rewrote a version that used unsigned long long [and/or unsigned __int128] transfers. This sped things up considerably. See below.
So, you may wish to disassemble the generated memcpy code. Either the library function and/or code that it may inline into your function.
Just a thought ... If you're just wanting a bulk transfer, you may wish to have the device driver for the device program the DMA engine on the FPGA. This might be handled more effectively with a custom ioctl call to the device driver that accepts a custom struct describing the desired transfer (vs. read or mmap from userspace).
Are you writing a custom device driver for the device? Or, are you just using some generic device driver?
Here's what I had to do to get a fast memcpy on arm. It generates ldp/stp asm instructions.
// qcpy.c -- fast memcpy
#include <string.h>
#include <stddef.h>
#ifndef OPT_QMEMCPY
#define OPT_QMEMCPY 128
#endif
#ifndef OPT_QCPYIDX
#define OPT_QCPYIDX 1
#endif
// atomic type for qmemcpy
#if OPT_QMEMCPY == 32
typedef unsigned int qmemcpy_t;
#elif OPT_QMEMCPY == 64
typedef unsigned long long qmemcpy_t;
#elif OPT_QMEMCPY == 128
typedef unsigned __int128 qmemcpy_t;
#else
#error qmemcpy.c: unknown/unsupported OPT_QMEMCPY
#endif
typedef qmemcpy_t *qmemcpy_p;
typedef const qmemcpy_t *qmemcpy_pc;
// _qmemcpy -- fast memcpy
// RETURNS: number of bytes transferred
size_t
_qmemcpy(qmemcpy_p dst,qmemcpy_pc src,size_t size)
{
size_t cnt;
size_t idx;
cnt = size / sizeof(qmemcpy_t);
size = cnt * sizeof(qmemcpy_t);
if (OPT_QCPYIDX) {
for (idx = 0; idx < cnt; ++idx)
dst[idx] = src[idx];
}
else {
for (; cnt > 0; --cnt, ++dst, ++src)
*dst = *src;
}
return size;
}
// qmemcpy -- fast memcpy
void
qmemcpy(void *dst,const void *src,size_t size)
{
size_t xlen;
// use fast memcpy for aligned size
if (OPT_QMEMCPY > 0) {
xlen = _qmemcpy(dst,src,size);
src += xlen;
dst += xlen;
size -= xlen;
}
// copy remainder with ordinary memcpy
if (size > 0)
memcpy(dst,src,size);
}
UPDATE #2:
Speaking of serendipity, I am using a Jetson Orin. That is very interesting about the byte-wise behavior.
Just a thought ... If you have a Jetson in the same system as the FPGA, you might get DMA action by judicious use of cuda
Due to requirements, I cannot use any custom kernel modules so I am trying to do it all in userspace.
That is a harsh mistress to serve ... With custom H/W, it is almost axiomatic that you can have a custom device driver. So, the requirement sounds like a marketing/executive one rather than a technical one. If it's something like not being able to ship a .ko file because you don't know the target kernel version, it is possible to ship the driver as a .o and defer the .ko creation to the install script.
We want to use the DMA engine, but I am hiking up the learning curve on this one. We are using DMA in the FPGA, but I thought that as long as we could write to the address specified in the dtb, that meant the DMA engine was set up and working. Now I'm wondering if I have completely misunderstood that part. –
userYou
You probably will not get DMA doing that. If you start the memcpy, how does the DMA engine know the transfer length?
You might have better luck using read/write vs mmap to get DMA going, depending upon the driver.
But, if it were me, I'd keep the custom driver option open:
If you have to tweak/modify the BAR config registers on driver/system startup, I can't recall if it's even possible to map the config registers to userspace.
When doing mmap, the device may be treated as the "backing store" for the mapping. That is, there is still an extra layer of kernel buffering [just like there is when mapping an ordinary file]. The device memory is only updated periodically from the kernel [buffer] memory.
A custom driver can set up a [guaranteed] direct mapping, using some trickery that only the kernel/driver has access to.
Historical note:
When I last worked with the Xilinx FPGA (12 years ago), the firmware loader utility (provided by Xilinx in both binary and source form), would read in bytes from the firmware/microcode .xsvf file used (e.g.) fscanf(fi,"%c",&myint) to get the bytes.
This was horrible. I refactored the utility to fix that and the processing of the state machine and reduced the load time from 15 minutes to 45 seconds.
Hopefully, Xilinx has fixed the utility by now.

Not able to write in /dev/mem

The issue that I am experimenting is not related with open() or mmap() function, which are executed properly. I have disabled CONFIG_STRICT_DEVMEM in the kernel so I can read from /dev/mem without problems. Actually, I can do the following:
const char *path = "/dev/mem"
int fd = open(path, O_RDWR); /* read and write flags*/
p = mmap(0, SIZE, PROT_READ | PROT_WRITE, MAP_SHARED, fd, BASE_ADDR); /* read and write flags*/
And the code does not fail. Nonetheless, I am using this code to write in the PCI address space. So, basically the BASE_ADDR is 0xc000000, and the size is 256 MiB (0x10000000, all the PCI address space).
Said that, when I try to write on these positions (with a specific offset, BDF format), nothing is written; again the code does not fail, it just does not write anything.
In case my code was wrong, I tried BusyBox, with the following parameters:
[horro# ~]$ sudo busybox devmem 0xc00b0a8c w 0xffffffff
[horro# ~]$ sudo busybox devmem 0xc00b0a8c
0x00000000
So, basically it is not writing anything.

There is a CONFIG_STRICT_DEVMEM kernel config option. My understanding is that it must be set at compile-time as CONFIG_STRICT_DEVMEM=n. This is a security reasons.

Copying files using memory map

I want to implement an effective file copying technique in C for my process which runs on BSD OS. As of now the functionality is implemented using read-write technique. I am trying to make it optimized by using memory map file copying technique.
Basically I will fork a process which mmaps both src and dst file and do memcpy() of the specified bytes from src to dst. The process exits after the memcpy() returns. Is msync() required here, because when I actually called msync with MS_SYNC flag, the function took lot of time to return. Same behavior is seen with MS_ASYNC flag as well?
i) So to summarize is it safe to avoid msync()?
ii) Is there any other better way of copying files in BSD. Because bsd seems to be does not support sendfile() or splice()? Any other equivalents?
iii) Is there any simple method for implementing our own zero-copy like technique for this requirement?
My code
/* mmcopy.c
Copy the contents of one file to another file, using memory mappings.
Usage mmcopy source-file dest-file
*/
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
{
char *src, *dst;
int fdSrc, fdDst;
struct stat sb;
if (argc != 3)
usageErr("%s source-file dest-file\n", argv[0]);
fdSrc = open(argv[1], O_RDONLY);
if (fdSrc == -1)
errExit("open");
/* Use fstat() to obtain size of file: we use this to specify the
size of the two mappings */
if (fstat(fdSrc, &sb) == -1)
errExit("fstat");
/* Handle zero-length file specially, since specifying a size of
zero to mmap() will fail with the error EINVAL */
if (sb.st_size == 0)
exit(EXIT_SUCCESS);
src = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fdSrc, 0);
if (src == MAP_FAILED)
errExit("mmap");
fdDst = open(argv[2], O_RDWR | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
if (fdDst == -1)
errExit("open");
if (ftruncate(fdDst, sb.st_size) == -1)
errExit("ftruncate");
dst = mmap(NULL, sb.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fdDst, 0);
if (dst == MAP_FAILED)
errExit("mmap");
memcpy(dst, src, sb.st_size); /* Copy bytes between mappings */
if (msync(dst, sb.st_size, MS_SYNC) == -1)
errExit("msync");
enter code here
exit(EXIT_SUCCESS);
}

Short answer: msync() is not required.
When you do not specify msync(), the operating system flushes the memory-mapped pages in the background after the process has been terminated. This is reliable on any POSIX-compliant operating system.
To answer the secondary questions:
Typically the method of copying a file on any POSIX-compliant operating system (such as BSD) is to use open() / read() / write() and a buffer of some size (16kb, 32kb, or 64kb, for example). Read data into buffer from src, write data from buffer into dest. Repeat until read(src_fd) returns 0 bytes (EOF).
However, depending on your goals, using mmap() to copy a file in this fashion is probably a perfectly viable solution, so long as the files being coped are relatively small (relative to the expected memory constraints of your target hardware and your application). The mmap copy operation will require roughly 2x the total physical memory of the file. So if you're trying to copy a file that's a 8MB, your application will use 16MB to perform the copy. If you expect to be working with even larger files then that duplication could become very costly.
So does using mmap() have other advantages? Actually, no.
The OS will often be much slower about flushing mmap pages than writing data directly to a file using write(). This is because the OS will intentionally prioritize other things ahead of page flushes so to keep the system 'responsive' for foreground tasks/apps.
During the time the mmap pages are being flushed to disk (in the background), the chance of sudden loss of power to the system will cause loss of data. Of course this can happen when using write() as well but if write() finishes faster then there's less chance for unexpected interruption.
the long delay you observe when calling msync() is roughly the time it takes the OS to flush your copied file to disk. When you don't call msync() it happens in the background instead (and also takes even longer for that reason).

Why does fopen/fgets use both mmap and read system calls to access the data?

I have a small example program which simply fopens a file and uses fgets to read it. Using strace, I notice that the first call to fgets runs a mmap system call, and then read system calls are used to actually read the contents of the file. on fclose, the file is munmaped. If I instead open read the file with open/read directly, this obviously does not occur. I'm curious as to what is the purpose of this mmap is, and what it is accomplishing.
On my Linux 2.6.31 based system, when under heavy virtual memory demand these mmaps will sometimes hang for several seconds, and appear to me to be unnecessary.
The example code:
#include <stdlib.h>
#include <stdio.h>
int main ()
{
FILE *f;
if ( NULL == ( f=fopen( "foo.txt","r" )))
{
printf ("Fail to open\n");
}
char buf[256];
fgets(buf,256,f);
fclose(f);
}
And here is the relevant strace output when the above code is run:
open("foo.txt", O_RDONLY) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=9, ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb8039000
read(3, "foo\nbar\n\n"..., 4096) = 9
close(3) = 0
munmap(0xb8039000, 4096) = 0

It's not the file that is mmap'ed - in this case mmap is used anonymously (not on a file), probably to allocate memory for the buffer that the consequent reads will use.
malloc in fact results in such a call to mmap. Similarly, the munmap corresponds to a call to free.

The mmap is not mapping the file; instead it's allocating memory for the stdio FILE buffering. Normally malloc would not use mmap to service such a small allocation, but it seems glibc's stdio implementation is using mmap directly to get the buffer. This is probably to ensure it's page-aligned (though posix_memalign could achieve the same thing) and/or to make sure closing the file returns the buffer memory to the kernel. I question the usefulness of page-aligning the buffer. Presumably it's for performance, but I can't see any way it would help unless the file offset you're reading from is also page-aligned, and even then it seems like a dubious micro-optimization.

from what i have read memory mapping functions are useful while handling large files. now the definition of large is something i have no idea about. but yes for the large files they are significantly faster as compared to the 'buffered' i/o calls.
in the example that you have posted i think the file is opened by the open() function and mmap is used for allocating memory or something else.
from the syntax of mmap function this can be seen clearly:
void *mmap(void *addr, size_t len, int prot, int flags, int fildes, off_t off);
the second last parameter takes the file descriptor which should be non-negative.
while in the stack trace it is -1

Source code of fopen in glibc shows that mmap can be actually used.
https://sourceware.org/git/?p=glibc.git;a=blob;f=libio/iofopen.c;h=965d21cd978f3acb25ca23152993d9cac9f120e3;hb=HEAD#l36

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight