Using memory mapping in C to read binary files

Using memory mapping in C to read binary files - c

While processing a very large binary file can using memory mapping in C make any difference when compared to fread ? Even if there are small differences in time it would be fine. And if it does make the process fsater any idea how to use memory mapping on a large binary file and extract data from it ?
Thanks!!

If you're going to read the entire file beginning to end, the most important thing is to let the platform know this. This will allow it to do aggressive read ahead and it will allow it to avoid polluting the cache with data that will not be read again anyway. You can do this either with memory mapping or without it. The key functions are posix_fadvise and posix_madvise.
Memory mapping is a huge win when you have random, small accesses. This is especially true when you have multiple writes to the same page. Without memory mapping, each read or write requires a user/kernel transition and a copy. With memory mapping, most operations don't.
But with sequential access, all will save is the copy. Oddly, the user/kernel transitions may be even worse. With large sequential reads, you get one user/kernel transition per read, which could be per 256KB if the reads are large. With large sequential access to a memory mapped file, you may fault every page (4KB). It depends on the kernel's "fault ahead" optimizations.
However, with memory mapping, you will save the copy, assuming you don't need to do the copy anyway. If you have to copy out of the mapped pages for any reason, then you might as well let a read operation copy them into place for you. However, if you can operate on the data in place, memory mapping may be a win.
It generally doesn't make as much of a difference as people tend to think it does. Especially when you think about how slow the disk is in comparison to all this stuff.

Related

why mmap is faster than traditional file io [duplicate]

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
mmap() vs. reading blocks
I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
mmap() is not reading sequentially.
mmap() has to fetch from the disk itself same as read() does
The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?

I heard (read it on the internet somewhere) that mmap() is faster than sequential IO. Is this correct? If yes then why it is faster?
It can be - there are pros and cons, listed below. When you really have reason to care, always benchmark both.
Quite apart from the actual IO efficiency, there are implications for the way the application code tracks when it needs to do the I/O, and does data processing/generation, that can sometimes impact performance quite dramatically.
mmap() is not reading sequentially.
2) mmap() has to fetch from the disk itself same as read() does
3) The mapped area is not sequential - so no DMA (?).
So mmap() should actually be slower than read() from a file? Which of my assumptions above are wrong?
is wrong... mmap() assigns a region of virtual address space corresponding to file content... whenever a page in that address space is accessed, physical RAM is found to back the virtual addresses and the corresponding disk content is faulted into that RAM. So, the order in which reads are done from the disk matches the order of access. It's a "lazy" I/O mechanism. If, for example, you needed to index into a huge hash table that was to be read from disk, then mmaping the file and starting to do access means the disk I/O is not done sequentially and may therefore result in longer elapsed time until the entire file is read into memory, but while that's happening lookups are succeeding and dependent work can be undertaken, and if parts of the file are never actually needed they're not read (allow for the granularity of disk and memory pages, and that even when using memory mapping many OSes allow you to specify some performance-enhancing / memory-efficiency tips about your planned access patterns so they can proactively read ahead or release memory more aggressively knowing you're unlikely to return to it).
absolutely true
"The mapped area is not sequential" is vague. Memory mapped regions are "contiguous" (sequential) in virtual address space. We've discussed disk I/O being sequential above. Or, are you thinking of something else? Anyway, while pages are being faulted in, they may indeed be transferred using DMA.
Further, there are other reasons why memory mapping may outperform usual I/O:
there's less copying:
often OS & library level routines pass data through one or more buffers before it reaches an application-specified buffer, the application then dynamically allocates storage, then copies from the I/O buffer to that storage so the data's usable after the file reading completes
memory mapping allows (but doesn't force) in-place usage (you can just record a pointer and possibly length)
continuing to access data in-place risks increased cache misses and/or swapping later: the file/memory-map could be more verbose than data structures into which it could be parsed, so access patterns on data therein could have more delays to fault in more memory pages
memory mapping can simplify the application's parsing job by letting the application treat the entire file content as accessible, rather than worrying about when to read another buffer full
the application defers more to the OS's wisdom re number of pages that are in physical RAM at any single point in time, effectively sharing a direct-access disk cache with the application
as well-wisher comments below, "using memory mapping you typically use less system calls"
if multiple processes are accessing the same file, they should be able to share the physical backing pages
The are also reasons why mmap may be slower - do read Linus Torvald's post here which says of mmap:
...page table games along with the fault (and even just TLB miss)
overhead is easily more than the cost of copying a page in a nice
streaming manner...
And from another of his posts:
quite noticeable setup and teardown costs. And I mean noticeable. It's things like following the page tables to unmap everything cleanly. It's the book-keeping for maintaining a list of all the mappings. It's The TLB flush needed after unmapping stuff.
page faulting is expensive. That's how the mapping gets populated, and it's quite slow.
Linux does have "hugepages" (so one TLB entry per 2MB, instead of per 4kb) and even Transparent Huge Pages, where the OS attempts to use them even if the application code wasn't written to explicitly utilise them.
FWIW, the last time this arose for me at work, memory mapped input was 80% faster than fread et al for reading binary database records into a proprietary database, on 64 bit Linux with ~170GB files.

mmap() can share between process.
DMA will be used whenever possible. DMA does not require contiguous memory -- many high end cards support scatter-gather DMA.
The memory area may be shared with kernel block cache if possible. So there is lessor copying.
Memory for mmap is allocated by kernel, it is always aligned.

"Faster" in absolute terms doesn't exist. You'd have to specify constraints and circumstances.
mmap() is not reading sequentially.
what makes you think that? If you really access the mapped memory sequentially, the system will usually fetch the pages in that order.
mmap() has to fetch from the disk itself same as read() does
sure, but the OS determines the time and buffer size
The mapped area is not sequential - so no DMA (?).
see above
What mmap helps with is that there is no extra user space buffer involved, the "read" takes place there where the OS kernel sees fit and in chunks that can be optimized. This may be an advantage in speed, but first of all this is just an interface that is easier to use.
If you want to know about speed for a particular setup (hardware, OS, use pattern) you'd have to measure.

Read mmapped data memory efficient

I want to mmap a big file into memory and parse it sequentially. As I understand if bytes have been lazily read into memory once, they stay there. Is there a way to periodically tell the system to release the previously read contents?

This understanding is only a very superficial view.
To understand what really happens you have to take into account the difference of the virtual memory of your process and the actual real memory of the machine. Mapping a huge file means reserving space in your virtual address-space. It's probably platform-dependent if anything is already read at this point.
When you actually access the data the OS has to fill an actual page of memory. When you access other parts these parts have to be brought into memory. It's completely up to the OS when it will re-use the memory. Normally this happens when some data is accessed by you or an other process and no free memory is available. But could happen at any time. If you access it again later it might be still in memory or will brought back by the OS. No way for your process to tell the difference.
In short: You don't need to care about that. The OS manages all that in the background.
One point might be that if you map a really huge file this takes up space in your virtual address-space which is limited. So if you deal with many huge mappings and or huge allocations you might want to only map parts of the file at a given time.
ADDITION: after thinking a bit about it, I came up with a reason why it might be smarter to do it blockwise-sequential. Although I doubt you will be able to measure that.
Any reasonable OS will look for a block to unload when in need in something like the following order:
unmapped files ( not needed anymore)
LRU unmodified mapped file (can be retrieved from disc)
LRU modified mapped file (same as 2. but needs to be updated on disc before unload)
LRU allocated memory (needs to be written to swap)
So unmapping blocks known to be never used again as you go, you give the OS a hint that these should be freed earlier. This will give data that has been used less recently but might be accessed in the future a bigger chance to stay in memory.

In a POSIX unix system, is it possible to mmap() a file in such a way that it will never be swapped out to disk in favor of other files?

If not using mmap(), it seems like there should be a way to give certain files "priority", so that the only time they're swapped out is for page faults trying to bring in, e.g., executing code, or memory that was malloc()'d by some process, but never other files. One can think of situations where this could be useful. Consider search engines, which should keep their index files in cache, but which may be simultaneously writing new files (not being used for search).

There are a few ways.
The best way is with madvise(), which allows you to inform the kernel that you will need a particular range of memory soon, which gives it priority over other memory. You can also use it to say that a particular range will not be needed soon, so it should be swapped out sooner.
The hack way is with mlock(), which forces a range of memory to stay in RAM. This is generally not a good idea, and should only be used in special cases. The most common case is to store passwords in RAM so that the password cannot be recovered from the swap file after the computer is powered off. I would not use mlock() for performance tuning unless I had exhausted other options.
The worst way is to constantly poke memory, forcing it to stay fresh.

reading data from filesystem vs compiling the data directly into program

I have a file (10-20MB) containing data, where each line is a single piece of data.
I have a C program that reads the file from the filesystem, and then based on command line input, it reads each line of the file, does a calculation on each line to determine if that line should be returned, and then return a subset of the data.
Assume that the program does an fread and reads the entire file into memory at the beginning, and then parses it directly from memory.
Would the program execute faster if, instead of reading it from the filesystem, I compiled the data into the program directly, by creating an array such as the following?
char *dataArray[] = {"data1", "data2", "data3"....};
Since the OS needs to read the entire binary from the filesystem, my gut feeling is that the execution time of both techniques would be similar, since reading from the filesystem would be the high order bit. However, would anyone have more definitive ideas on this?

Defining everything as a program literal will certainly be faster.
You do not need the relatively slow "open" call for the data file and you don't need to move the data from the buffer to your storage.
This was a common optimization circa. 1970, and every programming/coding style book since then stongly recommends you do not do this. The actual performance increase is minimal and what you gain in performance you lose in maintainability and flexibility.
Should you want a quick maintainable optimisation for this type of problem then look at the "mmap" call which makes the buffer directly available to your program and minimises data movement.

I doubt the difference in execution time will be significant, but from a memory utilization standpoint, putting the data in the executable (and qualifying it const appropriately) will make a big difference.
If you read 10-20 megs of data from a file into memory allocated (e.g. via malloc) in your program, the data initially exists in two places in memory: the filesystem cache, and your program's private memory. The former copy can be discarded if memory is tight, but the latter occupies physical memory or swap permanently until it's freed.
If on the other hand the 10-20 megs of data are part of your program's image (in the executable file), the data will be demand-paged, and can be discarded whenever needed because the OS knows it can reload the pages if it needs them again.

How to copy a ram_base file to disk efficiently

I want to copy a large a ram-based file (located at /dev/shm direcotry) to local disk, is there some way for an efficient copy instead of read char one by one or create another piece memory? I can use only C language here. Is there anyway that I can put the memory file directly to disk? Thanks!

I would mmap() the files and do memcpy() between them.

Thanks you guys for the help! I made it by mmap the ram-based file and write the entire block directly to the destination. memcopy was not used because I am actually writing to a parallel file system (pvfs), which does not support mmap operation.

/dev/shm is shared memory, so one way to copy it would be to open it as shared memory, but frankly I don't think you will gain anything.
when writing your memory file to disk, the bottleneck will be the disk.
just be sure to write data in big chunks, and you should be fine.

You can just copy it like any other file:
cp /dev/shm/tmp ~/tmp
So, a quick, simple way is to issue a cp command via system().

You could try to see if the splice system call works for this. I'm not sure if it will since it has some restrictions about the types of files that it can work with, but if it did work you would call it repeatedly with memory page sized (or some multiple page size) requests repeatedly until it finished, and the kernel would handle it very efficently.
If this doesn't work you'll need to do either mmap or do plain old read/write.
Reading and Writing in memory page sized chunks makes things much more efficient. It can be even more efficient if your buffers are memory page size aligned since it opens up the oppurtunity for the kernel to just move the data to/from your process's memory via memory managment trickery rather than actually copying the data around.

The only thing you can do is read() in page size aligned chunks. I'm assuming you need to guarantee the data as written, which is going to mean bypassing buffers via posix_fadvise() or using O_DIRECT (I typically use posix_fadvise(), but O_DIRECT is appropriate here).
In this case, the speed of the media being written to alone dictates how quickly this will happen.
If you don't need to bypass buffers, the operation will complete faster, but there's no guarantee that the data will actually be written in the event of a reboot / power outage / etc. Since the source of the data is in shared memory, I'm (again) guessing you want the write to be guaranteed.
The only thing you can optimize is how long it takes read() to get data from shared memory into your own address space, which page size aligned chunks will improve.

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight