mmap( ) vs read( ) - c

I'm writing a bulk ID3 tag editor in C. ID3 tags are usually at the beginning of an mp3 encoded file, although older (version 1) tags are at the end. The app is designed to accept a directory and frame ID list from the command line, then recurse the directory structure updating all the ID3 tags it finds. The user may additionally choose to remove all older (version 1) tags. Another option is to simply display the current tags, without performing an update. The directory might contain 2 files or 2 million. If the user means to update the files, I was planning to load the entire file into memory, perform the updates, then save it (the file may be renamed as well). However, if the user only means to print the current ID3 tags, then loading the entire file seems excessive. After all the file could be 200mb.
I've read through this thread, which was insightful - mmap() vs. reading blocks
So my question is, what the most efficient way to go about this -- read(), mmap() or some combination? Design ideas welcome.
Edit: It's my understanding that mmap essentially delegates loading a file into memory, to the virtual memory subsystem. It seems to me, the VMM would be highly optimized on most systems as it's critical for system performance.

It really depends on what you're trying to do. If all you need to do is hop to a known offset and read out a small tag, read() may be faster (mmap() has to do some rather complex internal accounting). If you are planning on copying out all 200mb of the MP3, however, or scanning it for some tag that may appear at an unknown offset, then mmap() is likely a faster approach.
For example, if you need to shift the entire file down a few hundred bytes in order to insert an ID3 tag, one simple approach would be to expand the file with ftruncate(), mmap the file, then memmove() the contents down a bit. This, however, will destroy the file if your program crashes while it's running. You could also copy the contents of the file into a new file - this is another place where mmap() really shines; you can simply mmap() the old file, then copy all of its data into the new file with a single write().
In short, mmap() is great if you're doing a large amount of IO in terms of total bytes transferred; this is because it reduces the number of copies needed, and can significantly reduce the number of kernel entries needed for reading cached data. However mmap() requires a minimum of two trips into the kernel (three if you clean up the mapping when you're done!) and does some complex internal kernel accounting, and so the fixed overhead can be high.
read() on the other hand involves an extra memory-to-memory copy, and can thus be inefficient for large I/O operations, but is simple, and so the fixed overhead is relatively low. In short, use mmap() for large bulk I/O, and read() or pread() for one-off, small I/Os.

Don't bother with mmap unless your code is CPU bound, specifically due to lots small reads and writes. mmap may sound nice, but it isn't the awesome why isn't everyone using this alternative it looks like.
Given that you're recursing through potentially large directory structures, your bottleneck will be directory IO and concurrency. mmap is not going to help.
Update0
Reading the linked to question finds this answer that supports my experience:
mmap() vs. reading blocks

If you're not normally going to be streaming the file in and then processing it, but rather hopping around (like read the tags at the front and then jump to the end, etc.) then I would use mmap simply because your code will be cleaner and easier to maintain treating the file as a large buffer without having to actually manage the the buffering and paging yourself.
As has been mentioned, if you're processing a lot of data disk I/O is likely going to dominate your processing anyway. mmap may be faster than read, but for reasonable implementations, it's likely not THAT much faster, especially on todays hardware which has continually got faster and faster while disk drives have been stuck at 7200 and 10,000 RPM for years and years.
So, go with mmap and make your code easy and neat.

I don't know if standard POSIX functions reside inside what you are allowed or you will to use for the development but think about these two functions:
int ftruncate(int fildes, off_t length);
int truncate(const char *path, off_t length);
defined in unistd.h, which can be used to truncate a file up to a specified length. In this way you could easily
find where ID3 tags frame begins (don't know if you can compute it easily by just reading the header of the MP3 file but I guess yes)
save the offset
close the file
truncate the file with the provided function
open the file in append binary mode and write new tags
I'm not sure about the performance, you should test this method, but it should load much less things inside ram while providing a senseful way of doing it.

Related

Reading a file of arbitrary length in C

What's the most idiomatic/efficient way to read a file of arbitrary length in C?
Get the filesize of the file in bytes and issue a single fread()
Keep fread()ing a constant size buffer until getting EOF
Anything else?
Avoid using any technique which requires knowing the size of the file in advance. That leaves exactly one technique: read the file a bit at a time, in blocks of a convenient size.
Here's why you don't want to try to find the filesize in advance:
If it is not a regular file, there may not be any way to tell. For example, you might be reading directly from a console, or taking piped input from a previous data generator. If your program requires the filesize to be knowable, these useful input mechanisms will not be available to your users, who will complain or choose a different tool.
Even if you can figure out the filesize, you have no way of preventing it from changing while you are reading the file. If you are not careful about how you read the file, you might open a vulnerability which could be exploited by adversarial programs.
For example, if you allocate a buffer of the "correct" size and then read until you get an end-of-file condition, you may end up overwriting random memory. (Multiple reads may be necessary if you use an interface like read() which might read less data than requested.) Or you might find that the file has been truncated; if you don't check the amount of data read, you might end up processing uninitialised memory leading to information leakage.
In practice, you usually don't need to keep the entire file content in memory. You'll often parse the file (notably if it is textual), or at least read the file in smaller pieces, and for that you don't need it entirely in memory. For a textual file, reading it line-by-line (perhaps with some state inside your parser) is often enough (using fgets or getline).
Files exist (notably on disks or SSDs) because usually they can be much "bigger" than your computer RAM. Actually, files have been invented (more than 50 years ago) to be able to deal with data larger than memory. Distributed file systems also can be very big (and accessed remotely even from a laptop, e.g. by NFS, CIFS, etc...)
Some file systems are capable of storing petabytes of data (on supercomputers), with individual files of many terabytes (much larger than available RAM).
You'll also likely to use some databases. These routinely have terabytes of data. See also this answer (about realistic size of sqlite databases).
If you really want to read a file entirely in memory using stdio (but you should avoid doing that, because you generally want your program to be able to handle a lot of data on files; so reading the entire file in memory is generally a design error), you indeed could loop on fread (or fscanf, or even fgetc) till end-of-file. Notice that feof is useful only after some input operation.
On current laptop or desktop computers, you could prefer (for efficiency) to use buffers of a few megabytes, and you certainly can deal with big files of several hundreds of gigabytes (much larger than your RAM).
On POSIX file systems, you might do memory mapped IO with e.g. mmap(2) - but that might not be faster than read(2) with large buffers (of a few megabytes). You could use readahead(2) (Linux specific) and posix_fadvise(2) (or madvise(2) if using mmap) to tune performance by hinting your OS kernel.
If you have to code for Microsoft Windows, you could study its WinAPI and find some way to do memory mapped IO.
In practice, file data (notably if it was accessed recently) often stays in the page cache, which is of paramount importance for performance. When that is not the case, your hardware (disk, controller, ...) becomes the bottleneck and your program becomes I/O bound (in that case, no software trick could improve significantly the performance).

pread/pwrite, buffers and disk cache

If my code does something like fd = open("/dev/sdXY", ...) and pwrite(fd, ...)/pread(fd, ...), do the I/O operations skip the buffers or disk cache? Suppose /dev/sdXY is a unmounted, formatted disk partition (ext4, ufs, etc.).
I ask that because there is a need to grant contiguous file storage in an application I'm working on and I read that the only way to achieve it is doing something like what I described. However, I may remove the need for contiguous storage if that would lead in lost of buffers, disk cache or some other useful feature.
I'm also confused about if I would need to re-implement low level stuff since the partition would already be formatted with a file system. I read that would be the case for RAW disks/partitions. I already know it will be needed to handle which blocks are free or in use, files and folders structures, etc., I'm already working on that.
Another question: I have only seen something about buffers when reading about fopen()/fread()/fwrite() and C++'s file streams. Is it right that only these streams and the f* family of functions have some kind of buffer, unlike open/write/read/pwrite/pread/etc? Is this buffer the same as disk cache or something different?
A last one: Is HDD cache handled by its own drive or by file system (ext4, ufs, etc.)?
The simple answer is 'it depends'. What's hard is characterizing what it depends on.
Simply using open() doesn't avoid the kernel disk buffer pool. To do that, you need special options (O_DIRECT) on Linux. However, using open() does avoid using hidden application buffers; you get to choose where the data is read from or written to without any intermediate copies. By contrast, the f* family of functions do have a 'hidden' application buffer; the data is frequently read into an I/O buffer associated with the FILE * file stream, and then copied into your application buffers.
If your /dev/sdXY device is already formatted with a file system but you want to ensure contiguous file storage for a file, you are going to have to replicate a significant portion of the file system driver to ensure you allocate the space correctly. It is unlikely to be a sensible use of your time or energy. Yes, you would need to reimplement all sorts of low-level disk space management — it would be entirely non-trivial. Further, the implementation for ext4 would be quite different from the implementation for ufs, etc — so you'd really have your work cut out for you.

Alter a file (without fseek or + modes) or concatenate two files with minimal copying

I am writing an audio file to an SD/MMC storage card in real time, in WAVE format, working on an ARM board. Said card is (and must remain) in FAT32 format. I can write a valid WAVE file just fine, provided I know how much I'm going to write beforehand.
I want to be able to put placeholder data in the Chunk Data Size field of the RIFF and data chunks, write my audio data, and then go back and update the Chunk Data Size field in those two chunks so that they have correct values, but...
I have a working filesystem and some stdio functions, with some caveats:
fwrite() supports r, w, and a, but not any + modes.
fseek() does not work in write mode.
I did not write the implementations of the above functions (I am using ARM's RL-FLashFS), and I am not certain what the justification for the restrictions/partial implementations is. Adding in the missing functionality personally is probably an option, but I would like to avoid it if possible (I have no other need of those features, do not forsee any, and can't really afford to spend too much time on it.) Switching to a different implementation is also not an option here.
I have very limited memory available, and I do not know how much audio data will be received, except that it will almost certainly be more than I can keep in memory at any one time.
I could write a file with the raw interleaved audio data in it while keeping track of how many bytes I write, close it, then open it again for reading, open a second file for writing, write the header into the second file, and copy the audio data over. That is, I could post-process it into a properly formatted valid WAVE file. I have done this and it works fine. But I want to avoid post-processing large amounts of data if at all possible.
Perhaps I could somehow concatenate two files in place? (I.e. write the data, then write the chunks to a separate file, then join them in the filesystem, avoiding much of the time spent copying potentially vast amounts of data.) My understanding of that is that, if possible, it would still involve some copying due to block orientation of the storage.
Suggestions?
EDIT:
I really should have mentioned this, but there is no OS running here. I have some stdio functions running on top of a hardware abstraction layer, and that's about it.
This should be possible, but it involves writing a set of FAT table manipulation routines.
The concept of FAT is simple: A file is stored in a chain of "clusters" - fixed size blocks. The clusters do not have to be contiguous on the disk. The Directory entry for a file includes the ID of the first cluster. The FAT contains one value for each cluster, which is either the ID of the next cluster in the chain, or an "End-Of-Chain" (EOC) marker.
So you can concatenate files together by altering the first file's EOC marker to point to the head cluster of the second file.
For your application you could write all the data, rewrite the first cluster (with the correct header) into a new file, then do FAT surgery to graft the new head onto the old tail:
Determine the FAT cluster size (S)
Determine the size of the WAV header up to the first data byte (F)
Write the incoming data to a temp file. Close when stream ends.
Create a new file with the desired name.
Open the temp file for reading, and copy the header to the new file while filling in the size field(s) correctly (as you have done before).
Write min(S-F, bytes_remaining) to the new file.
Close the new file.
If there are no bytes remaining, you are done,
else,
Read the FAT and Directory into memory.
Read the Directory to get
the first cluster of the temp file (T1) (with all the data),
the first cluster of the wav file (W1). (with the correct header)
Read the FAT entry for T1 to find the second temp cluster (T2).
Change the FAT entry for W1 from "EOC" to T2.
Change the FAT entry for T1 from T2 to "EOC".
Swap the FileSize entries for the two files in the Directory.
Write the FAT and Directory back to disk.
Delete the Temp file.
Of course, by the time you do this, you will probably understand the file system well enough to implement fseek(fp,0,SEEK_SET), which should give you enough functionality to do the header fixup through standard library calls.
We are working with exactly the same scenario as you in our project - recorder application. Since the length of file is unknown - we write a RIFF header with 0 length in the beginning (to reserve space) and on closing - go back to the 0 position (with fseek) and write correct header. Thus, I think you have to debug why fseek doesn't work in write mode, otherwise you will not be able to perform this task efficiently.
By the way, you'd better off from file system internal specific workarounds like concatenating blocks, etc - this is hardly possible, will not be portable and can bring you new problems. Let's use standard and proven methods instead.
Update
(After finding out that your FS is ARM's RL-FlashFS) why not using rewind http://www.keil.com/support/man/docs/rlarm/rlarm_rewind.htm instead of fseek?

High performance reading - linux/pthreads

I have moderately large binary file consisting of independent blocks like this:
header1
data1
header2
data2
header3
data3
...
The number of blocks, the size of each block and the total size of the file vary quite a lot, but typical numbers are ~1000 blocks and average blocksize 100kb. The files are generated by an external application which I have no control over, but I want to read them as fast as possible. In many cases I am only interested in a fraction (i.e. 10 %) of the blocks, and this is the case I will optimize for.
My current implementation is like this:
Open the file and read all the headers - using information in the header to fseek() to the next header location; retain an open FILE * pointer.
When data is requested use fseek() to locate the data block, read all the data and return it.
This works fine - but I was thinking maybe(?) it was possible to speed things up using e.g. aio, mmap or other techniques I have only heard of.
Any thoughts?
Joakim
The speed difference between mmap and read is not that big (both need to read the data from disk), the biggest advantage of mmap is avoiding the double buffering.
If you are only interested in 10% of the contents, your biggest saving will be to not read the other 90%. This could be done by only reading the headers, and seeking to the next header or to the data block wanted. But it all depends on the fileformat, which the OP did not show in detail.
Most of the time is probably spent in accessing the disk. So perhaps buying an SSD is sensible. (Whatever you do, your application is I/O bound).
Apparently, your file is only about 100Mb. You could get it on disk (kernel file) cache just by reading it, e.g. with cat yourfile > /dev/null before running your program. For such a small file (on a reasonable machine it fits in RAM), I won't worry that much.
You could pre-process the text file, e.g. to make a database (for sqlite, or a real RDBMS like PostGreSQL) or just a gdbm indexed file.
If using <stdio.h> you might have a bigger buffer with setbuffer, or call fopen with a "rmt" mode (the m is a GNU Glibc extension to ask mmap-ing it).
You could use mmap with madvise.
You could (perhaps in a separate thread) use the readahead syscall.
But your file seems small enough that you should not bother that much. Are you sure it is really a performance issue? Do you read that file many thousand times per day, or do you have many hundreds of such files?

What posix_fadvise() args for sequential file write?

I am working on an application which does sequentially write a large file (and does not read at all), and I would like to use posix_fadvise() to optimize the filesystem behavior.
The function description in the manpage suggests that the most appropriate strategy would be POSIX_FADV_SEQUENTIAL. However, the Linux implementation description doubts that:
Under Linux, POSIX_FADV_NORMAL sets the readahead window to the default size for the backing device; POSIX_FADV_SEQUENTIAL doubles this size, and POSIX_FADV_RANDOM disables file readahead entirely.
As I'm only writing data (overwriting files possibly too), I don't expect any readahead. Should I then stick with my POSIX_FADV_SEQUENTIAL or rather use POSIX_FADV_RANDOM to disable it?
How about other options, such as POSIX_FADV_NOREUSE? Or maybe do not use posix_fadvise() for writing at all?
Most of the posix_fadvise() flags (eg POSIX_FADV_SEQUENTIAL and POSIX_FADV_RANDOM) are hints about readahead rather than writing.
There's some advice from Linus here and here about getting good sequential write performance. The idea is to break the file into large-ish (8MB) windows, then loop around doing:
Write out window N with write();
Request asynchronous write-out of window N with sync_file_range(..., SYNC_FILE_RANGE_WRITE)
Wait for the write-out of window N-1 to complete with sync_file_range(..., SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER)
Drop window N-1 from the pagecache with posix_fadvise(..., POSIX_FADV_DONTNEED)
This way you never have more than two windows worth of data in the page cache, but you still get the kernel writing out part of the pagecache to disk while you fill the next part.
It all depends on the temporal locality of your data. If your application won't need the data soon after it was written, then you can go with POSIX_FADV_NOREUSE to avoid writing to the buffer cache (in a similar way as the O_DIRECT flag from open()).
As far as writes go I think that you can just rely on the OSes disk IO scheduler to do the right thing.
You should keep in mind that while posix_fadvise is there specifically to give the kernel hints about future file usage patterns the kernel also has other data to help it out.
If you don't open the file for reading then it would only need to read blocks in when they were partially written. If you were to truncate the file to 0 then it doesn't even have to do that (you said that you were overwriting).

Resources