I want to copy a large a ram-based file (located at /dev/shm direcotry) to local disk, is there some way for an efficient copy instead of read char one by one or create another piece memory? I can use only C language here. Is there anyway that I can put the memory file directly to disk? Thanks!
I would mmap() the files and do memcpy() between them.
Thanks you guys for the help! I made it by mmap the ram-based file and write the entire block directly to the destination. memcopy was not used because I am actually writing to a parallel file system (pvfs), which does not support mmap operation.
/dev/shm is shared memory, so one way to copy it would be to open it as shared memory, but frankly I don't think you will gain anything.
when writing your memory file to disk, the bottleneck will be the disk.
just be sure to write data in big chunks, and you should be fine.
You can just copy it like any other file:
cp /dev/shm/tmp ~/tmp
So, a quick, simple way is to issue a cp command via system().
You could try to see if the splice system call works for this. I'm not sure if it will since it has some restrictions about the types of files that it can work with, but if it did work you would call it repeatedly with memory page sized (or some multiple page size) requests repeatedly until it finished, and the kernel would handle it very efficently.
If this doesn't work you'll need to do either mmap or do plain old read/write.
Reading and Writing in memory page sized chunks makes things much more efficient. It can be even more efficient if your buffers are memory page size aligned since it opens up the oppurtunity for the kernel to just move the data to/from your process's memory via memory managment trickery rather than actually copying the data around.
The only thing you can do is read() in page size aligned chunks. I'm assuming you need to guarantee the data as written, which is going to mean bypassing buffers via posix_fadvise() or using O_DIRECT (I typically use posix_fadvise(), but O_DIRECT is appropriate here).
In this case, the speed of the media being written to alone dictates how quickly this will happen.
If you don't need to bypass buffers, the operation will complete faster, but there's no guarantee that the data will actually be written in the event of a reboot / power outage / etc. Since the source of the data is in shared memory, I'm (again) guessing you want the write to be guaranteed.
The only thing you can optimize is how long it takes read() to get data from shared memory into your own address space, which page size aligned chunks will improve.
Related
In the case that memory is allocated and its known that it (almost certainly / probably) won't be used for a long time, it could be useful to tag this memory to be more aggressively moved into swap-space.
Is there some command to tell the kernel of this?
Failing that, it may be better to dump these out to temp files, but I was curious about the ability to send-to-swap (or something similar).
Of course if there is no swap-space, this would do nothing, and in that case writing temp files may be better.
You can use the madvise call to tell the kernel what you will likely be doing with the memory in the future. For example:
madvise(base, length, MADV_SOFT_OFFLINE);
tells the kernel that you wont need the memory in quesion any time soon, so it can be flushed to backing store (or just dropped if it was mapped from a file and is unchanged).
There's also MADV_DONTNEED which allows the kernel to drop the contents even if modified (so when you next access the memory, if you do, it might be zeroed or reread from the original mapped file).
The closest thing I can think of would be mmap see: Memory-mapped I/O. This does not write to the linux swap partition, but allows for paging (complete pages of memory) to disk for access. Temporary files and directories are also available with tempfile, mkstemp and mkdtemp, but again, this does not write to the swap partition, but instead it occurs on the normal filesystem.
Other than features similar to the above, I do not believe there is anything that allows direct access to the swap partition (other than exhausting system memory).
While processing a very large binary file can using memory mapping in C make any difference when compared to fread ? Even if there are small differences in time it would be fine. And if it does make the process fsater any idea how to use memory mapping on a large binary file and extract data from it ?
Thanks!!
If you're going to read the entire file beginning to end, the most important thing is to let the platform know this. This will allow it to do aggressive read ahead and it will allow it to avoid polluting the cache with data that will not be read again anyway. You can do this either with memory mapping or without it. The key functions are posix_fadvise and posix_madvise.
Memory mapping is a huge win when you have random, small accesses. This is especially true when you have multiple writes to the same page. Without memory mapping, each read or write requires a user/kernel transition and a copy. With memory mapping, most operations don't.
But with sequential access, all will save is the copy. Oddly, the user/kernel transitions may be even worse. With large sequential reads, you get one user/kernel transition per read, which could be per 256KB if the reads are large. With large sequential access to a memory mapped file, you may fault every page (4KB). It depends on the kernel's "fault ahead" optimizations.
However, with memory mapping, you will save the copy, assuming you don't need to do the copy anyway. If you have to copy out of the mapped pages for any reason, then you might as well let a read operation copy them into place for you. However, if you can operate on the data in place, memory mapping may be a win.
It generally doesn't make as much of a difference as people tend to think it does. Especially when you think about how slow the disk is in comparison to all this stuff.
I am using the low-level I/O function 'write' to write some data to disk in my code (C language on Linux). First, I accumulate the data in a memory buffer, and then I use 'write' to write the data to disk when the buffer is full. So what's the best buffer size for 'write'? According to my tests it isn't the bigger the faster, so I am here to look for the answer.
There is probably some advantage in doing writes which are multiples of the filesystem block size, especially if you are updating a file in place. If you write less than a partial block to a file, the OS has to read the old block, combine in the new contents and then write it out. This doesn't necessarily happen if you rapidly write small pieces in sequence because the updates will be done on buffers in memory which are flushed later. Still, once in a while you could be triggering some inefficiency if you are not filling a block (and a properly aligned one: multiple of block size at an offset which is a multiple of the block size) with each write operation.
This issue of transfer size does not necessarily go away with mmap. If you map a file, and then memcpy some data into the map, you are making a page dirty. That page has to be flushed at some later time: it is indeterminate when. If you make another memcpy which touches the same page, that page could be clean now and you're making it dirty again. So it gets written twice. Page-aligned copies of multiples-of a page size will be the way to go.
You'll want it to be a multiple of the CPU page size, in order to use memory as efficiently as possible.
But ideally you want to use mmap instead, so that you never have to deal with buffers yourself.
You could use BUFSIZ defined in <stdio.h>
Otherwise, use a small multiple of the page size sysconf(_SC_PAGESIZE) (e.g. twice that value). Most Linux systems have 4Kbytes pages (which is often the same as or a small multiple of the filesystem block size).
As other replied, using the mmap(2) system call could help. GNU systems (e.g. Linux) have an extension: the second mode string of fopen may contain the latter m and when that happens, the GNU libc try to mmap.
If you deal with data nearly as large as your RAM (or half of it), you might want to also use madvise(2) to fine-tune performance of mmap.
See also this answer to a question quite similar to yours. (You could use 64Kbytes as a reasonable buffer size).
The "best" size depends a great deal on the underlying file system.
The stat and fstat calls fill in a data structure, struct stat, that includes the following field:
blksize_t st_blksize; /* blocksize for file system I/O */
The OS is responsible for filling this field with a "good size" for write() blocks. However, it's also important to call write() with memory that is "well aligned" (e.g., the result of malloc calls). The easiest way to get this to happen is to use the provided <stdio.h> stream interface (with FILE * objects).
Using mmap, as in other answers here, can also be very fast for many cases. Note that it's not well suited to some kinds of streams (e.g., sockets and pipes) though.
It depends on the amount of RAM, VM, etc. as well as the amount of data being written. The more general answer is to benchmark what buffer works best for the load you're dealing with, and use what works the best.
I need the fastest way to periodically sync file with memory.
What I think I would like is to have an mmap'd file, which is only sync'd to disk manually. I'm not sure how to prevent any automatic syncing from happening.
The file cannot be modified except at the times I manually specify. The point is to have a checkpoint file which keeps a snapshot of the state in memory. I would like to avoid copying as much as possible, since this will be need to called fairly frequently and speed is important.
Anything you write to the memory within a MAP_SHARED mapping of a file is considered as being written to the file at that time, as surely as if you had used write(). msync() in this sense is completely analagous to fsync() - it merely ensures that changes you have already made to the file are actually pushed out to permanent storage. You can't change this - it's how mmap() is defined to work.
In general, the safe way to do this is to write a complete consistent copy of the data to a temporary file, sync the temporary file, then atomically rename it over the prior checkpoint file. This is the only way to ensure that a crash between checkpoints doesn't leave you with an inconsistent file. Any solution that does less copying is going to require both a more complicated transaction-log style file format, and be more intrusive to the rest of your application (requiring specific hooks to be invoked in each place that the in-memory state is changed).
You could mmap() the file as copy on write so that any updates you do in memory are not written back to the file, then when you want to sync, you could:
A) Make a new memory mapping that is not copy on write and copy just the pages you modified into it.
Or
B) Open the file (regular file open) with direct I/O (block size aligned sized reading and writing) and write only the pages you modified. Direct I/O would be nice and fast because you're writing whole pages (memory page size is a multiple of disk block size) and there's no buffering. This method has the benefit of not using address space in case your mmap() is large and there's no room to mmap() another huge file.
After the sync, your copy on write mmap() is the same as your disk file, but the kernel still has the pages you needed to sync marked as non shared (with the disk). So you can then close and recreate the mmap() (still copy on write) that way the kernel can discard your pages if necessary (instead of paging them out to swap space) if there's memory pressure.
Of course, you'd have to keep track of which pages you had modified yourself because I can't think of how you'd get access to where the OS keeps that info. (wouldn't that be a handy syscall()?)
-- edit --
Actually, see Can the dirtiness of pages of a mmap be found from userspace? for ideas on how to see which pages are dirty.
mmap can't be used for this purpose. There's no way to prevent data from being written to disk. In practice, using mlock() to make the memory unswappable might have a side effect of preventing it from getting written to disk except when you ask for it to be written, but there's no guarantee. Certainly if another process opens the file, it's going to see the copy cached in memory (with your latest changes), not the copy on physical disk. In many ways, what you should do depends on whether you're trying to do synchronization with other processes or just for safety in case of crash or power failure.
If your data size is small, you might try a number of other methods for atomic syncing to disk. One way is to store the entire dataset in a filename and create an empty file by that name, then delete the old file. If 2 files exist at startup (due to extremely unlikely crash time), delete the older one and resume from the newer one. write() may also be atomic if your data size is smaller than a filesystem block, page size, or disk block, but I don't know of any guarantee to that effect right off. You'd have to do some research.
Another very standard approach that works as long as your data isn't so big that 2 copies won't fit on disk: just create a second copy with a temporary name, then rename() it over top of the old one. rename() is always atomic. This is probably the best approach unless you have a reason not to do it that way.
As the other respondents have suggested, I don't think there's a portable way to do what you want without copying. If you're looking to do this in a special-purpose environment where you can control the OS etc, you may be able to do it under Linux with the btrfs filesystem.
btrfs supports a new reflink() operation which is essentially a copy-on-write filesystem copy. You could reflink() your file to a temporary on start-up, mmap() the temporary, then msync() and reflink() the temporary back to the original to checkpoint.
I highly suspect that may not be taken advantage of by any OS, but it would be possible for an OS to notice optimizations for:
int fd = open("file", O_RDWR | O_SYNC | O_DIRECT);
size_t length = get_lenght(fd);
uint8_t * map_addr = mmap(NULL, length, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
...
// This represents all of the changes that could possibly happen before you
// want to update the on disk file.
change_various_data(map_addr);
if (is_time_to_update()) {
write(fd, map_addr, length);
lseek(fd, 0, SEEK_SET);
// you could have just used pwrite here and not seeked
}
The reasons that an OS could possibly take advantage of this is that until you write to a particular page (and no one else did either) the OS would probably just use the actual file's page at that location as the swap for that page.
Then when you wrote to some set of those pages the OS would Copy On Write those pages for your process, but still keep the unwritten pages backed up by the original file.
Then, upon calling write the OS could notice that the write was block aligned both in memory and on disk, and then it could notice that some of the source memory pages were already synched up with those exact file system pages that they were being written to and only write out the pages which had changed.
All of that being said, it wouldn't surprise me if this optimization isn't done by any OS, and this type of code ends up being really slow and causes lots of disk writing when you call 'write'. It would be cool if it was taken advantage of.
I'm implementing persistent large constant arrays via mmap. Is there any tips and tricks or gotchas one should be aware when using mmap?
All pointers that are stored inside the mmap'd region should be done as offsets from the base of the mmap'd region, not as real pointers! You won't necessarily be getting the same base address when you mmap the region on the next run of the program. (I have had to clean up code that made incorrect assumptions about mmap region base address constancy).
This is the most straight forward use case for mmap() so there shouldn't be much to trip you up.
You are effectively just loading a large constant array. Being constants you shouldn't need to worry about synchronization. It would be advisable to make sure the prot parameter is set to PROT_READ only since you won't be writing.
If one or more programs using the constants are going to be continually run, it might be worthwhile to have a separate program that loads the data and keeps it resident. Runs of the other programs then essentially are just doing an shared memory attach rather than continually reading the file into memory.
Make sure you check for restrictions on open file size or memory usage. On Linux there is a built in shell command ulimit. Run as ulimit -a to see the current settings.
Flush writes to the in-memory array to the file with the msync(2) syscall or else they may stay in memory until munmap(2) and there may be a power outage or something before then!
If multiple processes are mmap'ing the same memory region shared with read and write privileges, make sure that only one is writing to it at a time to avoid corrupting your data. Or use file locking or some other means of synchronization.