mmap file with larger fixed length with zero padding?

mmap file with larger fixed length with zero padding? - c

I want to use mmap() to read a file with fixed length (eg. 64MB), but there also some files < 64MB.
I mmap this files (<64MB, eg.30MB) with length = 64MB, when read file data beyond file size(30MB - 64MB), the program got a bus-error.
I want mmap these files with fixed length, and read 0x00 when pointer beyond file size. How to do that?
One method I can think is ftruncate file first, and ftruncate back to ori size, but I don't think this method is perfect.

This is one of the few reasonable use cases for MAP_FIXED, to remap part of an existing mapping to use a new backing file.
A simple solution here is to unconditionally mmap 64 MB of anonymous memory (or explicitly mmap /dev/zero), without MAP_FIXED and store the resulting pointer.
Next, mmap 64 MB or your actual file size (whichever is less) of your actual file, passing in the result of the anonymous/zero mmap and passing the MAP_FIXED flag. The pages corresponding to your file will no longer be anonymous/zero mapped, and instead will be backed by your file's data; the remaining pages will be backed by the anonymous/zero pages.
When you're done, a single munmap call will unmap all 64 MB at once (you don't need to separately unmap the real file pages and the zero backed pages).
Extremely simple example (no error checking, please add it yourself):
// Reserve 64 MB of contiguous addresses; anonymous mappings are always zero backed
void *mapping = mmap(NULL, 64 * 1024 * 1024, PROT_READ, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
// Open file and check size
struct stat sb;
int fd = open(myfilename, O_RDONLY);
fstat(fd, &sb);
// Use smaller of file size or 64 MB
size_t filemapsize = sb.st_size > 64 * 1024 * 1024 ? 64 * 1024 * 1024 : sb.st_size;
// Remap up to 64 MB of pages, replacing some or all of original anonymous pages
mapping = mmap(mapping, filemapsize, PROT_READ, MAP_SHARED | MAP_FIXED, fd, 0);
close(fd);
// ... do stuff with mapping ...
munmap(mapping, 64 * 1024 * 1024);

Related

Mmap a large 1TB file and write 1's to it?

I am very new to mmap and memset. I have been assigned a task to create a large file (1TB) and write 1's to it as we are trying to understand the performance.
Now from what I understand, I can basically fallocate a file with 1Tb, then in a C function, I can mmap it with PROT_READ, PROT_WRITE, MAP_SHARED and then memset that mmap'ed pointed with memset, like this :
int fd = open(FILEPATH, O_RDWE, (mode_t)0700);
size_t data_length = 1000000000000;
char *data = (char*)mmap(NULL, data_length, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_POPULATE, fd , 0);
memset(data, '1', data_length);
Is this correct?
Do I need to sync or anything to make this data persistent?
If so, I'm basically writing a 1TB file then why does my function run within split seconds.
I've tried to cat the output file and there indeed are 1's, its just that my terminal craps out after a few seconds, but the data is actually getting written.
Am I doing this correctly or should I actually go and write data to memory rather than memset. If so, how should I do it?
Thanks for any help.

copy whole of a file into memory using mmap

i want to copy whole of a file to memory using mmap in C.i write this code:
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <errno.h>
int main(int arg, char *argv[])
{
char c ;
int numOfWs = 0 ;
int numOfPr = 0 ;
int numberOfCharacters ;
int i=0;
int k;
int pageSize = getpagesize();
char *data;
float wsP = 0;
float prP = 0;
int fp = open("2.txt", O_RDWR);
data = mmap((caddr_t)0, pageSize, PROT_READ, MAP_SHARED, fp,pageSize);
printf("%s\n", data);
exit(0);
}
when i execute the code i get the Bus error message.
next, i want to iterate this copied file and do some thing on it.
how can i copy the file correctly?

2 things.
The second parameter of mmap() is the size of the portion of file you want to make visible in your address space. The last one is the offset in the file from which you want the map. This means that as you have called mmap() you will see only 1 page (on x86 and ARM it's 4096 bytes) starting at offset 4096 in your file. If your file is smaller than 4096 bytes, then there will be no mapping and mmap() will return MAP_FAILED (i.e. (caddr_t)-1). You didn't check the return value of the function so the following printf() dereferences an illegal pointer => BUS ERROR.
Using a memory map with string functions can be difficult. If the file doesn't contain binary 0. It can happen that these functions then try to access past the mapped size of the file and touch unmapped memory => SEGFAULT.
To open a memory for a file, you have to know the size of the file.
struct stat filestat;
if(fstat(fd, &filestat) !=0) {
perror("stat failed");
exit(1);
}
data = mmap(NULL, filestat.st_size, PROT_READ, MAP_SHARED, fp, 0);
if(data == MAP_FAILED) {
perror("mmap failed");
exit(2);
}
EDIT: The memory map will always be opened with a size that is a multiple of the pagesize. This means that the last page will be filled with 0 up to the next multiple of the pagesize. Often programs using memory mapped files with string functions (like your printf()) will work most of the time, but will suddenly crash when mapping a file whith a size exactly a multiple of the page size (4096, 8192, 12288 etc.). The often seen advice to pass to mmap() a size bigger than real file size works on Linux but is not portable and is even in violation of Posix, which explicitly states that mapping beyond the file size is undefined behaviour. The only portable way is to not use string functions on memory maps.

The last parameter of mmap is the offset within the file, where the part of file mapped to memory starts. It shall be 0 in your case
data = mmap(NULL, pageSize, PROT_READ, MAP_SHARED, fp,0);
If your file is shorter than pageSize, you will not be able to use addresses beyond the end of file. To use the full size, you shall expand the size to pageSize before calling mmap. Use something like:
ftruncate(fp, pageSize);
If you want to write to the memory (file) you shall use flag PROT_WRITE as well. I.e.
data = mmap(NULL, pageSize, PROT_READ|PROT_WRITE, MAP_SHARED, fp,0);
If your file does not contain 0 character (as end of string) and you want to print it as a string, you shall use printf with explicitly specified maximum size:
printf("%.*s\n", pageSize, data);
Also, of course, as pointed by #Jongware, you shall test result of open for -1 and mmap for MAP_FAILED.

Copying files using memory map

I want to implement an effective file copying technique in C for my process which runs on BSD OS. As of now the functionality is implemented using read-write technique. I am trying to make it optimized by using memory map file copying technique.
Basically I will fork a process which mmaps both src and dst file and do memcpy() of the specified bytes from src to dst. The process exits after the memcpy() returns. Is msync() required here, because when I actually called msync with MS_SYNC flag, the function took lot of time to return. Same behavior is seen with MS_ASYNC flag as well?
i) So to summarize is it safe to avoid msync()?
ii) Is there any other better way of copying files in BSD. Because bsd seems to be does not support sendfile() or splice()? Any other equivalents?
iii) Is there any simple method for implementing our own zero-copy like technique for this requirement?
My code
/* mmcopy.c
Copy the contents of one file to another file, using memory mappings.
Usage mmcopy source-file dest-file
*/
#include <sys/mman.h>
#include <sys/stat.h>
#include <fcntl.h>
#include "tlpi_hdr.h"
int
main(int argc, char *argv[])
{
char *src, *dst;
int fdSrc, fdDst;
struct stat sb;
if (argc != 3)
usageErr("%s source-file dest-file\n", argv[0]);
fdSrc = open(argv[1], O_RDONLY);
if (fdSrc == -1)
errExit("open");
/* Use fstat() to obtain size of file: we use this to specify the
size of the two mappings */
if (fstat(fdSrc, &sb) == -1)
errExit("fstat");
/* Handle zero-length file specially, since specifying a size of
zero to mmap() will fail with the error EINVAL */
if (sb.st_size == 0)
exit(EXIT_SUCCESS);
src = mmap(NULL, sb.st_size, PROT_READ, MAP_PRIVATE, fdSrc, 0);
if (src == MAP_FAILED)
errExit("mmap");
fdDst = open(argv[2], O_RDWR | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
if (fdDst == -1)
errExit("open");
if (ftruncate(fdDst, sb.st_size) == -1)
errExit("ftruncate");
dst = mmap(NULL, sb.st_size, PROT_READ | PROT_WRITE, MAP_SHARED, fdDst, 0);
if (dst == MAP_FAILED)
errExit("mmap");
memcpy(dst, src, sb.st_size); /* Copy bytes between mappings */
if (msync(dst, sb.st_size, MS_SYNC) == -1)
errExit("msync");
enter code here
exit(EXIT_SUCCESS);
}

Short answer: msync() is not required.
When you do not specify msync(), the operating system flushes the memory-mapped pages in the background after the process has been terminated. This is reliable on any POSIX-compliant operating system.
To answer the secondary questions:
Typically the method of copying a file on any POSIX-compliant operating system (such as BSD) is to use open() / read() / write() and a buffer of some size (16kb, 32kb, or 64kb, for example). Read data into buffer from src, write data from buffer into dest. Repeat until read(src_fd) returns 0 bytes (EOF).
However, depending on your goals, using mmap() to copy a file in this fashion is probably a perfectly viable solution, so long as the files being coped are relatively small (relative to the expected memory constraints of your target hardware and your application). The mmap copy operation will require roughly 2x the total physical memory of the file. So if you're trying to copy a file that's a 8MB, your application will use 16MB to perform the copy. If you expect to be working with even larger files then that duplication could become very costly.
So does using mmap() have other advantages? Actually, no.
The OS will often be much slower about flushing mmap pages than writing data directly to a file using write(). This is because the OS will intentionally prioritize other things ahead of page flushes so to keep the system 'responsive' for foreground tasks/apps.
During the time the mmap pages are being flushed to disk (in the background), the chance of sudden loss of power to the system will cause loss of data. Of course this can happen when using write() as well but if write() finishes faster then there's less chance for unexpected interruption.
the long delay you observe when calling msync() is roughly the time it takes the OS to flush your copied file to disk. When you don't call msync() it happens in the background instead (and also takes even longer for that reason).

reading and writing in chunks on linux using c

I have a ASCII file where every line contains a record of variable length. For example
Record-1:15 characters
Record-2:200 characters
Record-3:500 characters
...
...
Record-n: X characters
As the file sizes is about 10GB, i would like to read the record in chunks. Once read, i need to transform them, write them into another file in binary format.
So, for reading, my first reaction was to create a char array such as
FILE *stream;
char buffer[104857600]; //100 MB char array
fread(buffer, sizeof(buffer), 104857600, stream);
Is it correct to assume, that linux will issue one system call and fetch the entire 100MB?
As the records are separated by new line, i search for character by character for a new line character in the buffer and reconstruct each record.
My question is that is this how i should read in chunks or is there a better alternative to read data in chunks and reconstitute each record? Is there an alternative way to read x number of variable sized lines from an ASCII file in one call ?
Next during write, i do the same. I have a write char buffer, which i pass to fwrite to write a whole set of records in one call.
fwrite(buffer, sizeof(buffer), 104857600, stream);
UPDATE: If i setbuf(stream, buffer), where buffer is my 100MB char buffer, would fgets return from buffer or cause a disk IO?

Yes, fread will fetch the entire thing at once. (Assuming it's a regular file.) But it won't read 105 MB unless the file itself is 105 MB, and if you don't check the return value you have no way of knowing how much data was actually read, or if there was an error.
Use fgets (see man fgets) instead of fread. This will search for the line breaks for you.
char linebuf[1000];
FILE *file = ...;
while (fgets(linebuf, sizeof(linebuf), file) {
// decode one line
}
There is a problem with your code.
char buffer[104857600]; // too big
If you try to allocate a large buffer (105 MB is certainly large) on the stack, then it will fail and your program will crash. If you need a buffer that big, you will have to allocate it on the heap with malloc or similar. I'd certainly keep stack usage for a single function in the tens of KB at most, although you could probably get away with a few MB on most stock Linux systems.
As an alternative, you could just mmap the entire file into memory. This will not improve or degrade performance in most cases, but it easier to work with.
int r, fdes;
struct stat st;
void *ptr;
size_t sz;
fdes = open(filename, O_RDONLY);
if (fdes < 0) abort();
r = fstat(fdes, &st);
if (r) abort();
if (st.st_size > (size_t) -1) abort(); // too big to map
sz = st.st_size;
ptr = mmap(NULL, sz, PROT_READ, MAP_SHARED, fdes, 0);
if (ptr == MAP_FAILED) abort();
close(fdes); // file no longer needed
// now, ptr has the data, sz has the data length
// you can use ordinary string functions
The advantage of using mmap is that your program won't run out of memory. On a 64-bit system, you can put the entire file into your address space at the same time (even a 10 GB file), and the system will automatically read new chunks as your program accesses the memory. The old chunks will be automatically discarded, and re-read if your program needs them again.
It's a very nice way to plow through large files.

If you can, you might find that mmaping the file will be easiest. mmap maps a (portion of a) file into memory so the whole file can be accessed essentially as an array of bytes. In your case, you might not be able to map the whole file at once it would look something like:
#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/mman.h>
/* ... */
struct stat stat_buf;
long pagesz = sysconf(_SC_PAGESIZE);
int fd = fileno(stream);
off_t line_start = 0;
char *file_chunk = NULL;
char *input_line;
off_t cur_off = 0;
off_t map_offset = 0;
/* map 16M plus pagesize to ensure any record <= 16M will always fit in the mapped area */
size_t map_size = 16*1024*1024+pagesz;
if (map_offset + map_size > stat_buf.st_size) {
map_size = stat_buf.st_size - map_offset;
}
fstat(fd, &stat_buf);
/* map the first chunk of the file */
file_chunk = mmap(NULL, map_size, PROT_READ, MAP_SHARED, fd, map_offset);
// until we reach the end of the file
while (cur_off < stat_buf.st_size) {
/* check if we're about to read outside the current chunk */
if (!(cur_off-map_offset < map_size)) {
// destroy the previous mapping
munmap(file_chunk, map_size);
// round down to the page before line_start
map_offset = (line_start/pagesz)*pagesz;
// limit mapped region to size of file
if (map_offset + map_size > stat_buf.st_size) {
map_size = stat_buf.st_size - map_offset;
}
// map the next chunk
file_chunk = mmap(NULL, map_size, PROT_READ, MAP_SHARED, fd, map_offset);
// adjust the line start for the new mapping
input_line = &file_chunk[line_start-map_offset];
}
if (file_chunk[cur_off-map_offset] == '\n') {
// found a new line, process the current line
process_line(input_line, cur_off-line_start);
// set up for the next one
line_start = cur_off+1;
input_line = &file_chunk[line_start-map_offset];
}
cur_off++;
}
Most of the complication is to avoid making too huge a mapping. You might be able to map the whole file using
char *file_data = mmap(NULL, stat_buf.st_size, PROT_READ, MAP_SHARED, fd, 0);

my opinion is using fgets(buff) for auto detect new line.
and then use strlen(buff) for counting the buffer size,
if( (total+strlen(buff)) > 104857600 )
then write in new chunk..
But the chunk's size will hardly be 104857600 bytes.
CMIIW

Is there really no mremap in Darwin?

I'm trying to find out how to remap memory-mapped files on a Mac (when I want to expand the available space).
I see our friends in the Linux world have mremap but I can find no such function in the headers on my Mac. /Developer/SDKs/MacOSX10.6.sdk/usr/include/sys/mman.h has the following:
mmap
mprotect
msync
munlock
munmap
but no mremap
man mremap confirms my fears.
I'm currently having to munmap and mmmap if I want to resize the size of the mapped file, which involves invalidating all the loaded pages. There must be a better way. Surely?
I'm trying to write code that will work on Mac OS X and Linux. I could settle for a macro to use the best function in each case if I had to but I'd rather do it properly.

If you need to shrink the map, just munmap the part at the end you want to remove.
If you need to enlarge the map, you can mmap the proper offset with MAP_FIXED to the addresses just above the old map, but you need to be careful that you don't map over something else that's already there...
The above text under strikeout is an awful idea; MAP_FIXED is fundamentally wrong unless you already know what's at the target address and want to atomically replace it. If you are trying to opportunistically map something new if the address range is free, you need to use mmap with a requested address but without MAP_FIXED and see if it succeeds and gives you the requested address; if it succeeds but with a different address you'll want to unmap the new mapping you just created and assume that allocation at the requested address was not possible.

If you expand in large enough chunks (say, 64 MB, but it depends on how fast it grows) then the cost of invalidating the old map is negligible. As always, benchmark before assuming a problem.

You can ftruncate the file to a large size (creating a hole) and mmap all of it. If the file is persistent I recommend filling the hole with write calls rather than by writing in the mapping, as otherwise the file's blocks may get unnecessarily fragmented on the disk.

I have no experience with memory mapping, but it looks like you can temporarily map the same file twice as a means to expand the mapping without losing anything.
int main() {
int fd;
char *fp, *fp2, *pen;
/* create 1K file */
fd = open( "mmap_data.txt", O_RDWR | O_CREAT, 0777 );
lseek( fd, 1000, SEEK_SET );
write( fd, "a", 1 );
/* map and populate it */
fp = mmap( NULL, 1000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0 );
pen = memset( fp, 'x', 1000 );
/* expand to 8K and establish overlapping mapping */
lseek( fd, 8000, SEEK_SET );
write( fd, "b", 1 );
fp2 = mmap( NULL, 7000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0 );
/* demonstrate that mappings alias */
*fp = 'z';
printf( "%c ", *fp2 );
/* eliminate first mapping */
munmap( fp, 1000 );
/* populate second mapping */
pen = memset( fp2+10, 'y', 7000 );
/* wrap up */
munmap( fp2, 7000 );
close( fd );
printf( "%d\n", errno );
}
The output is zxxxxxxxxxyyyyyy.....
I suppose, if you pound on this, it may be possible to run out of address space faster than with mremap. But nothing is guaranteed either way and it might on the other hand be just as safe.