copy whole of a file into memory using mmap - c

i want to copy whole of a file to memory using mmap in C.i write this code:
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <sys/stat.h>
#include <errno.h>
int main(int arg, char *argv[])
{
char c ;
int numOfWs = 0 ;
int numOfPr = 0 ;
int numberOfCharacters ;
int i=0;
int k;
int pageSize = getpagesize();
char *data;
float wsP = 0;
float prP = 0;
int fp = open("2.txt", O_RDWR);
data = mmap((caddr_t)0, pageSize, PROT_READ, MAP_SHARED, fp,pageSize);
printf("%s\n", data);
exit(0);
}
when i execute the code i get the Bus error message.
next, i want to iterate this copied file and do some thing on it.
how can i copy the file correctly?

2 things.
The second parameter of mmap() is the size of the portion of file you want to make visible in your address space. The last one is the offset in the file from which you want the map. This means that as you have called mmap() you will see only 1 page (on x86 and ARM it's 4096 bytes) starting at offset 4096 in your file. If your file is smaller than 4096 bytes, then there will be no mapping and mmap() will return MAP_FAILED (i.e. (caddr_t)-1). You didn't check the return value of the function so the following printf() dereferences an illegal pointer => BUS ERROR.
Using a memory map with string functions can be difficult. If the file doesn't contain binary 0. It can happen that these functions then try to access past the mapped size of the file and touch unmapped memory => SEGFAULT.
To open a memory for a file, you have to know the size of the file.
struct stat filestat;
if(fstat(fd, &filestat) !=0) {
perror("stat failed");
exit(1);
}
data = mmap(NULL, filestat.st_size, PROT_READ, MAP_SHARED, fp, 0);
if(data == MAP_FAILED) {
perror("mmap failed");
exit(2);
}
EDIT: The memory map will always be opened with a size that is a multiple of the pagesize. This means that the last page will be filled with 0 up to the next multiple of the pagesize. Often programs using memory mapped files with string functions (like your printf()) will work most of the time, but will suddenly crash when mapping a file whith a size exactly a multiple of the page size (4096, 8192, 12288 etc.). The often seen advice to pass to mmap() a size bigger than real file size works on Linux but is not portable and is even in violation of Posix, which explicitly states that mapping beyond the file size is undefined behaviour. The only portable way is to not use string functions on memory maps.

The last parameter of mmap is the offset within the file, where the part of file mapped to memory starts. It shall be 0 in your case
data = mmap(NULL, pageSize, PROT_READ, MAP_SHARED, fp,0);
If your file is shorter than pageSize, you will not be able to use addresses beyond the end of file. To use the full size, you shall expand the size to pageSize before calling mmap. Use something like:
ftruncate(fp, pageSize);
If you want to write to the memory (file) you shall use flag PROT_WRITE as well. I.e.
data = mmap(NULL, pageSize, PROT_READ|PROT_WRITE, MAP_SHARED, fp,0);
If your file does not contain 0 character (as end of string) and you want to print it as a string, you shall use printf with explicitly specified maximum size:
printf("%.*s\n", pageSize, data);
Also, of course, as pointed by #Jongware, you shall test result of open for -1 and mmap for MAP_FAILED.

Related

How to change characters in a text file using C's mmap()?

Let's say I have the standard "Hello, World! \n" saved to a text file called hello.txt. If I want to change the 'H' to a 'R' or something, can I achieve this with mmap()?
mmap does not exist in the standard C99 (or C11) specification. It is defined in POSIX.
So assuming you have a POSIX system (e.g. Linux), you could first open(2) the file for read & write:
int myfd = open("hello.txt", O_RDWR);
if (myfd<0) { perror("hello.txt open"); exit(EXIT_FAILURE); };
Then you get the size (and other meta-data) of the file with fstat(2):
struct stat mystat = {};
if (fstat(myfd,&mystat)) { perror("fstat"); exit(EXIT_FAILURE); };
Now the size of the file is in mystat.st_size.
off_t myfsz = mystat.st_size;
Now we can call mmap(2) and we need to share the mapping (to be able to write inside the file thru the virtual address space)
void*ad = mmap(NULL, myfsz, PROT_READ|PROT_WRITE, MAP_SHARED,
myfd, 0);
if (ad == MMAP_FAILED) { perror("mmap"); exit(EXIT_FAILURE); };
Then we can overwrite the first byte (and we check that indeed the first byte in that file is H since you promised so):
assert (*(char*ad) == 'H');
((char*)ad) = 'R';
We might call msync(2) to ensure the file is updated right now on the disk. If we don't, it could be updated later.
Notably for very large mappings (notably those much larger than available RAM), we can assist the kernel (and its page cache) with hints given thru madvise(2) or posix_madvise(3)...
Notice that a mapping remains in effect even after a close(2). Use munmap & mprotect or mmap with MAP_FIXED on the same address range to change them.
On Linux, you could use proc(5) to query the address space. So your program could read (e.g. after fopen, using fgets in a loop) the pseudo /proc/self/maps file (or /proc/1234/maps for process of pid 1234).
BTW, mmap is used by dlopen(3); it can be called a lot of times, my manydl.c program demonstrates that on Linux you could have many hundreds of thousands of dlopen-ed shared files (so many hundreds of thousands of memory mappings).
Here's a working example.
#include <stdio.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
int main(){
int myFile = open("hello.txt", O_RDWR);
if(myFile < 0){
printf("open error\n");
}
struct stat myStat = {};
if (fstat(myFile, &myStat)){
printf("fstat error\n");
}
off_t size = myStat.st_size;
char *addr;
addr = mmap(NULL, size, PROT_READ|PROT_WRITE, MAP_SHARED, myFile, 0);
if (addr == MAP_FAILED){
printf("mmap error\n");
}
if (addr[0] != 'H'){
printf("Error: first char in file not H");
}
addr[0] = 'J';
return 0;
}

Segmentation fault with Posix-C program using mmap and mapfile

Well I have this program and I get a segmentation fault: 11 (core dumped). After lots of checks I get this when the for loop gets to i=1024 and it tries to mapfile[i]=0. The program is about making a server and a client program that communicates by read/writing in a common file made in the server program. This is the server program and it prints the value inside before and after the change. I would like to see what's going on, if it's a problem with the mapping or just problem with memory of the *mapfile. Thanks!
#include <sys/shm.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <errno.h>
#include <math.h>
int main()
{
int ret, i;
int *mapfile;
system("dd if=/dev/zero of=/tmp/c4 bs=4 count=500");
ret = open("/tmp/c4", O_RDWR | (mode_t)0600);
if (ret == -1)
{
perror("File");
return 0;
}
mapfile = mmap(NULL, 2000, PROT_READ | PROT_WRITE, MAP_SHARED, ret, 0);
for (i=1; i<=2000; i++)
{
mapfile[i] = 0;
}
while(mapfile[0] != 555)
{
mapfile = mmap(NULL, 2000, PROT_READ | PROT_WRITE, MAP_SHARED, ret, 0);
if (mapfile[0] != 0)
{
printf("Readed from file /tmp/c4 (before): %d\n", mapfile[0]);
mapfile[0]=mapfile[0]+5;
printf("Readed from file /tmp/c4 (after) : %d\n", mapfile[0]);
mapfile[0] = 0;
}
sleep(1);
}
ret = munmap(mapfile, 2000);
if (ret == -1)
{
perror("munmap");
return 0;
}
close(ret);
return 0;
}
mapfile = mmap(NULL, 2000, PROT_READ | PROT_WRITE, MAP_SHARED, ret, 0);
for (i=1; i<=2000; i++)
{
mapfile[i] = 0;
}
In this code here, you see that you are requesting 2000 units of memory. In this case mmap is taking in a size_t type meaning that its looking for a size, and not an amount of things for memory. As #Mat mentioned, you will need t use the sizeof(int) operator in order to feed mmap the proper size it requires.
The other issue that should be noted about this code that may cause a problem for you down the road, is beginning your loop index at i=1 rather than i=0. Starting your index at 0 wil ensure that you are going from the indices 0 - 1999, which corresponds to the memory you are trying to allocate.
Overall here, it looks like what your trying to do is initialize the values of your memory to 0. perhaps you could do this easier by relying on a builtin function called memset:
void *memset(void *str, int c, size_t n)
your code then becomes:
mapfile = mmap(NULL, 2000*sizeof(int), PROT_READ | PROT_WRITE, MAP_SHARED, ret, 0);
void *returnedPointer = memset(mapfile, 0, 2000*sizeof(int));
docs for memset can be found here:
http://www.tutorialspoint.com/c_standard_library/c_function_memset.htm
You're requesting 2000 bytes from mmap, but treating the returned value as an array of 2000 ints. That can't work, an int is usually 4 or 8 bytes these days. You'll be writing past the end of the reserved memory in your loop.
Change the mmap calls to use 2000*sizeof(int). And while you're at it, give that 2000 constant a name (e.g. const int num_elems = 2000; near the top) and don't repeat the magic constant all over the place. And once that's done change it to 1024 or 2048 so that the resulting size is a multiple of the page size (if you're not sure of your page size, getconf PAGE_SIZE on the command line).
And also change your dd command to create a large-enough file. It is currently creating a 2000 byte file, you'll need to increase that as well.
And validate the return value of mmap - it can fail, and you should detect that.
Finally, don't continuously remap, you're using MAP_SHARED modifications through other shared mappings of the same file and offset will be visible to your process. (Must really be the same file, if the other process also does a dd, that might not work. Only one process should have the responsibility of creating that file.)
If you do want to remap, you must also unmap each time. Otherwise you're leaking mappings.

reading and writing in chunks on linux using c

I have a ASCII file where every line contains a record of variable length. For example
Record-1:15 characters
Record-2:200 characters
Record-3:500 characters
...
...
Record-n: X characters
As the file sizes is about 10GB, i would like to read the record in chunks. Once read, i need to transform them, write them into another file in binary format.
So, for reading, my first reaction was to create a char array such as
FILE *stream;
char buffer[104857600]; //100 MB char array
fread(buffer, sizeof(buffer), 104857600, stream);
Is it correct to assume, that linux will issue one system call and fetch the entire 100MB?
As the records are separated by new line, i search for character by character for a new line character in the buffer and reconstruct each record.
My question is that is this how i should read in chunks or is there a better alternative to read data in chunks and reconstitute each record? Is there an alternative way to read x number of variable sized lines from an ASCII file in one call ?
Next during write, i do the same. I have a write char buffer, which i pass to fwrite to write a whole set of records in one call.
fwrite(buffer, sizeof(buffer), 104857600, stream);
UPDATE: If i setbuf(stream, buffer), where buffer is my 100MB char buffer, would fgets return from buffer or cause a disk IO?
Yes, fread will fetch the entire thing at once. (Assuming it's a regular file.) But it won't read 105 MB unless the file itself is 105 MB, and if you don't check the return value you have no way of knowing how much data was actually read, or if there was an error.
Use fgets (see man fgets) instead of fread. This will search for the line breaks for you.
char linebuf[1000];
FILE *file = ...;
while (fgets(linebuf, sizeof(linebuf), file) {
// decode one line
}
There is a problem with your code.
char buffer[104857600]; // too big
If you try to allocate a large buffer (105 MB is certainly large) on the stack, then it will fail and your program will crash. If you need a buffer that big, you will have to allocate it on the heap with malloc or similar. I'd certainly keep stack usage for a single function in the tens of KB at most, although you could probably get away with a few MB on most stock Linux systems.
As an alternative, you could just mmap the entire file into memory. This will not improve or degrade performance in most cases, but it easier to work with.
int r, fdes;
struct stat st;
void *ptr;
size_t sz;
fdes = open(filename, O_RDONLY);
if (fdes < 0) abort();
r = fstat(fdes, &st);
if (r) abort();
if (st.st_size > (size_t) -1) abort(); // too big to map
sz = st.st_size;
ptr = mmap(NULL, sz, PROT_READ, MAP_SHARED, fdes, 0);
if (ptr == MAP_FAILED) abort();
close(fdes); // file no longer needed
// now, ptr has the data, sz has the data length
// you can use ordinary string functions
The advantage of using mmap is that your program won't run out of memory. On a 64-bit system, you can put the entire file into your address space at the same time (even a 10 GB file), and the system will automatically read new chunks as your program accesses the memory. The old chunks will be automatically discarded, and re-read if your program needs them again.
It's a very nice way to plow through large files.
If you can, you might find that mmaping the file will be easiest. mmap maps a (portion of a) file into memory so the whole file can be accessed essentially as an array of bytes. In your case, you might not be able to map the whole file at once it would look something like:
#include <stdio.h>
#include <sys/stat.h>
#include <sys/types.h>
#include <unistd.h>
#include <sys/mman.h>
/* ... */
struct stat stat_buf;
long pagesz = sysconf(_SC_PAGESIZE);
int fd = fileno(stream);
off_t line_start = 0;
char *file_chunk = NULL;
char *input_line;
off_t cur_off = 0;
off_t map_offset = 0;
/* map 16M plus pagesize to ensure any record <= 16M will always fit in the mapped area */
size_t map_size = 16*1024*1024+pagesz;
if (map_offset + map_size > stat_buf.st_size) {
map_size = stat_buf.st_size - map_offset;
}
fstat(fd, &stat_buf);
/* map the first chunk of the file */
file_chunk = mmap(NULL, map_size, PROT_READ, MAP_SHARED, fd, map_offset);
// until we reach the end of the file
while (cur_off < stat_buf.st_size) {
/* check if we're about to read outside the current chunk */
if (!(cur_off-map_offset < map_size)) {
// destroy the previous mapping
munmap(file_chunk, map_size);
// round down to the page before line_start
map_offset = (line_start/pagesz)*pagesz;
// limit mapped region to size of file
if (map_offset + map_size > stat_buf.st_size) {
map_size = stat_buf.st_size - map_offset;
}
// map the next chunk
file_chunk = mmap(NULL, map_size, PROT_READ, MAP_SHARED, fd, map_offset);
// adjust the line start for the new mapping
input_line = &file_chunk[line_start-map_offset];
}
if (file_chunk[cur_off-map_offset] == '\n') {
// found a new line, process the current line
process_line(input_line, cur_off-line_start);
// set up for the next one
line_start = cur_off+1;
input_line = &file_chunk[line_start-map_offset];
}
cur_off++;
}
Most of the complication is to avoid making too huge a mapping. You might be able to map the whole file using
char *file_data = mmap(NULL, stat_buf.st_size, PROT_READ, MAP_SHARED, fd, 0);
my opinion is using fgets(buff) for auto detect new line.
and then use strlen(buff) for counting the buffer size,
if( (total+strlen(buff)) > 104857600 )
then write in new chunk..
But the chunk's size will hardly be 104857600 bytes.
CMIIW

Segfault while using mmap in C for reading binary files

I am trying to use mmap in C just to see how it exactly works. Currently I am try to read a binary file byte by byte using mmap. My code is like this:
#include <unistd.h>
#include <sys/types.h>
#include <sys/mman.h>
#include <stdio.h>
#include <stdlib.h>
#include <fcntl.h>
int main(int argc, char *argv[]) {
int fd;
char *data;
for ( int i = 1; i<argc; i++)
{
if(strcmp(argv[i],"-i")==0)
fd = open(argv[i+1],O_RDONLY);
}
data = mmap(NULL, 4000, PROT_READ, MAP_SHARED, fd, 8000);
int i = 0;
notation = data [i];
// ......
}
My problem occurs when I try notation = data[0] and I get a segfault . I am sure that the first byte in the binary file is a character as well. My for loop checks if there is an -i flag while compiling , if there is the next argument should be the file name.
It appears that mmap fails because the offset is not a multiple of page size. You can test this with perror and see that the problem is an invalid argument. If you write:
data = mmap(NULL, 4000, PROT_READ, MAP_SHARED, fd, 8000);
perror("Error");
At least on my OS X the following error is printed:
Error: Invalid argument
Changing offset from 8000 to 4096 or 8192 works. 6144 doesn't, so it has to be a multiple of 4096 on this platform. Incidentally,
printf("%d\n",getpagesize());
prints 4096. You should round your offset down to nearest multiple of this for mmap and add the remainder to i when accessing the area. Of course, get the page size for your particular platform from that function. It's probably defined in unistd.h, which you already declared.
Here's how to handle the offset correctly and deal with possible errors. It prints the byte at position 8000:
int offset = 8000;
int pageoffset = offset % getpagesize();
data = mmap(NULL, 4000 + pageoffset, PROT_READ, MAP_SHARED, fd, offset - pageoffset);
if ( data == MAP_FAILED ) {
perror ( "mmap" );
exit ( EXIT_FAILURE );
}
i = 0;
printf("%c\n",data [i + pageoffset]);

Why does mmap() fail with ENOMEM on a 1TB sparse file?

I've been working with large sparse files on openSUSE 11.2 x86_64. When I try to mmap() a 1TB sparse file, it fails with ENOMEM. I would have thought that the 64 bit address space would be adequate to map in a terabyte, but it seems not. Experimenting further, a 1GB file works fine, but a 2GB file (and anything bigger) fails. I'm guessing there might be a setting somewhere to tweak, but an extensive search turns up nothing.
Here's some sample code that shows the problem - any clues?
#include <errno.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <unistd.h>
int main(int argc, char *argv[]) {
char * filename = argv[1];
int fd;
off_t size = 1UL << 40; // 30 == 1GB, 40 == 1TB
fd = open(filename, O_RDWR | O_CREAT | O_TRUNC, 0666);
ftruncate(fd, size);
printf("Created %ld byte sparse file\n", size);
char * buffer = (char *)mmap(NULL, (size_t)size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if ( buffer == MAP_FAILED ) {
perror("mmap");
exit(1);
}
printf("Done mmap - returned 0x0%lx\n", (unsigned long)buffer);
strcpy( buffer, "cafebabe" );
printf("Wrote to start\n");
strcpy( buffer + (size - 9), "deadbeef" );
printf("Wrote to end\n");
if ( munmap(buffer, (size_t)size) < 0 ) {
perror("munmap");
exit(1);
}
close(fd);
return 0;
}
The problem was that the per-process virtual memory limit was set to only 1.7GB. ulimit -v 1610612736 set it to 1.5TB and my mmap() call succeeded. Thanks, bmargulies, for the hint to try ulimit -a!
Is there some sort of per-user quota, limiting the amount of memory available to a user process?
My guess is the the kernel is having difficulty allocating the memory that it needs to keep up with this memory mapping. I don't know how swapped out pages are kept up with in the Linux kernel (and I assume that most of the file would be in the swapped out state most of the time), but it may end up needing an entry for each page of memory that the file takes up in a table. Since this file might be mmapped by more than one process the kernel has to keep up with the mapping from the process's point of view, which would map to another point of view, which would map to secondary storage (and include fields for device and location).
This would fit into your addressable space, but might not fit (at least contiguously) within physical memory.
If anyone knows more about how Linux does this I'd be interested to hear about it.

Resources