This looks like a simple question, but I didn't find anything similar here.
Since there is no file copy function in C, we have to implement file copying ourselves, but I don't like reinventing the wheel even for trivial stuff like that, so I'd like to ask the cloud:
What code would you recommend for file copying using fopen()/fread()/fwrite()?
What code would you recommend for file copying using open()/read()/write()?
This code should be portable (windows/mac/linux/bsd/qnx/younameit), stable, time tested, fast, memory efficient and etc. Getting into specific system's internals to squeeze some more performance is welcomed (like getting filesystem cluster size).
This seems like a trivial question but, for example, source code for CP command isn't 10 lines of C code.
This is the function I use when I need to copy from one file to another - with test harness:
/*
#(#)File: $RCSfile: fcopy.c,v $
#(#)Version: $Revision: 1.11 $
#(#)Last changed: $Date: 2008/02/11 07:28:06 $
#(#)Purpose: Copy the rest of file1 to file2
#(#)Author: J Leffler
#(#)Modified: 1991,1997,2000,2003,2005,2008
*/
/*TABSTOP=4*/
#include "jlss.h"
#include "stderr.h"
#ifndef lint
/* Prevent over-aggressive optimizers from eliminating ID string */
const char jlss_id_fcopy_c[] = "#(#)$Id: fcopy.c,v 1.11 2008/02/11 07:28:06 jleffler Exp $";
#endif /* lint */
void fcopy(FILE *f1, FILE *f2)
{
char buffer[BUFSIZ];
size_t n;
while ((n = fread(buffer, sizeof(char), sizeof(buffer), f1)) > 0)
{
if (fwrite(buffer, sizeof(char), n, f2) != n)
err_syserr("write failed\n");
}
}
#ifdef TEST
int main(int argc, char **argv)
{
FILE *fp1;
FILE *fp2;
err_setarg0(argv[0]);
if (argc != 3)
err_usage("from to");
if ((fp1 = fopen(argv[1], "rb")) == 0)
err_syserr("cannot open file %s for reading\n", argv[1]);
if ((fp2 = fopen(argv[2], "wb")) == 0)
err_syserr("cannot open file %s for writing\n", argv[2]);
fcopy(fp1, fp2);
return(0);
}
#endif /* TEST */
Clearly, this version uses file pointers from standard I/O and not file descriptors, but it is reasonably efficient and about as portable as it can be.
Well, except the error function - that's peculiar to me. As long as you handle errors cleanly, you should be OK. The "jlss.h" header declares fcopy(); the "stderr.h" header declares err_syserr() amongst many other similar error reporting functions. A simple version of the function follows - the real one adds the program name and does some other stuff.
#include "stderr.h"
#include <stdarg.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
void err_syserr(const char *fmt, ...)
{
int errnum = errno;
va_list args;
va_start(args, fmt);
vfprintf(stderr, fmt, args);
va_end(args);
if (errnum != 0)
fprintf(stderr, "(%d: %s)\n", errnum, strerror(errnum));
exit(1);
}
The code above may be treated as having a modern BSD license or GPL v3 at your choice.
As far as the actual I/O goes, the code I've written a million times in various guises for copying data from one stream to another goes something like this. It returns 0 on success, or -1 with errno set on error (in which case any number of bytes might have been copied).
Note that for copying regular files, you can skip the EAGAIN stuff, since regular files are always blocking I/O. But inevitably if you write this code, someone will use it on other types of file descriptors, so consider it a freebie.
There's a file-specific optimisation that GNU cp does, which I haven't bothered with here, that for long blocks of 0 bytes instead of writing you just extend the output file by seeking off the end.
void block(int fd, int event) {
pollfd topoll;
topoll.fd = fd;
topoll.events = event;
poll(&topoll, 1, -1);
// no need to check errors - if the stream is bust then the
// next read/write will tell us
}
int copy_data_buffer(int fdin, int fdout, void *buf, size_t bufsize) {
for(;;) {
void *pos;
// read data to buffer
ssize_t bytestowrite = read(fdin, buf, bufsize);
if (bytestowrite == 0) break; // end of input
if (bytestowrite == -1) {
if (errno == EINTR) continue; // signal handled
if (errno == EAGAIN) {
block(fdin, POLLIN);
continue;
}
return -1; // error
}
// write data from buffer
pos = buf;
while (bytestowrite > 0) {
ssize_t bytes_written = write(fdout, pos, bytestowrite);
if (bytes_written == -1) {
if (errno == EINTR) continue; // signal handled
if (errno == EAGAIN) {
block(fdout, POLLOUT);
continue;
}
return -1; // error
}
bytestowrite -= bytes_written;
pos += bytes_written;
}
}
return 0; // success
}
// Default value. I think it will get close to maximum speed on most
// systems, short of using mmap etc. But porters / integrators
// might want to set it smaller, if the system is very memory
// constrained and they don't want this routine to starve
// concurrent ops of memory. And they might want to set it larger
// if I'm completely wrong and larger buffers improve performance.
// It's worth trying several MB at least once, although with huge
// allocations you have to watch for the linux
// "crash on access instead of returning 0" behaviour for failed malloc.
#ifndef FILECOPY_BUFFER_SIZE
#define FILECOPY_BUFFER_SIZE (64*1024)
#endif
int copy_data(int fdin, int fdout) {
// optional exercise for reader: take the file size as a parameter,
// and don't use a buffer any bigger than that. This prevents
// memory-hogging if FILECOPY_BUFFER_SIZE is very large and the file
// is small.
for (size_t bufsize = FILECOPY_BUFFER_SIZE; bufsize >= 256; bufsize /= 2) {
void *buffer = malloc(bufsize);
if (buffer != NULL) {
int result = copy_data_buffer(fdin, fdout, buffer, bufsize);
free(buffer);
return result;
}
}
// could use a stack buffer here instead of failing, if desired.
// 128 bytes ought to fit on any stack worth having, but again
// this could be made configurable.
return -1; // errno is ENOMEM
}
To open the input file:
int fdin = open(infile, O_RDONLY|O_BINARY, 0);
if (fdin == -1) return -1;
Opening the output file is tricksy. As a basis, you want:
int fdout = open(outfile, O_WRONLY|O_BINARY|O_CREAT|O_TRUNC, 0x1ff);
if (fdout == -1) {
close(fdin);
return -1;
}
But there are confounding factors:
you need to special-case when the files are the same, and I can't remember how to do that portably.
if the output filename is a directory, you might want to copy the file into the directory.
if the output file already exists (open with O_EXCL to determine this and check for EEXIST on error), you might want to do something different, as cp -i does.
you might want the permissions of the output file to reflect those of the input file.
you might want other platform-specific meta-data to be copied.
you may or may not wish to unlink the output file on error.
Obviously the answers to all these questions could be "do the same as cp". In which case the answer to the original question is "ignore everything I or anyone else has said, and use the source of cp".
Btw, getting the filesystem's cluster size is next to useless. You'll almost always see speed increasing with buffer size long after you've passed the size of a disk block.
the size of each read need to be a multiple of 512 ( sector size ) 4096 is a good one
Here is a very easy and clear example: Copy a file. Since it is written in ANSI-C without any particular function calls I think this one would be pretty much portable.
Depending on what you mean by copying a file, it is certainly far from trivial. If you mean copying the content only, then there is almost nothing to do. But generally, you need to copy the metadata of the file, and that's surely platform dependent. I don't know of any C library which does what you want in a portable manner. Just handling the filename by itself is no trivial matter if you care about portability.
In C++, there is the file library in boost
One thing I found when implementing my own file copy, and it seems obvious but it's not: I/O's are slow. You can pretty much time your copy's speed by how many of them you do. So clearly you need to do as few of them as possible.
The best results I found were when I got myself a ginourmous buffer, read the entire source file into it in one I/O, then wrote the entire buffer back out of it in one I/O. If I even had to do it in 10 batches, it got way slow. Trying to read and write out each byte, like a naieve coder might try first, was just painful.
The accepted answer written by Steve Jessop does not answer to the first part of the quession, Jonathan Leffler do it, but do it wrong: code should be written as
while ((n = fread(buffer, 1, sizeof(buffer), f1)) > 0)
if (fwrite(buffer, n, 1, f2) != 1)
/* we got write error here */
/* test ferror(f1) for a read errors */
Explanation:
sizeof(char) = 1 by definition, always: it does not matter how many bits in it, 8 (in most cases), 9, 11 or 32 (on some DSP, for example) — size of char is one. Note, it is not an error here, but an extra code.
The fwrite function writes upto nmemb (second argument) elements of specified size (third argument), it does not required to write exactly nmemb elements. To fix this you must write the rest of the data readed or just write one element of size n — let fwrite do all his work. (This item is in question, should fwrite write all data or not, but in my version short writes impossible until error occurs.)
You should test for a read errors too: just test ferror(f1) at the end of loop.
Note, you probably need to disable buffering on both input and output files to prevent triple buffering: first on read to f1 buffer, second in our code, third on write to f2 buffer:
setvbuf(f1, NULL, _IONBF, 0);
setvbuf(f2, NULL, _IONBF, 0);
(Internal buffers should, probably, be of size BUFSIZ.)
Related
I want to take all characters past location 900 from a file called WWW, and put all of these in an array:
//Keep track of all characters past position 900 in WWW.
int Seek900InWWW = lseek(WWW, 900, 0); //goes to position 900 in WWW
printf("%d \n", Seek900InWWW);
if(Seek900InWWW < 0)
printf("Error seeking to position 900 in WWW.txt");
char EverythingPast900[appropriatesize];
int NextRead;
char NextChar[1];
int i = 0;
while((NextRead = read(WWW, NextChar, sizeof(NextChar))) > 0) {
EverythingPast900[i] = NextChar[0];
printf("%c \n", NextChar[0]);
i++;
}
I try to create a char array of length 1, since the read system call requires a pointer, I cannot use a regular char. The above code does not work. In fact, it does not print any characters to the terminal as expected by the loop. I think my logic is correct, but perhaps a misunderstanding of whats going on behind the scenes is what is making this hard for me. Or maybe i missed something simple (hope not).
If you already know how many bytes to read (e.g. in appropriatesize) then just read in that many bytes at once, rather than reading in bytes one at a time.
char everythingPast900[appropriatesize];
ssize_t bytesRead = read(WWW, everythingPast900, sizeof everythingPast900);
if (bytesRead > 0 && bytesRead != appropriatesize)
{
// only everythingPast900[0] to everythingPast900[bytesRead - 1] is valid
}
I made a test version of your code and added bits you left out. Why did you leave them out?
I also made a file named www.txt that has a hundred lines of "This is a test line." in it.
And I found a potential problem, depending on how big your appropriatesize value is and how big the file is. If you write past the end of EverythingPast900 it is possible for you to kill your program and crash it before you ever produce any output to display. That might happen on Windows where stdout may not be line buffered depending on which libraries you used.
See the MSDN setvbuf page, in particular "For some systems, this provides line buffering. However, for Win32, the behavior is the same as _IOFBF - Full Buffering."
This seems to work:
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdio.h>
int main()
{
int WWW = open("www.txt", O_RDONLY);
if(WWW < 0)
printf("Error opening www.txt\n");
//Keep track of all characters past position 900 in WWW.
int Seek900InWWW = lseek(WWW, 900, 0); //goes to position 900 in WWW
printf("%d \n", Seek900InWWW);
if(Seek900InWWW < 0)
printf("Error seeking to position 900 in WWW.txt");
int appropriatesize = 1000;
char EverythingPast900[appropriatesize];
int NextRead;
char NextChar[1];
int i = 0;
while(i < appropriatesize && (NextRead = read(WWW, NextChar, sizeof(NextChar))) > 0) {
EverythingPast900[i] = NextChar[0];
printf("%c \n", NextChar[0]);
i++;
}
return 0;
}
As stated in another answer, read more than one byte. The theory behind "buffers" is to reduce the amount of read/write operations due to how slow disk I/O (or network I/O) is compared to memory speed and CPU speed. Look at it as if it is code and consider which is faster: adding 1 to the file size N times and writing N bytes individually, or adding N to the file size once and writing N bytes at once?
Another thing worth mentioning is the fact that read may read fewer than the number of bytes you requested, even if there is more to read. The answer written by #dreamlax illustrates this fact. If you want, you can use a loop to read as many bytes as possible, filling the buffer. Note that I used a function, but you can do the same thing in your main code:
#include <sys/types.h>
/* Read from a file descriptor, filling the buffer with the requested
* number of bytes. If the end-of-file is encountered, the number of
* bytes returned may be less than the requested number of bytes.
* On error, -1 is returned. See read(2) or read(3) for possible
* values of errno.
* Otherwise, the number of bytes read is returned.
*/
ssize_t
read_fill (int fd, char *readbuf, ssize_t nrequested)
{
ssize_t nread, nsum = 0;
while (nrequested > 0
&& (nread = read (fd, readbuf, nrequested)) > 0)
{
nsum += nread;
nrequested -= nread;
readbuf += nread;
}
return nsum;
}
Note that the buffer is not null-terminated as not all data is necessarily text. You can pass buffer_size - 1 as the requested number of bytes and use the return value to add a null terminator where necessary. This is useful primarily when interacting with functions that will expect a null-terminated string:
char readbuf[4096];
ssize_t n;
int fd;
fd = open ("WWW", O_RDONLY);
if (fd == -1)
{
perror ("unable to open WWW");
exit (1);
}
n = lseek (fd, 900, SEEK_SET);
if (n == -1)
{
fprintf (stderr,
"warning: seek operation failed: %s\n"
" reading 900 bytes instead\n",
strerror (errno));
n = read_fill (fd, readbuf, 900);
if (n < 900)
{
fprintf (stderr, "error: fewer than 900 bytes in file\n");
close (fd);
exit (1);
}
}
/* Read a file, printing its contents to the screen.
*
* Caveat:
* Not safe for UTF-8 or other variable-width/multibyte
* encodings since required bytes may get cut off.
*/
while ((n = read_fill (fd, readbuf, (ssize_t) sizeof readbuf - 1)) > 0)
{
readbuf[n] = 0;
printf ("Read\n****\n%s\n****\n", readbuf);
}
if (n == -1)
{
close (fd);
perror ("error reading from WWW");
exit (1);
}
close (fd);
I could also have avoided the null termination operation and filled all 4096 bytes of the buffer, electing to use the precision part of the format specifiers of printf in this case, changing the format specification from %s to %.4096s. However, this may not be feasible with unusually large buffers (perhaps allocated by malloc to avoid stack overflow) because the buffer size may not be representable with the int type.
Also, you can use a regular char just fine:
char c;
nread = read (fd, &c, 1);
Apparently you didn't know that the unary & operator gets the address of whatever variable is its operand, creating a value of type pointer-to-{typeof var}? Either way, it takes up the same amount of memory, but reading 1 byte at a time is something that normally isn't done as I've explained.
Mixing declarations and code is a no no. Also, no, that is not a valid declaration. C should complain about it along the lines of it being variably defined.
What you want is dynamically allocating the memory for your char buffer[]. You'll have to use pointers.
http://www.ontko.com/pub/rayo/cs35/pointers.html
Then read this one.
http://www.cprogramming.com/tutorial/c/lesson6.html
Then research a function called memcpy().
Enjoy.
Read through that guide, then you should be able to solve your problem in an entirely different way.
Psuedo code.
declare a buffer of char(pointer related)
allocate memory for said buffer(dynamic memory related)
Find location of where you want to start at
point to it(pointer related)
Figure out how much you want to store(technically a part of allocating memory^^^)
Use memcpy() to store what you want in the buffer
OK I know questions like this have been asked in various forms before and I have read them all and tried everything that has been suggested but I still cannot create a file that is more than 2GB on a 64bit system using malloc, open, lseek, blah blah every trick under the sun.
Clearly I'm writing c here. I'm running Fedora 20, I'm actually trying to mmap the file but that is not where it fails, my original method was to use open(), then lseek to the position where the file should end which in this case is at 3GB, edit: and then write a byte at the file end position to actually create the file of that size, and then mmap the file. I cannot lseek to past 2GB. I cannot malloc more than 2GB either. ulimit -a etc all show unlimited, /etc/security/limits.conf shows nothing, ....
when I try to lseek past 2GB I get EINVAL for errno and the ret val of lseek is -1.edit: The size parameter to lseek is of type off_t which is defined as a long int (64bit signed), not size_t as I said previously.
edit:
I've already tried defining _LARGEFILE64_SOURCE & _FILE_OFFSET_BITS 64 and it made no difference.
I'm also compiling specifically for 64bit i.e. -m64
I'm lost. I have no idea why I cant do this.
Any help would be greatly appreciated.
Thanks.
edit: I've removed a lot of completely incorrect babbling on my part and some other unimportant ramblings that have been dealt with later on.
My 2GB problem was in the horribly sloppy interchanging of multiple different types. Mixing of signed and unsigned being the problem. Essentially the 3GB position I was passing to lseek was being interpreted/turned into a position of -1GB and clearly lseek didnt like that. So my bad. Totally stupid.
I am going to change to using posix_fallocate() as p_l suggested. While it does remove one function call i.e. only need posix_fallocate instead of an lseek and then a write, for me that isn't significant, it is the fact that posix_fallocate is doing exactly what I want directly which the lseek method doesn't. So thanks in particular to p_l for suggesting that, and a special thanks to NominalAnimal whose persistence that he knew better indirectly lead me to the realisation that I cant count which in turn led me to accept that posix_fallocate would work and so change to using it.
Regardless of the end method I used. The problem of 2GB was entirely my own crap coding and thanks again to EOF, chux, p_l and Jonathon Leffler who all contributed information and suggestions that lead me to the problem I had created for myself.
I've included a shorter version of this in an answer.
My 2GB problem was in the horribly sloppy interchanging of multiple different types. Mixing of signed and unsigned being the problem. Essentially the 3GB position I was passing to lseek was being interpreted/turned into a position of -1GB and clearly lseek didnt like that. So my bad. Totally stupid crap coding.
Thanks again to EOF, chux, p_l and Jonathon Leffler who all contributed information and suggestions that lead me to the problem I'd created and its solution.
Thanks again to p_l for suggesting posix_fallocate(), and a special thanks to NominalAnimal whose persistence that he knew better indirectly lead me to the realisation that I cant count which in turn led me to accept that posix_fallocate would work and so change to using it.
#p_l although the solution to my actual problem wasn't in your answer, I'd still up vote your answer that suggested using posix_fallocate but I dont have enough points to do that.
First of all, try:
//Before any includes:
#define _LARGEFILE64_SOURCE
#define _FILE_OFFSET_BITS 64
If that doesn't work, change lseek to lseek64 like this
lseek64(fd, 3221225472, SEEK_SET);
A better option than lseek might be posix_fallocate():
posix_fallocate(fd, 0, 3221225472);
before the call to mmap();
I recommend keeping the defines, though :)
This is a test program I created (a2b.c):
#include <assert.h>
#include <errno.h>
#include <fcntl.h>
#include <inttypes.h>
#include <stdarg.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/stat.h>
#include <unistd.h>
static void err_exit(const char *fmt, ...);
int main(void)
{
char const filename[] = "big.file";
int fd = open(filename, O_RDONLY);
if (fd < 0)
err_exit("Failed to open file %s for reading", filename);
struct stat sb;
fstat(fd, &sb);
uint64_t size = sb.st_size;
printf("File: %s; size %" PRIu64 "\n", filename, size);
assert(size > UINT64_C(3) * 1024 * 1024 * 1024);
off_t offset = UINT64_C(3) * 1024 * 1024 * 1024;
if (lseek(fd, offset, SEEK_SET) < 0)
err_exit("lseek failed");
close(fd);
_Static_assert(sizeof(size_t) > 4, "sizeof(size_t) is too small");
size = UINT64_C(3) * 1024 * 1024 * 1024;
void *space = malloc(size);
if (space == 0)
err_exit("failed to malloc %zu bytes", size);
*((char *)space + size - 1) = '\xFF';
printf("All OK\n");
return 0;
}
static void err_exit(const char *fmt, ...)
{
int errnum = errno;
va_list args;
va_start(args, fmt);
vfprintf(stderr, fmt, args);
va_end(args);
if (errnum != 0)
fprintf(stderr, ": (%d) %s", errnum, strerror(errnum));
putc('\n', stderr);
exit(1);
}
When compiled and run on a Mac (Mac OS X 10.9.2 Mavericks, GCC 4.8.2, 16 GiB physical RAM), with command line:
gcc -O3 -g -std=c11 -Wall -Wextra -Wmissing-prototypes -Wstrict-prototypes \
-Wold-style-definition -Werror a2b.c -o a2b
and having created big.file with:
dd if=/dev/zero of=big.file bs=1048576 count=5000
I got the reassuring output:
File: big.file; size 5242880000
All OK
I had to use _Static_assert rather than static_assert because the Mac <assert.h> header doesn't define static_assert. When I compiled with -m32, the static assert triggered.
When I ran it on an Ubuntu 13.10 64-bit VM with 1 GiB virtual physical memory (or is that tautological?), I not very surprisingly got the output:
File: big.file; size 5242880000
failed to malloc 3221225472 bytes: (12) Cannot allocate memory
I used exactly the same command line to compile the code; it compiled OK on Linux with static_assert in place of _Static_assert. The output of ulimit -a indicated that the maximum memory size was unlimited, but that means 'no limit smaller than that imposed by the amount of virtual memory on the machine' rather than anything bigger.
Note that my compilations did not explicitly include -m64 but they were automatically 64-bit compilations.
What do you get? Can dd create the big file? Does the code compile? (If you don't have C11 support in your compiler, then you'll need to replace the static assert with a normal 'dynamic' assert, removing the error message.) Does the code run? What result do you get.
Here is an example program, example.c:
/* Not required on 64-bit architectures; recommended anyway. */
#define _FILE_OFFSET_BITS 64
/* Tell the compiler we do need POSIX.1-2001 features. */
#define _POSIX_C_SOURCE 200112L
/* Needed to get MAP_NORESERVE. */
#define _GNU_SOURCE
#include <unistd.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <errno.h>
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#ifndef FILE_NAME
#define FILE_NAME "data.map"
#endif
#ifndef FILE_SIZE
#define FILE_SIZE 3221225472UL
#endif
int main(void)
{
const size_t size = FILE_SIZE;
const char *const file = FILE_NAME;
size_t page;
unsigned char *data;
int descriptor;
int result;
/* First, obtain the normal page size. */
page = (size_t)sysconf(_SC_PAGESIZE);
if (page < 1) {
fprintf(stderr, "BUG: sysconf(_SC_PAGESIZE) returned an invalid value!\n");
return EXIT_FAILURE;
}
/* Verify the map size is a multiple of page size. */
if (size % page) {
fprintf(stderr, "Map size (%lu) is not a multiple of page size (%lu)!\n",
(unsigned long)size, (unsigned long)page);
return EXIT_FAILURE;
}
/* Create backing file. */
do {
descriptor = open(file, O_RDWR | O_CREAT | O_EXCL, 0600);
} while (descriptor == -1 && errno == EINTR);
if (descriptor == -1) {
fprintf(stderr, "Cannot create backing file '%s': %s.\n", file, strerror(errno));
return EXIT_FAILURE;
}
#ifdef FILE_ALLOCATE
/* Allocate disk space for backing file. */
do {
result = posix_fallocate(descriptor, (off_t)0, (off_t)size);
} while (result == -1 && errno == EINTR);
if (result == -1) {
fprintf(stderr, "Cannot resize and allocate %lu bytes for backing file '%s': %s.\n",
(unsigned long)size, file, strerror(errno));
unlink(file);
return EXIT_FAILURE;
}
#else
/* Backing file is sparse; disk space is not allocated. */
do {
result = ftruncate(descriptor, (off_t)size);
} while (result == -1 && errno == EINTR);
if (result == -1) {
fprintf(stderr, "Cannot resize backing file '%s' to %lu bytes: %s.\n",
file, (unsigned long)size, strerror(errno));
unlink(file);
return EXIT_FAILURE;
}
#endif
/* Map the file.
* If MAP_NORESERVE is not used, then the mapping size is limited
* to the amount of available RAM and swap combined in Linux.
* MAP_NORESERVE means that no swap is allocated for the mapping;
* the file itself acts as the backing store. That's why MAP_SHARED
* is also used. */
do {
data = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED | MAP_NORESERVE,
descriptor, (off_t)0);
} while ((void *)data == MAP_FAILED && errno == EINTR);
if ((void *)data == MAP_FAILED) {
fprintf(stderr, "Cannot map file '%s': %s.\n", file, strerror(errno));
unlink(file);
return EXIT_FAILURE;
}
/* Notify of success. */
fprintf(stdout, "Mapped %lu bytes of file '%s'.\n", (unsigned long)size, file);
fflush(stdout);
#if defined(FILE_FILL)
memset(data, ~0UL, size);
#elif defined(FILE_ZERO)
memset(data, 0, size);
#elif defined(FILE_MIDDLE)
data[size/2] = 1; /* One byte in the middle set to one. */
#else
/*
* Do something with the mapping, data[0] .. data[size-1]
*/
#endif
/* Unmap. */
do {
result = munmap(data, size);
} while (result == -1 && errno == EINTR);
if (result == -1)
fprintf(stderr, "munmap(): %s.\n", strerror(errno));
/* Close the backing file. */
result = close(descriptor);
if (result)
fprintf(stderr, "close(): %s.\n", strerror(errno));
#ifndef FILE_KEEP
/* Remove the backing file. */
result = unlink(file);
if (result)
fprintf(stderr, "unlink(): %s.\n", strerror(errno));
#endif
/* We keep the file. */
fprintf(stdout, "Done.\n");
fflush(stdout);
return EXIT_SUCCESS;
}
To compile and run, use e.g.
gcc -W -Wall -O3 -DFILE_KEEP -DFILE_MIDDLE example.c -o example
./example
The above will create a three-gigabyte (10243) sparse file data.map, and set the middle byte in it to 1 (\x01). All other bytes in the file remain zeroes. You can then run
du -h data.map
to see how much such a sparse file actually takes on-disk, and
hexdump -C data.map
if you wish to verify the file contents are what I claim they are.
There are a few compile-time flags (macros) you can use to change how the example program behaves:
'-DFILE_NAME="filename"'
Use file name filename instead of data.map. Note that the entire value is defined inside single quotes, so that the shell does not parse the double quotes. (The double quotes are part of the macro value.)
'-DFILE_SIZE=(1024*1024*1024)'
Use 10243 = 1073741824 byte mapping instead of the default 3221225472. If the expression contains special characters the shell would try to evaluate, it is best to enclose it all in single or double quotes.
-DFILE_ALLOCATE
Allocate actual disk space for the entire mapping. By default, a sparse file is used instead.
-DFILE_FILL
Fill the entire mapping with (unsigned char)(~0UL), typically 255.
-DFILE_ZERO
Clear the entire mapping to zero.
-DFILE_MIDDLE
Set the middle byte in the mapping to 1. All other bytes are unchanged.
-DFILE_KEEP
Do not delete the data file. This is useful to explore how much data the mapping actually requires on disk; use e.g. du -h data.map.
There are three key limitations to consider when using memory-mapped files in Linux:
File size limits
Older file systems like FAT (MS-DOS) do not support large files, or sparse files. Sparse files are useful if the dataset is sparse (contains large holes); in that case the unset parts are not stored on disk, and simply read as zeroes.
Because many filesystems have problems with reads and writes larger than 231-1 bytes (2147483647 bytes), current Linux kernels internally limit each single operation to 231-1 bytes. The read or write call does not fail, it just returns a short count. I am not aware of any filesystem similarly limiting the llseek() syscall, but since the C library is responsible for mapping the lseek()/lseek64() functions to the proper syscalls, it is quite possible the C library (and not the kernel) limits the functionality. (In the case of the GNU C library and Embedded GNU C library, such syscall mapping is dependent on the compile-time flags. For example, see man 7 feature_test_macros, man 2 lseek and man 3 lseek64.
Finally, file position handling is not atomic in most Linux kernels. (Patches are upstream, but I'm not sure which releases contain them.) This means that if more than one thread uses the same descriptor in a way that modifies the file position, it is possible the file position gets completely garbled.
Memory limits
By default, file-backed memory maps are still subject to available memory and swap limits. That is, default mmap() behaviour is to assume that at memory pressure, dirty pages are swapped, not flushed to disk. You'll need to use the Linux-specific MAP_NORESERVE flag to avoid those limits.
Address space limits
On 32-bit Linux systems, the address space available to an userspace process is typically less than 4 GiB; it is a kernel compile-time option.
On 64-bit Linux systems, large mappings consume significant amounts of RAM, even if the mapping contents themselves are not yet faulted in. Typically, each single page requires 8 bytes of metadata ("page table entry") in memory, or more, depending on architecture. Using 4096-byte pages, this means a minimum overhead of 0.1953125%, and setting up e.g. a terabyte map requires two gigabytes of RAM just in page table structures!
Many 64-bit systems in Linux support huge pages to avoid that overhead. In most cases, huge pages are of limited use due to the configuration and tweaking and limitations. Kernels also may have limitations on what a process can do with a huge page mapping; a robust application would need thorough fallbacks to normal page mappings.
The kernel may impose stricter limits than resource availability to user-space processes. Run bash -c 'ulimit -a' to see the currently-imposed limits. (Details are available in the ulimit section in man bash-builtins.)
I'm trying to make a program that would copy 512 bytes from 1 file to another using said system calls (I could make a couple buffers, memcpy() and then fwrite() but I want to practice with Unix specific low level I/O). Here is the beginning of the code:
#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <fcntl.h>
int main(int argc, char **argv)
{
int src, dest, bytes_read;
char tmp_buf[512];
if (argc < 3)
printf("Needs 2 arguments.");
printf("And this message I for some reason don't see.... o_O");
if ((src = open(argv[1], O_RDWR, 0)) == -1 || (dest = open(argv[2], O_CREAT, 0)) == -1)
perror("Error");
while ((bytes_read = read(src, tmp_buf, 512)) != -1)
write(dest, tmp_buf, 512);
return 0;
}
I know I didn't deal with the fact that the file read from isn't going to be a multiple of 512 in size. But first I really need to figure out 2 things:
Why isn't my message showing up? No segmentation fault either, so I end up having to just C-c out of the program
How exactly do those low level functions work? Is there a pointer which shifts with each system call, like say if we were using FILE *file with fwrite, where our *file would automatically increment, or do we have to increment the file pointer by hand? If so, how would we access it assuming that open() and etc. never specify a file pointer, rather just the file ID?
Any help would be great. Please. Thank you!
The reason you don't see the printed message is because you don't flush the buffers. The text should show up once the program is done though (which never happens, and why this is, is explained in a comment by trojanfoe and in an answer by paxdiablo). Simply add a newline at the end of the strings to see them.
And you have a serious error in the read/write loop. If you read less than the requested 512 bytes, you will still write 512 bytes.
Also, while you do check for errors when opening, you don't know which of the open calls that failed. And you still continue the program even if you get an error.
And finally, the functions are very simple: They call a function in the kernel which handles everything for you. If you read X bytes the file pointer is moved forward X bytes after the call is done.
The reason you don't see the message is because you're in line-buffered mode. It will only be flushed if it discovers a newline character.
As to why it's waiting forever, you'll only get -1 on an error.
Successfully reading to end of file will give you a 0 return value.
A better loop would be along the lines of:
int bytes_left = 512;
while ((bytes_left > 0) {
bytes_read = read(src, tmp_buf, bytes_left);
if (bytes_read < 1) break;
write(dest, tmp_buf, bytes_read);
bytes_left -= bytes_read;
}
if (bytes_left < 0)
; // error of some sort
I have to write C code for reading large files. The code is below:
int read_from_file_open(char *filename,long size)
{
long read1=0;
int result=1;
int fd;
int check=0;
long *buffer=(long*) malloc(size * sizeof(int));
fd = open(filename, O_RDONLY|O_LARGEFILE);
if (fd == -1)
{
printf("\nFile Open Unsuccessful\n");
exit (0);;
}
long chunk=0;
lseek(fd,0,SEEK_SET);
printf("\nCurrent Position%d\n",lseek(fd,size,SEEK_SET));
while ( chunk < size )
{
printf ("the size of chunk read is %d\n",chunk);
if ( read(fd,buffer,1048576) == -1 )
{
result=0;
}
if (result == 0)
{
printf("\nRead Unsuccessful\n");
close(fd);
return(result);
}
chunk=chunk+1048576;
lseek(fd,chunk,SEEK_SET);
free(buffer);
}
printf("\nRead Successful\n");
close(fd);
return(result);
}
The issue I am facing here is that as long as the argument passed (size parameter) is less than 264000000 bytes, it seems to be able to read. I am getting the increasing sizes of the chunk variable with each cycle.
When I pass 264000000 bytes or more, the read fails, i.e.: according to the check used read returns -1.
Can anyone point me to why this is happening? I am compiling using cc in normal mode, not using DD64.
In the first place, why do you need lseek() in your cycle? read() will advance the cursor in the file by the number of bytes read.
And, to the topic: long, and, respectively, chunk, have a maximum value of 2147483647, any number greater than that will actually become negative.
You want to use off_t to declare chunk: off_t chunk, and size as size_t.
That's the main reason why lseek() fails.
And, then again, as other people have noticed, you do not want to free() your buffer inside the cycle.
Note also that you will overwrite the data you have already read.
Additionally, read() will not necessarily read as much as you have asked it to, so it is better to advance chunk by the amount of the bytes actually read, rather than amount of bytes you want to read.
Taking everything in regards, the correct code should probably look something like this:
// Edited: note comments after the code
#ifndef O_LARGEFILE
#define O_LARGEFILE 0
#endif
int read_from_file_open(char *filename,size_t size)
{
int fd;
long *buffer=(long*) malloc(size * sizeof(long));
fd = open(filename, O_RDONLY|O_LARGEFILE);
if (fd == -1)
{
printf("\nFile Open Unsuccessful\n");
exit (0);;
}
off_t chunk=0;
lseek(fd,0,SEEK_SET);
printf("\nCurrent Position%d\n",lseek(fd,size,SEEK_SET));
while ( chunk < size )
{
printf ("the size of chunk read is %d\n",chunk);
size_t readnow;
readnow=read(fd,((char *)buffer)+chunk,1048576);
if (readnow < 0 )
{
printf("\nRead Unsuccessful\n");
free (buffer);
close (fd);
return 0;
}
chunk=chunk+readnow;
}
printf("\nRead Successful\n");
free(buffer);
close(fd);
return 1;
}
I also took the liberty of removing result variable and all related logic since, I believe, it can be simplified.
Edit: I have noted that some systems (most notably, BSD) do not have O_LARGEFILE, since it is not needed there. So, I have added an #ifdef in the beginning, which would make the code more portable.
The lseek function may have difficulty in supporting big file sizes. Try using lseek64
Please check the link to see the associated macros which needs to be defined when you use lseek64 function.
If its 32 bit machine, it will cause some problem for reading a file of larger than 4gb. So if you are using gcc compiler try to use the macro -D_LARGEFILE_SOURCE=1 and -D_FILE_OFFSET_BITS=64.
Please check this link also
If you are using any other compiler check for similar types of compiler option.
I could use a set of eyes (or more) on this code. I'm trying to read in a set amount of bytes from a filestream (f1) to an array/buffer (file is a text file, array is of char type). If I read in size "buffer - 1" I want to "realloc" the array and the continue to read, starting at where I left off. Basically I'm trying to dynamically expand the buffer for the file of unknown size. What I'm wondering:
Am I implementing this wrong?
How would I check failure conditions on something like "realloc"
with the code the way it is?
I'm getting a lot of warnings when I compile about "implicit declaration of built-in function realloc..." (I'm seeing that warning for my use of read, malloc, strlen, etc. as well.
When "read()" get's called a second time (and third, fourth, etc.) does it read from the beginning of the stream each time? That could be my issue is I only seem to return the first "buff_size" char's.
Here's the snippet:
//read_buffer is of size buff_size
n_read = read(f1, read_buffer, buff_size - 1);
read_count = n_read;
int new_size = buff_size;
while (read_count == (buff_size - 1))
{
new_size *= 2;
read_buffer = realloc(read_buffer, new_size);
n_read = read(f1, read_buffer[read_count], buff_size - 1);
read_count += n_read;
}
As I am learning how to do this type of dynamic read, I'm wondering if someone could state a few brief facts about best practices with this sort of thing. I'm assuming this comes up a TON in the professional world (reading files of unknown size)? Thanks for your time. ALSO: As you guys find good ways of doing things (ie a technique for this type of problem), do you find yourselves memorizing how you did it, or maybe saving it to reference in the future (ie is a solution fairly static)?
If you're going to expand the buffer for the entire file anyway, it's probably easiest to seek to the end, get the current offset, then seek back to the beginning and read in swoop:
size = lseek(f1, 0, SEEK_END); // get offset at end of file
lseek(f1, 0, SEEK_SET); // seek back to beginning
buffer = malloc(size+1); // allocate enough memory.
read(f1, buffer, size); // read in the file
Alternatively, on any reasonably modern POSIX-like system, consider using mmap.
Here's a cool trick: use mmap instead (man mmap).
In a nutshell, say you have your file descriptor f1, on a file of nb bytes. You simply call
char *map = mmap(NULL, nb, PROT_READ, MAP_PRIVATE, f1, 0);
if (map == MAP_FAILED) {
return -1; // handle failure
}
Done.
You can read from the file as if it was already in memory, and the OS will read pages into memory as necessary. When you're done, you can simply call
munmap(map, nb);
and the mapping goes away.
edit: I just re-read your post and saw you don't know the file size. Why?
You can use lseek to seek to the end of the file and learn its current length.
If instead it's because someone else is writing to the file while you're reading, you can read from your current mapping until it runs out, then call lseek again to get the new length, and use mremap to increase the size. Or, you could simply munmap what you have, and mmap with a new "offset" (the number I set to 0, which is how many bytes from the file to skip).
#include <stdlib.h> /* for realloc() */
#include <string.h> /* for memcpy() */
#include <unistd.h> /* for read() */
char buff[512] ; /* anything goes */
size_t done, size;
char *result = NULL;
int fd;
done = size = 0;
while (1) {
int n_read;
n_read = read(fd, buff, sizeof buff);
if (n_read <=0) {
... for network connections, (n_read == -1 && errno == EAGAIN)
... should be handled special (by a continue) here.
break;
}
if (done+n_read > size) {
result = realloc(result, size ? 2*size : n_read );
... maybe handle NULL return from realloc here ...
size = size ? 2*size : n_read;
}
memcpy(result+done, buff, n_read);
done += n_read;
}
... and maybe shave down result a bit here ...
Note: this is more or less the vanilla way. Another way would be to malloc a real big array first, and realloc to the right size later. That will reduce the number of reallocs, and it might be more gentle for the malloc arena, wrt fragmentation. YMMV.