Reading a file of text into array with malloc in C

Reading a file of text into array with malloc in C - c

I could use a set of eyes (or more) on this code. I'm trying to read in a set amount of bytes from a filestream (f1) to an array/buffer (file is a text file, array is of char type). If I read in size "buffer - 1" I want to "realloc" the array and the continue to read, starting at where I left off. Basically I'm trying to dynamically expand the buffer for the file of unknown size. What I'm wondering:
Am I implementing this wrong?
How would I check failure conditions on something like "realloc"
with the code the way it is?
I'm getting a lot of warnings when I compile about "implicit declaration of built-in function realloc..." (I'm seeing that warning for my use of read, malloc, strlen, etc. as well.
When "read()" get's called a second time (and third, fourth, etc.) does it read from the beginning of the stream each time? That could be my issue is I only seem to return the first "buff_size" char's.
Here's the snippet:
//read_buffer is of size buff_size
n_read = read(f1, read_buffer, buff_size - 1);
read_count = n_read;
int new_size = buff_size;
while (read_count == (buff_size - 1))
{
new_size *= 2;
read_buffer = realloc(read_buffer, new_size);
n_read = read(f1, read_buffer[read_count], buff_size - 1);
read_count += n_read;
}
As I am learning how to do this type of dynamic read, I'm wondering if someone could state a few brief facts about best practices with this sort of thing. I'm assuming this comes up a TON in the professional world (reading files of unknown size)? Thanks for your time. ALSO: As you guys find good ways of doing things (ie a technique for this type of problem), do you find yourselves memorizing how you did it, or maybe saving it to reference in the future (ie is a solution fairly static)?

If you're going to expand the buffer for the entire file anyway, it's probably easiest to seek to the end, get the current offset, then seek back to the beginning and read in swoop:
size = lseek(f1, 0, SEEK_END); // get offset at end of file
lseek(f1, 0, SEEK_SET); // seek back to beginning
buffer = malloc(size+1); // allocate enough memory.
read(f1, buffer, size); // read in the file
Alternatively, on any reasonably modern POSIX-like system, consider using mmap.

Here's a cool trick: use mmap instead (man mmap).
In a nutshell, say you have your file descriptor f1, on a file of nb bytes. You simply call
char *map = mmap(NULL, nb, PROT_READ, MAP_PRIVATE, f1, 0);
if (map == MAP_FAILED) {
return -1; // handle failure
}
Done.
You can read from the file as if it was already in memory, and the OS will read pages into memory as necessary. When you're done, you can simply call
munmap(map, nb);
and the mapping goes away.
edit: I just re-read your post and saw you don't know the file size. Why?
You can use lseek to seek to the end of the file and learn its current length.
If instead it's because someone else is writing to the file while you're reading, you can read from your current mapping until it runs out, then call lseek again to get the new length, and use mremap to increase the size. Or, you could simply munmap what you have, and mmap with a new "offset" (the number I set to 0, which is how many bytes from the file to skip).

#include <stdlib.h> /* for realloc() */
#include <string.h> /* for memcpy() */
#include <unistd.h> /* for read() */
char buff[512] ; /* anything goes */
size_t done, size;
char *result = NULL;
int fd;
done = size = 0;
while (1) {
int n_read;
n_read = read(fd, buff, sizeof buff);
if (n_read <=0) {
... for network connections, (n_read == -1 && errno == EAGAIN)
... should be handled special (by a continue) here.
break;
}
if (done+n_read > size) {
result = realloc(result, size ? 2*size : n_read );
... maybe handle NULL return from realloc here ...
size = size ? 2*size : n_read;
}
memcpy(result+done, buff, n_read);
done += n_read;
}
... and maybe shave down result a bit here ...
Note: this is more or less the vanilla way. Another way would be to malloc a real big array first, and realloc to the right size later. That will reduce the number of reallocs, and it might be more gentle for the malloc arena, wrt fragmentation. YMMV.

Related

Proper way to get file size in C

I am working on an assignment in socket programming in which I have to send a file between sparc and linux machine. Before sending the file in char stream I have to get the file size and tell the client. Here are some of the ways I tried to get the size but I am not sure which one is the proper one.
For testing purpose, I created a file with content " test" (space + (string)test)
Method 1 - Using fseeko() and ftello()
This is a method I found on https://www.securecoding.cert.org/confluence/display/c/FIO19-C.+Do+not+use+fseek()+and+ftell()+to+compute+the+size+of+a+regular+file
While the fssek() has a problem of "Setting the file position indicator to end-of-file, as with fseek(file, 0, SEEK_END), has undefined behavior for a binary stream", fseeko() is said to have tackled this problem but it only works on POSIX system (which is fine because the environment I am using is sparc and linux)
fd = open(file_path, O_RDONLY);
fp = fopen(file_path, "rb");
/* Ensure that the file is a regular file */
if ((fstat(fd, &st) != 0) || (!S_ISREG(st.st_mode))) {
/* Handle error */
}
if (fseeko(fp, 0 , SEEK_END) != 0) {
/* Handle error */
}
file_size = ftello(fp);
fseeko(fp, 0, SEEK_SET);
printf("file size %zu\n", file_size);
This method works fine and get the size correctly. However, it is limited to regular files only. I tried to google the term "regular file" but I still not quite understand it thoroughly. And I do not know if this function is reliable for my project.
Method 2 - Using strlen()
Since the max. size of a file in my project is 4MB, so I can just calloc a 4MB buffer. After that, the file is read into the buffer, and I tried to use the strlen to get the file size (or more correctly the length of content). Since strlen() is portable, can I use this method instead? The code snippet is like this
fp = fopen(file_path, "rb");
fread(file_buffer, 1024*1024*4, 1, fp);
printf("strlen %zu\n", strlen(file_buffer));
This method works too and returns
strlen 8
However, I couldn't see any similar approach on the Internet using this method. So I am thinking maybe I have missed something or there are some limitations of this approach which I haven't realized.

Regular file means that it is nothing special like device, socket, pipe etc. but "normal" file.
It seems that by your task description before sending you must retrieve size of normal file.
So your way is right:
FILE* fp = fopen(...);
if(fp) {
fseek(fp, 0 , SEEK_END);
long fileSize = ftell(fp);
fseek(fp, 0 , SEEK_SET);// needed for next read from beginning of file
...
fclose(fp);
}
but you can do it without opening file:
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
struct stat buffer;
int status;
status = stat("path to file", &buffer);
if(status == 0) {
// size of file is in member buffer.st_size;
}

OP can do it the easy way as "max. size of a file in my project is 4MB".
Rather than using strlen(), use the return value from fread(). stlen() stops on the first null character, so may report too small a value. #Sami Kuhmonen Also we do not know the data read contains any null character, so it may not be a string. Append a null character (and allocate +1) if code needs to use data as a string. But in that case, I'd expect the file needed to be open in text mode.
Note that many OS's do not even use allocated memory until it is written.
Why is malloc not "using up" the memory on my computer?
fp = fopen(file_path, "rb");
if (fp) {
#define MAX_FILE_SIZE 4194304
char *buf = malloc(MAX_FILE_SIZE);
if (buf) {
size_t numread = fread(buf, sizeof *buf, MAX_FILE_SIZE, fp);
// shrink if desired
char *tmp = realloc(buf, numread);
if (tmp) {
buf = tmp;
// Use buf with numread char
}
free(buf);
}
fclose(fp);
}
Note: Reading the entire file into memory may not be the best idea to begin with.

Getting characters past a certain point in a file in C

I want to take all characters past location 900 from a file called WWW, and put all of these in an array:
//Keep track of all characters past position 900 in WWW.
int Seek900InWWW = lseek(WWW, 900, 0); //goes to position 900 in WWW
printf("%d \n", Seek900InWWW);
if(Seek900InWWW < 0)
printf("Error seeking to position 900 in WWW.txt");
char EverythingPast900[appropriatesize];
int NextRead;
char NextChar[1];
int i = 0;
while((NextRead = read(WWW, NextChar, sizeof(NextChar))) > 0) {
EverythingPast900[i] = NextChar[0];
printf("%c \n", NextChar[0]);
i++;
}
I try to create a char array of length 1, since the read system call requires a pointer, I cannot use a regular char. The above code does not work. In fact, it does not print any characters to the terminal as expected by the loop. I think my logic is correct, but perhaps a misunderstanding of whats going on behind the scenes is what is making this hard for me. Or maybe i missed something simple (hope not).

If you already know how many bytes to read (e.g. in appropriatesize) then just read in that many bytes at once, rather than reading in bytes one at a time.
char everythingPast900[appropriatesize];
ssize_t bytesRead = read(WWW, everythingPast900, sizeof everythingPast900);
if (bytesRead > 0 && bytesRead != appropriatesize)
{
// only everythingPast900[0] to everythingPast900[bytesRead - 1] is valid
}

I made a test version of your code and added bits you left out. Why did you leave them out?
I also made a file named www.txt that has a hundred lines of "This is a test line." in it.
And I found a potential problem, depending on how big your appropriatesize value is and how big the file is. If you write past the end of EverythingPast900 it is possible for you to kill your program and crash it before you ever produce any output to display. That might happen on Windows where stdout may not be line buffered depending on which libraries you used.
See the MSDN setvbuf page, in particular "For some systems, this provides line buffering. However, for Win32, the behavior is the same as _IOFBF - Full Buffering."
This seems to work:
#include <fcntl.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <unistd.h>
#include <stdio.h>
int main()
{
int WWW = open("www.txt", O_RDONLY);
if(WWW < 0)
printf("Error opening www.txt\n");
//Keep track of all characters past position 900 in WWW.
int Seek900InWWW = lseek(WWW, 900, 0); //goes to position 900 in WWW
printf("%d \n", Seek900InWWW);
if(Seek900InWWW < 0)
printf("Error seeking to position 900 in WWW.txt");
int appropriatesize = 1000;
char EverythingPast900[appropriatesize];
int NextRead;
char NextChar[1];
int i = 0;
while(i < appropriatesize && (NextRead = read(WWW, NextChar, sizeof(NextChar))) > 0) {
EverythingPast900[i] = NextChar[0];
printf("%c \n", NextChar[0]);
i++;
}
return 0;
}

As stated in another answer, read more than one byte. The theory behind "buffers" is to reduce the amount of read/write operations due to how slow disk I/O (or network I/O) is compared to memory speed and CPU speed. Look at it as if it is code and consider which is faster: adding 1 to the file size N times and writing N bytes individually, or adding N to the file size once and writing N bytes at once?
Another thing worth mentioning is the fact that read may read fewer than the number of bytes you requested, even if there is more to read. The answer written by #dreamlax illustrates this fact. If you want, you can use a loop to read as many bytes as possible, filling the buffer. Note that I used a function, but you can do the same thing in your main code:
#include <sys/types.h>
/* Read from a file descriptor, filling the buffer with the requested
* number of bytes. If the end-of-file is encountered, the number of
* bytes returned may be less than the requested number of bytes.
* On error, -1 is returned. See read(2) or read(3) for possible
* values of errno.
* Otherwise, the number of bytes read is returned.
*/
ssize_t
read_fill (int fd, char *readbuf, ssize_t nrequested)
{
ssize_t nread, nsum = 0;
while (nrequested > 0
&& (nread = read (fd, readbuf, nrequested)) > 0)
{
nsum += nread;
nrequested -= nread;
readbuf += nread;
}
return nsum;
}
Note that the buffer is not null-terminated as not all data is necessarily text. You can pass buffer_size - 1 as the requested number of bytes and use the return value to add a null terminator where necessary. This is useful primarily when interacting with functions that will expect a null-terminated string:
char readbuf[4096];
ssize_t n;
int fd;
fd = open ("WWW", O_RDONLY);
if (fd == -1)
{
perror ("unable to open WWW");
exit (1);
}
n = lseek (fd, 900, SEEK_SET);
if (n == -1)
{
fprintf (stderr,
"warning: seek operation failed: %s\n"
" reading 900 bytes instead\n",
strerror (errno));
n = read_fill (fd, readbuf, 900);
if (n < 900)
{
fprintf (stderr, "error: fewer than 900 bytes in file\n");
close (fd);
exit (1);
}
}
/* Read a file, printing its contents to the screen.
*
* Caveat:
* Not safe for UTF-8 or other variable-width/multibyte
* encodings since required bytes may get cut off.
*/
while ((n = read_fill (fd, readbuf, (ssize_t) sizeof readbuf - 1)) > 0)
{
readbuf[n] = 0;
printf ("Read\n****\n%s\n****\n", readbuf);
}
if (n == -1)
{
close (fd);
perror ("error reading from WWW");
exit (1);
}
close (fd);
I could also have avoided the null termination operation and filled all 4096 bytes of the buffer, electing to use the precision part of the format specifiers of printf in this case, changing the format specification from %s to %.4096s. However, this may not be feasible with unusually large buffers (perhaps allocated by malloc to avoid stack overflow) because the buffer size may not be representable with the int type.
Also, you can use a regular char just fine:
char c;
nread = read (fd, &c, 1);
Apparently you didn't know that the unary & operator gets the address of whatever variable is its operand, creating a value of type pointer-to-{typeof var}? Either way, it takes up the same amount of memory, but reading 1 byte at a time is something that normally isn't done as I've explained.

Mixing declarations and code is a no no. Also, no, that is not a valid declaration. C should complain about it along the lines of it being variably defined.
What you want is dynamically allocating the memory for your char buffer[]. You'll have to use pointers.
http://www.ontko.com/pub/rayo/cs35/pointers.html
Then read this one.
http://www.cprogramming.com/tutorial/c/lesson6.html
Then research a function called memcpy().
Enjoy.
Read through that guide, then you should be able to solve your problem in an entirely different way.
Psuedo code.
declare a buffer of char(pointer related)
allocate memory for said buffer(dynamic memory related)
Find location of where you want to start at
point to it(pointer related)
Figure out how much you want to store(technically a part of allocating memory^^^)
Use memcpy() to store what you want in the buffer

Reading a large file using C (greater than 4GB) using read function, causing problems

I have to write C code for reading large files. The code is below:
int read_from_file_open(char *filename,long size)
{
long read1=0;
int result=1;
int fd;
int check=0;
long *buffer=(long*) malloc(size * sizeof(int));
fd = open(filename, O_RDONLY|O_LARGEFILE);
if (fd == -1)
{
printf("\nFile Open Unsuccessful\n");
exit (0);;
}
long chunk=0;
lseek(fd,0,SEEK_SET);
printf("\nCurrent Position%d\n",lseek(fd,size,SEEK_SET));
while ( chunk < size )
{
printf ("the size of chunk read is %d\n",chunk);
if ( read(fd,buffer,1048576) == -1 )
{
result=0;
}
if (result == 0)
{
printf("\nRead Unsuccessful\n");
close(fd);
return(result);
}
chunk=chunk+1048576;
lseek(fd,chunk,SEEK_SET);
free(buffer);
}
printf("\nRead Successful\n");
close(fd);
return(result);
}
The issue I am facing here is that as long as the argument passed (size parameter) is less than 264000000 bytes, it seems to be able to read. I am getting the increasing sizes of the chunk variable with each cycle.
When I pass 264000000 bytes or more, the read fails, i.e.: according to the check used read returns -1.
Can anyone point me to why this is happening? I am compiling using cc in normal mode, not using DD64.

In the first place, why do you need lseek() in your cycle? read() will advance the cursor in the file by the number of bytes read.
And, to the topic: long, and, respectively, chunk, have a maximum value of 2147483647, any number greater than that will actually become negative.
You want to use off_t to declare chunk: off_t chunk, and size as size_t.
That's the main reason why lseek() fails.
And, then again, as other people have noticed, you do not want to free() your buffer inside the cycle.
Note also that you will overwrite the data you have already read.
Additionally, read() will not necessarily read as much as you have asked it to, so it is better to advance chunk by the amount of the bytes actually read, rather than amount of bytes you want to read.
Taking everything in regards, the correct code should probably look something like this:
// Edited: note comments after the code
#ifndef O_LARGEFILE
#define O_LARGEFILE 0
#endif
int read_from_file_open(char *filename,size_t size)
{
int fd;
long *buffer=(long*) malloc(size * sizeof(long));
fd = open(filename, O_RDONLY|O_LARGEFILE);
if (fd == -1)
{
printf("\nFile Open Unsuccessful\n");
exit (0);;
}
off_t chunk=0;
lseek(fd,0,SEEK_SET);
printf("\nCurrent Position%d\n",lseek(fd,size,SEEK_SET));
while ( chunk < size )
{
printf ("the size of chunk read is %d\n",chunk);
size_t readnow;
readnow=read(fd,((char *)buffer)+chunk,1048576);
if (readnow < 0 )
{
printf("\nRead Unsuccessful\n");
free (buffer);
close (fd);
return 0;
}
chunk=chunk+readnow;
}
printf("\nRead Successful\n");
free(buffer);
close(fd);
return 1;
}
I also took the liberty of removing result variable and all related logic since, I believe, it can be simplified.
Edit: I have noted that some systems (most notably, BSD) do not have O_LARGEFILE, since it is not needed there. So, I have added an #ifdef in the beginning, which would make the code more portable.

The lseek function may have difficulty in supporting big file sizes. Try using lseek64
Please check the link to see the associated macros which needs to be defined when you use lseek64 function.

If its 32 bit machine, it will cause some problem for reading a file of larger than 4gb. So if you are using gcc compiler try to use the macro -D_LARGEFILE_SOURCE=1 and -D_FILE_OFFSET_BITS=64.
Please check this link also
If you are using any other compiler check for similar types of compiler option.

What is the most efficient way to print a file in C to stdout?

I'm stuck on this. Currently I'm using:
FILE *a = fopen("sample.txt", "r");
int n;
while ((n = fgetc(a)) != EOF) {
putchar(n);
}
However this method seems to be a bit inefficient. Is there any better way? I tried using fgets:
char *s;
fgets(s, 600, a);
puts(s);
There's one thing I find wrong about this second method, which is that you would need a really large number for the second argument of fgets.
Thanks for all the suggestions. I found a way (someone on IRC told me this) using open(), read(), and write().
char *filename = "sample.txt";
char buf[8192];
int r = -1;
int in = open(filename, O_RDONLY), out = 0;
if (in == -1)
return -1;
while (1) {
r = read(in, buf, sizeof(buf));
if (r == -1 || r == 0) { break; }
r = write(out, buf, r);
if (r == -1 || r == 0) { break; }
}

The second code is broken. You need to allocate a buffer, e.g.:
char s[4096];
fgets(s, sizeof(s), a);
Of course, this doesn't solve your problem.
Read fix-size chunks from the input and write out whatever gets read in:
int n;
char s[65536];
while ((n = fread(s, 1, sizeof(s), a))) {
fwrite(s, 1, n, stdout);
}
You might also want to check ferror(a) in case it stopped for some other reason than reaching EOF.
Notes
I originally used a 4096 byte buffer because it is a fairly common page size for memory allocation and block size for the file system. However, the sweet-spot on my Linux system seems to be around the 64 kB mark, which surprised me. Perhaps CPU cache is a factor here, but I'm just guessing.
For a cold cache, it makes almost no difference, since I/O paging will dominate; even one byte at a time runs at about the same speed.

The most efficient method will depend greatly on the operating system. For example, in Linux, you can use sendfile:
struct stat buf;
int fd = open(filename, O_RDONLY);
fstat(fd, &buf);
sendfile(0, fd, NULL, buf.st_size);
This does the copy directly in the kernel, minimizing unnecessary memory-to-memory copies. Other platforms may have similar approaches, such as write()ing to stdout from a mmaped buffer.

I believe the FILE returned by fopen is tipically (always?) buffered, so you first example is not so inefficient as you may think.
The second might perform a little better... if you correct the errors: remember to allocate the buffer, and remember that puts add a newline!.
Other option is to use binary reads (fread).

It all depends on what you want to do with the data.
this will crash though:
char *s;
fgets(s, 600, a);
puts(s);
since s is no buffer, just a pointer somewhere.
One way is to read in the whole file into a buffer and work with that
by using fread()
filebuffer = malloc(filelength);
fread( buffer, 1, filelength, fp );

What you're doing is plenty good enough in 99% of applications. Granted, in most C libraries, stdio performs badly, and you'd be better off with Phong Vo's sfio library. If you have measurements showing this is a bottleneck, the natural next step is to allocate a buffer and use fread/fwrite. You don't want fgets because you don't care about newlines.
First make it run, then make it right. You probably don't have to make it fast.

Tried and true simple file copying code in C?

This looks like a simple question, but I didn't find anything similar here.
Since there is no file copy function in C, we have to implement file copying ourselves, but I don't like reinventing the wheel even for trivial stuff like that, so I'd like to ask the cloud:
What code would you recommend for file copying using fopen()/fread()/fwrite()?
What code would you recommend for file copying using open()/read()/write()?
This code should be portable (windows/mac/linux/bsd/qnx/younameit), stable, time tested, fast, memory efficient and etc. Getting into specific system's internals to squeeze some more performance is welcomed (like getting filesystem cluster size).
This seems like a trivial question but, for example, source code for CP command isn't 10 lines of C code.

This is the function I use when I need to copy from one file to another - with test harness:
/*
#(#)File: $RCSfile: fcopy.c,v $
#(#)Version: $Revision: 1.11 $
#(#)Last changed: $Date: 2008/02/11 07:28:06 $
#(#)Purpose: Copy the rest of file1 to file2
#(#)Author: J Leffler
#(#)Modified: 1991,1997,2000,2003,2005,2008
*/
/*TABSTOP=4*/
#include "jlss.h"
#include "stderr.h"
#ifndef lint
/* Prevent over-aggressive optimizers from eliminating ID string */
const char jlss_id_fcopy_c[] = "#(#)$Id: fcopy.c,v 1.11 2008/02/11 07:28:06 jleffler Exp $";
#endif /* lint */
void fcopy(FILE *f1, FILE *f2)
{
char buffer[BUFSIZ];
size_t n;
while ((n = fread(buffer, sizeof(char), sizeof(buffer), f1)) > 0)
{
if (fwrite(buffer, sizeof(char), n, f2) != n)
err_syserr("write failed\n");
}
}
#ifdef TEST
int main(int argc, char **argv)
{
FILE *fp1;
FILE *fp2;
err_setarg0(argv[0]);
if (argc != 3)
err_usage("from to");
if ((fp1 = fopen(argv[1], "rb")) == 0)
err_syserr("cannot open file %s for reading\n", argv[1]);
if ((fp2 = fopen(argv[2], "wb")) == 0)
err_syserr("cannot open file %s for writing\n", argv[2]);
fcopy(fp1, fp2);
return(0);
}
#endif /* TEST */
Clearly, this version uses file pointers from standard I/O and not file descriptors, but it is reasonably efficient and about as portable as it can be.
Well, except the error function - that's peculiar to me. As long as you handle errors cleanly, you should be OK. The "jlss.h" header declares fcopy(); the "stderr.h" header declares err_syserr() amongst many other similar error reporting functions. A simple version of the function follows - the real one adds the program name and does some other stuff.
#include "stderr.h"
#include <stdarg.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
void err_syserr(const char *fmt, ...)
{
int errnum = errno;
va_list args;
va_start(args, fmt);
vfprintf(stderr, fmt, args);
va_end(args);
if (errnum != 0)
fprintf(stderr, "(%d: %s)\n", errnum, strerror(errnum));
exit(1);
}
The code above may be treated as having a modern BSD license or GPL v3 at your choice.

As far as the actual I/O goes, the code I've written a million times in various guises for copying data from one stream to another goes something like this. It returns 0 on success, or -1 with errno set on error (in which case any number of bytes might have been copied).
Note that for copying regular files, you can skip the EAGAIN stuff, since regular files are always blocking I/O. But inevitably if you write this code, someone will use it on other types of file descriptors, so consider it a freebie.
There's a file-specific optimisation that GNU cp does, which I haven't bothered with here, that for long blocks of 0 bytes instead of writing you just extend the output file by seeking off the end.
void block(int fd, int event) {
pollfd topoll;
topoll.fd = fd;
topoll.events = event;
poll(&topoll, 1, -1);
// no need to check errors - if the stream is bust then the
// next read/write will tell us
}
int copy_data_buffer(int fdin, int fdout, void *buf, size_t bufsize) {
for(;;) {
void *pos;
// read data to buffer
ssize_t bytestowrite = read(fdin, buf, bufsize);
if (bytestowrite == 0) break; // end of input
if (bytestowrite == -1) {
if (errno == EINTR) continue; // signal handled
if (errno == EAGAIN) {
block(fdin, POLLIN);
continue;
}
return -1; // error
}
// write data from buffer
pos = buf;
while (bytestowrite > 0) {
ssize_t bytes_written = write(fdout, pos, bytestowrite);
if (bytes_written == -1) {
if (errno == EINTR) continue; // signal handled
if (errno == EAGAIN) {
block(fdout, POLLOUT);
continue;
}
return -1; // error
}
bytestowrite -= bytes_written;
pos += bytes_written;
}
}
return 0; // success
}
// Default value. I think it will get close to maximum speed on most
// systems, short of using mmap etc. But porters / integrators
// might want to set it smaller, if the system is very memory
// constrained and they don't want this routine to starve
// concurrent ops of memory. And they might want to set it larger
// if I'm completely wrong and larger buffers improve performance.
// It's worth trying several MB at least once, although with huge
// allocations you have to watch for the linux
// "crash on access instead of returning 0" behaviour for failed malloc.
#ifndef FILECOPY_BUFFER_SIZE
#define FILECOPY_BUFFER_SIZE (64*1024)
#endif
int copy_data(int fdin, int fdout) {
// optional exercise for reader: take the file size as a parameter,
// and don't use a buffer any bigger than that. This prevents
// memory-hogging if FILECOPY_BUFFER_SIZE is very large and the file
// is small.
for (size_t bufsize = FILECOPY_BUFFER_SIZE; bufsize >= 256; bufsize /= 2) {
void *buffer = malloc(bufsize);
if (buffer != NULL) {
int result = copy_data_buffer(fdin, fdout, buffer, bufsize);
free(buffer);
return result;
}
}
// could use a stack buffer here instead of failing, if desired.
// 128 bytes ought to fit on any stack worth having, but again
// this could be made configurable.
return -1; // errno is ENOMEM
}
To open the input file:
int fdin = open(infile, O_RDONLY|O_BINARY, 0);
if (fdin == -1) return -1;
Opening the output file is tricksy. As a basis, you want:
int fdout = open(outfile, O_WRONLY|O_BINARY|O_CREAT|O_TRUNC, 0x1ff);
if (fdout == -1) {
close(fdin);
return -1;
}
But there are confounding factors:
you need to special-case when the files are the same, and I can't remember how to do that portably.
if the output filename is a directory, you might want to copy the file into the directory.
if the output file already exists (open with O_EXCL to determine this and check for EEXIST on error), you might want to do something different, as cp -i does.
you might want the permissions of the output file to reflect those of the input file.
you might want other platform-specific meta-data to be copied.
you may or may not wish to unlink the output file on error.
Obviously the answers to all these questions could be "do the same as cp". In which case the answer to the original question is "ignore everything I or anyone else has said, and use the source of cp".
Btw, getting the filesystem's cluster size is next to useless. You'll almost always see speed increasing with buffer size long after you've passed the size of a disk block.

the size of each read need to be a multiple of 512 ( sector size ) 4096 is a good one

Here is a very easy and clear example: Copy a file. Since it is written in ANSI-C without any particular function calls I think this one would be pretty much portable.

Depending on what you mean by copying a file, it is certainly far from trivial. If you mean copying the content only, then there is almost nothing to do. But generally, you need to copy the metadata of the file, and that's surely platform dependent. I don't know of any C library which does what you want in a portable manner. Just handling the filename by itself is no trivial matter if you care about portability.
In C++, there is the file library in boost

One thing I found when implementing my own file copy, and it seems obvious but it's not: I/O's are slow. You can pretty much time your copy's speed by how many of them you do. So clearly you need to do as few of them as possible.
The best results I found were when I got myself a ginourmous buffer, read the entire source file into it in one I/O, then wrote the entire buffer back out of it in one I/O. If I even had to do it in 10 batches, it got way slow. Trying to read and write out each byte, like a naieve coder might try first, was just painful.

The accepted answer written by Steve Jessop does not answer to the first part of the quession, Jonathan Leffler do it, but do it wrong: code should be written as
while ((n = fread(buffer, 1, sizeof(buffer), f1)) > 0)
if (fwrite(buffer, n, 1, f2) != 1)
/* we got write error here */
/* test ferror(f1) for a read errors */
Explanation:
sizeof(char) = 1 by definition, always: it does not matter how many bits in it, 8 (in most cases), 9, 11 or 32 (on some DSP, for example) — size of char is one. Note, it is not an error here, but an extra code.
The fwrite function writes upto nmemb (second argument) elements of specified size (third argument), it does not required to write exactly nmemb elements. To fix this you must write the rest of the data readed or just write one element of size n — let fwrite do all his work. (This item is in question, should fwrite write all data or not, but in my version short writes impossible until error occurs.)
You should test for a read errors too: just test ferror(f1) at the end of loop.
Note, you probably need to disable buffering on both input and output files to prevent triple buffering: first on read to f1 buffer, second in our code, third on write to f2 buffer:
setvbuf(f1, NULL, _IONBF, 0);
setvbuf(f2, NULL, _IONBF, 0);
(Internal buffers should, probably, be of size BUFSIZ.)

Develop Reference

c reactjs sql-server angularjs arrays wpf database batch-file google-app-engine silverlight

Reading a file of text into array with malloc in C - c

Related

Proper way to get file size in C

Getting characters past a certain point in a file in C

Reading a large file using C (greater than 4GB) using read function, causing problems

What is the most efficient way to print a file in C to stdout?

Tried and true simple file copying code in C?

Categories

Resources