The program works correctly in Linux, but I get extra characters after the end of file when running in Windows or through Wine. Not garbage but repeated text that was already written. The issue persists whether I write to stdout or a file, but doesn't occur with small files, a few hundred KB is needed.
I nailed down the issue to this function:
static unsigned long read_file(const char *filename, const char **output)
{
struct stat file_stats;
int fdescriptor;
unsigned long file_sz;
static char *file;
fdescriptor = open(filename, O_RDONLY);
if (fdescriptor < 0 || (fstat(fdescriptor ,&file_stats) < 0))
{ printf("Error opening file: %s \n", filename);
return (0);
}
if (file_stats.st_size < 0)
{ printf("file %s reports an Incorrect size", filename);
return (0);
}
file_sz = (unsigned long)file_stats.st_size;
file = malloc((file_sz) * sizeof(*file));
if (!file)
{ printf("Error allocating memory for file %s of size %lu\n", filename, file_sz);
return (0);
}
read(fdescriptor, file, file_sz);
*output = file;
write(STDOUT_FILENO, file, file_sz), exit(1); //this statement added for debugging.
return (file_sz);
}
I can't debug through Wine, much less in windows, but by using printf statements I can tell the file size is correct. The issue is either in the reading or the writing and without a debugger I can't look at the contents of the buffer in memory.
The program was compiled with x86_64-w64-mingw32-gcc, version 8.3. which is the same version of gcc in my system.
At this point I'm just perplexed; I would love to hear any ideas you may have.
Thank you.
Edit: The issue was that fewer bytes were being read than the reported file size and I was writing more than necessary. Thanks to Matt for telling me where to look.
Read can return a size different than that reported by fstat. I was writing the reported file size instead of the actual number of bytes read, which led to the issue. If writing, one should use the number of bytes directly reported by read to avoid this.
It is always best to both check the return value of read/write for failure and to make sure all bytes have been read as read can return less bytes than the total when reading from a pipe or interrupted by a signal, in which case multiple calls are necessary.
Thanks to Mat and Felix for the answer.
Related
I have a simple function to print file size and file name:
void *mystat(void *filename) {
struct stat fileStat;
if (lstat(filename,&fileStat) < 0) {
fprintf(stderr, "no such file or directory: %s\n", filename);
return NULL;
}
printf(" %'d",fileStat.st_size);
printf(" %s\n", filename);
}
it works fine for small files, but when file is large (couple of GB) it prints size 0.
Why is this not working for large files ?
EDIT
Actually it only prints file size 0 when the file size is multiple of 4GB. In other case, when file is large but not multiple, it prints negative number.
but, when I capture the return code of lstat and print it, it is 0:
ret = lstat(filename,&fileStat)
I am compiling and running my code on a 64-bit system.
Obviously, the fileStat.st_size is overflowing, but why?
lstat is giving you the right answer. It's printf that's the problem. Use %'ld instead of %'d to be good enough in practice, or if you want to be pedantically correct, then do this instead:
printf(" %'jd", (intmax_t)fileStat.st_size);
You may also need to #include <stdint.h>, if you get an error that intmax_t doesn't exist.
I am trying to use fread and fwrite to read and write a data pertaining to a structure in a file. Here's my code:
#include<stdio.h>
#include<time.h>
#include<stdlib.h>
#include<string.h>
typedef struct book book;
struct book
{
char title[200];
char auth[200];
char publi[200];
int p_year;
int price;
int edition;
int isbn;
};
int main()
{
int i;
FILE* fp = fopen("this.dat","w");
book * a = calloc(1000000,sizeof (book));
srand(time(NULL));
for(i=0;i<1000000;i++)
{
a[i].price = rand()%1000;
a[i].p_year = 1500 + rand()%518;
a[i].isbn = 10000+rand()%100000;
a[i].edition = i%15;
strcpy(a[i].title,"title");
strcpy(a[i].auth,"author");
strcpy(a[i].publi,"publication");
}
if((i=fwrite(a,sizeof(*a),1000000,fp))!= 1000000)
{
printf("ERROR - Only %d records written\n",i);
printf("feof:%d\nferror:%d",feof(fp),ferror(fp));
return EXIT_FAILURE;
}
if(ferror(fp))
{
printf("ERROR");
return EXIT_FAILURE;
}
if(fclose(fp)!=0)
{
printf("ERROR while closing the stream");
return EXIT_FAILURE;
}
if((fp = fopen("this.dat","r")) == NULL)
{
printf("ERROR reopening");
return EXIT_FAILURE;
}
if((i=fread(a,sizeof(book),100,fp))!=100)
{
printf("ERROR - Only %d records read\n",i);
printf("feof:%d\nferror:%d",feof(fp),ferror(fp));
return EXIT_FAILURE;
}
if(ferror(fp))
{
printf("~ERROR");
return EXIT_FAILURE;
}
for(i=0;i<100;i++)
printf("price:%d\nedition:%d\nisbn:%d\np_year:%d\n\n\n",a[i].price,a[i].edition,a[i].isbn,a[i].p_year);
fclose(fp);
return EXIT_SUCCESS;
}
The thing is occasionally it executes successfully but most of the times it doesn't. I get an error while reading back from the file using fread. It ends up reading variable number of records every time and less number of records than it's supposed to (i.e 100). Following is one of the outputs of an unsuccessful execution of the program:
ERROR - Only 25 records read
feof:16
ferror:0
Question 1: Why eof achieved reading just 25 records when more than 25 were written ? (I've tried using rewind/fseek after reopening the file but the issue still persisted.)
Question 2: In such cases, is it normal for the data contained in the array a beyond a[x-1] to get tampered when x (<100) records are read ? Would the data still have been tampered beyond a[99] even if 100 records were successfully read ? (I know the data gets tampered since trying to print the fields of elements of array a beyond the xth element results in inappropriate values, like price > 1000 or price<0 and so on)
you shouldn't open your files in text mode while reading/writing as binary structures.
Whereas it has no effect on Linux/Unix, on Windows this has serious consequences. And it makes your files non-shareable between Windows and Linux.
Depending on the data LF <=> CR/LF conversion can corrupt/shift the data (removing the carriage return or inserting one)
in text mode in Windows, each LF (ASCII 10) byte is replaced by CR+LF (13+10 ASCII) bytes when writing (and reverse in reading: 13+10 => 10). Those 10 bytes can happen, for instance when writing year 1802 (hex: 0x70A) as binary.
Solution: use binary mode:
if((fp = fopen("this.dat","rb")) == NULL)
and
FILE* fp = fopen("this.dat","wb");
Note: In "text" mode, specifying a block size doesn't work since the size depends on the data. That probably answers your second question: last 100th record read is corrupt because you're reading too few bytes. I'm not sure about the details but since the system adds/removes bytes when writing/reading, block size can be buggy.
I have to write C code for reading large files. The code is below:
int read_from_file_open(char *filename,long size)
{
long read1=0;
int result=1;
int fd;
int check=0;
long *buffer=(long*) malloc(size * sizeof(int));
fd = open(filename, O_RDONLY|O_LARGEFILE);
if (fd == -1)
{
printf("\nFile Open Unsuccessful\n");
exit (0);;
}
long chunk=0;
lseek(fd,0,SEEK_SET);
printf("\nCurrent Position%d\n",lseek(fd,size,SEEK_SET));
while ( chunk < size )
{
printf ("the size of chunk read is %d\n",chunk);
if ( read(fd,buffer,1048576) == -1 )
{
result=0;
}
if (result == 0)
{
printf("\nRead Unsuccessful\n");
close(fd);
return(result);
}
chunk=chunk+1048576;
lseek(fd,chunk,SEEK_SET);
free(buffer);
}
printf("\nRead Successful\n");
close(fd);
return(result);
}
The issue I am facing here is that as long as the argument passed (size parameter) is less than 264000000 bytes, it seems to be able to read. I am getting the increasing sizes of the chunk variable with each cycle.
When I pass 264000000 bytes or more, the read fails, i.e.: according to the check used read returns -1.
Can anyone point me to why this is happening? I am compiling using cc in normal mode, not using DD64.
In the first place, why do you need lseek() in your cycle? read() will advance the cursor in the file by the number of bytes read.
And, to the topic: long, and, respectively, chunk, have a maximum value of 2147483647, any number greater than that will actually become negative.
You want to use off_t to declare chunk: off_t chunk, and size as size_t.
That's the main reason why lseek() fails.
And, then again, as other people have noticed, you do not want to free() your buffer inside the cycle.
Note also that you will overwrite the data you have already read.
Additionally, read() will not necessarily read as much as you have asked it to, so it is better to advance chunk by the amount of the bytes actually read, rather than amount of bytes you want to read.
Taking everything in regards, the correct code should probably look something like this:
// Edited: note comments after the code
#ifndef O_LARGEFILE
#define O_LARGEFILE 0
#endif
int read_from_file_open(char *filename,size_t size)
{
int fd;
long *buffer=(long*) malloc(size * sizeof(long));
fd = open(filename, O_RDONLY|O_LARGEFILE);
if (fd == -1)
{
printf("\nFile Open Unsuccessful\n");
exit (0);;
}
off_t chunk=0;
lseek(fd,0,SEEK_SET);
printf("\nCurrent Position%d\n",lseek(fd,size,SEEK_SET));
while ( chunk < size )
{
printf ("the size of chunk read is %d\n",chunk);
size_t readnow;
readnow=read(fd,((char *)buffer)+chunk,1048576);
if (readnow < 0 )
{
printf("\nRead Unsuccessful\n");
free (buffer);
close (fd);
return 0;
}
chunk=chunk+readnow;
}
printf("\nRead Successful\n");
free(buffer);
close(fd);
return 1;
}
I also took the liberty of removing result variable and all related logic since, I believe, it can be simplified.
Edit: I have noted that some systems (most notably, BSD) do not have O_LARGEFILE, since it is not needed there. So, I have added an #ifdef in the beginning, which would make the code more portable.
The lseek function may have difficulty in supporting big file sizes. Try using lseek64
Please check the link to see the associated macros which needs to be defined when you use lseek64 function.
If its 32 bit machine, it will cause some problem for reading a file of larger than 4gb. So if you are using gcc compiler try to use the macro -D_LARGEFILE_SOURCE=1 and -D_FILE_OFFSET_BITS=64.
Please check this link also
If you are using any other compiler check for similar types of compiler option.
I am read from a file like this:
#include <stdio.h>
int main() {
FILE *fp = fopen("sorted_hits", "r+");
while(!feof(fp)) {
int item_read;
int *buffer = (int *)malloc(sizeof(int));
item_read = fread(buffer, sizeof(int), 1, fp);
if(item_read == 0) {
printf("at file %ld\n", ftell(fp));
perror("read error:");
}
}
}
This file is big and I got the "Bad file descriptor" error sometimes. "ftell" indicates that the file position stopped when error occurred.
I don't know why it is "sometimes", is that normal? does the problem lie in my code or in my hard disk? How to handle this?
perror prints whatever is in errno as a descriptive string. errno gets set to an error code whenever a system call has an error return. But, if a system call DOESN'T fail, errno doesn't get modified and will continue to contain whatever it contained before. Now if fread returns 0, that means that either there was an error OR you reached the end of the file. In the latter case, errno is not set and might contain any random garbage from before.
So in this case, the "Bad file descriptor" message you're getting probably just means there hasn't been an error at all. You should be checking ferror(fp) to see if an error has occurred.
You seem to be mixing text and binary modes when reading the file.
Normally when you use fread you read from a binary file i.e. fread reads a number of bytes matching the buffer size but you seem to be opening the file in text mode (r+). ftell doesn't work reliably on files opened in text mode because newlines are treated differently than other characters.
Open the file in binary mode (untranslated) instead:
FILE *fp = fopen("sorted_hits", "rb+");
If that's really what your loop looks like, my guess would be that you're probably getting a more or less spurious error because your process is just running out of memory because your loop is leaking it so badly (calling malloc every iteration of your loop, but no matching call to free anywhere).
It's also possible (but a lot less likely) that you're running into a little problem from your (common but nearly always incorrect) use of while (!feof(fp)).
Your all to printf also gives undefined behavior because you've mismatched the conversion and the type (though on many current systems it's irrelevant because long and int are the same size).
Fixing those may or may not remove the problem you've observed, but at least if you still see it, you'll have narrowed down the possibilities of what may be causing the problem.
int main() {
FILE *fp = fopen("sorted_hits", "r+");
int buffer;
while(0 != fread(&buffer, sizeof(int), 1, fp))
; // read file but ignore contents.
if (ferror(fp)) {
printf("At file: %ld\n", ftell(fp));
perror("read error: ");
}
}
This looks like a simple question, but I didn't find anything similar here.
Since there is no file copy function in C, we have to implement file copying ourselves, but I don't like reinventing the wheel even for trivial stuff like that, so I'd like to ask the cloud:
What code would you recommend for file copying using fopen()/fread()/fwrite()?
What code would you recommend for file copying using open()/read()/write()?
This code should be portable (windows/mac/linux/bsd/qnx/younameit), stable, time tested, fast, memory efficient and etc. Getting into specific system's internals to squeeze some more performance is welcomed (like getting filesystem cluster size).
This seems like a trivial question but, for example, source code for CP command isn't 10 lines of C code.
This is the function I use when I need to copy from one file to another - with test harness:
/*
#(#)File: $RCSfile: fcopy.c,v $
#(#)Version: $Revision: 1.11 $
#(#)Last changed: $Date: 2008/02/11 07:28:06 $
#(#)Purpose: Copy the rest of file1 to file2
#(#)Author: J Leffler
#(#)Modified: 1991,1997,2000,2003,2005,2008
*/
/*TABSTOP=4*/
#include "jlss.h"
#include "stderr.h"
#ifndef lint
/* Prevent over-aggressive optimizers from eliminating ID string */
const char jlss_id_fcopy_c[] = "#(#)$Id: fcopy.c,v 1.11 2008/02/11 07:28:06 jleffler Exp $";
#endif /* lint */
void fcopy(FILE *f1, FILE *f2)
{
char buffer[BUFSIZ];
size_t n;
while ((n = fread(buffer, sizeof(char), sizeof(buffer), f1)) > 0)
{
if (fwrite(buffer, sizeof(char), n, f2) != n)
err_syserr("write failed\n");
}
}
#ifdef TEST
int main(int argc, char **argv)
{
FILE *fp1;
FILE *fp2;
err_setarg0(argv[0]);
if (argc != 3)
err_usage("from to");
if ((fp1 = fopen(argv[1], "rb")) == 0)
err_syserr("cannot open file %s for reading\n", argv[1]);
if ((fp2 = fopen(argv[2], "wb")) == 0)
err_syserr("cannot open file %s for writing\n", argv[2]);
fcopy(fp1, fp2);
return(0);
}
#endif /* TEST */
Clearly, this version uses file pointers from standard I/O and not file descriptors, but it is reasonably efficient and about as portable as it can be.
Well, except the error function - that's peculiar to me. As long as you handle errors cleanly, you should be OK. The "jlss.h" header declares fcopy(); the "stderr.h" header declares err_syserr() amongst many other similar error reporting functions. A simple version of the function follows - the real one adds the program name and does some other stuff.
#include "stderr.h"
#include <stdarg.h>
#include <stdlib.h>
#include <string.h>
#include <errno.h>
void err_syserr(const char *fmt, ...)
{
int errnum = errno;
va_list args;
va_start(args, fmt);
vfprintf(stderr, fmt, args);
va_end(args);
if (errnum != 0)
fprintf(stderr, "(%d: %s)\n", errnum, strerror(errnum));
exit(1);
}
The code above may be treated as having a modern BSD license or GPL v3 at your choice.
As far as the actual I/O goes, the code I've written a million times in various guises for copying data from one stream to another goes something like this. It returns 0 on success, or -1 with errno set on error (in which case any number of bytes might have been copied).
Note that for copying regular files, you can skip the EAGAIN stuff, since regular files are always blocking I/O. But inevitably if you write this code, someone will use it on other types of file descriptors, so consider it a freebie.
There's a file-specific optimisation that GNU cp does, which I haven't bothered with here, that for long blocks of 0 bytes instead of writing you just extend the output file by seeking off the end.
void block(int fd, int event) {
pollfd topoll;
topoll.fd = fd;
topoll.events = event;
poll(&topoll, 1, -1);
// no need to check errors - if the stream is bust then the
// next read/write will tell us
}
int copy_data_buffer(int fdin, int fdout, void *buf, size_t bufsize) {
for(;;) {
void *pos;
// read data to buffer
ssize_t bytestowrite = read(fdin, buf, bufsize);
if (bytestowrite == 0) break; // end of input
if (bytestowrite == -1) {
if (errno == EINTR) continue; // signal handled
if (errno == EAGAIN) {
block(fdin, POLLIN);
continue;
}
return -1; // error
}
// write data from buffer
pos = buf;
while (bytestowrite > 0) {
ssize_t bytes_written = write(fdout, pos, bytestowrite);
if (bytes_written == -1) {
if (errno == EINTR) continue; // signal handled
if (errno == EAGAIN) {
block(fdout, POLLOUT);
continue;
}
return -1; // error
}
bytestowrite -= bytes_written;
pos += bytes_written;
}
}
return 0; // success
}
// Default value. I think it will get close to maximum speed on most
// systems, short of using mmap etc. But porters / integrators
// might want to set it smaller, if the system is very memory
// constrained and they don't want this routine to starve
// concurrent ops of memory. And they might want to set it larger
// if I'm completely wrong and larger buffers improve performance.
// It's worth trying several MB at least once, although with huge
// allocations you have to watch for the linux
// "crash on access instead of returning 0" behaviour for failed malloc.
#ifndef FILECOPY_BUFFER_SIZE
#define FILECOPY_BUFFER_SIZE (64*1024)
#endif
int copy_data(int fdin, int fdout) {
// optional exercise for reader: take the file size as a parameter,
// and don't use a buffer any bigger than that. This prevents
// memory-hogging if FILECOPY_BUFFER_SIZE is very large and the file
// is small.
for (size_t bufsize = FILECOPY_BUFFER_SIZE; bufsize >= 256; bufsize /= 2) {
void *buffer = malloc(bufsize);
if (buffer != NULL) {
int result = copy_data_buffer(fdin, fdout, buffer, bufsize);
free(buffer);
return result;
}
}
// could use a stack buffer here instead of failing, if desired.
// 128 bytes ought to fit on any stack worth having, but again
// this could be made configurable.
return -1; // errno is ENOMEM
}
To open the input file:
int fdin = open(infile, O_RDONLY|O_BINARY, 0);
if (fdin == -1) return -1;
Opening the output file is tricksy. As a basis, you want:
int fdout = open(outfile, O_WRONLY|O_BINARY|O_CREAT|O_TRUNC, 0x1ff);
if (fdout == -1) {
close(fdin);
return -1;
}
But there are confounding factors:
you need to special-case when the files are the same, and I can't remember how to do that portably.
if the output filename is a directory, you might want to copy the file into the directory.
if the output file already exists (open with O_EXCL to determine this and check for EEXIST on error), you might want to do something different, as cp -i does.
you might want the permissions of the output file to reflect those of the input file.
you might want other platform-specific meta-data to be copied.
you may or may not wish to unlink the output file on error.
Obviously the answers to all these questions could be "do the same as cp". In which case the answer to the original question is "ignore everything I or anyone else has said, and use the source of cp".
Btw, getting the filesystem's cluster size is next to useless. You'll almost always see speed increasing with buffer size long after you've passed the size of a disk block.
the size of each read need to be a multiple of 512 ( sector size ) 4096 is a good one
Here is a very easy and clear example: Copy a file. Since it is written in ANSI-C without any particular function calls I think this one would be pretty much portable.
Depending on what you mean by copying a file, it is certainly far from trivial. If you mean copying the content only, then there is almost nothing to do. But generally, you need to copy the metadata of the file, and that's surely platform dependent. I don't know of any C library which does what you want in a portable manner. Just handling the filename by itself is no trivial matter if you care about portability.
In C++, there is the file library in boost
One thing I found when implementing my own file copy, and it seems obvious but it's not: I/O's are slow. You can pretty much time your copy's speed by how many of them you do. So clearly you need to do as few of them as possible.
The best results I found were when I got myself a ginourmous buffer, read the entire source file into it in one I/O, then wrote the entire buffer back out of it in one I/O. If I even had to do it in 10 batches, it got way slow. Trying to read and write out each byte, like a naieve coder might try first, was just painful.
The accepted answer written by Steve Jessop does not answer to the first part of the quession, Jonathan Leffler do it, but do it wrong: code should be written as
while ((n = fread(buffer, 1, sizeof(buffer), f1)) > 0)
if (fwrite(buffer, n, 1, f2) != 1)
/* we got write error here */
/* test ferror(f1) for a read errors */
Explanation:
sizeof(char) = 1 by definition, always: it does not matter how many bits in it, 8 (in most cases), 9, 11 or 32 (on some DSP, for example) — size of char is one. Note, it is not an error here, but an extra code.
The fwrite function writes upto nmemb (second argument) elements of specified size (third argument), it does not required to write exactly nmemb elements. To fix this you must write the rest of the data readed or just write one element of size n — let fwrite do all his work. (This item is in question, should fwrite write all data or not, but in my version short writes impossible until error occurs.)
You should test for a read errors too: just test ferror(f1) at the end of loop.
Note, you probably need to disable buffering on both input and output files to prevent triple buffering: first on read to f1 buffer, second in our code, third on write to f2 buffer:
setvbuf(f1, NULL, _IONBF, 0);
setvbuf(f2, NULL, _IONBF, 0);
(Internal buffers should, probably, be of size BUFSIZ.)