In another question, the accepted answer shows a method for reading the contents of a file into memory.
I have been trying to use this method to read in the content of a text file and then copy it to a new file. When I write the contents of the buffer to the new file, however, there is always some extra garbage at the end of the file. Here is an example of my code:
inputFile = fopen("D:\\input.txt", "r");
outputFile = fopen("D:\\output.txt", "w");
if(inputFile)
{
//Get size of inputFile
fseek(inputFile, 0, SEEK_END);
inputFileLength = ftell(inputFile);
fseek(inputFile, 0, SEEK_SET);
//Allocate memory for inputBuffer
inputBuffer = malloc(inputFileLength);
if(inputBuffer)
{
fread (inputBuffer, 1, inputFileLength, inputFile);
}
fclose(inputFile);
if(inputBuffer)
{
fprintf(outputFile, "%s", inputBuffer);
}
//Cleanup
free(inputBuffer);
fclose(outputFile);
}
The output file always contains an exact copy of the input file, but then has the text "MPUTERNAM2" appended to the end. Can anyone shed some light as to why this might be happening?
You may be happier with
int numBytesRead = 0;
if(inputBuffer)
{
numBytesRead = fread (inputBuffer, 1, inputFileLength, inputFile);
}
fclose(inputFile);
if(inputBuffer)
{
fwrite( inputBuffer, 1, numBytesRead, outputFile );
}
It doesn't need a null-terminated string (and therefore will work properly on binary data containing zeroes)
Because you are writing the buffer as if it were a string. Strings end with a NULL, the file you read does not.
You could NULL terminate your string, but a better solution is to use fwrite() instead of fprintf(). This would also let you copy files that contain NULL characters.
Unless you know the input file will always be small, you might consider reading/writing in a loop so that you can copy files larger than memory.
You haven't allocated enough space for the terminating null character in your buffer (and you also forget to actually set it), so your fprintf is effectively overreading into some other memory. Your buffer is exactly the same size as the file, and is filled with its content, however, fprintf reads the parameter looking for the terminating null, which isn't there, until a couple of characters later where, coincidently, there is one.
EDIT
You're actually mixing two types of io, fread (which is paired with fwrite) and fprintf (which is paired with fscanf). You should probably be doing fwrite with the number of bytes to write; or conversely, use fscanf, which would null-terminate your string (although, this wouldn't allow nulls in your string).
Allocating memory to fit the file is actually quite a bad way of doing it, especially the way it's done here. If the malloc() fails, no data is written to the output file (and it fails silently). In other words, you can't copy files greater than a few gigabytes on a 32-bit platform due to the address space limitations.
It's actually far better to use a smaller memory chunk (allocated or on the stack) and read/write the file in chunks. The reads and writes will be buffered anyway and, as long as you make the chunks relatively large, the overhead of function calls to the C runtime libraries is minimal.
You should always copy files in binary mode as well, it's faster since there's no chance of translation.
Something like:
FILE *fin = fopen ("infile","rb"); // make sure you check these for NULL return
FILE *fout = fopen ("outfile","wb");
char buff[1000000]; // or malloc/check-null if you don't have much stack space.
while ((count = fread (buff, 1, sizeof(buff), fin)) > 0) {
// Check count == -1 and errno here.
fwrite (buff, 1, count, fout); // and check return value.
}
fclose (fout);
fclose (fin);
This is from memory but provides the general idea of how to do it. And you should always have copiuos error checking.
fprintf expects inputBuffer to be null-terminated, which it isn't. So it's reading past the end of inputBuffer and printing whatever's there (into your new file) until it finds a null character.
In this case you could malloc an extra byte and put a null as the last character in inputBuffer.
In addition to what other's have said: You should also open your files in binary-mode - otherwise, you might get unexpected results on Windows (or other non-POSIX systems).
You can use
fwrite (inputBuffer , 1 , inputFileLength , outputFile );
instead of fprintf, to avoid the zero-terminated string problem. It also "matches better" with fread :)
Try using fgets instead, it will add the null for you at the end of the string. Also as was said above you need one more space for the null terminator.
ie
The string "Davy" is represented as the array that contains D,a,v,y,\0 (without the commas). Basically your array needs to be at least sizeofstring + 1 to hold the null terminator. Also fread will not automatically add the terminator, which is why even if your file is way shorter than the maximum length you get garbage..
Note an alternative method for being lazy is just to use calloc which sets the string to 0. But still you should only fread inputFileLength-1 characters at most.
Related
Consider this code to read a text based file. This sort of fread() usage was briefly touched upon in the excellent book C Programming: A Modern Approach by K.N. King.
There are other methods of reading text based files, but here I am concerned with fread() only.
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
// Declare file stream pointer.
FILE *fp = fopen("Note.txt", "r");
// fopen() call successful.
if(fp != NULL)
{
// Navigate through to end of the file.
fseek(fp, 0, SEEK_END);
// Calculate the total bytes navigated.
long filesize = ftell(fp);
// Navigate to the beginning of the file so
// it can be read.
rewind(fp);
// Declare array of char with appropriate size.
char content[filesize + 1];
// Set last char of array to contain NULL char.
content[filesize] = '\0';
// Read the file content.
fread(content, filesize, 1, fp);
// Close file stream pointer.
fclose(fp);
// Print file content.
printf("%s\n", content);
}
// fopen() call unsuccessful.
else
{
printf("File could not be read.\n");
}
return 0;
}
There are some problems I have with this method. My opinion is that this is not a safe method of performing fread() since there might be an overflow if we try to read an extremely large string. Is this opinion valid?
To circumvent this issue, we may use a buffer size and keep on reading into a char array of that size. If filesize is less than buffer size, then we simply perform fread() once as described in the above code. Otherwise, We divide the total file size by the buffer size and get a result, whose int portion we will use as the total number of times to iterate a loop where we will invoke fread() each time, appending the read buffer array into a larger string. Now, for the final fread(), which we will perform after the loop, we will have to read exactly (filesize % buffersize) bytes of data into an array of that size and finally append this array into the larger string (Which we would have malloc-ed with filesize + 1 beforehand). I find that if we perform fread() for the last chunk of data using buffersize as its second parameter, then extra garbage data of size (buffersize - chunksize) will be read in and the data might become corrupted. Are my assumptions here correct? Please explain if/ how I have overlooked something.
Also, there is the issue that non-ASCII characters might not have size of 1 byte. In that case I would assume the proper amount is being read, but each byte is being read as a single char, so the text is distorted somehow? How is fread() handling reading of multi-byte chars?
this is not a safe method of performing fread() since there might be an overflow if we try to read an extremely large string. Is this opinion valid?
fread() does not care about strings (null character terminated arrays). It reads data as if it was in multiples of unsigned char*1 with no special concern to the data content if the stream opened in binary mode and perhaps some data processing (e.g. end-of-line, byte-order-mark) in text mode.
Are my assumptions here correct?
Failed assumptions:
Assuming ftell() return value equals the sum of fread() bytes.
The assumption can be false in text mode (as OP opened the file) and fseek() to the end is technical undefined behavior in binary mode.
Assuming not checking the return value of fread() is OK. Use the return value of fread() to know if an error occurred, end-of-file and how many multiples of bytes were read.
Assuming error checking is not required. , ftell(), fread(), fseek() instead of rewind() all deserve error checks. In particular, ftell() readily fails on streams that have no certain end.
Assuming no null characters are read. A text file is not certainly made into one string by reading all and appending a null character. Robust code detects and/or copes with embedded null characters.
Multi-byte: assuming input meets the encoding requirements. Example: robust code detects (and rejects) invalid UTF8 sequences - perhaps after reading the entire file.
Extreme: Assuming a file length <= LONG_MAX, the max value returned from ftell(). Files may be larger.
but each byte is being read as a single char, so the text is distorted somehow? How is fread() handling reading of multi-byte chars?
fread() does not function on multi-byte boundaries, only multiples of unsigned char. A given fread() may end with a portion of a multi-byte and the next fread() will continue from mid-multi-byte.
Instead of of 2 pass approach consider 1 single pass
// Pseudo code
total_read = 0
Allocate buffer, say 4096
forever
if buffer full
double buffer_size (`realloc()`)
u = unused portion of buffer
fread u bytes into unused portion of buffer
total_read += number_just_read
if (number_just_read < u)
quit loop
Resize buffer total_read (+ 1 if appending a '\0')
Alternatively consider the need to read the entire file in before processing the data. I do not know the higher level goal, but often processing data as it arrives makes for less resource impact and faster throughput.
Advanced
Text files may be simple ASCII only, 8-bit code page defined, one of various UTF encodings (byte-order-mark, etc. The last line may or may not end with a '\n'. Robust text processing beyond simple ASCII is non-trivial.
ASCII and UTF-8 are the most common. IMO, handle 1 or both of those and error out on anything that does not meet their requirements.
*1 fread() reads in multiple of bytes as per the 3rd argument, which is 1 in OP's case.
// v --- multiple of 1 byte
fread(content, filesize, 1, fp);
I want to read the whole contents of a text file into a string. But at the end of my string, some more characters appears, which I don't want, following the correct contents. and the \0 I added seems lost.
char*
textFileInput(char* filename)
{
char* text;
long lSize;
FILE *pf = fopen(filename, "r");
if (pf == NULL) return NULL;
fseek(pf,0,SEEK_END);
lSize = ftell(pf);
text = (char*)malloc(lSize+1);
rewind(pf);
fread(text, sizeof(char), lSize, pf);
fclose(pf);
text[lSize] = '\0';//this sentence doesn't work well
return text;
}
after '\n' is the wrong characters.
In windows this mistake will appear. when I run the code in linux , it seems working well.
fseek followed byftell counts the Windows newline as two bytes (because that's what it is).
But when you read the file it will translate the two-byte \r\n to plain newline \n.
Therefore the actual data in the buffer will seem shorter than the length you allocated, and you will set the null-terminator in the wrong position.
Always use the value returned by fread as the actual length, and use it to set the null-terminator.
thanks guy's comment;
i check the return of "fread()" and find it is less than the variable "lSize" which i defined to sign the size of the file. i think the reason is that the end of file in Windows contents some special sign or the \r in \r\n doesn't read in but counted.
i change malloc into calloc which will automatically intialize the string , the porblem sovled then. i will be careful next-time.
So i have this code
FILE* file = fopen("file.txt", "r");
if(file == NULL)
{
printf("Failed to open file.\n");
return NULL;
}
fseek(file, 0L, SEEK_END);
long bufferSize = ftell(file);
fseek(file, 0L, SEEK_SET);
char* buffer = (char*) malloc(bufferSize);
if(buffer == NULL)
{
printf("Failed to allocate memory for buffer.\n");
return NULL;
}
fread(buffer, sizeof(char), bufferSize, file);
fclose(file);
This seems to work perfectly fine when printing to console with printf("%s", buffer) but i am wondering if this should be causing a buffer overflow or if its wrong since there seemingly isnt a null terminator character at the end.
Lets assume that the file.txt has exactly 4 characters in it. When the bufferSize is calculated it will be a long with the value of 4. So when i am calling malloc(bufferSize) I am creating a buffer with a size of 4 bytes which does not account for a null terminator character. Everywhere i have seen examples of people reading an entire text file they use code like this but shouldnt this be creating a char* with the characters from the file without an ending null terminator character? should i be allocating this buffer using malloc(bufferSize + 1) and adding a null terminator character?
This seems to work perfectly fine when printing to console with printf("%s", buffer)
Seem to working perfectly fine is a perfect manifestation of undefined behavior.
should i be allocating this buffer using malloc(bufferSize + 1) and adding a null terminator character?
If you wish to use %s printf format specifier with the pointer to a consecutive bytes of printable characters, these bytes need to be terminated with a zero byte. Or the other way, %s printf format specifier needs a zero terminated sequence of bytes. Otherwise, undefined behavior happens.
So:
Your input file contains a zero byte, so that %s stops outputting there.
You need to supply a zero terminating byte by yourself, to make sure that %s knows where to stop.
Or you can iterate over the bytes yourself for (...) { printf("%c", buffer[i]); } or (assuming bufferSize is lower then INT_MAX, so probably is) just tell printf when to stop by specifying the precision of the format specifier, like: printf("%.*s", (int)bufferSize, buffer);
or undefined behavior will happen.
Depending upon the size of buffer you allocate and the size of the allocation unit your OS provides, there are often extra bytes at the end of the allocation. Which means that depending how you later use the memory, an exact buffer allocation may lead to fail, or there may be spare byte(s) at the end of the allocation, which your fread() would not overwrite. The result? You may test your program with files that have serendipitous sizes, but programs may fail intermittently once shipped.
Quick fix? Always allocate a bit more space at the end of your buffer - depending upon how your program interprets the bytes (char, short, int, long, long long, struct).
Note that the size of the allocation unit is less likely to save you if the string is nested in a struct, where struct elements are snuggled close together. But odd sized strings would still have spare space, depending upon compiler flags.
Note that your specific usage is finding the end of the file, and slurping the entire file into memory. Likely your OS provides memory in 16, 32, or 64 byte chunks. Which means that you have 1/16, 1/32, or 1/64 chance of accidentally strolling off the end of your allocated buffer.
Suggestions:
(0) Always allocate extra padding, to cushion running into walls.
(1) Consider using fstat() rather than ftell()?
(2) Consider memory mapping the file, rather than using malloc/free and fread.
I am writing an academic project in C and I can use only <fcntl.h> and <unistd.h> libraries to file operations.
I have the function to read file line by line. The algorithm is:
Set pointer at the beginning of the file and get current position.
Read data to the buffer (char buf[100]) with constant size, iterate character by character and detect end of line '\n'.
Increment current position: curr_pos = curr_pos + length_of_read_line;
Set pointer to current position using lseek(fd, current_position, SEEK_SET);
SEEK_SET - set pointer to given offset from the beginning of the file. In my pseudo code current_position is the offset.
And actually it works fine, but I always move the pointer starting at the beginning of the file - I use SEEK_SET - it isn't optimized.
lseek accept also argument SEEK_CUR - it's a current position. How can I move back pointer from current position of pointer (SEEK_CUR). I tried to set negative offset, but didn't work.
The most efficient way to read lines of data from a file is typically to read a large chunk of data that may span multiple lines, process lines of data from the chunk until one reaches the end, move any partial line from the end of the buffer to the start, and then read another chunk of data. Depending upon the target system and task to be performed, it may be better to read enough to fill whatever space remains after the partial line, or it may be better to always read a power-of-two number of bytes and make the buffer large enough to accommodate a chunk that size plus a maximum-length partial line (left over from the previous read). The one difficulty with this approach is that all data to be read from the stream using the same buffer. In cases where that is practical, however, it will often allow better performance than using many separate calls to fread, and may be nicer than using fgets.
While it should be possible for a standard-library function to facilitate line input, the design of fgets is rather needlessly hostile since it provides no convenient indication of how much data it has read. After reading each line, code that wants a string containing the printable portion will have to use strlen to try to ascertain how much data was read (hopefully the input won't contain any zero bytes) and then check the byte before the trailing zero to see if it's a newline. Not impossible, but awkward at the very least. If the fread-and-buffer approach will satisfy an application's needs, it's likely to be at least as efficient as using fgets, if not moreso, and since the effort required to use fgets() robustly will be comparable to that required to use a buffering approach, one may as well use the latter.
Since your question is tagged as posix, I would go with getline(), without having to manually take care of moving the file pointer.
Example:
#include <stdio.h>
#include <stdlib.h>
int main(void)
{
FILE* fp;
char* line = NULL;
size_t len = 0;
ssize_t read;
fp = fopen("input.txt", "r");
if(fp == NULL)
return -1;
while((read = getline(&line, &len, fp)) != -1)
{
printf("Read line of length %zu:\n", read);
printf("%s", line);
}
fclose(fp);
if(line)
free(line);
return 0;
}
Output with custom input:
Read line of length 11:
first line
Read line of length 12:
second line
Read line of length 11:
third line
I'm writing an application that deals with very large user-generated input files. The program will copy about 95 percent of the file, effectively duplicating it and switching a few words and values in the copy, and then appending the copy (in chunks) to the original file, such that each block (consisting of between 10 and 50 lines) in the original is followed by the copied and modified block, and then the next original block, and so on. The user-generated input conforms to a certain format, and it is highly unlikely that any line in the original file is longer than 100 characters long.
Which would be the better approach?
To use one file pointer and use variables that hold the current position of how much has been read and where to write to, seeking the file pointer back and forth to read and write; or
To use multiple file pointers, one for reading and one for writing.
I am mostly concerned with the efficiency of the program, as the input files will reach up to 25,000 lines, each about 50 characters long.
If you have memory constraints, or you want a generic approach, read bytes into a buffer from one file pointer, make changes, and write out the buffer to a second file pointer when the buffer is full. If you reach EOF on the first pointer, make your changes and just flush whatever is in the buffer to the output pointer. If you intend to replace the original file, copy the output file to the input file and remove the output file. This "atomic" approach lets you check that the copy operation took place correctly before deleting anything.
For example, to deal with generically copying over any number of bytes, say, 1 MiB at a time:
#define COPY_BUFFER_MAXSIZE 1048576
/* ... */
unsigned char *buffer = NULL;
buffer = malloc(COPY_BUFFER_MAXSIZE);
if (!buffer)
exit(-1);
FILE *inFp = fopen(inFilename, "r");
fseek(inFp, 0, SEEK_END);
uint64_t fileSize = ftell(inFp);
rewind(inFp);
FILE *outFp = stdout; /* change this if you don't want to write to standard output */
uint64_t outFileSizeCounter = fileSize;
/* we fread() bytes from inFp in COPY_BUFFER_MAXSIZE increments, until there is nothing left to fread() */
do {
if (outFileSizeCounter > COPY_BUFFER_MAXSIZE) {
fread(buffer, 1, (size_t) COPY_BUFFER_MAXSIZE, inFp);
/* -- make changes to buffer contents at this stage
-- if you resize the buffer, then copy the buffer and
change the following statement to fwrite() the number of
bytes in the copy of the buffer */
fwrite(buffer, 1, (size_t) COPY_BUFFER_MAXSIZE, outFp);
outFileSizeCounter -= COPY_BUFFER_MAXSIZE;
}
else {
fread(buffer, 1, (size_t) outFileSizeCounter, inFp);
/* -- make changes to buffer contents at this stage
-- again, make a copy of buffer if it needs resizing,
and adjust the fwrite() statement to change the number
of bytes that need writing */
fwrite(buffer, 1, (size_t) outFileSizeCounter, outFp);
outFileSizeCounter = 0ULL;
}
} while (outFileSizeCounter > 0);
free(buffer);
An efficient way to deal with a resized buffer is to keep a second pointer, say, unsigned char *copyBuffer, which is realloc()-ed to twice the size, if necessary, to deal with accumulated edits. That way, you keep expensive realloc() calls to a minimum.
Not sure why this got downvoted, but it's a pretty solid approach for doing things with a generic amount of data. Hope this helps someone who comes across this question, in any case.
25000 lines * 100 characters = 2.5MB, that's not really a huge file. The fastest will probably be to read the whole file in memory and write your results to a new file and replace the original with that.